Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes

Abstract

Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision–recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6 × 10−5).

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The prediction pipeline and functionality provided in MethylPipeR.
Fig. 2: ROC and precision–recall curves for the full models.
Fig. 3: Confusion matrix plot of true positives/false negatives and false positives/true negatives in the Generation Scotland test dataset.

Similar content being viewed by others

Data availability

According to the terms of consent for Generation Scotland participants, access to data must be reviewed by the Generation Scotland Access Committee. Applications should be made to [email protected]. The informed consent given by the KORA S4 study participants does not cover data posting in public databases. However, data are available upon request from the KORA Project Application Self-Service Tool (https://epi.helmholtz-muenchen.de/). Data requests can be submitted online and are subject to approval by the KORA board.

Code availability

Analysis scripts for this study are available at https://github.com/marioni-group/episcores-diabetes-prediction and https://doi.org/10.5281/zenodo.7628959. MethylPipeR v.0.1.0 is available at https://github.com/marioni-group/MethylPipeR and https://doi.org/10.5281/zenodo.7628816. MethylPipeR-UI is available at https://github.com/marioni-group/MethylPipeR-UI and https://doi.org/10.5281/zenodo.7635952.

References

  1. Saeedi, P. et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the International Diabetes Federation Diabetes Atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019).

    Article  PubMed  Google Scholar 

  2. Gregg, E. W., Sattar, N. & Ali, M. K. The changing face of diabetes complications. Lancet Diabetes Endocrinol. 4, 537–547 (2016).

    Article  PubMed  Google Scholar 

  3. Biessels, G. J. & Despa, F. Cognitive decline and dementia in diabetes mellitus: mechanisms and clinical implications. Nat. Rev. Endocrinol. 14, 591–604 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  4. McGurnaghan, S. J. et al. Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland. Lancet Diabetes Endocrinol. 9, 82–93 (2021).

    Article  CAS  PubMed  Google Scholar 

  5. Lee, C. G. et al. Effect of metformin and lifestyle interventions on mortality in the Diabetes Prevention Program and Diabetes Prevention Program Outcomes Study. Diabetes Care 44, 2775–2782 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Keng, M. J. et al. Impact of achieving primary care targets in type 2 diabetes on health outcomes and healthcare costs. Diabetes Obes. Metab. 21, 2405–2412 (2019).

    Article  CAS  PubMed  Google Scholar 

  7. Collins, G. S., Mallett, S., Omar, O. & Yu, L.-M. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 9, 103 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  8. McCartney, D. L. et al. Epigenetic prediction of complex traits and death. Genome Biol. 19, 136 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Gadd, D. A. et al. Epigenetic scores for the circulating proteome as tools for disease prediction. eLife 11, e71802 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Cardona, A. et al. Epigenome-wide association study of incident type 2 diabetes in a British population: EPIC-Norfolk study. Diabetes 68, 2315–2326 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Meeks, K. A. C. et al. Epigenome-wide association study in whole blood on type 2 diabetes among sub-Saharan African individuals: findings from the RODAM study. Int. J. Epidemiol. 48, 58–70 (2019).

    Article  PubMed  Google Scholar 

  12. Walaszczyk, E. et al. DNA methylation markers associated with type 2 diabetes, fasting glucose and HbA1c levels: a systematic review and replication in a case-control sample of the Lifelines study. Diabetologia 61, 354–368 (2018).

    Article  CAS  PubMed  Google Scholar 

  13. Al Muftah, W. A. et al. Epigenetic associations of type 2 diabetes and BMI in an Arab population. Clin. Epigenetics 8, 13 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Chambers, J. C. et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 3, 526–534 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Nakatochi, M. et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin. Epigenetics 9, 54 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Wang, X. et al. An epigenome-wide study of obesity in African American youth and young adults: novel findings, replication in neutrophils, and relationship with gene expression. Clin. Epigenetics 10, 3 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Wahl, S. et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017).

    Article  CAS  PubMed  Google Scholar 

  18. Haw, J. S. et al. Long-term sustainability of diabetes prevention approaches: a systematic review and meta-analysis of randomized clinical trials. JAMA Intern. Med. 177, 1808–1817 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Samuels, T. A., Cohen, D., Brancati, F. L., Coresh, J. & Kao, W. H. Delayed diagnosis of incident type 2 diabetes mellitus in the ARIC study. Am. J. Manag. Care 12, 717–724 (2006).

    PubMed  Google Scholar 

  20. Saffari, A. et al. Estimation of a significance threshold for epigenome-wide association studies. Genet. Epidemiol. 42, 20–33 (2018).

    Article  PubMed  Google Scholar 

  21. Ekoe, J.-M., Goldenberg, R. & Katz, P. Screening for diabetes in adults. Can. J. Diabetes 42, S16–S19 (2018).

    Article  PubMed  Google Scholar 

  22. Van Calster, B. et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 230 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Van Calster, B. et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J. Clin. Epidemiol. 74, 167–176 (2016).

    Article  PubMed  Google Scholar 

  24. Demerath, E. W. et al. Epigenome-wide association study (EWAS) of BMI, BMI change and waist circumference in African American adults identifies multiple replicated loci. Hum. Mol. Genet. 24, 4464–4479 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Mendelson, M. M. et al. Association of body mass index with DNA methylation and gene expression in blood cells and relations to cardiometabolic disease: a Mendelian randomization approach. PLoS Med. 14, e1002215 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Sayols-Baixeras, S. et al. Identification and validation of seven new loci showing differential DNA methylation related to serum lipid profile: an epigenome-wide approach. The REGICOR study. Hum. Mol. Genet. 25, 4556–4565 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Braun, K. V. E. et al. Epigenome-wide association study (EWAS) on lipids: the Rotterdam Study. Clin. Epigenetics 9, 15 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Kriebel, J. et al. Association between DNA methylation in whole blood and measures of glucose metabolism: KORA F4 study. PLoS ONE 11, e0152314 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. Exact post-selection inference, with application to the lasso. Ann. Stat. 44, 907–927 (2016).

    Article  Google Scholar 

  30. Taylor, J. & Tibshirani, R. Post‐selection inference for ℓ1‐penalized likelihood models. Can. J. Stat. 46, 41–61 (2018).

    Article  PubMed  Google Scholar 

  31. Austin, P. C., Lee, D. S. & Fine, J. P. Introduction to the analysis of survival data in the presence of competing risks. Circulation 133, 601–609 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Br. J. Surg. 102, 148–158 (2015).

    Article  CAS  PubMed  Google Scholar 

  33. Smith, B. H. et al. Cohort Profile: Generation Scotland: Scottish Family Health Study (GS: SFHS). The study, its participants and their potential for genetic research on health and illness. Int. J. Epidemiol. 42, 689–700 (2013).

    Article  PubMed  Google Scholar 

  34. Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. Ann. Appl. Stat. 2, 841–860 (2008).

    Article  Google Scholar 

  37. Sparapani, R. A., Logan, B. R., McCulloch, R. E. & Laud, P. W. Nonparametric survival analysis using Bayesian additive regression trees (BART). Stat. Med. 35, 2741–2753 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  39. Chipman, H. A., George, E. I. & McCulloch, R. E. BART: Bayesian additive regression trees. Ann. Appl. Stat. 4, 266–298 (2010).

    Article  Google Scholar 

  40. Ishwaran, H. & Kogalur, U. Fast unified random forests for survival, regression, and classification (RF-SRC). R package version 2.11.0 (2021).

  41. Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric machine learning and efficient computation with Bayesian additive regression trees: the BART R package. J. Stat. Softw. 97, 1–66. (2021).

    Article  Google Scholar 

  42. Saito, T. & Rehmsmeier, M. The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  43. De Cock, B., Nieboer, D., Van Calster, B., Steyerberg, E. W. & Vergouwe, Y. The CalibrationCurves package: validating predicted probabilities against binary events. R package version 0.1.5. https://github.com/BavoDC/CalibrationCurves (2023).

  44. Battram, T. et al. The EWAS Catalog: a database of epigenome-wide association studies. Wellcome Open Res. 7, 41 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Wichmann, H.-E., Gieger, C. & Illig, T. KORA-gen—resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen 67, S26–S30 (2005).

    Article  PubMed  Google Scholar 

  46. Fawns-Ritchie, C. et al. CovidLife: a resource to understand mental health, well-being and behaviour during the COVID-19 pandemic in the UK. Wellcome Open Res. 6, 176 (2021).

    Article  Google Scholar 

  47. Shah, W., Hillman, T., Playford, E. D. & Hishmeh, L. Managing the long term effects of covid-19: summary of NICE, SIGN, and RCGP rapid guideline. BMJ 372, n136 (2021).

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

This research was funded in whole, or in part, by the Wellcome Trust (nos. 104036/Z/14/Z, 108890/Z/15/Z and 216767/Z/19/Z). For the purpose of open access, we have applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission. Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (no. CZD/16/6) and the Scottish Funding Council (no. HR03006) and is currently supported by the Wellcome Trust (no. 216767/Z/19/Z). DNAm profiling of the Generation Scotland samples was carried out by the Genetics Core Laboratory at the Edinburgh Clinical Research Facility and was funded by the Medical Research Council UK and the Wellcome Trust (Wellcome Trust Strategic Award ‘STratifying Resilience and Depression Longitudinally’ (ref. no. 104036/Z/14/Z)). The DNAm data assayed for Generation Scotland was partially funded by a 2018 NARSAD Young Investigator Grant from the Brain & Behavior Research Foundation (ref. no. 27404; awardee: D. M. Howard) and by a JMAS SIM fellowship from the Royal College of Physicians of Edinburgh (awardee: H. C. Whalley). Y.C. is supported by the University of Edinburgh and University of Helsinki joint PhD program in Human Genomics. D.A.G. is supported by funding from the Wellcome Trust 4-year PhD in Translational Neuroscience—training the next generation of basic neuroscientists to embrace clinical research (no. 108890/Z/15/Z). C.A.V. is a Chancellor’s Fellow funded by the University of Edinburgh. D.L.M. and R.E.M. are supported by an Alzheimer’s Research UK major project grant no. ARUK-PG2017B-10. R.E.M. is supported by an Alzheimer’s Society major project grant no. AS-PG-19b-010. M.J.S., N.S. and E.L. are supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. Recruitment to the CovidLife study was facilitated by the Scottish Health Research Register (SHARE) and Biobank. SHARE is supported by NHS Research Scotland, the universities of Scotland and the Chief Scientist Office of the Scottish Government. The KORA S4 study was initiated and financed by the Helmholtz Zentrum München—German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education and Research and by the State of Bavaria. Furthermore, the KORA research has been supported by the Munich Center of Health Sciences, Ludwig-Maximilians-Universität München as part of LMUinnovativ and is supported by the German Centre for Cardiovascular Research. The KORA S4 study is funded by the Bavarian State Ministry of Health and Care through the research project DigiMed Bayern (www.digimed-bayern.de).

Author information

Authors and Affiliations

Authors

Contributions

Y.C., D.A.G. and C.G. performed the data analysis. Y.Z., I.B., M.J.S., N.S. and E.L. conducted the preliminary analyses. Generation Scotland cohort: N.W. and L.M. were responsible for the data collection. A.C., C.N., R.M.W., K.L.E. and D.L.M. prepared the data. D.J.P. and A.M.M. were responsible for data collection and cohort management. KORA S4 cohort: A.P., W.R. and M.W. were responsible for data collection and preparation. CovidLife: C.F.-R. performed the data collection and preparation. K.M.-G. and T.I.C. provided input on statistical modeling. A.G., C.A.V. and R.E.M. were responsible for study design and methodology. All authors contributed to drafting the paper and the figures.

Corresponding authors

Correspondence to Catalina A. Vallejos or Riccardo E. Marioni.

Ethics declarations

Competing interests

R.E.M. has received a speaker fee from Illumina and is an advisor to the Epigenetic Clock Development Foundation. A.M.M. has previously received speaker fees from Janssen and Illumina and research funding from The Sackler Trust. L.M. has received payment from Illumina for presentations and consultancy. The other authors declare no competing interests.

Peer review

Peer review information

Nature Aging thanks Srikanth Bellary and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Preprocessing steps for Generation Scotland and KORA S4.

The number of individuals/cases and controls in are given after each step.

Extended Data Fig. 2 Calibration plots for incremental models in Generation Scotland.

Plots are shown for the full model (risk factors + composite protein epigenetic score + Cox PH lasso direct epigenetic score) (top-left) and the risk factors only model (bottom-left). The black line shows the loess calibration regression curve. The grey area shows 95% confidence intervals calculated from 2000 bootstrap samples. The ideal calibration line (observed = predicted) is shown in red. The histogram shows the distribution of predicted probabilities. The wider confidence intervals at higher predicted probabilities are due to the small number of predictions in those ranges. Most predictions are low in the probability range, emphasised in the zoomed-in plots (top-right and bottom-right).

Extended Data Fig. 3 An example from the MethylPipeR-UI Shiny app.

The left hand panel provides functionality for uploading data and specifying pipeline parameters. The right hand tabs show output such as model diagnostics, performance metrics and console output.

Supplementary information

Supplementary Information

Supplementary Methods.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–14.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, Y., Gadd, D.A., Gieger, C. et al. Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes. Nat Aging 3, 450–458 (2023). https://doi.org/10.1038/s43587-023-00391-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43587-023-00391-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing