Abstract
Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision–recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6 × 10−5).
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
27,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
118,99 € per year
only 9,92 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
Data availability
According to the terms of consent for Generation Scotland participants, access to data must be reviewed by the Generation Scotland Access Committee. Applications should be made to [email protected]. The informed consent given by the KORA S4 study participants does not cover data posting in public databases. However, data are available upon request from the KORA Project Application Self-Service Tool (https://epi.helmholtz-muenchen.de/). Data requests can be submitted online and are subject to approval by the KORA board.
Code availability
Analysis scripts for this study are available at https://github.com/marioni-group/episcores-diabetes-prediction and https://doi.org/10.5281/zenodo.7628959. MethylPipeR v.0.1.0 is available at https://github.com/marioni-group/MethylPipeR and https://doi.org/10.5281/zenodo.7628816. MethylPipeR-UI is available at https://github.com/marioni-group/MethylPipeR-UI and https://doi.org/10.5281/zenodo.7635952.
References
Saeedi, P. et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the International Diabetes Federation Diabetes Atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019).
Gregg, E. W., Sattar, N. & Ali, M. K. The changing face of diabetes complications. Lancet Diabetes Endocrinol. 4, 537–547 (2016).
Biessels, G. J. & Despa, F. Cognitive decline and dementia in diabetes mellitus: mechanisms and clinical implications. Nat. Rev. Endocrinol. 14, 591–604 (2018).
McGurnaghan, S. J. et al. Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland. Lancet Diabetes Endocrinol. 9, 82–93 (2021).
Lee, C. G. et al. Effect of metformin and lifestyle interventions on mortality in the Diabetes Prevention Program and Diabetes Prevention Program Outcomes Study. Diabetes Care 44, 2775–2782 (2021).
Keng, M. J. et al. Impact of achieving primary care targets in type 2 diabetes on health outcomes and healthcare costs. Diabetes Obes. Metab. 21, 2405–2412 (2019).
Collins, G. S., Mallett, S., Omar, O. & Yu, L.-M. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 9, 103 (2011).
McCartney, D. L. et al. Epigenetic prediction of complex traits and death. Genome Biol. 19, 136 (2018).
Gadd, D. A. et al. Epigenetic scores for the circulating proteome as tools for disease prediction. eLife 11, e71802 (2022).
Cardona, A. et al. Epigenome-wide association study of incident type 2 diabetes in a British population: EPIC-Norfolk study. Diabetes 68, 2315–2326 (2019).
Meeks, K. A. C. et al. Epigenome-wide association study in whole blood on type 2 diabetes among sub-Saharan African individuals: findings from the RODAM study. Int. J. Epidemiol. 48, 58–70 (2019).
Walaszczyk, E. et al. DNA methylation markers associated with type 2 diabetes, fasting glucose and HbA1c levels: a systematic review and replication in a case-control sample of the Lifelines study. Diabetologia 61, 354–368 (2018).
Al Muftah, W. A. et al. Epigenetic associations of type 2 diabetes and BMI in an Arab population. Clin. Epigenetics 8, 13 (2016).
Chambers, J. C. et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 3, 526–534 (2015).
Nakatochi, M. et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin. Epigenetics 9, 54 (2017).
Wang, X. et al. An epigenome-wide study of obesity in African American youth and young adults: novel findings, replication in neutrophils, and relationship with gene expression. Clin. Epigenetics 10, 3 (2018).
Wahl, S. et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017).
Haw, J. S. et al. Long-term sustainability of diabetes prevention approaches: a systematic review and meta-analysis of randomized clinical trials. JAMA Intern. Med. 177, 1808–1817 (2017).
Samuels, T. A., Cohen, D., Brancati, F. L., Coresh, J. & Kao, W. H. Delayed diagnosis of incident type 2 diabetes mellitus in the ARIC study. Am. J. Manag. Care 12, 717–724 (2006).
Saffari, A. et al. Estimation of a significance threshold for epigenome-wide association studies. Genet. Epidemiol. 42, 20–33 (2018).
Ekoe, J.-M., Goldenberg, R. & Katz, P. Screening for diabetes in adults. Can. J. Diabetes 42, S16–S19 (2018).
Van Calster, B. et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 230 (2019).
Van Calster, B. et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J. Clin. Epidemiol. 74, 167–176 (2016).
Demerath, E. W. et al. Epigenome-wide association study (EWAS) of BMI, BMI change and waist circumference in African American adults identifies multiple replicated loci. Hum. Mol. Genet. 24, 4464–4479 (2015).
Mendelson, M. M. et al. Association of body mass index with DNA methylation and gene expression in blood cells and relations to cardiometabolic disease: a Mendelian randomization approach. PLoS Med. 14, e1002215 (2017).
Sayols-Baixeras, S. et al. Identification and validation of seven new loci showing differential DNA methylation related to serum lipid profile: an epigenome-wide approach. The REGICOR study. Hum. Mol. Genet. 25, 4556–4565 (2016).
Braun, K. V. E. et al. Epigenome-wide association study (EWAS) on lipids: the Rotterdam Study. Clin. Epigenetics 9, 15 (2017).
Kriebel, J. et al. Association between DNA methylation in whole blood and measures of glucose metabolism: KORA F4 study. PLoS ONE 11, e0152314 (2016).
Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. Exact post-selection inference, with application to the lasso. Ann. Stat. 44, 907–927 (2016).
Taylor, J. & Tibshirani, R. Post‐selection inference for ℓ1‐penalized likelihood models. Can. J. Stat. 46, 41–61 (2018).
Austin, P. C., Lee, D. S. & Fine, J. P. Introduction to the analysis of survival data in the presence of competing risks. Circulation 133, 601–609 (2016).
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Br. J. Surg. 102, 148–158 (2015).
Smith, B. H. et al. Cohort Profile: Generation Scotland: Scottish Family Health Study (GS: SFHS). The study, its participants and their potential for genetic research on health and illness. Int. J. Epidemiol. 42, 689–700 (2013).
Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. Ann. Appl. Stat. 2, 841–860 (2008).
Sparapani, R. A., Logan, B. R., McCulloch, R. E. & Laud, P. W. Nonparametric survival analysis using Bayesian additive regression trees (BART). Stat. Med. 35, 2741–2753 (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Chipman, H. A., George, E. I. & McCulloch, R. E. BART: Bayesian additive regression trees. Ann. Appl. Stat. 4, 266–298 (2010).
Ishwaran, H. & Kogalur, U. Fast unified random forests for survival, regression, and classification (RF-SRC). R package version 2.11.0 (2021).
Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric machine learning and efficient computation with Bayesian additive regression trees: the BART R package. J. Stat. Softw. 97, 1–66. (2021).
Saito, T. & Rehmsmeier, M. The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).
De Cock, B., Nieboer, D., Van Calster, B., Steyerberg, E. W. & Vergouwe, Y. The CalibrationCurves package: validating predicted probabilities against binary events. R package version 0.1.5. https://github.com/BavoDC/CalibrationCurves (2023).
Battram, T. et al. The EWAS Catalog: a database of epigenome-wide association studies. Wellcome Open Res. 7, 41 (2022).
Wichmann, H.-E., Gieger, C. & Illig, T. KORA-gen—resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen 67, S26–S30 (2005).
Fawns-Ritchie, C. et al. CovidLife: a resource to understand mental health, well-being and behaviour during the COVID-19 pandemic in the UK. Wellcome Open Res. 6, 176 (2021).
Shah, W., Hillman, T., Playford, E. D. & Hishmeh, L. Managing the long term effects of covid-19: summary of NICE, SIGN, and RCGP rapid guideline. BMJ 372, n136 (2021).
Acknowledgements
This research was funded in whole, or in part, by the Wellcome Trust (nos. 104036/Z/14/Z, 108890/Z/15/Z and 216767/Z/19/Z). For the purpose of open access, we have applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission. Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (no. CZD/16/6) and the Scottish Funding Council (no. HR03006) and is currently supported by the Wellcome Trust (no. 216767/Z/19/Z). DNAm profiling of the Generation Scotland samples was carried out by the Genetics Core Laboratory at the Edinburgh Clinical Research Facility and was funded by the Medical Research Council UK and the Wellcome Trust (Wellcome Trust Strategic Award ‘STratifying Resilience and Depression Longitudinally’ (ref. no. 104036/Z/14/Z)). The DNAm data assayed for Generation Scotland was partially funded by a 2018 NARSAD Young Investigator Grant from the Brain & Behavior Research Foundation (ref. no. 27404; awardee: D. M. Howard) and by a JMAS SIM fellowship from the Royal College of Physicians of Edinburgh (awardee: H. C. Whalley). Y.C. is supported by the University of Edinburgh and University of Helsinki joint PhD program in Human Genomics. D.A.G. is supported by funding from the Wellcome Trust 4-year PhD in Translational Neuroscience—training the next generation of basic neuroscientists to embrace clinical research (no. 108890/Z/15/Z). C.A.V. is a Chancellor’s Fellow funded by the University of Edinburgh. D.L.M. and R.E.M. are supported by an Alzheimer’s Research UK major project grant no. ARUK-PG2017B-10. R.E.M. is supported by an Alzheimer’s Society major project grant no. AS-PG-19b-010. M.J.S., N.S. and E.L. are supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. Recruitment to the CovidLife study was facilitated by the Scottish Health Research Register (SHARE) and Biobank. SHARE is supported by NHS Research Scotland, the universities of Scotland and the Chief Scientist Office of the Scottish Government. The KORA S4 study was initiated and financed by the Helmholtz Zentrum München—German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education and Research and by the State of Bavaria. Furthermore, the KORA research has been supported by the Munich Center of Health Sciences, Ludwig-Maximilians-Universität München as part of LMUinnovativ and is supported by the German Centre for Cardiovascular Research. The KORA S4 study is funded by the Bavarian State Ministry of Health and Care through the research project DigiMed Bayern (www.digimed-bayern.de).
Author information
Authors and Affiliations
Contributions
Y.C., D.A.G. and C.G. performed the data analysis. Y.Z., I.B., M.J.S., N.S. and E.L. conducted the preliminary analyses. Generation Scotland cohort: N.W. and L.M. were responsible for the data collection. A.C., C.N., R.M.W., K.L.E. and D.L.M. prepared the data. D.J.P. and A.M.M. were responsible for data collection and cohort management. KORA S4 cohort: A.P., W.R. and M.W. were responsible for data collection and preparation. CovidLife: C.F.-R. performed the data collection and preparation. K.M.-G. and T.I.C. provided input on statistical modeling. A.G., C.A.V. and R.E.M. were responsible for study design and methodology. All authors contributed to drafting the paper and the figures.
Corresponding authors
Ethics declarations
Competing interests
R.E.M. has received a speaker fee from Illumina and is an advisor to the Epigenetic Clock Development Foundation. A.M.M. has previously received speaker fees from Janssen and Illumina and research funding from The Sackler Trust. L.M. has received payment from Illumina for presentations and consultancy. The other authors declare no competing interests.
Peer review
Peer review information
Nature Aging thanks Srikanth Bellary and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Preprocessing steps for Generation Scotland and KORA S4.
The number of individuals/cases and controls in are given after each step.
Extended Data Fig. 2 Calibration plots for incremental models in Generation Scotland.
Plots are shown for the full model (risk factors + composite protein epigenetic score + Cox PH lasso direct epigenetic score) (top-left) and the risk factors only model (bottom-left). The black line shows the loess calibration regression curve. The grey area shows 95% confidence intervals calculated from 2000 bootstrap samples. The ideal calibration line (observed = predicted) is shown in red. The histogram shows the distribution of predicted probabilities. The wider confidence intervals at higher predicted probabilities are due to the small number of predictions in those ranges. Most predictions are low in the probability range, emphasised in the zoomed-in plots (top-right and bottom-right).
Extended Data Fig. 3 An example from the MethylPipeR-UI Shiny app.
The left hand panel provides functionality for uploading data and specifying pipeline parameters. The right hand tabs show output such as model diagnostics, performance metrics and console output.
Supplementary information
Supplementary Information
Supplementary Methods.
Supplementary Tables
Supplementary Tables 1–14.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cheng, Y., Gadd, D.A., Gieger, C. et al. Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes. Nat Aging 3, 450–458 (2023). https://doi.org/10.1038/s43587-023-00391-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43587-023-00391-4
This article is cited by
-
DNA Methylation-derived biological age and long-term mortality risk in subjects with type 2 diabetes
Cardiovascular Diabetology (2024)
-
Epigenetic scores derived in saliva are associated with gestational age at birth
Clinical Epigenetics (2024)
-
Blood protein assessment of leading incident diseases and mortality in the UK Biobank
Nature Aging (2024)
-
Epigenomic insights into common human disease pathology
Cellular and Molecular Life Sciences (2024)
-
Integrated epigenome, whole genome sequence and metabolome analyses identify novel multi-omics pathways in type 2 diabetes: a Middle Eastern study
BMC Medicine (2023)