MSGene: a multistate model using genetic risk and the electronic health record applied to lifetime risk of coronary artery disease

Urbut, Sarah M.; Yeung, Ming Wai; Khurshid, Shaan; Cho, So Mi Jemma; Schuermans, Art; German, Jakob; Taraszka, Kodi; Paruchuri, Kaavya; Fahed, Akl C.; Ellinor, Patrick T.; Trinquart, Ludovic; Parmigiani, Giovanni; Gusev, Alexander; Natarajan, Pradeep

doi:10.1038/s41467-024-49296-9

Download PDF

Article
Open access
Published: 07 June 2024

MSGene: a multistate model using genetic risk and the electronic health record applied to lifetime risk of coronary artery disease

Nature Communications volume 15, Article number: 4884 (2024) Cite this article

7112 Accesses
5 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Coronary artery disease (CAD) is the leading cause of death among adults worldwide. Accurate risk stratification can support optimal lifetime prevention. Current methods lack the ability to incorporate new information throughout the life course or to combine innate genetic risk factors with acquired lifetime risk. We designed a general multistate model (MSGene) to estimate age-specific transitions across 10 cardiometabolic states, dependent on clinical covariates and a CAD polygenic risk score. This model is designed to handle longitudinal data over the lifetime to address this unmet need and support clinical decision-making. We analyze longitudinal data from 480,638 UK Biobank participants and compared predicted lifetime risk with the 30-year Framingham risk score. MSGene improves discrimination (C-index 0.71 vs 0.66), age of high-risk detection (C-index 0.73 vs 0.52), and overall prediction (RMSE 1.1% vs 10.9%), in held-out data. We also use MSGene to refine estimates of lifetime absolute risk reduction from statin initiation. Our findings underscore our multistate model’s potential public health value for accurate lifetime CAD risk estimation using clinical factors and increasingly available genetics toward earlier more effective prevention.

Meta-prediction of coronary artery disease risk

Article 16 April 2025

Deep learning-derived cardiovascular age shares a genetic basis with other cardiac phenotypes

Article Open access 31 December 2022

Environmental and genetic predictors of human cardiovascular ageing

Article Open access 21 August 2023

Introduction

Coronary artery disease (CAD) remains the leading cause of morbidity and mortality worldwide¹. Estimating an individual’s risk of developing CAD over their lifetime is essential for timely and effective prevention and intervention^2,3,4,5. Traditional risk prediction models, such as the Pooled Cohort Equations (PCE) 10-year risk score, have guided clinical decisions and preventive strategies; however, these models come with inherent limitations^6,7,8. A 30-year or 10-year window provides only a fixed, albeit extended, snapshot of risk. It neither captures the entirety of an individual’s lifetime risk nor provides dynamic, age-specific insights beyond these arbitrary periods. Most importantly, there is a growing need for models capable of both recognizing undertreated younger patients while reducing over-estimation in older patients^7,9,10.

Current guidelines^9,11,12 recommend the consideration of primordial risk factors in risk-stratifying patients, and call for better methods of estimating lifetime risk. Recent evidence suggests that lifetime risk assessment provides a more comprehensive picture of an individual’s propensity for developing CAD across time^13,14. Over a longer horizon, traditional factors in combination with genomic risk can confer a disproportionately elevated risk for CAD when compared to short-term static risk^2,15,16,17. For this reason, integrating genomic and traditional features into a lifetime risk assessment allows for more effective patient counseling, tailored preventive measures, and earlier interventions that may delay or prevent the onset of CAD altogether^18,19.

Because of the multifactorial nature of CAD, there is an increasing need for continuously updated, dynamic, and individualized CAD risk predictions that span a patient’s entire life^2,14,20. Understanding risk from this perspective allows for more informed and timely interventions, potentially even before the conventional risk windows are applicable.

Here we introduce the MSGene model—a multistate model designed to predict the lifetime risk of CAD, conditional on both time-invariant and time-dependent variables. Multistate models allow for the estimation of the risk of an individual transitioning between health states^{21,22,23,24,25} through flexible estimation of conditional probabilities by modeling the transitions between states over time. By modeling the different health states simultaneously, these states naturally account for competing risks.

MSGene is capable of modeling the dynamic transitions from risk factor states to CAD with age-specific coefficients. Critically, our approach differs from a traditional Markov-based multistate model^21,22 by extending our model to the time-inhomogeneous case and allowing our transitions to vary with age; in addition, our model differs from traditional Cox models by allowing for non-proportional hazards.

In this study, we develop and validate the MSGene model in an application for estimating lifetime risk of CAD. We evaluate the performance compared to the traditionally employed Framingham 30-year²⁶ and PCE 10-year^5,6 models. Here we show the potential ability of MSGene to reduce CAD events by guiding timely initiation of statin therapy and demonstrate the benefit of a multistate framework to incorporate dynamic changes in treatment decisions for unique patient profiles.

Results

Novel multistate model with time-dependent transitions

We build a novel time-dependent multistate model in which age is the time scale.

For each age a and current state j (Fig. 1), we model the one-year probabilities of transition to state k for individual i at age a, ${\pi }_{{jkia}}$, as logistic regressions conditional on both time-invariant covariates (e.g., sex, CAD-PRS), and time-dependent covariates (e.g., smoking, use of anti-hypertensives or statins) (“Methods”, Supplementary Table 1). This methodology defines an inhomogeneous Markov transition model, which can be used to compute the probability of reaching any state of interest during one’s lifetime among other quantities. Our transition model is Markovian, in that the current state and not earlier states are needed to know the probability of the next move. Also, the probabilities of transitioning between health states are allowed to vary over time and depend on covariates, which characterizes it as “inhomogeneous”. Here, we focus on the lifetime risk of CAD.

**Fig. 1: Multistate transitions over time.**

We chose the set of covariates above for comparability with existing approaches. We also report results for smaller subsets of covariates in a sensitivity analysis (“Methods”, Supplementary Table 1). To improve estimation efficiency of state- and age-specific covariate coefficients, we smooth these coefficients across age using a method called tricube distance-weighted least squares regression^27,28. This approach considers the reliability of each raw estimate by weighting adjacent ages by their proximity and their inverse variance so that transitions with small sample sizes receive proportionately less weighting. By doing this, we can use information from a wide range of ages, which is especially useful when there are only a few cases of a particular change at certain ages. This allows for the sharing of information across ages in instances in which the number of individuals at a particular transition may be small. We calculate risk under statin-treated and statin-untreated strategies by imputing the relative risk reduction of statins using estimates from 24 clinical trials²⁹ on each annual age-specific transition (“Methods”). We provide an interactive application for users to calculate CAD risk based on various covariates (https://surbut.shinyapps.io/risk/).

Baseline characteristics

We considered 480,638 individuals: 260,653 (54.2%) were female with 43,855 (11.1%) incident CAD diagnoses (Table 1) with a median 29.9 years [22.4–35.1] years of follow-up and median age of first observation in EHR 24.3 [IQR: 18.0, 37.1] after excluding 20,534 who lacked quality controlled genotypes or had CAD at baseline (Fig. 2). We visualize the proportional representation by risk factor at each age (Fig. 1) at both baseline and throughout: ~39.6% are ultimately diagnosed with hypertension, 23.6% with hyperlipidemia, and 9.9% with Diabetes Mellitus. Furthermore, 10.5% report currently smoking and 20.3% began antihypertensive use during the course of our study. General practice data linkage was available for 46.1% of the cohort, and sensitivity analysis showed the distribution of risk factors was homogenous (Supplementary Table 2).

Table 1 Distribution of overall cohort

Full size table

Model interpretation

We split the cohort of participants, with 80% of individuals in the training set (384,510 individuals) and 20% as a testing set (79,117 individuals) (Fig. 2). We report the lifetime risk remaining at any age for a given individual i of progressing to state k from state j from age A₁ to A₂, where a indexes the current age and A₂ is set conservatively at 80 (Eq. (1)).

$${Interval\; Risk}=1-{\prod }_{{A}_{1}}^{{A}_{2}}\, \left(1-{\pi }_{{jkia}}\right)$$

(1)

Modeling transitions

Using our multistate approach, MSGene, we describe the overall state distribution across the lifespan in our cohort, pictured to exclude censoring at each age (Fig. 1) and also described above. At age 40 years, for example, 94.4% of individuals are in the healthy category and 0.3% in the CAD category before exclusions, with 4.1% in the hypertensive category. By age 76 years, CAD state occupancy peaks at 12.5% of uncensored individuals, and health is reduced to 27.6% of uncensored individuals. By age 80 years, 7.4% of all individuals enrolled have died.

Improved detection of early events when compared to 10-year risk

When compared to the PCE, a 10% lifetime threshold using MSGene uniquely identifies 5315 (59.3%) cases versus 123 (1.3%) cases using the 10-year PCE (5% threshold) alone at age 40. This reduces to <1% of cases at age 68 (vs 81% with PCE) (Supplementary Fig. 1). At age 40, MSGene showed substantially greater sensitivity for lifetime CAD events compared to PCE (event reclassification 58.2%, 95% CI 58.1–58.3%), at the cost of moderate inappropriate up-classification of lifetime non-events (non-event reclassification –37.3%, 95% CI 37.2–37.4%). At age 70, MSGene showed substantially greater specificity compared to PCE (non-event reclassification 32.1%, 95% CI 31.9–32.1%), at the cost of some inappropriate down classification of events (event reclassification –12.5%, 95% CI –12.4 to –12.6%). Overall, reclassification was consistently favorable (median net reclassification index 0.12) over 40 years of consideration. Notably, MSGene identified more individuals with high genetic risk. Among individuals with predicted life risk >10%, 9.7% (95% CI 9.6–9.8%) of individuals have PRS in the top quintile, while only 3.1% (95% CI 2.9–3.2%) have PRS in the lowest quintile (Supplementary Fig. 1).

Improved calibration when compared to 30-year risk score

MSGene demonstrated globally improved calibration when compared to FRS30RC. We compared the average predicted risk by sex and genomic risk strata with empirical overall incidence rates. In healthy individuals, the RMSE of MSGene is 1.06% (1.04% males, 1.09% females, SEM 0.06) while FRS30RC is 10.9% (12.1% males, 10.1% females, SEM 0.07, Supplementary Fig. 3). In contrast to MSGene, FRS30RC 30-year risk increases monotonically across the lifespan. When restricting the analysis to ages 40 and 50 for whom 30 years of follow-up is available, the RMSE is 0.98% with MSGene when compared to 5.68% for FRS30RC. We further compute the RMSE starting from additional single risk factor phenotype states (hypertension, hyperlipidemia, and diabetes) across a subset of covariate choices (Supplementary Table 1). We found the improvement to be robust across states and covariate choices.

Dynamic effects of 10-year, 30-year, and remaining lifetime risk

MSGene allows for the estimation of survival curves for an individual starting from a given age, and for updated remaining lifetime curves asked over a range of ages. We compute the remaining lifetime risk when compared with FRS30RC as recalibrated for our population³⁰. First, we depict the predicted survival curve for individuals of six different genetic and sex strata starting at healthy at age 40 (Fig. 3). Under this traditional analysis, CAD-free survival is projected to decline monotonically as a function of sex and genetic risk to 96.8% (95% CI 96.78–96.82) for a female in the lowest genetic strata and to 81.26% (95% CI 81.24–81.28) for a male in the highest genetic strata. However, a remaining lifetime risk curve reveals the opposite behavior: for example, a high genetic risk male has a 22.9% (95% CI 22.7–23.1%) risk without treatment at age 40, but the same high-risk male has only a 10.21% (95% CI 10.20–10.22%) risk of developing CAD if he remains CAD-free at age 70. This contradicts the 10-year risk prediction, in which 10-year risk rises from 2.84% at age 40 to 10.21% at age 70 (Fig. 3 and Supplementary Data Tables 1–16). We compare this to FRS30RC projections²⁶ and note that while remaining lifetime risk declines with age, the extended fixed window (FRS30RC) increases monotonically across age and genetic risk strata. In our cohort, the FRS30RC risk for a high genetic risk male rises from 13.4% at age 40 to 33.0% (Fig. 3) at age 70 using the recalibrated measure. When applying trial-estimated statin benefit via introducing a trial-estimated relative risk reduction to each annual transition probability²⁹ (“Methods”, Eq. (6)) under MSGene lifetime projections, predicted absolute risk under treatment for the same high genetic risk male at age 40 improves from 22.86% (95% CI 22.85–22.87%) to 18.70% (95% CI 18.69–18.71%) over the 40-year span. This is compared to a smaller decline from 10.21% (95% CI 10.19–10.22%) to 8.25% (95% CI 8.24–8.26%) at age 70. In general, MSGene assigns higher risks to high genetic risk, younger individuals, while individuals’ FRS30RC, a 30-year fixed-window approach, recognizes older individuals regardless of genetic strata.

**Fig. 3: Survival, 10-year, and lifetime risk curves.**

MSGene demonstrates improved dynamic projection on time-dependent transitions

An updated lifetime prediction, conditional on a patient’s current state, can be made per year, using age-specific coefficients. We use these updated predictions as covariates in a time-dependent extended model^31,32 to evaluate the performance of our model on predicting time-to-event (“Methods”). Though the scores arise from a non-Cox model, since the Cox model score statistic is well defined for time-dependent covariates, the concordance is also well defined for a time-dependent risk score: at each event time the current risk score of the subject who failed is compared to the current (time-dependent) scores of all those still at risk. The Cox model is never used for estimating the MSGene transition probabilities themselves: it is only used in the model assessment stage to evaluate concordance in a time-dependent matter³³.

We first consider the age distribution at which an individual first exceeded a lifetime risk threshold of 10% using MSGene or FRS30RC, or using a PCE-derived 10-year risk threshold of >5%. Using MSGene to assess lifetime risk, 44.8% percent of individuals exceed this threshold at age 40, while 38.9% never do. With FRS30RC, 44.1% exceed this threshold at age 40, but virtually all (99.8%) exceed this threshold by age 80. Using the first age exceeded under each model as a time-dependent predictor of CAD status, we find that MSGene improves model concordance by 21% (C-index 0.73 vs 0.52, P < 2.1 $\times$ 10^–140) and of the 10-year index by 17.4% (C-index 0.55, P < 2.1 $\times$ 10^–103) (Fig. 4a–d).

**Fig. 4: Time-dependent threshold analysis.**

We then use the yearly time- and state-varying predictions as predictors in a time-dependent Cox proportional hazard model in which one’s score is recorded annually in nonoverlapping intervals and estimate the concordance of this model. The concordance of this time-dependent model using dynamic MSGene predictions exceeds that of the updated FRS30RC predictions by 0.71 vs 0.66, P < 2.9 $\times$ 10^–17 (Fig. 4e–g). We repeat these results using the subset with general practice (GP) records alone for both training (80%) and testing (20%) and the results hold for both the thresholding analysis (C-index 0.71 vs 0.53, P < 2 $\times$ 10^–16) and continuous time-dependent analysis (C-index 0.73 vs 0.67, P < 2 $\times$ 10^–16, Supplementary Figs. 3 and 4).

Estimated benefit

Our model incorporates the estimated benefit of a treatment strategy that is imputed conditional on starting age and risk status. Using a randomized clinical trial (RCT)-imputed annual risk reduction of 20% for statins on statin-free individuals^34,35, we observe an inverse relationship between predicted 10-year risk and expected benefit. An individual with the highest genetic risk at age 40 has a predicted 10-year risk (4.2%, SD 0.01) roughly equivalent to the lowest genetic risk individual at age 70 (3.9%, SD 0.01), but an expected lifetime absolute risk reduction of 5% (SD 0.01) at age 40 versus only 0.8% (SD 5 $\times$ 10^–2) at age 70 (Fig. 5). When we consider the distribution of all starting states, we see that the mean absolute risk reduction is the greatest for younger individuals (4.6–7.2%; SD 0.01) across risk states at age 40, to a mean absolute risk reduction of 0.3–3.5% (SD 0.01) at age 79.

**Fig. 5: Absolute risk reduction: Short-term and lifetime risk.**

Improvement in discrimination over the cumulative horizon

When considering only the presence or absence of disease over observed time without regard to timing, the AUC-ROC of a model comparing the prediction of cumulative occurrence using updated MSGene lifetime score shows greater performance than that of either FRS30 or FRSRC early in the life course (Supplementary Fig. 5) (0.69 vs. 0.65, P < 2 $\times$ 10^–16 at age 40) and also based on precision-recall (0.20 vs 0.16 at age 40, P < 0.01). Both metrics exceeded the estimation of lifetime risk using genetics as a predictor alone. In general, when comparing individuals captured by MSGene but not by FRS30RC, MSGene identified more women and individuals at higher genetic risk. With time, these differences were more profound (Supplementary Fig. 6).

External validation

We then performed external validation of MSGene in the Framingham Offspring (FOS) cohort, using first measurements to ensure optimal follow-up duration. FOS is a community-based cohort recruited in 1971 with a median 39 years of follow-up [IQR 38–40], median age of enrollment 35 years [IQR 28–44] (Supplementary Fig. 7). Our analyses were on the subset of 2595 individuals who met exclusion criteria (Supplementary Fig. 7). MSGene again had favorable discrimination (age 40: 0.75 [95% CI 0.69–0.82] vs. 0.73 [95% CI 0.66–0.80]; age 55: 0.63 [95% CI 0.42–0.84] vs. 0.53 [95% CI 0.29–0.76]) and calibration (RMSE 8.4% vs. 11.3%, P < 2 $\times$ 10^–16) when compared to FRS30 (Supplementary Fig. 8).

Discussion

Our study introduces a novel method called MSGene, which aims to assess the risk of developing CAD and other health states over an individual’s lifespan. We demonstrate that dynamic modeling of lifetime risk using longitudinal data and our novel multistate approach can improve both calibration and discrimination when compared with existing gold-standard approaches, such as the PCE and FRS30. Furthermore, incorporating genetic risk and the flexibility to estimate remaining lifetime risk improves the identification of younger individuals at high risk without overestimating risk in older adults, in contrast to existing fixed-window approaches^6,30. Our projected benefit analysis shows that this might result in large reduction in preventable CAD events if statin therapy is guided by MSGene.

The technique utilizes generalized linear models (GLMs) to compute the transition probabilities between different states (e.g., from a healthy state or risk factor to CAD, death, or intermediate risk) for every age over the observed lifespan. The novelty derives from four features: (1) the provision of unique age-dependent models via GLMs that allow the relationship of each covariate on the outcome to vary freely with time; (2) the calculation of risk conditional on time-dependent states; (3) the assessment of a multistate model via time-dependent Cox modeling; and (4) the unique use of the UKB EHR as a comprehensive longitudinal data resource. The study follows individuals from adulthood through their enrollment in the linked health record. By incorporating age and time dependence, this method provides annual risk estimates over the lifespan, here focused on risk assessment from ages 40 to 80 years.

Over a lifetime horizon, the dynamic change in risk makes accurate lifetime risk estimations challenging^4,7,11. However, leveraging genetics and multistate modeling, MSGene enhances lifetime risk predictions. This effectively identifies individuals previously deemed low-risk. The model’s age-dependent features, producing age-sensitive coefficients, negate the need to rely on fixed parametric interactions between each covariate and time, a prevalent limitation in traditional models⁶. We show that using updated estimates conditional on the dynamic state of an individual improves time-to-event prediction overall.

Through the incorporation of treatment effects, we show that those individuals with the greatest and least expected absolute risk reduction from statin therapy actually have a similar 10-year risk. However, this short-term focus is what current clinical methods rely upon⁷. Presented effects are conservative as statin effects may magnify with duration and on CAD-PRS background^19,36,37,38.

Our approach facilitates accurate event prediction both for undercaptured young individuals and also lower-risk older individuals who might otherwise be included in a fixed-window approach that extends the time horizon: our median global net reclassification when compared with a 10-year approach is 12.2% [IQR 5.5–18.6%] over 40 years. This in part explains the improvement in overall time-dependent performance when incorporated into a time-to-event framework. Using a time-dependent evaluation, the distribution of the first age at which a lifetime threshold is exceeded demonstrates that MSGene optimally identifies at-risk individuals without indiscriminately calling all patients “at-risk”. However, future work is warranted to determine optimal thresholds of lifetime risk to maximize potential benefits among high-risk younger individuals while reducing unnecessary costs and harms to low-risk older individuals.

One of the strengths of our method is access to a significant history of electronic health records that allow us to derive estimates informed by a greater group of patients throughout the life course. Existing scores^26,39 imply that the levels of covariates will stay fixed over the life course or require recalculation, which ignores the information within transitions through the life course. Here, our longitudinal outlook allows for individuals to be followed over a lifetime and quickly estimates what their updated risk trajectory would look like under an alternative profile.

Estimation of remaining lifetime risk is conducted using age-specific predictions informed only by individuals in the at-risk set at a given age, thus making this a true lifetime estimate. In our work, we choose a conservatively estimated age of 80 as the maximum lifetime age given the density of age estimation within our set. This estimation is possible under the assumption that risk trajectory is similar across shifting windows of age at risk but falls apart with strong calendar time trends. Given that our cohort was required to be between 40 and 69 years old in 2006, we reduced the variation in calendar effects^5,40.

When combined with genetic information, an emphasis on dynamically updated lifetime risk projections can uncover latent risks in seemingly healthy individuals. Determining an appropriate lifetime risk threshold is a laudable goal^2,7. Indeed, current guidelines^12,40 note that genetic risk scores can identify individuals at birth with a high propensity to develop disease, but few approaches have coupled this information with realized risk stages dynamically. As age increases, short-term risk increases, and the remaining lifetime risk is reduced, meaning that a metric focusing on short-term risk will preferentially focus on disease in older individuals, thwarting the efforts of true prevention. It is not enough to increase the lifetime threshold to account for younger individuals as proposed in European Society of Cardiology guidelines; additional years add additional uncertainty, and thus, having tools capable of dynamically incorporating new information over the life course in combination with more comprehensive time assessments is critical to moving prevention forward. The current MSGene model is available as a risk assessment tool at https://surbut.shinyapps.io/risk, where users can compute lifetime risk of CAD based on different risk factor states and covariates. (Supplementary Fig. 9). Critically, we also provide the code to rapidly refit and compute this model for a new cohort (https://github.com/surbut/MSGene).

In this study, we use a composite of phenotypic codes to define our risk factor states. One of the challenges of developing a lifetime assessment tool surrounds the availability of continuously updated laboratory data. Using EHR data, an unbiased ascertainment of updated biometric variables at uniform intervals is challenging. We added baseline continuous laboratory data from the age of enrollment to our grid search, and this added little to our model (Supplementary Fig. 10).

A second limitation surrounds the heterogeneity of phenotyping. We define hyperlipidemia and hypertension according to validated diagnostic codes⁴¹. However, there exists heterogeneity in the severity and duration of these conditions. The potential benefit of adding additional states must be balanced with the uncertainty imposed and the reduction in sample size caused by dispersion across grades of each condition. Our model is capable of incorporating health history in state specification: this resolves the loss in underlying latent risk that is often erroneously captured in EHR data when an individual’s nominal laboratory value falls secondary to medication.

One of the advantages of heterogenous data collection is a wealth of available phenotyping modalities: the UKBB has access through linkages to routinely available national health systems enhanced by self-report and previous records⁴². Although not all individuals included had GP data, we demonstrate that the age and prevalence of conditions is homogenous between individuals in the GP subset and otherwise (Supplementary Fig. 1) and that analysis on this subset alone results in similar model discrimination.

Third, the generalizability of our findings may be impacted by study design and sample specificity. The UK Biobank included healthier and less socioeconomically deprived individuals who were predominantly Caucasian individuals living in the United Kingdom⁴³. We provide detailed analysis in our supplement documenting that these results held by self-reported ethnicity (Supplementary Table 3). Furthermore, given that the minimum age for genotyping was 40 years old, we began our inference for risk modeling at age 40, provided they were captured in the EHR before then. Although individuals who reached age 40 prior to enrollment were appropriately at risk for the primary CAD outcome given their capture in the longitudinal EHR, they were protected from death until the time of enrollment, which may affect estimates related to the competing risk of death. For time-dependent evaluation of our prediction, we conservatively left-censored at age of enrollment to eliminate years protected from death and found that the improvements in discrimination over FRS30RC remained unchanged. We note consistent performance in external validation in the FOS cohort, where all death and CAD events occurred exclusively after enrollment. Finally, our dynamic logistic regression approach can readily be adapted to any population with minimal computational resources (https://github.com/surbut/MSGene) and we provide R code to do so.

Leveraging a unique resource of genetic and longitudinal clinical data spanning over 80 years in nearly 500,000 participants of the UK Biobank prospective cohort study, we develop MSGene, a multistate model for dynamic transitions throughout the life course to estimate lifetime risk of CAD. MSGene is well-calibrated and discriminates early and late events both in the UK Biobank and an external validation sample. We anticipate that by providing interpretable and dynamic estimates of CAD lifetime risk, MSGene may inform future therapeutic decisions to enable more efficient and effective CAD prevention throughout the life course.

Methods

Data source

The UK Biobank (UKB) is a prospective UK population-based study that enrolled approximately half a million adults aged 40–69 between 2006 and 2010 designed to investigate the genetic and lifestyle determinants for a wide range of diseases. Participants underwent genome-wide genotyping, with linkage to longitudinal hospitalization, primary care (GP), and self-report data dating back to 1940 (Fig. 2 and Supplementary Figs. 11 and 12)⁴¹. Using the ukbpheno package (version 1.0)⁴¹, we assembled detailed longitudinal data from the various sources documenting events from 1940 until December 2021 for 481,927 individuals after excluding 20,534 who lacked quality control genotyping or risk factor information (Fig. 2 and Supplementary Figs. 11 and 13). At the time of analysis, linkage to the United Kingdom General Practice (GP) Registry was available for a subset of 221,351 individuals. This assembly across data sources generated phenotypes for hypertension (Htn), diabetes mellitus (DM) (type 1 or 2), hyperlipidemia (Hld), or coronary artery disease (CAD) based on validated collections of hospitalization (HESIN), diagnostic, operation, general practice (GP) clinical and script as well as death information⁴¹. We found high overlap between these phenotypes and our own lab’s previously generated HESIN-restricted phenotypes^36,44 (Supplementary Fig. 13). These phenotypes subsequently became the risk factor states in our model. Informed consent was obtained from all participants, and secondary data analyses were approved by the Mass General Brigham Institutional Review Board 2021P002228. Secondary data analysis of UKB was performed under application number 7089.

Because of the longitudinal nature of this cohort, every individual is observed at first encounter with the electronic health record (EHR) in early adulthood (median age 24.2 years). We selected UKB participants free of CAD at age 40 and followed until the occurrence of CAD, death, or loss to follow-up (median follow-up 29.9 years). We categorize individuals by their condition at entry into our cohort at age 40 years provided they have been observed in the EHR (Fig. 2). We then re-evaluate at each age the risk set as those individuals who have (1) been observed and (2) have not been censored for a given phenotype. We demonstrate the diversity of data sources and the corresponding availability of each data source over time for all considered phenotypes (Supplementary Fig. 12). In general, our model allows for the progression from CAD to death, but we report here the risk of progression to CAD on CAD-frr individuals at baseline.

Polygenic risk

An additional novelty of our model is the incorporation of the dynamic effects of genetics over time. We use CAD polygenic risk score (PRS) as released through the UKB resource⁴⁵ and compute on individuals with adequate genotype information after quality control and after controlling for the principal component axes obtained from the common genotype data in the 1000 Genomes reference data set using standard methods⁴⁵. Data supporting these scores were entirely from external GWAS data (the Standard PRS set) as conducted by Genomics PLC (Oxford, UK) under UKB project 9659⁴⁵. We demonstrate that the distribution of PRS is similar across entry age (Supplementary Fig. 14).

States and competing risk

The unique nature of our multistate model features eight mutually exclusive states and restricts one-year transitions as follows (Fig. 1), with death as the final absorbing state from which one cannot exit. At any age across the life course, cumulative one-step transitions can be assessed (Fig. 1). Possible transitions are as follows:

1.
Healthy to a single risk factor (Hypertension, Hyperlipidemia, Diabetes), CAD or death; (healthy to healthy also allowed but not displayed for clarity)
2.
Single risk factor to corresponding double risk factor, CAD or death;
3.
Double risk factor to triple risk factor, CAD or death;
4.
Triple risk factor to CAD or death;
5.
CAD to death.

Predictions with age as the time scale

Our model inferences are made per year using the individuals who are in a particular risk state at a given age (Fig. 2 and Supplementary. Fig. 9). Predictions can, therefore, be made over a requested time interval using the product of age-specific risks for which coefficients were estimated from individuals who were in the at-risk subset during a given period. While enrollment in the UK Biobank required that an individual be alive at age 40 to enroll for genotyping, it did not require that the individual be risk factor-free, and therefore we use this information to assign individuals into risk categories for inference from age 40 onward. We exclude individuals with CAD at baseline from our predictions. After deriving the model construction, we describe the computation and evaluation of state and age-specific risks below.

Statistical analysis

Let ${\pi }_{{jkia}}$ represent the annual transition probability from state j to state k for individual i during year a. We let the states j and k represent phenotypes ascertained from the electronic health record. ‘From’ states include Health; single risk factor states: Hypertension (Ht), Hyperlipidemia (Hld), Diabetes Mellitus Type 1 and Type 2 (DM), double risk factor states: Ht & Hld, Ht & Dm, Dm & Hld; Triple risk factor states: Dm & Hld & Ht; and Coronary Artery Disease (CAD). “To” states include all of the “From” states and Death. For our purposes, we report the progression to CAD or death from any of the “From” states.

For p covariates for a given individual transitioning from state j to k, we refer to the following.

$$log \frac{{\pi }_{{jkia}}}{1-{\pi }_{{jkia}}}={\hat{\beta }}_{{jka}0}+{\hat{\beta }}_{{jka}1}{{{{{{\rm{x}}}}}}}_{1}+\ldots {\hat{\beta }}_{{jkap}}{{{{{{\rm{x}}}}}}}_{p}$$

(2)

Where ${\hat{\beta }}_{{jkar}}$ represents the coefficient of variable r in the prediction of the transition probability from state j to state k at age a. Taking the inverse logit of the estimate returns the absolute risk for any individual i as a function of the age-specific coefficients and their p covariates, such that the annual risk estimate from state j to state k is given by:

$${\pi }_{{jkia}}=\frac{{exp }^{{{{{{{\boldsymbol{X}}}}}}}_{{{{{{\boldsymbol{ia}}}}}}}{{{{{{\boldsymbol{B}}}}}}}_{{{{{{\boldsymbol{jka}}}}}}}}}{1+{exp }^{{{{{{{\boldsymbol{X}}}}}}}_{{{{{{\boldsymbol{ia}}}}}}}{{{{{{\boldsymbol{B}}}}}}}_{{{{{{\boldsymbol{jka}}}}}}}}}$$

(3)

Here we let ${{{{{\boldsymbol{X}}}}}}^{{{\prime} }}$ represent the 1 × P vector of individuals and covariate profiles at a given age and ${{{{{\boldsymbol{\beta }}}}}}$ represents the P × 1 vector of age and state-state-specific smoothed coefficients. Smoothing is described in subsequent sections.

When estimating Eq. (3), state j represents the “from” state and state k represents the “to” state. To account for censoring, an individual exits the “at risk” group for transition inference when they are lost to follow-up. We use a 1-year interval over which to discretize age intervals and independently estimate the ${\pi }_{{jkia}}$ age-dependent-state to state transitions. We have both fixed and time-varying covariates. The effect of all covariates on transitions varies with age. Time-invariant covariates include sex and polygenic risk score (PRS). UKB assesses current smoker status at enrollment subsequent change in smoking status is not sufficiently reliable for our purposes. Therefore, we use it as a time-invariant covariate for model estimation. Time-dependent covariates include both antihypertensive and statin prescriptions. These are reevaluated each year using prescription data from the UKB⁴⁶. Our final prediction model allows for continuous updates of smoking and medication usage in evaluating age-specific transition probabilities. We use 80% of our data as training and 20% as testing (Fig. 2) which divides our data into a training set for model fitting using 384,510 samples and a testing data set of 79,117 unique individuals. Before carrying out the analysis, we selected these covariates for compatibility with the existing Pooled Cohort and Framingham 30-year equations^6,26. As a sensitivity analysis, we also report the results after removing certain covariates (Supplementary Table 1).

Predicted interval risk

Predicted risk of progressing to state k from state j for individual i over any period ranging from age A₁ to A₂ is (Eq. (4)):

$${Interval\; Risk}=1-{\prod }_{{A}_{1}}^{{A}_{2}}\left(1-{\pi }_{{jkia}}\right)$$

(4)

where a indexes the current age. Accordingly, we compute risk for an individual i of progressing to state k from state j where L is the maximum age of life and a is the currently observed age (Eq. (5)). For our purposes, we choose L = 80 in line with the available data by age in the UK Biobank.

$${Remaining\; Lifetime\; Risk}=1-{\prod }_{{A}_{1}}^{L}\,\left(1-{\pi }_{{jkia}}\right)$$

(5)

The remaining lifetime risk can be modified to account for treatments by applying a constant relative risk reduction to the age-specific transition probabilities in expression 4. Then the interval risk under treatment can be calculated using the per-year risk reduction RR of progressing to state k from state j over an interval from age A₁ to A₂ as (Eq. (6)):

$${Interval\; Risk\; under\; treatment}=1-{\prod }_{{A}_{1}}^{{A}_{2}}\left(1-{\left(1-{RR}\right)\times \pi }_{{jkia}}\right)$$

(6)

For the purposes of this manuscript, we are interested in state k = CAD. We impute the relative risk reduction of 0.20 from 24 trials of statin therapy²⁹. Within our model, we constrain each individual’s predicted probabilities across states per year to sum to one such that for each age a, the probability of staying within the given state is the complement of the sum of transitions over K to the alternative states:

$${\pi }_{{jjia}}=1- {\sum }_{k\ne j}{\pi }_{{jkia}}$$

(7)

We choose j because it is mostly above 50% and the constraint in 6 will guarantee that for a given age, the probabilities for an individual of a particular covariate profile sum to 1. The alternative of fitting a polytomous regression is computationally much more demanding and gives approximately the same answer. Here we report the product of these conditional one-step transitions from the healthy state as the state of most interest for primary prevention.

Flexible smoothing of regression coefficients across ages

The unadjusted observed coefficients may be inherently noisy and certain transitions may have low sample sizes. Therefore, we extract the unsmoothed coefficients ${\hat{\beta }}_{{jka}}$ for each age and state transition from the logistic regressions in Eq. (2). To borrow information across ages, we fit a smoothed locally estimated polynomial regression in which for each state to state transition and each covariate^27,28 (LOESS) (Supplemental Fig. 15). The loess weights are proportional to the product of the inverse variance of each estimated coefficient and the tricube distance function of nearby ages. This will smooth adjacent ages more closely together proportional to the cube of their distance d from the age in question, where:

$$D={abs}\left({age}-{ag}{e}_{i}\right)$$

(8)

We consider the neighboring unsmoothed coefficients as those within an adjusted window length, and if the age in question is within 5 years of the minimum or maximum age, we extend the adjusted window by 5 years.

$$\begin{array}{c}{neighbors}={which}\left({D}_{i}\le {adjuste}{d}_{{windo}{w}_{{width}}}\right)\\ {weight}{s}_{{tricube}}=1 - {\left(\frac{{D}_{{neighbors}}}{{windo}{w}_{{width}}}\right)}^{3}\\ {weights} \, < -{weight}{s}_{{tricube}} * \frac{1}{{\sigma }^{2}}\end{array}$$

(9)

We then use weighted least square regression to obtain the weighted sum of neighboring coefficients where the design matrix X is the “N” neighbor’ by degree +1 matrix X and y is the N × 1 vector of unsmoothed coefficients.

$$\begin{array}{c}{WX}=\sqrt{{weights}}\,X\\ {Wy}=\sqrt{\left({weights}\right)} * {coefficients}\left[{neighbors}\right]\\ \beta \, < -{\left(W{X}^{{\prime} }{WX}\right)}^{-1}W{X}^{{\prime} }{Wy}\\ {smoothe}{d}_{{coefficient}{s}_{i}}=\sum \beta+\beta * {Ag}{e}_{\left\{i\right\}}\ldots \beta * {Ag}{e}_{i}^{d}\end{array}$$

(10)

A vignette showing this process on a sample calculation is shown here https://surbut.github.io/MSGene/vignette.html. Furthermore, flexible window choices and polynomial degrees can be found here: https://surbut.shinyapps.io/testapp/. All analyses were performed with R (version 4.3.1). The underlying MSGene framework which could be adapted for other datasets is available at https://github.com/surbut/MSGene with accompanying vignettes. All plots are available at https://surbut.github.io/multistate2/index.html.

Distance weighting refers to the fact that, for each age point, neighboring data within a dynamic window is incorporated, and expanded at the age extremes to mitigate boundary bias. We then apply a tricube weight function that assigns higher weights to nearer neighbors, tapering to zero beyond the window, capturing the assumption of locality. The regression is stabilized by inverse variance weighting, emphasizing more reliable (less variable) observations, in line with the assumption that points with lower variance provide more accurate information. The design matrix, accounting for polynomial terms up to a specified degree, facilitates flexible modeling of the age-coefficient relationship, without imposing a global functional form. This approach assumes that polynomials can locally approximate the age-related trends in coefficients and that these local fits can be aggregated to represent the global trend. It also presumes the initial coefficient estimates are sensible and variances are correctly specified for accurate weighting.

Standard error of projection

We sample with replacement (“bootstrap”) our training data 1000 times and extract the corresponding means and standard errors of each projection across bootstrapping iterations. We compute the remaining lifetime risk by setting the maximum age to 80, according to the density of observations in our training data. We impute a relative risk (RR) of CAD from statins of 0.20^34,47,48; notably, the RR may be larger for some groups, such as those with elevated CAD-PRS^36,49, and for longer periods of time and thus this reflects a conservative estimate⁵⁰. We apply this benefit only to individuals not previously on statins.

For the RMSE, we report the standard error of the mean across strata. For proportions, we report the standard error of the sample proportion as $\surd (\hat{p}(1-\hat{p})/n)$ where $\hat{p}$ represents the sample proportion.

Performance metrics

For each age, we compare the average predicted score by genomic (<20%, 20–80%, and >80%) and sex strata, and report the root mean squared error (RMSE) as the difference in the average empirical and predicted cumulative incidence rate for each PRS and sex group.

$${RMSE}={sqrt}\Big(\sqrt{{{Empirical}\; {Incidence}}_{{PRS}\times {sex}} - {mean}\left({{Predicted} \; {Rate}}_{{PRS}\times {sex}}\right)}$$

(11)

For the area under the receiver operator curve (AUC-ROC) and precision-recall analysis, we compute the area under each curve using each score as a predictor of cumulative case or control status computed using values for individuals at each year plotted.

Comparison to 10-year PCE and 30-year Framingham CAD risks

For comparison of 10-year risk, we use the 2018 PCE with baseline covariates (total cholesterol, HDL-cholesterol and systolic blood pressure, current smoking) obtained from UKB enrollment data and update each prediction²⁶ with age, diabetes, and medication use according to available records. This technique was used in the Framingham 30-year risk development to validate new longer window estimates in which age was iteratively updated with all other risk factors at their baseline values²⁶.

For comparison of calibration to 30-year risk, we used the 2009 complete (lipids, non-BMI) Framingham 30-year equation (FRS30) and update each prediction²⁶ with age, diabetes, and antihypertensive use according to available records, consistent with detailed formulae within the FRS30. Given the differing populations, we recalibrated⁵¹ according to the mean levels of each covariate and baseline hazard in the UKB sample (FRS30RC). For fair comparison, we report our results against FRS30RC (Supplementary Fig. 16). Precision and discrimination analysis described as above. We compute and display the predicted 30-year risk for individuals from ages 40–70 according to this model.

Age-dependent model assessment

To evaluate model concordance, we first use the age and state-specific predicted risk scores for each individual which arise from our MSGene system of smoothed logistic regressions - as covariates in a time-dependent Cox model, in which an individual is featured in nonoverlapping intervals with their respective score and event status. In the evaluation stage, we conservatively left censor individuals until enrollment. The way in which the test set is defined reflects the clinical application of the model. The Cox model is never used for estimating the MSGene transition probabilities themselves: it is only used in the model assessment stage to evaluate concordance in a time-dependent matter³³.

We also calculate the minimum age at which an individual would exceed pre-specified risk thresholds for MSGene, PCE, and FRS30. We divide every individual’s observed trajectory into nonoverlapping intervals, indicating when one or all thresholds are achieved and when an event occurs. For example, if an individual is observed from ages 40–70 and exceeds one risk score at age 45 and the other at age 52 and has an event at age 68, his period of study will be divided into 4 intervals: the period from age 40 to 44 in which he exceeds the threshold with neither score, the period from 45 to 51 in which he exceeds the threshold only with score 1, the period from 52 to 67 in which exceeds with both scores, and the period from 68 to 80 in which he has had an event and exceeded in both scores. We left censor in this analysis at age of enrollment. We fit independent time-dependent Cox models³¹ to this expanded data set, and again conservatively left censor until enrollment. For both analyses, we report the concordance index (Harrell’s C) with confidence intervals derived from 1000 bootstrapping iterations⁵². Concordance asks, at each time interval, whether the level of one time-varying score exceeds the level of an alternative score for individuals with an event versus those without³³.

Internal and external model assessment

We internally assess the RMSE (Supplementary Table 1) of models using a finite number of covariates (here sex, polygenic risk, and time-dependent smoking and antihypertensive use) for eight state-specific transitions built on a training set and independently assessed on our testing set. In addition, external validation was performed on individuals in the Framingham Heart Study Offspring cohort (FOS)⁵³ by comparing the model fits estimated in the UKB with 10-year and lifetime risk estimates with individuals in the FOS (Supplementary Fig. 7) for whom genetic data are available. This is a community-based Northeastern United States cohort that was recruited in 1971, median age [IQR] 33.0 years [27.0, 41.0] and followed through 2013. Clinical data and incident disease for 3836 participants, and genetic data for a subset (2611), were available through the database of Genotypes and Phenotypes (dbGaP; accession phs000007.v33.p14). We compare these with the PCE and FRS30 (original score, calibrated for this population) estimates calculated at Exam 1 and compute the RMSE and AUC over the 30-year follow-up period. Informed consent was obtained from all participants, and secondary data analyses of dbGAP-based FOS and UKB were approved by the Mass General Brigham Institutional Review Board applications 2016P002395 and 2021P002228.

Calculating net reclassification

For net reclassification indices, at each age of consideration, we defined NRI_event as the net proportion of cases correctly reclassified by MSGene Lifetime (MSGene_LT >10%) as compared to a ten-year PCE:

$${{NRI}}_{{event}}:\,\frac{{MSGen}{e}_{{LT}} \, > \, 10\%\,\cap {PCE} \, < \, 5\%\cap {CAD}-{MSGen}{e}_{{LT}} \, < \, 5\%\,\cap {PCE} \, > \, 5\%\cap {CAD}}{{Develops\; CAD}}$$

(12)

We defined NRI_non-event as the net proportion of controls correctly reclassified by MSGene lifetime risk <10%:

$${{NRI}}_{{non}-{event}}$$

$$\frac{{MSGen}{e}_{{LT}} \, < \, 10\%\,\cap {PCE} \, > \, 5\%\cap {No\; CAD}-{MSGen}{e}_{{LT}} \, > \, 10\%\,\cap {PCE} \, < \, 5\%\cap {No\; CAD}}{{Does\; not\; develop\; CAD}}$$

(13)

Marginal calculation

We also allow, for the absorbing states of CAD and death, the possibility of computing the probability of progressing through any path to CAD (“marginal”). The calculation of progressing to state K from state J through any path over N years is the product of N transition matrices T in which the j,k element for matrix T_ia is the probability of progressing from state j to k at age a for individual of covariate profile i:

$${Marginal\; Interval\; risk}=\,{\prod }_{A1}^{A2}{T}_{{iajk}}$$

(14)

For absorbing states, the k,k probability is 1. This vignette is available at https://surbut.github.io/MSGene/usingMarginal.html.

Data availability

All data from the UK Biobank (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access) are made available to researchers from universities and other institutions with genuine research inquiries following institutional review board and UK Biobank approval. This research was conducted using the UK Biobank resource under application number 7089 and approved by the Mass General Brigham institutional review board. All data from the Framingham offspring was made available from dbGap (https://www.ncbi.nlm.nih.gov/gap/) to researchers from universities and other institutions with genuine research inquiries following institutional review board approval. Data described here was accessed using accession number phs000007.v32.p13. All data generated during this study are included in this published article and its supplementary information files.

Code availability

The code for running the MSGene model is available at https://github.com/surbut/MSGene. Vignettes for running the analyses are available at https://surbut.github.io/MSGene/vignette.html and https://surbut.github.io/MSGene/usingMarginal.html. Shiny app for calculating interval risk is available at https://surbut.shinyapps.io/risk/. Code to reproduce all plots is available at https://surbut.github.io/multistate2/index.html.

References

Tsao, C. W. et al. Heart disease and stroke statistics—2023 update: a report from the American Heart Association. Circulation https://doi.org/10.1161/CIR.0000000000001123 (2023).
Lloyd-Jones, D. M. et al. Prediction of lifetime risk for cardiovascular disease by risk factor burden at 50 years of age. Circulation 113, 791–798 (2006).
Article PubMed Google Scholar
Wilkins, J. T. et al. Data resource profile: the cardiovascular disease lifetime risk pooling project. Int. J. Epidemiol. 44, 1557–1564 (2015).
Article PubMed PubMed Central Google Scholar
Bundy, J. D. et al. Cardiovascular health score and lifetime risk of cardiovascular disease. Circulation: Cardiovascular Quality and Outcomes https://doi.org/10.1161/CIRCOUTCOMES.119.006450 (2020).
Grundy, S. M. et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/ APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: executive summary. Circulation 139, e1082–e1143 (2019).
Article PubMed Google Scholar
Yadlowsky, S. et al. Clinical implications of revised pooled cohort equations for estimating atherosclerotic cardiovascular disease risk. Ann. Intern. Med. 169, 20–29 (2018).
Article PubMed Google Scholar
Navar, A. M. et al. Earlier treatment in adults with high lifetime risk of cardiovascular diseases: what prevention trials are feasible and could change clinical practice? Report of a National Heart, Lung, and Blood Institute (NHLBI) Workshop. Am. J. Preventive Cardiol. 12, 100430 (2022).
Article Google Scholar
Jaspers, N. E. M. et al. Prediction of individualized lifetime benefit from cholesterol lowering, blood pressure lowering, antithrombotic therapy, and smoking cessation in apparently healthy people. Eur. Heart J. 41, 1190–1199 (2020).
Article CAS PubMed Google Scholar
Navar, A. M., Fonarow, G. C. & Pencina, M. J. Time to revisit using 10-year risk to guide statin therapy. JAMA Cardiol. 7, 785 (2022).
Article PubMed Google Scholar
Zeitouni, M. et al. Performance of guideline recommendations for prevention of myocardial infarction in young adults. J. Am. Coll. Cardiol. 76, 653–664 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lloyd-Jones, D. M., Albert, M. A. & Elkind, M. The American Heart Association’s focus on primordial prevention. Circulation 144, e233–e235 (2021).
Article PubMed Google Scholar
2021 ESC Guidelines on cardiovascular disease prevention in clinical practice | European Heart Journal | Oxford Academic. https://academic.oup.com/eurheartj/article/42/34/3227/6358713 (2021).
Berry, J. D. et al. Lifetime risks of cardiovascular disease. New Engl. J. Med. 366, 321–329 (2012).
Article CAS PubMed Google Scholar
Conner, S. C. et al. A comparison of statistical methods to predict the residual lifetime risk. Eur. J. Epidemiol. 37, 173–194 (2022).
Article PubMed PubMed Central Google Scholar
Michos, E. D. & Choi, A. D. Coronary artery disease in young adults. J. Am. Coll. Cardiol. 74, 1879–1882 (2019).
Article PubMed Google Scholar
O’Sullivan, J. W. et al. Polygenic risk scores for cardiovascular disease: a scientific statement from the American Heart Association. Circulation 146, e93–e118 (2022).
Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults. J. Am. Coll. Cardiol. 72, 1883–1893 (2018).
Article PubMed PubMed Central Google Scholar
Sniderman, A. D. & Furberg, C. D. Age as a modifiable risk factor for cardiovascular disease. Lancet 371, 1547–1549 (2008).
Article PubMed Google Scholar
Wang, N., Woodward, M., Huffman, M. D. & Rodgers, A. Compounding benefits of cholesterol-lowering therapy for the reduction of major cardiovascular events: systematic review and meta-analysis. Circulation: Cardiovasc. Qual. Outcomes 15, e008552 (2022).
Google Scholar
Marma, A. K., Berry, J. D., Ning, H., Persell, S. D. & Lloyd-Jones, D. M. Distribution of 10-year and lifetime predicted risks for cardiovascular disease in US adults: findings from the National Health and Nutrition Examination Survey 2003 to 2006. Circ. Cardiovasc. Qual. Outcomes 3, 8–14 (2010).
Article PubMed Google Scholar
Le-Rademacher, J. G., Therneau, T. M. & Ou, F.-S. The utility of multistate models: a flexible framework for time-to-event data. Curr. Epidemiol. Rep. 9, 183–189 (2022).
Article PubMed PubMed Central Google Scholar
Wreede, L. C, de, Fiocco, M. & Putter, H. mstate: an R package for the analysis of competing risks and multi-state models. J. Stat. Soft. 38, 1–30 (2011).
Brookmeyer, R. & Abdalla, N. Multistate models and lifetime risk estimation: application to Alzheimer’s disease. Stat. Med. 38, 1558–1565 (2019).
Article MathSciNet PubMed Google Scholar
Neumann, J. T. et al. A multistate model of health transitions in older people: a secondary analysis of ASPREE clinical trial data. Lancet Healthy Longev. 3, e89–e97 (2022).
Article PubMed PubMed Central Google Scholar
Jack, C. R. et al. Rates of transition between amyloid and neurodegeneration biomarker states and to dementia among non-demented individuals: a population-based cohort study. Lancet Neurol. 15, 56–64 (2016).
Article CAS PubMed Google Scholar
Pencina, M. J. et al. Predicting the 30-year risk of cardiovascular disease. Circulation https://doi.org/10.1161/CIRCULATIONAHA.108.816694 (2009).
Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836 (1979).
Article MathSciNet Google Scholar
Cleveland, W. S. & Devlin, S. J. Locally-weighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83, 596–610 (1988).
Cholesterol Treatment Trialists’ (CTT) Collaborators. et al. The effects of lowering LDL cholesterol with statin therapy in people at low risk of vascular disease: meta-analysis of individual data from 27 randomised trials. Lancet 380, 581–590 (2012).
Article Google Scholar
Rospleszcz, S. et al. Validation of the 30-Year Framingham risk score in a german population-based cohort. Diagnostics 12, 965 (2022).
Article PubMed PubMed Central Google Scholar
Therneau, T. M. (n.d.). Using Time dependent covariates and time dependent coefficients in the cox model [PDF file]. Retrieved from https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf.
Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model. (Springer New York, 2000).
Therneau, T. M. & Watson, D. A. The concordance statistic and the cox model. Technical Report No. 85, Department of Health Sciences Research, Mayo Clinic. (2017).
Cholesterol Treatmentors. et al. The effects of lowering LDL cholesterol with statin therapy in people at low risk of vascular disease: meta-analysis of individual data from 27 randomised trials. Lancet 380, 581–590 (2012).
Article Google Scholar
Chou, R. et al. Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the US Preventive Services Task Force. JAMA 328, 754 (2022).
Article PubMed Google Scholar
Natarajan, P. et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation 135, 2091–2101 (2017).
Article PubMed PubMed Central Google Scholar
Thanassoulis, G., Sniderman, A. D. & Pencina, M. J. A long-term benefit approach vs standard risk-based approaches for statin eligibility in primary prevention. JAMA Cardiol. 3, 1090–1095 (2018).
Article PubMed PubMed Central Google Scholar
Pencina, M. J. et al. The expected 30-year benefits of early versus delayed primary prevention of cardiovascular disease by lipid lowering. Circulation 142, 827–837 (2020).
Article PubMed Google Scholar
Hippisley-Cox, J. et al. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ 336, 1475–1482 (2008).
Article PubMed PubMed Central Google Scholar
Arnett, D. K. et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease. J. Am. Coll. Cardiol. 74, e177–e232 (2019).
Article PubMed PubMed Central Google Scholar
Yeung, M. W., Van Der Harst, P. & Verweij, N. ukbpheno v1.0: an R package for phenotyping health-related outcomes in the UK Biobank. STAR Protoc. 3, 101471 (2022).
Article PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).
Article PubMed PubMed Central Google Scholar
Klarin, D. et al. Genetic analysis in UK Biobank links insulin resistance and transendothelial migration pathways to coronary artery disease. Nat. Genet. 49, 1392–1397 (2017).
Article CAS PubMed PubMed Central Google Scholar
Thompson, D. J. et al. UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits. Preprint at https://doi.org/10.1101/2022.06.16.22276246 (2022).
Darke, P. et al. Curating a longitudinal research resource using linked primary care EHR data—a UK Biobank case study. J. Am. Med. Inform. Assoc. 29, 546–552 (2022).
Article PubMed Google Scholar
Ference, B. A. et al. Effect of long-term exposure to lower low-density lipoprotein cholesterol beginning early in life on the risk of coronary heart disease: a Mendelian randomization analysis. J. Am. Coll. Cardiol. 60, 2631–2639 (2012).
Article CAS PubMed Google Scholar
Ference, B. A. How to use Mendelian randomization to anticipate the results of randomized trials. Eur. Heart J. 39, 360–362 (2018).
Article PubMed Google Scholar
Mega, J. et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy. Lancet 385, 2264–2271 (2015).
Article CAS PubMed PubMed Central Google Scholar
Marston, N. A. et al. Predicting benefit from evolocumab therapy in patients with atherosclerotic disease using a genetic risk score: results from the FOURIER trial. Circulation 141, 616–623 (2020).
Article PubMed Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Harrell, F. E. et al. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).
Article PubMed Google Scholar
Feinleib, M., Kannel, W. B., Garrison, R. J., McNamara, P. M. & Castelli, W. P. The Framingham Offspring Study. Design and preliminary data. Prev. Med. 4, 518–525 (1975).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

S.M.U. is supported by T32HG010464 from the National Human Genome Research Institute. S.K. is supported by the NIH (K23HL169839) and the American Heart Association. (23CDA1050571). S.J.C. is supported by a grant of the Korea Health Technology R&D Project through the Korea. Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant no.: HI19C1330). A.C.F. is supported by grants 1K08HL161448 and. R01HL164629 from the National Heart, Lung, and Blood Institute. P.T.E reported receiving grants from the NIH (1RO1HL092577, 1R01HL157635, and, 5R01HL139731), the American Heart Association Strategically Focused Research Networks. (18SFRN34110082), the European Union (MAESTRIA 965286), Bayer AG (to the Broad Institute), IBM Health (to the Broad Institute), Bristol Myers Squibb (to Massachusetts General Hospital), and Pfizer (to Massachusetts General Hospital). A.G. is supported by National Institutes of Health (NIH) grant nos R01CA227237, R01CA244569, and R21HG010748, and awards from the Claudia Adams Barr Foundation, Louis B. Mayer Foundation, Doris Duke Charitable Foundation, Emerson Collective, and Phi Beta Psi Sorority. P.N. is supported by grants R01HL1427, R01HL148565, and R01HL148050 from the National Heart, Lung, and Blood Institute, and grant 1U01HG011719 from the National Human Genome Research Institute. The authors would like to acknowledge Leslie Gaffney of the MIT-Broad Communications Lab for her invaluable graphics and copyediting advice.

Author information

Authors and Affiliations

Cardiology Division, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Sarah M. Urbut, Kaavya Paruchuri, Akl C. Fahed, Patrick T. Ellinor & Pradeep Natarajan
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Sarah M. Urbut, Shaan Khurshid, So Mi Jemma Cho, Art Schuermans, Kaavya Paruchuri, Akl C. Fahed, Patrick T. Ellinor, Alexander Gusev & Pradeep Natarajan
Center for Genomic Medicine, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
Sarah M. Urbut, So Mi Jemma Cho, Art Schuermans, Akl C. Fahed & Pradeep Natarajan
Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
Sarah M. Urbut, Shaan Khurshid, Kaavya Paruchuri & Patrick T. Ellinor
Department of Cardiology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
Ming Wai Yeung
Demoulas Center for Cardiac Arrhythmias, Massachusetts General Hospital, Boston, MA, USA
Shaan Khurshid & Patrick T. Ellinor
Integrative Research Center for Cerebrovascular and Cardiovascular Diseases, Yonsei University College of Medicine, Seoul, Republic of Korea
So Mi Jemma Cho
Faculty of Medicine, KU Leuven, Leuven, Belgium
Art Schuermans
Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
Jakob German
Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Jakob German
Division of Population Sciences, Harvard Medical School and Dana-Farber Cancer Institute, Boston, MA, USA
Kodi Taraszka & Alexander Gusev
Institute for Clinical Research and Health Policy Studies (ICRHPS), Tufts Medical Center, Boston, MA, USA
Ludovic Trinquart
Tufts Clinical and Translational Science Institute (CTSI), Tufts University, Boston, MA, USA
Ludovic Trinquart
Department of Data Science, Dana Farber Cancer Institute, Boston, MA, USA
Giovanni Parmigiani
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Giovanni Parmigiani

Authors

Sarah M. Urbut
View author publications
Search author on:PubMed Google Scholar
Ming Wai Yeung
View author publications
Search author on:PubMed Google Scholar
Shaan Khurshid
View author publications
Search author on:PubMed Google Scholar
So Mi Jemma Cho
View author publications
Search author on:PubMed Google Scholar
Art Schuermans
View author publications
Search author on:PubMed Google Scholar
Jakob German
View author publications
Search author on:PubMed Google Scholar
Kodi Taraszka
View author publications
Search author on:PubMed Google Scholar
Kaavya Paruchuri
View author publications
Search author on:PubMed Google Scholar
Akl C. Fahed
View author publications
Search author on:PubMed Google Scholar
Patrick T. Ellinor
View author publications
Search author on:PubMed Google Scholar
Ludovic Trinquart
View author publications
Search author on:PubMed Google Scholar
Giovanni Parmigiani
View author publications
Search author on:PubMed Google Scholar
Alexander Gusev
View author publications
Search author on:PubMed Google Scholar
Pradeep Natarajan
View author publications
Search author on:PubMed Google Scholar

Contributions

S.M.U., G.P., A.G., and P.N. conceived and designed the analysis. S.M.U. performed the analysis and wrote the software. S.M.U., G.P., and A.G. developed the statistical method. M.W.Y. provided critical data organization tools and critical confirmatory data analysis. L.T., S.K., and K.T. provided critical statistical insight. S.K., S.M.C., A.S., J.G., K.P., A.C.F., P.T.E., and P.N. provided critical insights into the clinical relevance of the findings. S.M.U. drafted the manuscript and M.W.Y., G.P., A.G., and P.N. provided critical revision. All authors contributed to the manuscript and approved the submitted version.

Corresponding author

Correspondence to Pradeep Natarajan.

Ethics declarations

Competing interests

During the course of the project, M.W.Y. became an employee and stock owner of GSK. A.C.F. is co-founder of Goodpath. PTE reports personal fees from Bayer AG, Novartis, and MyoKardia. GP holds equity in Phaeno Biotechnologies, is on the SAB of RealmIDX and currently consults for Delphi Diagnostics. P.N. reports research grants from Allelica, Apple, Amgen,Boston Scientific, Genentech / Roche, and Novartis, personal fees from Allelica, Apple, AstraZeneca, Blackstone Life Sciences, Foresite Labs, Genentech/Roche, GV, HeartFlow, Magnet Biomedicine, and Novartis, scientific advisory board membership of Esperion Therapeutics, Preciseli, and TenSixteen Bio, scientific co-founder of TenSixteen Bio, equity in MyOme, Preciseli, and TenSixteen Bio, and spousal employment at Vertex Pharmaceuticals, all unrelated to the present work. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Johannes Neumann and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1-18

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Urbut, S.M., Yeung, M.W., Khurshid, S. et al. MSGene: a multistate model using genetic risk and the electronic health record applied to lifetime risk of coronary artery disease. Nat Commun 15, 4884 (2024). https://doi.org/10.1038/s41467-024-49296-9

Download citation

Received: 23 February 2024
Accepted: 30 May 2024
Published: 07 June 2024
DOI: https://doi.org/10.1038/s41467-024-49296-9