Introduction

Due to its high incidence, recurrence, mortality, and disability rates, stroke continues to be the second leading cause of death worldwide, according to the Global Burden of Disease Study (GBD) 20191. Intravenous thrombolysis and endovascular thrombectomy are the main treatments for ischemic stroke at the moment2. A limited therapeutic window presents a significant challenge for nations with insufficient or unbalanced medical resources3. Therefore, the best way to lessen the burden of stroke is early prevention.

More than half of ischemic stroke patients are diagnosed with hypertension, making it one of the most significant modifiable risk factors for stroke4. Additionally, there is a causal relationship between homocysteine (Hcy) concentration and stroke5. The prevalence of hyperhomocysteinemia (HHcy) is about 3 / 4 of the hypertension population in China. H-type hypertension is defined as hypertension combined with elevated Hcy level6. Surprisingly, vascular damage is worsened by a synergistic effect between Hcy and hypertension7. The above suggests that H-type hypertension patients should be focused on monitoring the risk of ischemic stroke.

The Framingham stroke risk profile (FSP) and CHA2DS2-VASc score are widely used to assess the risk of stroke in general population and and nonvalvular atrial fibrillation (AF) patients8,9. However, there are few validated tools available for assessing the risk of ischemic stroke in patients with H-type hypertension. Clinicians typically manage high-risk population in the cardiovascular field using a combination of demographics characteristics and medical history, along with some laboratory indicators. By employing this strategy, our study aims to screen out high-risk groups for ischemic stroke, develop and validate a high-performance prediction model for ischemic stroke in patients with H-type hypertension, and facilitate further risk stratification management by clinicians for patients.

Results

Baseline characteristics

According to the inclusion and exclusion criteria, among the 11,631 patients diagnosed with H-type hypertension at Beijing Anzhen Hospital from January 2022 and December 2023, 4,632 suffered an ischemic stroke. A total of 3,305 had medical records in same hospital between January 2018 and December 2021, and 2,340 were assigned to the training set and 965 to the testing set (Supplementary Fig. 1). Another 103 H-type hypertension patients, including 61 patients without ischemic stroke and 42 patients with ischemic stroke, were enrolled as an external validation cohort from the China-Japan Friendship Hospital (Supplementary Fig. 2). Detailed information about the characteristics of patients in the total cohort, training, and internal validation sets are shown in Table 1 and Supplementary Table 1, respectively. As shown in Table 1, patients with ischemic stroke were older with higher SBP and had a higher proportion of smokers and a history of cardiovascular disease (all P < 0.05) as compared to non-stroke patients.

Table 1 Baseline clinical and biochemical characteristics of all patients.

Predictor selections

There were 16 variables with P < 0.05 by univariate logistic regression (Supplementary Table 2). After stepwise regression, 13 variables were ultimately retained, namely, age, antihypertensive therapy, hyperlipidemia, atrial fibrillation (AF), diabetes mellitus (DM), BMI, SBP, DBP, hs-CRP, K, Mg, Hcy and proteinuria. In best subset selection regression, when the model included eight variables, the BIC of the model reached its minimum. These eight variables were age, antihypertensive therapy, hyperlipidemia, AF, hs-CRP, K, Mg, and proteinuria, respectively (Fig. 1A and B). In LASSO regression, 17 variables were selected with a lambda that is within 1 standard error (SE), namely age, gender, antihypertensive therapy, antiplatelet therapy, hyperlipidemia, AF, DM, coronary artery disease (CAD), BMI, SBP, DBP, hs-CRP, K, Mg, Hcy, proteinuria and carotid artery stenosis (Fig. 1C and D). Eventually, eight variables were included to develop models: age, antihypertensive therapy, hyperlipidemia, AF, hs-CRP, K, Mg, and proteinuria (Fig. 2).

Fig. 1
figure 1

Best subset selection regression and LASSO regression for the selection of variables. (A) Variation of BCI with the change of model size. (B) Features included when BIC reaches its minimum value. (C) Coeffificient of each variable in LASSO regression with the change of log lambda. First vertical dotted line: l value (lambda min) when binomial deviance was minimum. Second vertical dotted line: lambda min + 1se (lambda 1se). (D) Variation of binomial deviance with the change of log lambda in LASSO regression. First vertical dotted line: l value (lambda min) when binomial deviance was minimum. Second vertical dotted line: lambda min + 1se (lambda 1se).

Fig. 2
figure 2

The common variables were confirmed by all three feature selection methods: stepwise regression, LASSO regression, and best subset selection regression.

Model development and validation

Eight variables were entered into a multivariable logistic regression model, linear kernel SVM model, random forest model, and XGBoost model, respectively. Four models yielded the AUC of 0.905 (95% CI: 0.887–0.924), 0.896 (95% CI: 0.876–0.915), 0.893 (95% CI: 0.872–0.914), 0.909 (95% CI: 0.890–0.927) for the risk of ischemic stroke (Fig. 3; Table 2). The difference of AUC between logistic regression model and XGBoost model was not significant (DeLong test, P = 0.406). Based on the maximal Youden’s index, the threshold of four models were 55%, 46%, 37%, and 43% in order. The XGBoost model had the highest sensitivity, 0.825, with a specificity of 0.860.

Fig. 3
figure 3

Discrimination of the four models. Receiver operating characteristic (ROC) curves of the four models with AUC and 95% CI.

Table 2 Predict performances of four models on the testing set.

Calibration plots were used to assess the calibration of models. As shown in Fig. 4, four models had a good calibration. Among them, the predicted odds of the outcome of the logistic regression model and XGBoost model were close to the actual probability (Fig. 4A and D). Four models resulted in a high net benefit, especially the logistic regression model and XGBoost model (Fig. 5). Conclusively, the logistic regression model and XGBoost model exhibited excellent discrimination and calibration performance. Considering the visualization and scalability of the prediction model, we ultimately chose the logical regression model as the optimal model. The weight coefficients of eight variables in the logistic regression model was shown in Supplementary Fig. 3. Serum magnesium, serum potassium, AF, and hyperlipidemia have a higher weight in the optimal model.

Fig. 4
figure 4

Calibration of the four models. (A) Logistic model. (B) SVM model. (C) random forest model. (D) XGBoost model.

Fig. 5
figure 5

Decision-curve analysis of the four models. Decision curves of the four models showing the net benefit of using each model according to different threshold probabilities in the internal validation cohort.

In the external cohort, the logistic regression model achieved an AUC of 0.872 (95% CI: 0.805–0.939) showing good discrimination capacity (Supplementary Fig. 4 and Supplementary Table 3). The logistic regression model also was well-calibrated and had a high net benefit in the external cohort (Supplementary Figs. 5 and 6).

Model visualization

The eight variables: age (A), antihypertensive therapy (A), biomarkers (B) (serum magnesium, serum potassium, proteinuria, and hypersensitive C-reactive protein), comorbidities (C) (atrial fibrillation and hyperlipidemia) were fitted a logistic regression model to predict the risk of ischemic stroke in H-type hypertension patients was termed the A2BC ischemic stroke model and presented as a nomogram (Fig. 6). The variables were listed separately, and the cumulative score is matched to a risk score.

Fig. 6
figure 6

Nomogram of the optimal model. The probability of ischemic stroke in patients with H-type hypertension. The clinical indicators were placed on each variable axis, and the vertical line was drawn from that value to the top points scale for calculating the score for each predictor. The total scores from each variable value represent the possibility of ischemic stroke in patients with H-type hypertension.

Discussion

Based on two independent retrospective cohorts with a large sample size, our study developed and internally and externally validated a model to predict the risk of ischemic stroke. This model included 8 variables: age (A), antihypertensive therapy (A), biomarkers (B) (serum magnesium, serum potassium, proteinuria, and hypersensitive C-reactive protein), comorbidities (C) (atrial fibrillation and hyperlipidemia), which termed the A2BC ischemic stroke model. The A2BC ischemic stroke model showed great discrimination and calibration for the risk of ischemic stroke, with similar findings when externally validated.

At present, the most effective treatment of acute ischemic stroke (AIS) is reperfusion therapy in therapeutic time window, including intravenous thrombolysis (IVT) and endovascular therapy (EVT), but about 3/4 patients present over 4.5 h after stroke onset or with an unknown time of onset10. Besides, there are numerous contraindications associated with IVT that must be carefully considered11. The rates of IVT and EVT were 5.64% and 1.45% between 2019 and 2020 in China12. Recurrent ischemic stroke is another challenge even with improved secondary prevention, recurrence rates of ischemic stroke seem unchanged over time13. Because of above all, primary prevention of high-risk population may be another effective way to improve the burden of ischemic stroke. However, there is an unmet need for accurate and validated models for estimating risk of ischemic stroke.

Some guidelines propose FSP as a reliable tool for 10-year stroke risk estimates14. Despite its widespread application, the validity of the FSP has not been sufficiently studied in populations with different age range or ethnicity. A prospective study showed that FSP overestimates the risk of stroke in Chinese15. In the same way, both the CV risk calculator and Stroke Riskometer need to be validated and adapted in the Chinese population16,17. Also, although most risk factors have an independent effect on ischemic stroke, interactions may exist between these factors when considering predicting overall risk. A combined analysis of hypertension and Hcy showed they act additively to increase the risk of stroke22. Therefore, it is necessary to establish a prediction model for ischemic stroke specific to the H-type hypertension subset. One of the purposes of risk assessment is to guide an appropriate primary prevention program. Additional folic acid significantly reduces the risk of first stroke in hypertension patients, compared with antihypertensive therapy alone18.

Serum magnesium, an inorganic ion, is given the most weight in our model, and serum potassium is also significant. Magnesium and potassium are crucial trace elements for organisms, as we all know. Magnesium helps to prevent ischemic stroke. Through various mechanisms, it lowers blood pressure more effectively than potassium19. Inflammation, endothelial dysfunction, and platelet dysfunction have all been linked to low magnesium levels20. Stroke risk was 2.5 times higher for diuretic users with low serum potassium than for those with high serum potassium21. When compared to adults receiving antihypertensive therapy, hypokalemia is independently associated with an increased risk of ischemic stroke and is unrelated to diuretics.

Dyslipidemia is an independent risk factor for stroke22. The risk of an ischemic stroke can be decreased by lowering atherogenic lipoproteins23,24. New lipid-lowering medications have made it possible to lower LDL-C to extremely low levels, but doing so will raise the risk of hemorrhagic stroke25. A significant risk factor for stroke is AF. Hypertrophic hypertensive cardiac disease complicated by AF is the most frequent cardiac source of emboli in cardioembolic stroke26.A thrombus from the left atrial (LA) cavity, particularly the left atrial appendage (LAA), is primarily responsible for ischemic stroke associated with AF27. Plasma Hcy levels were found to be associated with LA/LAA thrombus and could be used to predict the risk of LA/LAA thrombus in non-valvular AF patients with low CHA2DS2-VASc scores28.

Numerous studies have demonstrated that hypertension can cause cerebrovascular diseases through a variety of mechanisms, including adapting automatic regulation of cerebral blood flow (CBF) to hypertension29, endothelial dysfunction, reduction of nitric oxide (NO)30, elevated levels of angiotensin II (Ang II) leading to cerebral artery hypertrophy and inward remodeling31. Fortunately, the negative effects of hypertension can be offset by a variety of antihypertensive medications. Using long-lasting dihydropyridine-Ca2+ channel blocker attributes to the normalization of autoregulation of CBF32. Similarly, other types of antihypertensive drugs also have this effect33. Additionally, combining antihypertensive medications in suboptimal doses can establish tolerance and effectively treat the remodeling of cerebral arteries brought on by hypertension34.

Proteinuria, the other factor in our model, is a common sign of renal damage and has a particularly strong association with stroke35,36. Researchers have proposed the term “cerebro-renal interaction” because kidney disease and cerebrovascular disease are closely related37. The above can be explained by the idea that increased urinary protein excretion rate may be connected to significant vascular damage38. According to epidemiological studies, people over 65 account for the majority of stroke cases, and the risk rises with age39,40. Variety in circulation factors in the systemic environment, cellular senescence, and hypertension during human aging can all increase the risk of stroke41.

The A2BC ischemic stroke model consists of 8 general variables, which are simple to collect in clinical practice, there is no need to take into account specialized examination equipment and technical personnel, allowing community hospitals to conduct rapid screening and significantly saving medical resources. Our study had some limitations as well. First, the differences between various ischemic stroke subtypes have not been considered. Hypertension is the primary cardiovascular risk factor for cerebral infarction only for lacunar strokes and atherothrombotic infarctions, that is, ischemic stroke associated with small and large artery disease42. Building upon the findings of this study, future research could confirm these results in the different ischemic stroke subtypes, particularly lacunar infarcts with hypertension and diabetes as the main risk factors43. Second, certain H-type hypertension-specific risk factors for ischemic stroke, like MTHFR polymorphism, have not been studied. However, not many community hospitals in China offer MTHFR polymorphism detection services. Third, since our research was a retrospective study and the prediction model created by the machine learning algorithm was just a reflection of mathematical logic, there was no causal relationship. Therefore, even though we have demonstrated the model’s good performance on an external validation cohort, more clinical data and prospective queues were required to improve the model’s performance in specific clinical application scenarios. Fourth, we excluded secondary hypertension in patients with H-type hypertension, as the causes of secondary hypertension are diverse, and the predictive factors are complex. Therefore, our model cannot be used for secondary hypertension patients with elevated Hcy levels. Finally, the clinical efficacy of Hcy reduction therapy in primary and secondary ischemic stroke prevention is still controversial. Identifying individuals who respond effectively to Hcy reduction treatment and implementing precision therapy—including optimized dosage, duration, combination with other drugs, and consideration of genotype—are crucial for reducing stroke risk in high-risk populations. Achieving these goals relies on an accurate and reliable risk prediction model.

Materials and methods

Study population

This retrospective cohort study consecutively included inpatients diagnosed with H-type hypertension, whether or not they suffered first ischemic stroke, at Beijing Anzhen Hospital, Capital Medical University from January 2022 to December 2023. Patients with secondary hypertension or a history of ischemic stroke would be excluded. These patients would also be excluded if they lack data in Beijing Anzhen Hospital from January 2018 to December 2021. Meanwhile, we extracted an external validation cohort from the China-Japan Friendship Hospital between January 2023 and June 2023.

Patients with hypertension were diagnosed according to the International Society of Hypertension recommendations44. Systolic blood pressure (SBP) in the office or clinic was ≥ 140 mmHg and/or diastolic blood pressure (DBP) was ≥ 90 mmHg following repeated examinations were considered as hypertension. In addition, the guideline also suggested that blood pressure < 140 / 90 mmHg in patients with a history of hypertension and currently using antihypertensive drugs were still diagnosed as hypertension. Patients with essential hypertension were identified when secondary hypertension was excluded. Hypertension patients, together with serum Hcy concentrations ≥ 10 µmol/L, were identified as H-type hypertension45. Ischemic stroke was confirmed via computed tomography (CT) or brain magnetic resonance imaging (MRI) combined with clinical symptoms and signs. Our study was conducted according to the Declaration of Helsinki and was approved by the hospital’s ethical review board (Beijing Anzhen Hospital, Capital Medical University, Beijing, China). The need to obtain informed consent was waived by the ethical review board of Beijing Anzhen Hospital, Capital Medical University.

Data collection

According to literature research, 23 candidate variables associated with stroke were enrolled in the study. Clinical characteristics including age, gender, smoking, drinking, SBP, DBP, body mass index (BMI), medical history and history of drug and results of carotid ultrasonography were collected from the hospital information system (HIS). Hypersensitive C-reactive protein (hs-CRP), serum potassium (K), serum sodium (NA), serum magnesium (Mg), total bilirubin (TBil), and direct bilirubin (DBiL) were detected by the automatic biochemical analyzer (Roche Cobas C702). Proteinuria was analyzed by the automatic urine analyzer (Mindray UA-5800). All laboratory tests were the first test results during the hospitalization and were collected from the laboratory information system (LIS).

Model development, validation, and statistical analysis

Our study was reported according to TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) statement (TRIPOD checklist). All statistical analyses were conducted with R software for macOS (Version 4.2.1, https://www.r-project.org/). Continuous variables were described with mean ± standard deviation (SD) or with median (25th − 75th quartile) for skewed data, compared using the Mann-Whitney test. Categorical variables were described with frequency and percentages and compared using χ2 tests. Variables with missing values > 20% were excluded. Multiple imputations were applied to variables missing < 2 0% using the R package mice, and one imputation result was finally used. Hcy was transformed into a categorical variable with 2 knots placed at 15 µmol/L and 30 µmol/L, the rest of the variables remained unchanged.

Patients from Beijing Anzhen Hospital, Capital Medical University were randomly divided into a training set and a testing set in a ratio of 70–30%. The Logistic regression, LASSO regression and best subset selection analysis were applied to filter features in the training cohort, respectively. In logistic regression, univariate regression was first performed for all variables, then statistically significant variables (P < 0.05) were selected for a bidirectional stepwise multivariate logistic regression analysis. In LASSO regression, the beta coefficients of variables that are not strongly associated with the outcome are decreased to zero, which removed these variables from the model. We applied 10-fold validation to obtain the suitable lambda (i.e., lambda.1se) and select the variables with non-zero coefficients. In best subset selection regression, variables in the regression with the minimum Bayesian information criterion (BIC) were selected. The models were fitted with the common features confirmed by three methods.

Prediction models were derived using four machine learning algorithms, including logistic regression, linear kernel support vector machine (SVM), random forest, and eXtreme gradient boosting (XGBoost). The AUC, calibration plot, and DCA were used to assess the discrimination and calibration performance of models. All P values are two-sided, with results < 0.05 considered significant.