Introduction

Cerebral ischemic stroke (IS) is a significant global health concern, and millions of people suffer from its worldwide each year. The survivors have varying degrees of sequelae, seriously affecting the quality of life of patients and their families1,2,3. Due to occlusion of intracranial blood vessels during ischemic stroke, brain cells have soon died of apoptosis, necrosis and/or pyroptosis. Other pathophysiological alterations elicited by IS include impairment of neurovascular unit, free radical accumulation, secondary inflammatory responses, ion channel alterations, mitochondrial damage, autophagy, and others, all of which contribute the final cell death4,5,6,7. Current effective treatment for IS includes thrombolysis or thrombectomy, which only applicable to less than 5% of patients due to their short intervention window. Thus, it has become critically important to understand detailed molecular mechanisms involved in ischemic stroke and identify potential therapeutic targets for drug development.

Pyroptosis is one of the major types of programmed cell death initiated by inflammation8. Formation of an inflammasome, composed of Nod-like receptor protein-3 (NLRP3), pro-caspase-1, and apoptosis-related speckle protein (ASC), is a critical step for pyroptosis. The inflammasome activates caspase -1, which cleaves IL-1β, IL-18, and GasderminD (GSDMD)9,10,11,12,13. Cleaved IL-1β and IL-18 further exacerbate inflammatory responses, while the N-terminal fragment of GSDMD translocates to the plasma membrane to induce pore formation and pyroptotic cell death. Alternatively, pyroptosis could also be induced by the non-classical cysteine proteinase-4, 5 and 11-dependent pathway14 or granzyme-mediated pathways and caspase-3-mediated pathways15. Pyroptosis plays an essential role in mediating cell death and clinical prognosis of various neurological disorders such as central nervous system infections, autoimmune disorders, neurodegenerative diseases, traumatic brain and spinal cord injuries, cerebral hemorrhagic and ischemic stroke16,17,18,19. Recently, an increasing number of studies have been conducted to search for pyroptosis-related biomarkers as potential indicators for predicting the development of disease and therapeutic target of drug discovery20.

Several gene expression profiling datasets have been published in the gene expression omnibus (GEO) (www.ncbi.nlm.nih.gov/geo/). These include GSE22255, which consists of 40 samples of gene expression profiles of peripheral blood mononuclear cells from IS patients and age-matched healthy adults; GSE140275 that includes gene expression profiles from blood samples of 3 acute ischemic stroke (AIS) patients and 3 healthy controls; and GSE16561 dataset that consists of 63 whole blood samples from 39 patients with IS and 24 healthy control subjects. In addition, information about pyroptosis-related genes (PRGs) has been made available in the GeneCards database (www.genecards.org/). These datasets have laid a foundation for detailed machine learning analysis to gain insight into genes that might be involved in IS and pyroptosis.

Using machine learning to search for differentially expressed genes in IS patient data set has drawn great interest. Qin et al.21 investigated the role of anoikis-related genes (ARGs) in ischemic stroke (IS), which employed machine learning to classify IS samples, identified key ARGs, and constructed diagnostic models. Ren et al.22 discovered inflammation-related biomarkers for ischemic stroke (IS) using machine learning. They identified differentially expressed genes (DEGs) between healthy controls and IS patients. According to Wang et al.23, immunosuppression-related genes have been identified in ischemic stroke (IS). It identified crosstalk genes linking immunosuppression and IS. Wei et al. has studied microRNA Expression Profile in Ischemic Stroke24. It constructed an IS diagnostic signature using logistic regression, identifying 14 differentially expressed miRNAs. Up to date, there has no report to search for pyroptosis-related differentially expressed genes (PRDEGs) in IS dataset.

Martha et al.25 used machine learning to predict stroke outcomes in acute ischemic stroke patients by analyzing inflammatory gene expression. It identified key genes and subject factors associated with infarct and edema volumes. These genes indicated chemoattraction and proliferation of immune cells, influencing stroke outcomes. The findings suggested that machine learning can help develop prognostic biomarkers for stroke outcomes and understand acute gene expression changes. O’Connell et al.26 used machine learning to validate a stroke-associated gene expression pattern using an independent patient population. It confirmed the robustness and temporal stability of this gene expression pattern as a potential diagnostic marker for stroke.

The aim of this study is to analyze the gene expression pattern of PRDEGs based on an integrated IS dataset using machine-learning methods in order to build predictive models of ischemic stroke samples and identify critically important PRDEGs involved in ischemic stroke. Classification models have been built to distinguish the healthy control samples and the IS patient samples using three machine learning methods (namely, LASSO, Random Forest and Support Vector Machine). As a result, a set of 10 PRDEGs was identified by all three machine learning models. Further analysis of the ten critical genes identified above has revealed the importance of these genes in IS development as well as potential therapeutic targets for IS.

Results

Overall workflow

In this study, we conducted a bioinformatics analysis to explore the expression characteristics of pyroptosis-related genes between normal and ischemic stroke (IS) tissues. Initially, we collected pyroptosis-related genes from literature and databases and combined them with RNA expression data of IS patients from specific GEO datasets. Within these datasets, we identified pyroptosis-related differentially expressed genes (Pyroptosis-Related DEGs).

Next, utilizing a LASSO regression model, we pinpointed a set of significant gene markers. These markers facilitated consensus clustering analysis to determine pyroptosis-associated IS subtypes. A substantial number of DEGs were identified between the different IS subtypes, from which we extracted core hub genes.

Throughout this process, we also conducted analyses of gene expression levels, correlation analysis, and gene ___location on chromosomes. These analyses helped us to understand the biological backdrop of the DEGs and their potential roles in disease progression. Further validation of our findings was ensured through additional validation of the GEO dataset and experimental verification. Moreover, we applied Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis tools to investigate the specific functions of these genes in the pyroptosis pathways and the biological processes involved. Protein–Protein Interaction (PPI) analysis and immune infiltration analysis were also performed to deepen our understanding of the role of DEGs in IS.

Finally, based on the identified hub genes, we screened for potential drugs or compounds for the treatment of ischemic stroke within the Drug Gene Interaction Database (DGIdb). This series of exhaustive analytical steps provided not only new biomarkers for the diagnosis of ischemic stroke but also potential drug targets for the disease’s treatment, thereby hoping to advance the development of therapeutic strategies for IS. The detailed schematic of this flowchart is illustrated in Fig. 1 of the article.

Fig. 1
figure 1

Overall workflow. IS ischemic stroke, DEGs differentially expressed genes, PRDEGs pyroptosis related differentially expressed genes, SVM support vector machine, LASSO least absolute shrinkage and selection operator.

The technical flowchart is shown in Fig. 1.

Ischemic stroke related DEGs

The IS dataset (Fig. 2A,B) was obtained by first removing batch effects from the GSE2225, GSE140275 and GSE16561 IS datasets using the R package sva and then normalizing the combined datasets using the limma package: the expression data in the IS dataset were normalized to remove the batch effect, and the expression matrices were standardized to be consistent between samples. Principal Component Analysis (PCA) was performed on the expression matrix of the dataset to verify the effect of removing the batch effect (Fig. 2C,D), and the results showed that the batch effect in the IS dataset was largely eliminated after batch removal.

Fig. 2
figure 2

IS dataset de-batching. (A,B) Boxplot plots of IS dataset before (A) and after (B) the batch effect treatment. (C,D) PCA plots of IS dataset before (C) and after (D) the batch effect treatment. IS ischemic stroke, PCA principal component analysis.

Differentially expressed genes (DEGs) were obtained from the IS dataset using the limma package to analyze the differences in gene expression between different subgroups of the IS dataset compared to the control group. The results are as follows: a total of 12,733 DEGs were obtained from the IS dataset, of which 1422 met the threshold of |logFC|> 0 and P.adj < 0.05. Under this threshold, the number of genes that were highly expressed in the IS group (low in the Control group with positive logFC) was 376; the number of genes with low expression in the IS group (high expression in the Control group with negative logFC) was 1046, and the results of the differential analysis of the IS dataset were plotted on a volcano plot (Fig. 3A).

Fig. 3
figure 3

Analysis of the IS dataset for Ischemic Stroke Related DEGs. (A) Volcano plots for analysis of DEGs in the IS group and control group. (B) Venn plots for DEGs and PRGs in the IS dataset. (C) Heat map of complex values for PRDEGs. (D) Heat map of chromosomal localization for PRDEGs. Chromosomal localization map of PRDEGs. IS ischemic stroke, DEGs differentially expressed genes, PRGs pyroptosis-related genes, PRDEGs pyroptosis-related differentially expressed genes.

The intersection of all the DEGs and the 372 PRGs obtained from the IS dataset was taken and a total of 32 PRDEGs were obtained and plotted in a Venn diagram (Fig. 3B), namely: AHSA1, APAF1, ATF6, ATG7, BIRC2, BNIP3L, CEBPB, CTSG, DNMT1, ELAVL1, GSK3B, IKBKE, IL18BP, IL32, IRAK3, JUN, LRPPRC, NDUFA13, NFE2L2, NFS1, NLRP3, OSM, PTGS2, SERPINB1, STAT3, TCEA3, TLR2, TNFSF13B, TRAF3, TREM1, UBE2D3, and VIM. The results obtained from Venn diagrams were used to analyze the differential expression of the 32 PRDEGs in different subgroups of the IS dataset and show the specific differential analysis results in a heat map (Fig. 3C). The expression of the 32 PRDEGs in the different subgroups of IS dataset (IS/Control) was significantly different.

The position annotation of 32 PRDEGs was used to analyze the ___location of PRDEGs on human chromosomes (Fig. 3D), which shows that they are mainly distributed on chromosomes 1, 2, 6, 11, 14 and 19, with the most distributed on chromosome 1, with a total of 6 PRDEGs, indicating that these PRDEGs, which are located close to each other on the chromosomes. This indicates that these PRDEGs in close proximity to each other at the genomic level are more closely linked.

Differential expression analysis of PRDEGs

The Wilcoxon rank sum test was used to analyze the differences in expression levels of the 32 PRDEGs between the different subgroups (IS/Control) in the IS dataset, and the results of the expression difference analysis are presented in a subgroup comparison plot (Fig. 4A). As can be seen from the figure, the expression level of the 32 PRDEGs between the IS group (IS) and the control group (Control) was highly statistically significant (P < 0.01), except for JUN and AHSA1, 10 genes (DNMT1, ELAVL1, IKBKE, IL18BP, IL32, LRPPRC, NDUFA13, NFS1, TCEA3 and TRAF3) were all expressed at higher levels in the control group (Control) than in the IS group (IS); while 20 genes (APAF1, ATF6, ATG7, BIRC2, BNIP3L, CEBPB, CTSG GSK3B, IRAK3, NFE2L2, NLRP3, OSM, PTGS2, SERPINB1, STAT3, TLR2, TNFSF13B, TREM1, UBE2D3, VIM) level were expressed at higher levels in the IS group (IS) than in the control group (Control).

Fig. 4
figure 4

Analysis of differential expression of PRDEGs in the IS dataset. (A) Plot of PRDEGs grouped in the IS dataset comparing the IS group and the control group (Control). (BI) ROC curves based on PRDEGs: AHSA1, APAF1, ATF6, ATG7 (B); BIRC2, BNIP3L, CEBPB, CTSG (C); DNMT1, ELAVL1, GSK3B, IKBKE (D); IL18BP, IL32, IRAK3, JUN (E); LRPPRC, NDUFA13, NFE2L2, NFS1 (F); NLRP3, OSM, PTGS2, SERPINB1 (G); STAT3, TCEA3, TLR2, TNFSF13B (H); TRAF3, TREM1, UBE2D3, VIM (I). Symbol ns equals P ≥ 0.05, which is not statistically significant; symbol * equals P.value < 0.05, which is statistically significant; symbol ** equals P.value < 0.01, which is highly statistically significant; and symbol *** equals P.value < 0.001, which is highly statistically significant. The closer the AUC in the ROC curve is to 1.0, the better the distinguishing power. ROC receiver operating characteristic curve, AUC area under curve.

The ROC curves of the 32 PRDEGs in the IS and control groups (Control) were then plotted and the results presented (Fig. 4B–I). As shown in Fig. 4, 20 PRDEGs (AHSA1, APAF1, ATF6, ATG7, CEBPB, CTSG, DNMT1, GSK3B, IL18BP, JUN, NFS1, NLRP3, OSM, PTGS2, SERPINB1, TCEA3, TLR2, TRAF3, TREM1, UBE2D3) in the IS expression in the dataset has low accurate for the diagnosis of IS occurrence (0.500 < AUC < 0.700, Fig. 4B–I); whereas the other 12 PRDEGs (BIRC2, BNIP3L, ELAVL1, IKBKE, IL32, IRAK3, LRPPRC, NDUFA13 NFE2L2, STAT3, TNFSF13B, and VIM) expression in the IS dataset had moderate accuracy in the detection of IS occurrence (0.700 < AUC < 0.900, Fig. 4B–I).

Machine learning models based on PRDEGs

To determine the diagnostic value of the 32 PRDEGs in the IS dataset, we first constructed SVM models based on the 32 PRDEGs. The SVM (Support Vector Machine) algorithm should obtain the lowest error rate (Fig. 5A) and the highest accuracy (Fig. 5B). The results showed that the accuracy of the SVM model was highest when the number of genes was 29; and the genes include BIRC2, ELAVL1, UBE2D3, ATF6, JUN, DNMT1, GSK3B, TCEA3, IRAK3, SERPINB1, APAF1, AHSA1, STAT3, BNIP3L, NDUFA13, LRPPRC, CTSG, VIM, OSM, TNFSF13B, ATG7, IL18BP, TRAF3, TREM1, TLR2, NFS1, IKBKE, NFE2L2, CEBPB.

Fig. 5
figure 5

Construction of a PRDEGs-based model for IS vs Control. (A) Number of genes with the lowest error rate obtained by the SVM modeling. (B) Number of genes with the highest accuracy rate obtained by the SVM modeling. (C) Plot of model training error for the random forest algorithm. (D) Random forest model showing top 30 PRDEGs (in descending order of IncNodePurity). (E) Forest plot of the logistic (logistic) regression model. (F) Diagnostic plot for the LASSO regression model. (G) Variable trajectory plot for the LASSO regression model. (H) Column line plot (Nomogram) of the 12-PRDEGs in the diagnostic model. (I,J) Calibration curve plot (I), DCA plot (J) for the PRDEGs-based model. SVM support vector machine, PRDEGs pyroptosis-related differentially expressed genes, LASSO least absolute shrinkage and selection operator, DCA decision curve analysis.

The expression data of the 32 PRDEGs in the IS dataset was also analyzed using the Random Forest algorithm (RandomForest, RF; rf) (Fig. 5C) to determine the predictive value of the 32 PRDEGs in the IS dataset. IncNodePurity (Increase in Node Purity) indicates an increase in node purity. The higher the node purity, the lower the number of impurities contained. The results of the specific analysis were screened for IncNodePurity > 0.5, and 21 diagnostic markers for IS disease were obtained from the 32 PRDEGs (Fig. 5D), namely: BIRC2, NDUFA13, ELAVL1, IL32, TNFSF13B, BNIP3L, NFE2L2, IRAK3, DNMT1, VIM, JUN, CTSG, STAT3, CEBPB, LRPPRC, IKBKE, PTGS2, NLRP3, GSK3B, ATF6, and TRAF3.

Finally, we performed a logistic regression analysis and constructed a logistic regression model based on the 32 PRDEGs. A total of 32 PRDEGs were included in the logistic regression model, and the expression profiles of the 32 PRDEGs were visualized in a Forest Plot (Fig. 5E). The PRDEGs-based predictive model was then constructed by LASSO (Least Absolute Shrinkage and Selection Operator) regression analysis, and the results were visualized by plotting the LASSO regression model (Fig. 5F) and the LASSO variable trajectories (Fig. 5G). The results show that the model contains a total of 12 PRDEGs, namely: APAF1, ATF6, BIRC2, BNIP3L, CTSG, ELAVL1, GSK3B, IKBKE, NDUFA13, OSM, STAT3, and VIM. The Nomogram demonstrates the contribution of the 12 PRDEGs to the diagnostic model (Fig. 5H).

To determine the accuracy and discriminatory power of the PRDEGs-based model, calibration curves (Calibration Curve) were plotted using Calibration analysis to assess the predictive effectiveness of the model for actual outcomes based on the fit of the optimal theoretical probability (solid line) to the probability predicted by the model (dashed line) under different scenarios (Fig. 5I). In addition, decision curve analysis (DCA) was used to assess the usefulness of the model in terms of the clinical utility and to present the results (Fig. 5J). When the model’s line was stable over a range above All positive and All negative, the larger the range, the greater the net gain and the more effective the model is. It is clear from the results (Fig. 5I,J) that the model constructed in this experiment has a high accuracy in predicting the occurrence of IS over Control.

The PRDEGs from the LASSO regression model, the PRDEGs from the SVM model and those from the random forest model were intersected to obtain a total of 10 PRDEGs and plotted in a Venn diagram (Fig. 6A), which were: ATF6, BIRC2, BNIP3L, CTSG, ELAVL1, GSK3B, IKBKE, NDUFA13, STAT3, and VIM. A chord plot was then drawn to show the correlation between the 10 Common PRDEGs based on their specific expression in the IS dataset (Fig. 6B), which showed that the 10 Common PRDEGs were mostly positively correlated with each other, i.e. the PRDEGs have mostly positive interactions with each other.

Fig. 6
figure 6

Correlation analysis of Common PRDEGs. (A) LASSO regression models for PRDEGs and SVM models, Venn plots for PRDEGs in random forest models. (B,C) Chord plots for the 10 Common PRDEGs (B), correlation heat map (C). (DI) Correlation scatter plots for Common PRDEGs: GSK3B and STAT3 (D), ATF6 and GSK3B (E), IKBKE and NDUFA13 (F), BNIP3L and IKBKE (G), ELAVL1 and STAT3 (H), and BIRC2 and NDUFA13 (I) correlation scatterplot results are shown. P ≥ 0.05, not statistically significant; P < 0.05, statistically significant; P < 0.01, highly statistically significant; P < 0.001, highly statistically significant. Correlation coefficients (r) in correlation scatter plots with absolute values above 0.8 are strongly correlated; absolute values between 0.5 and 0.8 are moderately correlated; absolute values between 0.3 and 0.5 are weakly correlated; absolute values below 0.3 are weakly or not correlated. LASSO least absolute shrinkage and selection operator, SVM support vector machine, PRDEGs pyroptosis-related differentially expressed genes.

Further correlation analysis was performed on the expression of the 10 Common PRDEGs in the IS dataset and the results were presented by correlation pheatmap (Fig. 6C), which showed that most of the correlations between the 10 Common PRDEGs in the IS dataset were statistically significant (P < 0.05), and then the results were presented by plotting correlation scatter plots separately for each of the three pairs with the highest and lowest correlations (Fig. 6D–I), with the three pairs with the highest positive correlations GSK3B and STAT3 (Fig. 6D, r = 0.606, P < 0.001); ATF6 and GSK3B (Fig. 6E, r = 0.564, P < 0.001); IKBKE and NDUFA13 (Fig. 6F, r = 0.500, P < 0.001) and the three most negatively correlated gene pairs: BNIP3L and IKBKE (Fig. 6G, r = − 0.566, P < 0.001); ELAVL1 and STAT3 (Fig. 6H, r = − 0.403, P < 0.001); BIRC2 and NDUFA13 (Fig.  6I, r = − 0.396, P < 0.001).

Analysis of differences between high and low-risk groups and functional enrichment analysis (GO)

To analyze the differences in gene expression in the IS samples in the high and low-risk groups of the PRDEGs-based diagnostic model, we used the limma package for differential analysis of the IS dataset to obtain DEGs between the different subgroups (High/Low-Riskscore). A total of 12,733 DEGs were obtained from the IS dataset, of which 12 genes met the threshold of |logFC|> 0.5 and P.adj < 0.05. Under this threshold, 9 genes were highly expressed in the High-Riskscore group (low expression in the Low-Riskscore group with positive logFC and up-regulated genes), and they are PF4V1, LY96, PROS1, GNG11, SH3BGRL2, PPBP, RGS18, IFNB1, and XK; and three genes have low expression in the High-Riskscore group (high expression in the Low-Riskscore group with logFC is negative), and they are GZMK, VPREB3, and HLA-DQA1. The results of the differential analysis of the IS dataset were plotted on a volcano map (Fig. 7A).

Fig. 7
figure 7

Analysis of the differences between high and low-risk groups and functional enrichment analysis (GO). (A) Volcano plot of DEGs analysis between different subgroups (High/ Low-Riskscore) in IS dataset. (B) Complex numerical heat map of DEGs in the IS dataset. (C) Bubble plot display of GO functional enrichment analysis results of DEGs (A). (D) Network plot display of GO functional enrichment analysis results of DEGs. (E) Bubble plot display of GO functional enrichment analysis of PRDEGs combined with logFC. Bubble plot display of GO functional enrichment analysis for PRDEGs combined with logFC. The vertical coordinate in the bubble plot (C) is the GO terms, and the bubble color indicates the number of molecules contained in the GO terms. The orange dots in the network diagram (D) represent specific genes and the cyan circles represent specific pathways. The cyan dots in the bubble diagram (E) represent the BP pathway, the orange circles represent the CC pathway; the dark blue circles represent the MF pathway. The screening criteria for GO enrichment entries are P < 0.05 and FDR value (q.value) < 0.05. IS ischemic stroke, DEGs differentially expressed genes, GO gene ontology, BP biological process.

The expression of the 12 DEGs (PF4V1, LY96, PROS1, GNG11, SH3BGRL2, PPBP, RGS18, IFNB1, XK, GZMK, VPREB3, HLA-DQA1) in different subgroups (High/Low-Riskscore) of the IS dataset was analyzed and the results of the differential analysis were plotted according to the results obtained from the volcano plot (Fig. 7B). The differential expression profiles were analyzed, and heat maps were plotted to show the results of the differential analysis (Fig. 7B), from which it can be seen that the 12 DEGs had significantly different expression profiles in different subgroups (High/Low-Riskscore) (Fig. 7B).

To analyze the relationship between the biological processes, molecular functions, cellular components and biological pathways of the 12 DEGs and IS disease, a GO (Gene Ontology) gene function enrichment analysis (Table 1) was performed on the 12 DEGs. The screening criteria for enrichment entries were P < 0.05 and FDR values (q.value) < 0.05, which were considered statistically significant. The results showed that the 12 DEGs in IS disease were mainly enriched in humoral immune response, leukocyte migration, blood coagulation, chemokine-mediated signaling pathway, response to chemokine, neutrophil chemotaxis, and other biological processes, as well as cellular components such as platelet alpha granule lumen and platelet alpha granule. These 12 DEGs were also enriched in cytokine activity, cytokine receptor binding, receptor ligand activity, chemokine activity, chemokine receptor binding, G-protein-coupled receptor binding and other molecular functions (MF). The results of GO function enrichment analysis are presented as bubble plots (Fig. 7C). In addition, the results were also presented as a network diagram (Fig. 7D). A joint logFC/GO enrichment analysis of these 12 DEGs was then performed, which is based on the enrichment analysis by providing the logFC values of the DEGs and calculating the corresponding z-score for each gene. The results were presented as bubble plots (Fig. 7E), from which it can be seen that the GO enrichment analysis results of the 12 DEGs were mainly concentrated in the BP pathway.

Table 1 GO enrichment analysis results of IS dataset pyroptosis model high and low risk group DEGs.

Pathway enrichment (KEGG) analysis of DEGs in high and low-risk groups

The 12 DEGs were analyzed for KEGG enrichment (Table 2) and the screening criteria of P < 0.05 and FDR value (q.value) < 0.05 were considered statistically significant. The results showed that the 12 DEGs were significantly enriched in chemokine signaling pathway, cytokine-cytokine receptor interaction, viral protein interaction with cytokine and cytokine receptor, Toll-like receptor signaling pathway, and toxoplasmosis in a total of five KEGG pathways (Fig. 8A). In addition, the pathway enrichment (KEGG) analysis was presented in the form of a chord diagram (Fig. 9B), and the pathway maps (Fig. 8C–G) were plotted and drawn by the R package Pathview to show KEGG enrichment in the Chemokine signaling pathway (Fig. 8C), Cytokine-cytokine receptor interaction (Fig. 8D), Viral protein interaction with cytokine and cytokine receptor (Fig. 8E), Toll-like receptor signaling pathway (Fig. 8F), and Toxoplasmosis (Fig. 8G) gene expression in the five KEGG pathways.

Table 2 KEGG enrichment analysis results of IS dataset Pyroptosis Model high and low risk group DEGs.
Fig. 8
figure 8

KEGG pathway enrichment analysis of the DEGs. (A,B) KEGG enrichment analysis of the IS dataset DEGs are shown in histogram (A), chord diagram (B). (CG) KEGG pathway enrichment analysis of DEGs: Chemokine signaling pathway (C), Cytokine-cytokine receptor interaction (D), Viral protein interaction with cytokine and cytokine receptor (E), Toll-like receptor signaling pathway (F), Toxoplasmosis (G) KEGG. IS ischemic stroke, DEGs differentially expressed genes, KEGG Kyoto encyclopedia of genes and genomes. Screening criteria for KEGG-enriched entries were P < 0.05 and FDR value (q.value) < 0.05.

Fig. 9
figure 9

GSEA enrichment analysis of the high and low-risk groups of the IS diagnostic model. (A) Analysis of the main 6 biological features of GSEA enrichment between high and low-risk groups in the diagnostic model of the IS dataset. B-G. Significant enrichment of genes between high and low-risk groups in the PRDEGs-based diagnostic model: Brain Hcp With H3k27me3 (B), Neuroactive Ligand Receptor Interaction (C), Regulation Of Autophagy (D), Sars Coronavirus And Innate Immunity (E), Interferon Alpha Beta Signaling (F), IL2 Stat5 Pathway (G) pathway. IS ischemic stroke, PRDEGs pyroptosis-related differentially expressed genes, DEGs differentially expressed genes, GSEA gene set enrichment analysis. The significant enrichment screening criteria for GSEA enrichment analysis were P.adj < 0.05 and FDR value (q.value) < 0.05.

GSEA and GSVA enrichment analysis of the high and low-risk groups of the IS dataset

To determine the effect of the expression levels of metabolism-related DEGs in ischaemic stroke (IS) on the occurrence of IS, the entire gene expression and involvement of DEGs in the IS dataset was analyzed by GSEA between high and low-risk groups of the PRDEGs-based diagnostic model. The association between gene expression and biologic processes, cellular components affected, and molecular functions performed was analyzed by GSEA, with P < 0.05 and FDR value (q.value) < 0.05 as significant enrichment screening criteria.

The results show significant enrichment of genes between high and low-risk groups in the PRDEGs-based diagnostic model include Brain Hcp With H3k27me3 (Fig. 9B), Neuroactive Ligand Receptor Interaction (Fig. 9C), Regulation Of Autophagy (Fig. 9D), SARS Coronavirus and Innate Immunity (Fig. 9E), Interferon Alpha Beta Signaling (Fig. 10F), Il2 Stat5 Pathway (Fig. 9G) and other pathways (Fig. 9A–G, Table 3).

Fig. 10
figure 10

GSVA enrichment analysis between high and low-risk groups of the IS dataset. Complex numerical heat map display of GSVA enrichment analysis data between high and low-risk groups of the IS dataset (A), group comparison plot (B). IS ischemic stroke, PRDEGs pyroptosis-related differentially expressed genes, GSVA gene set variation analysis, GSVA gene set variation analysis. GSVA enrichment analysis results were based on P < 0.05 and |logFC|> 0.35 as significant enrichment screening criteria.

Table 3 GSEA enrichment analysis results of IS dataset Pyroptosis Model high and low risk group genes.

GSVA (Gene Set Variation Analysis) analysis was performed on all genes in the IS dataset between the high and low-risk groups to explore the variability of the c2.cp.all.v2022.1.Hs.symbols.gmt gene set between the high and low-risk groups of the IS dataset, with P < 0.05 and |logFC|> 0.35 as significant enrichment screening criteria (Fig. 10, Table 4). GSVA of all genes between high and low-risk groups in the IS dataset showed Gdp Fucose Biosynthesis, Synthesis Of Dolichyl Phosphate, Cytosine Methylation, Formation Of Xylulose 5 Phosphate, Beta Oxidation Of Butanoyl Coa To Acetyl Coa, Aurka Targets, Wax And Plasmalogen Biosynthesis, Cerebral Organic Acidurias Including Diseases, Chylomicron Clearance, Diseases Of Base Excision Repair, Mirna Biogenesis, Synthesis Of Ketone Bodies, Runx1 Regulates Transcription Of Genes Involved In Wnt Signaling, Melanin Biosynthesis, Aml Methylation Cluster 7 Dn, Cohesin Loading Onto Chromatin, Covid19 Thrombosis And Anticoagulation, Regulation Of Gene Expression By Hypoxia Inducible Factor, Signaling To p38 Via Rit And Rin a total of 19 gene set pathways showing differences between different subgroups in the IS dataset, with Melanin Biosynthesis, Aml Methylation Cluster 7 Dn, Cohesin Loading Onto Chromatin, Covid19 Thrombosis And Anticoagulation, Regulation Of Gene Expression By Hypoxia Inducible Factor, Signaling To P38 Via Rit And Rin, six pathways had significantly higher enrichment scores in the high-risk group of the IS dataset than in the low-risk group, while the other 13 pathways had significantly higher enrichment scores in the low-risk group of the IS dataset than in the high-risk group. The results of the differential expression of the 19 pathways in the different subgroups of the IS dataset were also analyzed and the results of the specific differential analysis were presented in a heat map (Fig. 10A). The results showed that the 19 pathways were significantly differentially expressed in the different subgroups (High/Low-Riskscore) of the IS dataset. The Mann–Whitney U-test analysis also showed that the expression of the 19 pathways in the IS dataset differed significantly in different subgroups (High/Low-Riskscore), except for the Wax And Plasmalogen Biosynthesis and Covid19 Thrombosis And Anticoagulation did not have statistically significant differences in enrichment scores (P > 0.05); and all other pathways had at least statistically significant differences (P < 0.05).

Table 4 GSVA enrichment analysis results of IS dataset Pyroptosis model high and low risk group genes.

8.Construction of protein–protein interaction networks (PPI networks), mRNA-miRNA, mRNA-RBP and mRNA-TF interaction networks.

The STRING database was used to perform protein–protein interaction analysis [minimum required interaction score: low confidence (0.150)] on 12 DEGs (PF4V1, LY96, PROS1, GNG11, SH3BGRL2, PPBP, RGS18, IFNB1, XK, GZMK, VPREB3, HLA-DQA1) between high and low risk groups of the PRDEGs diagnostic model in the IS dataset. protein–protein interaction analysis [minimum required interaction score: low confidence (0.150)], and a protein–protein interaction network (PPI network; Protein–protein interaction network) consisting of 12 DEGs was constructed (Fig. 11A). network) (Fig. 11A), and then the PRDEGs in the PPI network with links to other genes were used as key genes (hub genes) for IS disease, and 11 key genes (hub genes) were obtained, namely: PF4V1, LY96, PROS1, GNG11, PPBP, RGS18 These 11 hub genes were then analysed for functional similarity, and the GO terms, sets of GO terms, gene products and gene clusters were calculated using the R package GOSemSim. The results showed that among the 11 hub genes, VPREB3 had no functional similarity value with other hub genes, and LY96 had the highest functional similarity value with other hub genes. genes had the highest functional similarity values (Fig. 11B).

Fig. 11
figure 11

Construction of protein–protein interaction network (PPI network), mRNA-miRNA, mRNA-RBP and mRNA-TF interaction networks. (A) IS dataset PRDEGs diagnostic model protein interaction network (PPI network) between high and low risk groups of DEGs. (B) Functional similarity analysis results of key genes are presented. (CE) mRNA-miRNA (C), mRNA-RBP (D), mRNA-TF (E) interaction networks of key genes. mRNA-miRNA (C), mRNA-RBP (D), mRNA-TF (E) interaction networks. -miRNA (C) Interacting network with sky-blue circular blocks as mRNA; orange hexagonal blocks as miRNA. mRNA-RBP (D) Interacting network with sky-blue oval blocks as mRNA; lavender diamond-shaped blocks as RBP. mRNA-TF (E) Interacting network with sky-blue oval blocks as mRNA; light green triangular blocks as transcription factor (TF). IS ischemic stroke, PRDEGs pyroptosis-related differentially expressed genes, DEGs differentially expressed genes, PPI network protein–protein interaction network, RBP RNA binding protein, TF transcription factors.

The mRNA-miRNA data in the miRDB database were used to predict the miRNAs that interacted with 11 key genes, and the mRNA-miRNA interaction network was drawn for visualization (Fig. 11C). mRNA-miRNA interaction network in the sky-blue oval blocks are mRNAs; orange hexagonal blocks are miRNAs. by mRNA-miRNA interaction The mRNA-miRNA interaction network was composed of 9 hub genes and 30 miRNA molecules, constituting a total of 32 pairs of mRNA-miRNA interactions.

The RNA binding protein (RBP) interacting with 11 key genes was predicted using the ENCORI database, and the mRNA-RBP interaction network was mapped for visualization (Fig. 11D). mRNA-RBP interaction network has sky blue circular blocks for mRNA and lavender diamond-shaped dot blocks for RBP. The mRNA-RBP interaction network consisted of 7 hub genes and 68 RBP molecules, constituting a total of 120 pairs of mRNA-RBP interactions, of which the hub genes PROS1 had interactions with 54 RBP molecules.

Finally, the CHIPBase database (version 3.0) and hTFtarget database were used to find transcription factors (TF) that bind to hub genes. The interaction relationships were intersected with 11 hub genes, and the interaction data of 7 hub genes and 70 transcription factors (TF) were obtained and visualized. mRNA-TF interaction network with sky blue oval blocks for mRNA; light green triangular blocks for transcription factors (TF) (Fig. 11E). mRNA-TF interaction network with key The key gene VPREB3 interacted with 56 transcription factors (TF).

9.Analysis of differences in ssgsea immune characteristics between high and low risk groups in the IS dataset dataset PRDEGs diagnostic model.

To investigate the differences in immune infiltration between the high and low risk groups of the IS dataset PRDEGs diagnostic model, the ssGSEA algorithm was used to calculate the abundance of infiltration of 28 immune cells in the samples between the high and low risk groups of the IS dataset, and then the Mann–Whitney U test was used to analyse the differences in infiltration of 28 immune cells between different IS disease subtypes. The degree of difference in the infiltration of 28 immune cells between different IS disease subtypes was then analysed by the Mann–Whitney U test and the results were presented by grouped comparison plots (Fig. 12A). The results showed statistically significant differences (p < 0.05) in the infiltration abundance of 10 immune cells between the high and low risk groups of the PRDEGs diagnostic model of the IS dataset, namely Activated B cell, Activated CD8 T cell, CD56dim natural killer cell Central memory CD4 T cell, Effector memory CD4 T cell, Effector memory CD8 T cell, Immature B cell, MDSC, Memory B cell, T follicular helper cell. The correlation between the infiltration abundance of the 10 immune cell types with statistical differences between the high-risk and low-risk groups was further calculated and the results were presented (Fig. 12B,C), showing that in both subgroups of the IS dataset (Fig. 12C), the infiltration abundance of the 10 immune cell types were mostly positively correlated with each other, and that Effector memory CD4 T cells and Activated CD8 T cells were positively correlated with each other. cells and Activated CD8 T cells (Fig. 12B,C).

Fig. 12
figure 12

Analysis of differences in ssgsea immune characteristics between high and low risk groups in the IS dataset dataset PRDEGs diagnostic model. (A) IS dataset dataset of the PRDEGs diagnostic model of ssGSEA immune infiltration analysis results grouped between high and low risk groups is presented. (B,C) IS dataset dataset of the low risk group (B), high risk group (C) correlation analysis results of immune cell infiltration abundance are presented. (D,E) IS dataset dataset of the low risk group (D), high risk group (E), and immune cell infiltration analysis results are presented. Heat map of correlation between immune cells and hub genes in the high-risk group (E). Symbol ns equals P ≥ 0.05, not statistically significant; symbols * equals P < 0.05, statistically significant; symbols ** equals P < 0.01, highly statistically significant; symbols *** equals P < 0.001, highly statistically significant. IS ischemic stroke, PRDEGs pyroptosis related differentially expressed genes, ssGSEA single-sample gene-set enrichment analysis.

The correlation between the abundance of 10 immune cell infiltrates and the expression of 11 key genes (hub genes: PF4V1, LY96, PROS1, GNG11, PPBP, RGS18, IFNB1, XK, GZMK, VPREB3, HLA-DQA1) were also calculated for the low-risk (Fig. 12D) and high-risk (Fig. 12E) patient samples, respectively. The correlations between the amounts, screened at P < 0.05 and the results presented by correlation heat map (Fig. 12D,E), showed some amount of significant correlation between immune cell content and the expression of 11 hub genes in both subgroups, with the hub genes in the low risk group of the IS dataset (Fig. 12D) There was a negative correlation between IFNB1 and the infiltration abundance of most immune cells, while there was a positive correlation for the hub genes VPREB3. In the high-risk group of the IS dataset (Fig. 12E) there was a negative correlation between hub genes IFNB1, LY96 and the infiltration abundance of most immune cells, while there were relatively more significant positive correlations for hub genes VPREB3, HLA-DQA1, and GZMK.

Discussion

Ischemic stroke (IS) is a major global health problem. Despite rapid advances in medical care worldwide, stroke continues to be the leading cause of death and disability27. The disease does not only occur in elderly people, but the chances of developing it in younger people are increasing every year28,29,30, and the common risk factors are smoking, hypertension, hypercholesterolemia hyperglycemia. Similarly, stroke is a dangerous disease in the life and health of children and can even lead to serious neurological deficits31. Up to date, there is no reliable molecular biomarkers could be used in clinical practice to accurately predict IS prognosis.

Pyroptosis, an inflammation-caused type of cell death, is thought to be one of the major types of brain cell death after IS and factor that affects the progression of IS32. Suppression of pyroptosis could promote neuronal survival and ameliorate neurological damage, thereby improving the prognosis and reducing mortality in IS patients33,34,35. As brain tissue biopsy has not been well accepted by the patients and their relatives, studying the expression pattern of PRDEGs in the blood samples of IS patients deposited to IS dataset may provide a tool to predict the IS prognosis and targets for drug development.

Our study encompassed a comprehensive machine learning analysis of both the Control and the ischemic stroke (IS) samples. We first collected and combined three IS expression profile datasets into one integrated IS dataset consisting of 62 IS patients and 47 controls, with a total of 12,733 gene expression data in the blood samples. Among them, 1422 genes met the threshold of |logFC|> 0 and P. adj < 0.05 and of which 376 were highly expressed in IS group. Intersection of PRGs and DEGs further narrowed the number of genes down to 32. Logistic/LASSO regression analysis, random forest and SVM algorithms were performed on each of the 32 PRDEGs to construct accurate predicting models and determine the predicting value of PRDEGs in the dataset. Through a detailed evaluation involving differential gene expression pattern analysis and ROC analysis (Fig. 4), we successfully identified a group of 10 critical PRDEGs (ATF6, BIRC2, BNIP3L, CTSG, ELAVL1, GSK3B, IKBKE, NDUFA13, STAT3, and VIM). These genes participate in diverse cellular processes, and their interconnection with IS is a subject of ongoing investigation. We have delved into some of the implications of these gene associations below and discuss some potential therapeutic interventions for IS.

The differential gene expression analysis of IS samples indicates an up-regulation of ATF6 (Activating Transcription Factor 6) (Fig. 4A). Furthermore, ATF6 demonstrates a moderate level of predictive power for IS against the Control, as evidenced by an AUROC value of 0.67 (Fig. 4B). This observation can be attributed to ATF6’s role as a cellular response mechanism that aids cells in managing stress resulting from the accumulation of misfolded proteins in the endoplasmic reticulum (ER), induced by various cellular stressors including ischemia36. This alignment with existing literature underscores ATF6’s significance in the context of ischemic stroke. The practical implication of this discovery lies in the potential utility of ATF6-activating compounds to mitigate ischemic stroke-related cellular stress. Notably, compounds such as BiX (1-(3,4-dihydroxyphenyl)-2-thiocyanate-ethanone) and compound 147, as mentioned in reference36, have demonstrated the ability to activate ATF6, offering a foundation for the development of novel pharmacological therapies for ischemic stroke.

The analysis also revealed an up-regulation of BIRC2 (Baculoviral IAP Repeat Containing 2) in the IS group relative to the Control group, supported by a P-value of less than 0.001 (Fig. 4A). BIRC2 is a member of the inhibitor of apoptosis (IAP) family37. It plays a role in regulating cysteine aspartic enzyme pathway, inhibiting apoptosis and promoting cell cycle progression38,39,40. In cerebral ischemia, suppression of BIRC2 leads to exacerbated neurological functional deficits and increased apoptosis and infarct volume37. In contrast, upregulation of BIRC2 provides protection against type 2 diabetes and hypercholesterolemia, both of which are risk factors for inducing and aggravating ischemic stroke41,42,43. Remarkably, BIRC2 emerges as a potent diagnostic discriminator, as evidenced by an area under the ROC curve (AUROC) value of 0.79 (Fig. 4C), effectively distinguishing IS samples from the Controls. This finding mirrors a recent study by Zhang et al.44, aligning with our investigation and providing valuable insights for exploring the underlying mechanisms of ischemic stroke, as well as fostering the development of innovative diagnostic and therapeutic avenues.

BNIP3L (BCL2 Interacting Protein 3 Like), which is a mitochondrial autophagy receptor45,46, plays a neuroprotective role in IS47,48. The up-regulation of BNIP3L identified in our analysis (Fig. 4A) holds particular significance due to its involvement in activating mitophagy, a process that confers protective effects against IS damage. The AUROC value of 0.701 (Fig. 4C) underscores BNIP3L’s efficacy as a diagnostic marker, illuminating its role as a response mechanism to IS, aimed at safeguarding the ischemic brain region. Corroborating this, research by Li et al.49 indicates that BNIP3L knockout mice exhibit larger brain infarct areas and exacerbated neurological deficits, which can be ameliorated through BNIP3L overexpression. Consequently, compounds capable of stimulating BNIP3L expression offer promise as therapeutic agents against IS. For instance, hydroxamic acid-based histone deacetylase (HDAC) inhibitors have been identified as inducers of BNIP3L expression50, warranting exploration as potential small molecule drug candidates for this purpose.

Cathepsin G (CTSG) is a serine protease of the trypsin C family51 that is released directly at the site of inflammation, causing platelet secretion and aggregation, and plays an important role in cell signaling, extracellular matrix degradation and cytokine processing, and could be pro-inflammatory or promote thrombosis when platelets interact with neutrophils52,53,54. Our analysis further indicates an up-regulation of CTSG (Cathepsin G) in the context of ischemic stroke (IS) compared to the Control group (Fig. 4A). Notably, this up-regulation is statistically significant, denoted by a P-value of 0.01. Moreover, the area under the ROC curve (AUROC) value of 0.66 (Fig. 4C) underscores CTSG’s diagnostic potential for distinguishing IS samples from Controls. This observation assumes significance given CTSG’s implication as an inflammation-related gene product. Building upon prior research, which demonstrated that knockdown of CTSG inhibits apoptosis55,56 and that the reduction of myocyte death post cardiac ischemia and reperfusion upon CTSG inhibition57, the prospect of CTSG inhibitors emerges as a promising avenue for potential therapeutic intervention against IS. A pertinent example is the Cathepsin G inhibitor Bis-Napthyl Beta-Ketophosphonic Acid (DrugBank ID: DB02360), which holds potential as a small molecule drug candidate for ischemic stroke.

Within our analysis, ELAVL1 exhibits down-regulation in the IS group relative to the Control group (Figs. 4A and D). This modulation is hypothesized to be a response mechanism to mitigate neuronal impairment in cerebral ischemia/reperfusion events. Support for this notion is derived from a study by Du et al.58, which underscores the role of ELAVL1 in ferroptosis-induced neuronal protection. Although explicit ELAVL1-downregulating compounds are not yet established, ongoing research aims to identify small molecules capable of disrupting the interaction between ELAVL1 and its target mRNAs. Such molecules have the potential to modulate the stabilizing influence of ELAVL1 on mRNA molecules, thereby influencing changes in gene expression59.

GSK3B (glycogen synthase kinase 3 beta) exhibits up-regulation in the IS samples based on our analysis (Fig. 4A). The AUROC value of 0.65 (Fig. 4D) highlights GSK3B’s robust discriminatory power in separating the IS samples from the Control samples. Research by Li et al.60 underscores the potential of GSK3 inhibition in reducing cerebral ischemia/reperfusion injury. In light of this, GSK3B inhibitors emerge as a promising avenue for the treatment of ischemic stroke. For instance, Tideglusib, a chemical inhibitor of GSK3B, has demonstrated a capacity to attenuate hypoxic-ischemic brain injury in neonatal mice61. This supports the potential utility of GSK3B inhibitors, such as Tideglusib, as candidates for consideration in the context of therapeutic interventions for IS patients.

IKBKE (inhibitor of nuclear factor kappa B kinase subunit epsilon) is a gene involved in pyroptosis during the ischemic stroke (IS). Its expression is down-regulated compared to the Control group (Fig. 4A). The area under the ROC curve (AUROC) is 0.72 (Fig. 4D), indicating a moderate level of statistical power in distinguishing IS samples from Control samples. Given the context-dependent nature of IKBKE’s function62, it is suggested, based on our analysis, that further in vitro and in vivo tests should be conducted with various modulators of this target to identify potential drug candidates relevant for treating IS.

STAT3 (Signal transducer and activator of transcription 3) (Fig. 4A) is another factor differentially expressed, with an AUROC of 0.705, suggesting a moderate level of power in distinguishing the IS samples from the Control samples. Studies have highlighted the essential role of STAT3 in maintaining cerebrovascular integrity and its significance in providing protection against cerebral ischemia. Notably, endothelial STAT3 could serve as a potential therapeutic target for mitigating endothelial dysfunction post-stroke63.

NDUFA13 (NADH dehydrogenase [ubiquinone] 1 alpha subcomplex subunit 13) is down-regulated (Fig. 4A), with an AUROC of 0.79 (Fig. 4F). Notably, studies have shown that a moderate down-regulation of NDUFA13 is associated with suppressed superoxide burst and reduced infarct size during the ischemia–reperfusion process64, highlighting the significant role that NDUFA13 plays in ischemic stroke.

Our analysis demonstrates an upregulation of VIM (Vimentin) (Fig. 4A), with an AUROC of 0.704 (Fig. 4I). VIM is a cytoskeletal intermediate filament protein65,66 that maintains the capacity for tissue repair and tissue integrity67,68. VIM, a protein coding gene, recruits inflammatory cells from the bloodstream and enhances inflammatory responses. It impairs vascular function by regulating the vascular tone and affects blood vessel remodeling67, which could exacerbate stroke progression and hinder the recovery. Interestingly, circulating VIM is positively associated with prediction of future IS incidence in a population–based cohort study69. Furthermore, VIM is associated with stroke risk factors with a high expression in diabetic pancreas islet cells70,71 and mediates hyperglycemia caused disperse of Golgi apparatus72. VIM deficiency prevents high-fat diet-induced obesity and insulin resistance in mice73. These results suggest that VIM is closely associated with IS risk factors and IS itself. And furthermore, Fasipe et al.74 found that extracellular VIM significantly promotes the formation of von Willebrand factor (VWF) strings through A2 ___domain binding and pharmacologically targeting VIM/VWF interactions promotes improvements in reperfusion after IS. These studies demonstrate the critical role of VIM in stroke pathology and further investigation is necessary to confirm whether inhibiting VIM could contribute to the recovery of IS patients from ischemic brain injury.

By using |logFC|> 0.5 and P.adj < 0.05 as screening criteria, a total of 12 DEGs (PF4V1, LY96, PROS1, GNG11, SH3BGRL2, PPBP, RGS18, IFNB1, XK, GZMK, VPREB3, HLA-DQA1) with significant differences between high and low risk groups of the IS dataset were screened. To understand the potential roles, pathways and biological properties of the 12 DEGs in the high and low-risk groups in the IS dataset, KEGG, GO, GSEA and GSVA enrichment analyses were performed in this experiment. KEGG results showed that these 12 genes were mainly enriched in the five KEGG pathways of Chemokine signaling pathway, Cytokine-cytokine receptor interaction, Viral protein interaction with cytokine and cytokine receptor, Toll-like receptor signaling pathway, Toxoplasmosis.

GO enrichment analysis showed that these genes were mainly enriched in BP pathway such as humoral immune response, leukocyte migration, blood coagulation, MF such as cytokine activity, cytokine receptor binding, receptor ligand activity and CC such as platelet alpha granule lumen, platelet alpha granule. Taken together, the results of these enrichment analyses allow an assessment of the potential value of PRDEGs in disease diagnosis. Numerous studies have shown that pyroptosis is also involved in immunity, this and the result of the analysis of the GO in our study and KEGG is consistent. Scorching leads to a range of inflammatory and immune responses including cell swelling, plasma membrane lysis, and chromatin rupture75,76. Evidence suggests that ischemic stroke patients frequently exhibit pronounced immune-inflammatory responses77,78. The immune system has been shown to be involved in all phases of ischemic stroke, from the early damage events that have occurred with arterial occlusion to later tissue repair, and the damaged brain in turn exerts immunosuppressive effects that promote fatal disturbances and dangerous post-stroke patient survival79,80.

Materials and methods

Data sources and pre-processing

We downloaded the expression profile datasets GSE2225581, GSE14027582 and GSE1656183 for patients with IS from the GEO database84 using the R package GEOquery85. All datasets were obtained from Homo sapiens.

The GSE22255 dataset consists of 40 samples of gene expression profiles of peripheral blood mononuclear cells (PBMCs) from IS patients and age-matched healthy adults: 20 IS patients and 20 age-matched healthy adults. The GSE140275 dataset includes gene expression profiles from blood samples of 3 acute ischemic stroke (AIS) patients and 3 healthy controls. The GSE16561 dataset consists of 63 samples of GEP data from whole blood samples from 39 patients with IS and 24 healthy control subjects.

The data platforms for GSE22255, GSE140275, and GSE16561 are GPL570 (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus 2.0 Array, GPL16791 Illumina HiSeq 2500 (Homo sapiens), and GPL6883 Illumina HumanRef-8 v3.0 expression beadchip, respectively. The probe annotation for these datasets was done using the corresponding GPL platform files. All samples from these datasets were included for further analysis. Please refer to Tables 5,6 for specific dataset information.

Table 5 Ischemic stroke data set information list.
Table 6 The clinical characteristics of datasets.

We collected Pyroptosis-Related Genes (PRGs) from the GeneCards database86 (https://www.genecards.org/), which provides comprehensive information on human genes. A total of 372 PRGs at the mRNA level were obtained using the term “pyroptosis” as a search term and “protein coding” as a filtering criterion. In addition, we used MSigDB (Molecular Signatures Database)87, with "pyroptosis " as the search term, to find 27 genes from the REACTOME_PYROPTOSIS.v2022.1.Hs reference gene set. After merging and de-duplication, we obtained a total of 372 PRGs as described in Table S1.

Differentially expressed genes associated with ischemic stroke

To identify the potential mechanisms, biological features, and pathways of differential genes in IS, the datasets GSE22255, GSE140275 and GSE16561 were combined to remove the batch effect, and then the three datasets were standardized to obtain the IS dataset. The expression matrix of the dataset before and after removal of the batch effect was subjected to principal component analysis (PCA)88. The effect of removing the batch effect was validated.

Differential analysis was then performed on the IS dataset using the LIMMA package, and DEGs (differentially expressed genes) were selected for further study using the following criteria: |logFC|> 0 and P.adj < 0.05. Among these DEGs, genes with logFC > 0 and P.adj < 0.05 were considered upregulated DEGs, while genes with logFC < 0 and P.adj < 0.05 were considered downregulated DEGs.

The DEGs were intersected with the 372 Pyroptosis-Related Genes (PRGs) and a Venn diagram was used to obtain the IS disease-related Pyroptosis-Related Differentially Expressed Genes (PRDEGs) (Table 7). The results of the analysis are presented in a volcano map and a heat map.

Table 7 IS dataset pyroptosis related differentially expressed genes list.

Constructing predictive models for ischemic stroke based on PRDEGs

Logistic regression analysis of the integrated IS dataset was performed to obtain a predictive model based on the PRDEGs. Logistic regression models were constructed by screening PRDEGs with a P-value < 0.10, and the expression of the PRDEGs included in the logistic regression models was presented in a Forest Plot.

Next, we applied LASSO (Least Absolute Shrinkage and Selection Operator) regression analysis to the PRDEGs included in the above logistic regression model using the R package glmnet89. The parameters set. seed (500) and family = “binomial” were used, and the analysis was run for 200 cycles. LASSO regression analysis, based on linear regression, reduces model overfitting and improves model generalization by adding a penalty term (lambda × absolute value of slope). The results of the LASSO regression analysis were plotted on a Nomogram by R package rms90.

RandomForest (RF)91 is an algorithm that integrates multiple decision trees through ensemble learning and belongs to the bagging (bootstrap aggregation) category of ensemble algorithms. By constructing multiple decision trees, it predicts a sample by aggregating the predictions from each tree and selecting the final result through voting. We used the RandomForest package92 to perform the model construction with the expression matrix of PRDEGs in the IS dataset. Parameters set. seed (234) and ntree = 1000.

Furthermore, we constructed an SVM (Support Vector Machine)93 model based on the Pyroptosis-Related DEGs (PRDEGs) and selected those PRDEGs based on the highest accuracy and lowest error rate. The PRDEGs included in the LASSO regression model, SVM model, and Random Forest model were intersected, and a Venn diagram was plotted to obtain the common PRDEGs.

Finally, we evaluated the accuracy and discriminative power of the PRDEGs diagnostic model using calibration analysis and plotted a calibration curve based on the pyroptosis-related hub genes. Decision Curve Analysis (DCA)94 (a simple method for evaluating clinical prediction models, diagnostic tests, and molecular markers) was used to assess the accuracy and discriminative power of the PRDEGs diagnostic model. The DCA plot evaluating the PRDEGs diagnostic model was generated using the R package ggDCA based on the pyroptosis-related DEGs (PRDEGs).

Differentially expressed genes associated with high and low-risk groups in a diagnostic model of PRDEGs

To identify the potential mechanisms, biological features, and pathways associated with differentially expressed genes (DEGs) in the high-risk and low-risk groups of the IS dataset PRDEGs diagnostic model, we performed differential analysis using the LIMMA package on the IS dataset. DEGs were selected based on the criteria of |logFC|> 0.5 and P < 0.05 for comparisons between different groups (High vs. Low). These DEGs, meeting the criteria, were further investigated in our study. Among them, genes with logFC > 0.5 and P < 0.05 were considered up-regulated DEGs, while genes with logFC < -0.5 and P < 0.05 were considered down-regulated DEGs.

Gene function enrichment analysis (GO) and pathway enrichment (KEGG) analysis

Gene Ontology (GO) analysis95 is a commonly used method for large-scale functional enrichment studies, including biological processes (BP), molecular functions (MF), and cellular components (CC). Kyoto Encyclopedia of Genes and Genomes (KEGG)96 is a widely used database that stores information on genomes, biological pathways, diseases, drugs, and more. We used the clusterProfiler package97 in R to perform GO annotation analysis on the DEGs. The criteria for selecting enriched terms were P.adj < 0.05 and false discovery rate (FDR) value (q-value) < 0.05, indicating statistical significance. The Benjamini-Hochberg (BH) method was used for P correction.

Gene set enrichment analysis and gene set variation analysis

GSEA (Gene Set Enrichment Analysis)98 is a computational method used to determine whether a set of pre-defined genes show statistical differences between two biological states and is commonly used to estimate changes in pathway and biological process activity in samples from expression datasets. In this study, we first divided the genes in the high and low-risk groups of the PRDEGs diagnostic model in the IS dataset into two groups of high and low phenotypic correlation based on phenotypic correlation ranking, and then we used the clusterProfiler package to perform an enrichment analysis of all genes in the two groups of high and low phenotypic correlation. The parameters used in this GSEA were as follows: seed of 2022, calculated number of times was 10,000, each gene set contained at least 10 genes and at most 500 genes, and the P correction method was Benjamini-Hochberg (BH). We obtained the "c2.all.v7.5.1.symbols.gmt" gene set from the Molecular Signatures Database (MSigDB). The screening criteria for significant enrichment were P.adj < 0.05 and FDR value (q. value) < 0.05.

GSVA (Gene Set Variation Analysis)99, known as Gene Set Variation Analysis, is a non-parametric, unsupervised analysis method that evaluates gene set enrichment in microarray transcriptomes by converting gene expression matrices between samples into gene set expression matrices between samples. This is used to assess whether different pathways are enriched across samples. We obtained the "c2.cp.all.v2022.1.Hs.symbols.gmt" gene set from the MSigDB database and performed GSVA on all genes in the PRDEGs diagnostic model high and low-risk groups in the IS dataset to calculate the functional enrichment of genes in the PRDEGs diagnostic model high and low-risk groups. The functional enrichment differences between the risk-groups were calculated, and P < 0.05 and |logFC|> 0.35 were used as the significant enrichment screening criteria.

Protein–protein interaction networks

The STRING database100 was used to construct DEGs-related protein–protein interaction networks from DEGs in the high and low risk groups of the filtered IS dataset PRDEGs diagnostic model, visualise the PPI network using Cytoscape101 and include DEGs in the PPI network with links to other nodes and as key genes (hub genes) for IS diseases.

Construction of mRNA-miRNA, mRNA-RBP, and mRNA-TF interactions networks

Using the miRDB database102 predicted miRNAs interacting with hub genes and found mRNA-miRNA interactions for key genes (hub genes) with a Target Score ≥ 90 on the data section to map mRNA-miRNA interaction networks. The ENCORI database was also used103 RNA binding proteins (RBPs) interacting with key genes were predicted and mRNA-RBP pairs were screened using clusterNum > 1 and clipExpNum > 1 as screening criteria and mRNA-RBP interaction networks were mapped. Through the CHIPBase database104 (version 3.0) and the hTFtarget database105. Transcription factors (TFs) that bind to key genes (hub genes) were identified and the common parts of both databases were retained. Cytoscape software was then used to visualize mRNA-miRNA, mRNA-RBP, and mRNA-TF interactions networks.

Statistical analysis

All data processing and analysis in this article was based on R software (Version 4.2.2), and continuous variables are presented as mean ± standard deviation. Comparisons between two groups were made using the Wilcoxon rank sum test (i.e., Wilcoxon rank sum test), while comparisons between three or more groups were made using the Kruskal–Wallis test.

Limitation

In this study, we described the gene expression profiles related to pyroptosis in ischemic stroke using machine learning to identify potential targets for drug repurposing. However, owing to the limitations of single-cell sequencing, such as the inability to accurately describe low-expression genes, large sample sizes are required for reliable analysis. Additionally, owing to the lack of corresponding clinical specimen research, it cannot be analyzed in combination with clinical information. We may conduct future research in this direction using multigroup science and space transcriptome technology.

Conclusion

Our comprehensive machine learning analysis of Control and ischemic stroke (IS) samples has yielded valuable insights into the complexity of PRDEGs and their relevance to IS. Through detailed examination of differential gene expression patterns and ROC analysis, we have identified a critical set of 10 PRDEGs that contribute to diverse cellular processes. Our findings underscore the potential interplay between these genes and IS, shedding light on their diagnostic and therapeutic significance.

ATF6’s up-regulation in IS samples reflects its crucial role as a cellular stress-response mechanism. This discovery opens avenues for ATF6-targeted interventions to alleviate IS-induced cellular stress. BIRC2 emerges as a potent diagnostic discriminator with clinical implications, aligning with recent research, and offering insights into IS mechanisms. BNIP3L’s up-regulation and its involvement in mitophagy indicate its protective role against IS damage. This observation emphasizes the prospect of stimulating BNIP3L expression for IS therapeutics. CTSG’s up-regulation and its potential as a diagnostic marker underscore its significance as an inflammation-related gene product. ELAVL1’s down-regulation reflects its potential role in mitigating neuronal impairment, guiding future research into its regulatory mechanisms. GSK3B’s up-regulation holds promise for IS treatment, with Tideglusib as a potential candidate for drug repurposing. IKBKE’s down-regulation and STAT3’s differential expression underscore their roles, warranting further investigations. NDUFA13’s down-regulation implicates its contribution to IS pathophysiology, and VIM’s upregulation opens prospects for potential therapeutic interventions.

In summary, our study provides a comprehensive view of the 10 PRDEGs’ roles in IS, offering insights into potential targets for diagnostic and therapeutic approaches. By unraveling the intricate network of gene expressions associated with IS, this work contributes to the advancement of knowledge in this field, potentially paving the way for novel strategies in managing and treating ischemic stroke.