Introduction

It is estimated that about a quarter of the world’s population is infected with Mycobacterium tuberculosis (MTB). After being infected with MTB, the risk of tuberculosis is the highest in the first two years (about 5%), and then it is much lower1. Latent tuberculosis infection (LTBI) refers to the persistent immune response to MTB antigen, but there is no clinical evidence of ATB. The diagnosis of LTBI lacks the gold standard2,3.

In the Stop TB Strategy, the World Health Organization proposes to “stop the global tuberculosis epidemic” in 2035 4. However, the current tuberculosis epidemic has not shown any trend of being contained. Developing a new tuberculosis vaccine, finding biomarkers to distinguish LTBI from ATB, and developing drugs to treat drug-resistant tuberculosis are crucial in controlling the epidemic5.

MTB can survive in the host in a dormant state for a long time, and it can also be reactivated and progress to ATB. The development from LTBI to ATB is a series of heterogeneous processes with specific microbiological and immunological characteristics6. Among them, the change of host immune state is the initial factor. Studying the host immune mechanism and then implementing preventive anti-tuberculosis treatment for high-risk populations is of great significance for reducing the incidence of tuberculosis. In addition, understanding the characteristics of immune function induced by MTB is also helpful in finding biomarkers to predict the progress of LTBI to ATB and developing immunotherapy drugs for tuberculosis7.

Although the immune response to MTB is mainly concentrated in the lungs, the circulating immune cells in peripheral blood determine the pathological state of tuberculosis infection. Whole blood transcriptome analysis can not only help us to determine the molecular components of potential infection but also help us to understand the host immune response in tuberculosis8. The developed signature of blood transcriptome, which can distinguish LTBI from ATB, and analyze the immune status of different ATB may help us to evaluate the risk of tuberculosis reactivation, and then carry out preventive treatment for high-risk infected people, which is very important for reducing the incidence of tuberculosis and controlling the epidemic situation of tuberculosis.

The datasets in the public database provide us with the possibility of investigating molecular patterns through bioinformatics. Studies have shown that the ability of single-gene biomarkers to predict diseases is insufficient, and a signature containing several genes may be a better choice9. Therefore, we use meta-analysis to integrate transcriptome datasets from different studies and screen biomarkers that distinguish LTBI from ATB.

Materials and methods

Experimental design

The experimental design flow chart is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of finding immune-related signature through multi-cohort study.

Data source

The dataset included in this study was downloaded from GEO database (http://www.ncbi.nlm.nih.gov/geo). The downloaded dataset meets the following criteria: (1) The cases contained in the dataset are mRNA expression data of blood. (2) The specimens contain cases of LTBI and ATB. (3) ATB cases were collected before treatment. (4) The patient’s age is over 15 years old. (5) HIV negative. Based on these four standards, 13 datasets were included, and we excluded the GSE19444 dataset because it is a subset of the GSE19491 dataset. To better understand racial disparities in TB, the specimens of these datasets come from eight different countries and regions, including high- and medium-incidence regions. The details of datasets are shown in Table 1 and supplement Table 1.

Table 1 The detailed information of 13 datasets.

We selected GSE19491, GSE28623, and GSE37250 for WGCNA analysis and merged three datasets into a training cohort, in which there were 177 LTBI and 204 cases of ATB. GSE144127, GSE152532, GSE39939, GSE39940, GSE40553, GSE54992, and GSE62525 were merged into testing cohort. There are 315 cases of LTBI and 268 cases of ATB in testing cohort. The training and testing cohorts merged into the total cohort, including 492 LTBI and 490 cases of ATB.

There is no expression of FCGR1B in GSE25534 dataset, and there is no expression of FCGR1B and SLC26A8 in GSE74092 dataset, so these two datasets are not merged into the total cohort, but verified separately.

Weighted correlation network analysis (WGCNA)

WGCNA can be used to find modules with highly correlated genes, and the characteristic gene network method is used to associate modules with each other, correlate the module with the external sample traits, and calculate the module member metrics. The relevance network promotes the network-based gene screening method, which can be used to identify candidate biomarkers or therapeutic targets10. In the datasets of GSE19491, GSE28623, and GSE37250, the top 25% genes with the highest variance are used for WGCNA analysis to ensure the accuracy of quality results, and the “pickSoftThreshold” function is used to select the best soft threshold (β). Then a weighted adjacency matrix is constructed. Using the TOM dissimilarity measure (1-TOM) based on a hierarchical clustering tree algorithm, the module has more than 50 genes. Each module is assigned a random color. Modular characteristic genes represent the overall gene expression profile of each module. Gene significance (GS) indicates the correlation between gene and clinical phenotype. The relationship between module and disease state is reflected by module significance (MS). Genes with critical values of MS > 0.8 and GS > 0.5 in the significant module were identified as key genes.

Determination of key genes

We use the “veen” package of R language to analyze the intersection genes of key genes in GSE19491, GSE28623 and GSE37250, and identify the intersection genes as hub genes. We use the “sva” software package to merge the datasets and conduct standardized analysis.

Construction of predictive model based on multiple machine learning algorithms

In the training cohort, we use the R language package “caret” to build machine learning models, including random forest (RF), support vector machine (SVM), generalized linear model (GLM), and extreme gradient lifting (XGB). We use “pROC” to visualize the AUC; then, we use “DALEX” to explain these four models. Finally, we draw the residual distribution of each model11.

To verify whether the machine learning model can distinguish between ATB and LTBI, we analyzed the ROC curves of signatures in each dataset: training cohort, testing cohort, and total cohort. We also compared our signature with other signatures in the total cohort 12,13,14,15,16,17,18,19,20,21,22,23.

In order to verify whether the machine learning model can distinguish LTBI from ATB in HIV-positive patients, we further verify it in the GSE37250 dataset. In the GSE37250 dataset, there are 84 cases of HIV-positive LTBI patients and 98 cases of HIV-positive ATB patients.

Immune infiltration analysis

CIBERSORT is an immunoassay tool developed by Stanford University. It was published in Nature Methods in 2015 and has been cited nearly 6000 times24. CIBERSORT performs deconvolution analysis based on the linear support vector regression principle and provides gene expression feature sets of 22 immune cell subtypes25.

Clustering analysis of ATB cases

Based on the expression of genes in the machine learning model, we used the R package “ConsensusClusterPlus” for unsupervised clustering analysis26. We use the k-means algorithm to classify 472 ATB cases into clusters and make 1,000 iterations on the clusters. The value of k is defined from 1 to 9 to generate different subtypes. We choose the best number of clusters according to the clustering consensus score. We use principal component analysis (PCA) to visualize the distribution of clusters.

Gene set variation analysis (GSVA)

We use the R package “GSVA” to analyze GSVA enrichment and clarify the differences in enrichment gene sets among different clusters. Gene matrix transposition (gmt) files including “c2.cp.kegg.symbols.gmt” and “c5.go.symbols.gmt” were downloaded from the database of MSigDB website for further GSVA analysis27. We use the R package “limma” to identify the pathways and biological functions of differential expression by comparing GSVA scores among different clusters. P value < 0.05 was statistically significant.

Screening of potential therapeutic drugs

We then used the “pRRophetic” R package to assess the treatment response, as indicated by the IC50 of targeted and immunotherapy drugs28.

Statistical analysis

We use R language (version 4.1.3) for statistical analysis. Spearman correlation calculated by the “cor.test” function was used to analyze the relationship between the expression genes and immune cells. P value < 0.05 was statistically significant.

Results

Identification of key gene by WGCNA

In GSE37250 dataset, there were 83 cases of LTBI and 97 cases of ATB (HIV negative). We selected the top 25% genes (7853 genes) with significant fluctuation for WGCNA analysis. When the soft threshold is 2, we establish a scale-independent topological network. The correlation analysis between MS value and sample traits found that the red module was most closely related to the progress from LTBI to ATB (cor = 0.89, p = 3e-33). Under the precise cut-off criteria (MS > 0.8 and GS > 0.5), seven genes from the central module were identified as important genes (Fig. 2A and B; Supplement Fig. 1).

Fig. 2
figure 2

Evaluating the module-trait association through the correlation between the module characteristic genes and the sample traits in GSE37250 (A), GSE28623 (C), and GSE19141 (E) dataset. B. The gene scatter plot in the red module in the GSE37250 dataset. D. The gene scatter plot in the yellow module in the GSE28623 dataset. F. The gene scatter plot in the brown module in the GSE19141 dataset. G. The intersection gene of important genes in three datasets. H. The expression of hub genes in LTBI and ATB cases in the training cohort. I. The expression of hub genes in LTBI and ATB cases in testing cohort. J. ROC curve analyses for SLC26A8, CD274, FCGR1B, SERPING1, and ANKRD22 in the training cohort. K. Residual distribution of each machine learning model. L. ROC analysis of four machine learning models in the training cohort.

In GSE28623 dataset, there are 25 cases of LTBI and 46 cases of ATB. The top 25% genes (4924 genes) with significant fluctuation were selected for WGCNA analysis. We establish a scale-independent topological network when the soft threshold is 4. The analysis shows that the yellow module is most closely related to the progress from LTBI to ATB (cor = 0.76, p = 1.9e-104). 113 genes from the central module were identified as important genes (Fig. 2C and D; Supplement Fig. 2).

In GSE19141 dataset, there are 69 cases of LTBI and 61 cases of ATB, and there are 6228 genes in the top 25% with significant fluctuation. WGCNA analyzes these genes. When the soft threshold is 1, we establish a scale-independent topological network. It is found that the brown module is most closely related to the progress from LTBI to ATB (cor = 0.92, p = 1.8e-168). According to the criteria of MS > 0.8 and GS > 0.5, we selected 16 genes from the central module as important genes (Fig. 2E and F; Supplement Fig. 3).

Determination of hub genes

We used the “veen” package to analyze whether the important genes of GSE19491, GSE28623, and GSE37250 intersect. We found that SLC26A8, CD274, FCGR1B, SERPING1, and ANKRD22 are the intersection genes of three datasets, and we identified these genes as hub genes. (Fig. 2G). In the training cohort, compared with LTBI patients, the expression of hub genes in ATB patients increased significantly (Fig. 2H), and similar findings were also observed in the testing dataset (Fig. 2I). ROC curve analysis of training datasets shows that the AUC values of SLC26A8, CD274, FCGR1B, SERPING1, and ANKRD22 are 0.745, 0.750, 0.798, 0.769, and 0.783, respectively (Fig. 2J).

Construction of machine learning models

In training cohort, we constructed four machine learning models based on the expression of hub genes, including RF, SVM, GLM and XGB. It is found that the residuals of SVM and RF machine learning models are low (Fig. 2K). The AUC of the four models is 0.918 for RF, 0.924 for SVM, 0.905 for XGB and 0.904 for GLM, respectively (Fig. 2L). Based on residual and AUC, the SVM machine learning model best distinguishes LTBI from ATB. We select the three-gene signature in the SVM model (SLC26A8, ANKRD22, and FCGR1B) for further analysis.

ROC analysis showed that the AUC of the three-gene signature was 0.947 in the training cohort (Fig. 3A), 0.700 in the testing cohort (Fig. 3B), and 0.801 (Fig. 3C) in the total cohort.

In the total cohort, we compared the identification potential of the three-gene signature with other signatures. The AUC curve shows that except for Jan’s Sig (22-gene signature), the ability of the three-gene signature to distinguish LTBI from ATB is not inferior to other models (Fig. 3D-O).

Fig. 3
figure 3

To verify the diagnostic value of the three-gene signature by ROC analysis in the training set (A), testing cohort (B), and total cohort (C). D-O. In the total cohort, the ability of other signatures to distinguish LTBI from ATB.

Table 2 shows the AUC of the three-gene signature in each dataset. The GSE25534 dataset does not have FCGR1B gene expression, and the AUC of SLC26A8 and ANKRD22 in differentiating LTBI from ATB is 0.988 (95% CI: 0.961-1.000). The platform file GPL21040 of GSE74092 data set only has 281 genes expressed (excluding FCGR1B and SLC26A8), so no further analysis was conducted.

Table 2 In each dataset, the diagnostic value of the three-gene signature was verified by ROC analysis.

Among HIV-positive cases in GSE37250 dataset, the AUC of identifying LTBI and ATB by the three-gene signature is 0.892 (95% CI: 0.844–0.938) (Fig. 4A).

Fig. 4
figure 4

A. The AUC curve of the three-gene signature in HIV positive cases. B. Difference of immune cell infiltration between LTBI and ATB. C. Correlation analysis between SLC26A8, ANKRD22, FCGR1B, and infiltrated immune cells. D. The cluster number is most stable when the k value is set to 2. E. PCA showed significant differences between the two clusters. F. SLC26A8, ANKRD22, and FCGR1B expression in C1 and C2 cases in total cohort. G. Difference of immune cell infiltration between C1 and C2. H. The difference in biological function between C1 and C2 was analyzed by the GSVA method.

Immune cell infiltration in LTBI and ATB

CIBERSORT was used to analyze the difference of immune cell infiltration between LTBI and ATB. It was found that the expressions of B cells naïve, B cells memory, T cells CD8, T cells CD4 memory resting, NK cells resting, NK cells activated, macrophages M2 and eosinophils increased in LTBI. The expression of T cells gamma delta, monocytes, macrophages M0, dendritic cells activated, mast cells activated and neutrophils increased in ATB (Fig. 4B).

We used spearman correlation to analyze the correlation between FCGR1B, SLC26A8, and ANKRD22 and immune cell infiltration. The study showed that the expression of FCGR1B, SLC26A8, and ANKRD22 was positively correlated with the levels of dendritic cells activated, macrophages M0, mast cells activated, monocytes and neutrophils. It was negatively correlated with the expression levels of B cells naïve, B cells memory, macrophages M2, mast cells resting, NK cells activated, NK cells resting, T cells CD4 memory activated, T cells CD4 memory resting and T cells CD4 naïve (Fig. 4C).

Clustering analysis of ATB cases

In the total cohort, based on the expressions of SLC26A8, ANKRD22, and FCGR1B, we use a consensus clustering algorithm to cluster 472 ATB cases. When the value of k is set to 2(k = 2), the optimal number of clusters is observed (Fig. 4D). PCA analysis showed that 472 patients with tuberculosis could be well divided into the cluster 1 (C1, n = 189) and the cluster 2 (C2, n = 283) (Fig. 4E).

Compared with C1, the expression of three genes in C2 is higher (Fig. 4F).

Through the analysis of immune cell infiltration, we found that there were significant differences in immune state between C1 and C2 (Fig. 4G). In C1, B cells naïve, B cells memory, T cells CD8, T cells CD4 naïve, T cells CD4 memory resting, T cells gamma delta, NK cells resting, NK cells activated, macrophages M2, the expressions of mast cells resting and eosinophils increased significantly (Fig. 4G). In C2, the expressions of monocytes, macrophages M0, dendritic cells activated, mast cells activated and neutrophils increased significantly (Fig. 4G).

GSVA analysis results

We identified the pathway activity and biological function related to each cluster by GSVA. Functional enrichment analysis showed that inositol phosphate metabolism, Huntingtons disease, long-term potentiation, tryptophan metabolism, Glycosaminoglycan biosynthesis - heparan sulfate, beta-Alanine metabolism, RNA polymerase, RNA degradation, spliceosome and aminoacyl-tRNA biosynthesis were significantly enriched in C1 (Fig. 4H).

In contrast, Cytosolic DNA-sensing pathway, apoptosis, type II diabetes mellitus, dorsoventral axis formation, vascular smooth muscle cell, EMC receptor interaction, O glycan biosynthesis, JAK-STAT signaling pathway, glycosaminoglycan degradation, chemokine signaling pathway and epidermal cell signaling in helicobacter infection were enriched in C2 (Fig. 4H).

Screening of potential therapeutic drugs

We have identified 35 kinds of immunotherapy or targeted drugs, and the IC50 values of these drugs in C2 are lower than those in C1, including Bicalutamide, CGP.60,474, KU.55,933, Parthenolide and NSC.87,877(Fig. 5A-E, Supplement Table 2). We also identified 62 drugs with IC50 values lower than C2 in C1, including AZD8055, Camptothecin, Lenalidomide, Metformin and Metformin (Fig. 5F-J, Supplement Table 2).

Fig. 5
figure 5

Drug sensitivity analysis. A-E showed that the drug had a lower IC50 value in the C2 group. F-J showed that the drug had a lower IC50 value in the C1 group.

Discussion

Although the tuberculin skin test (TST) and the interferon gamma release assays (IGRA) have been widely used in LTBI screening among students and special people with immunosuppression, and in clinical auxiliary diagnosis of ATB disease, TST and IGRA are of limited value in distinguishing ATB cases from LTBI29.

MTB infection is a dynamic process. Host and bacteria are the main factors that induce LTBI to develop into ATB, and the change of host immune state is the most direct reason for LTBI to develop into ATB. The immunological mechanism of tuberculosis infection is very complicated, and innate immunity and adaptive immunity are very important in maintaining the state of latent infection. In innate immunity, alveolar macrophages, dendritic cells, neutrophils, and natural killer cells play an important role in tuberculosis infection. Adaptive immunity includes cellular immunity and humoral immunity. Cellular immunity mediated by T cells is the main immune mechanism of anti-TB6. It was correlated with the expression of many immune cells. However, we should have elaborated on the dynamic changes of the host immune state from normal to LTBI and then to ATB, this is not the main purpose of this study.

Our study found that SLC26A8, CD274, FCGR1B, SERPING1 and ANKRD22 may be hub genes for LTBI to develop into ATB. Among these five genes, the interesting one is the CD274 gene. In some literatures focusing on the study of tuberculosis signature, it is found that CD274 may play an important role in the process of tuberculosis activity 30,31. Many studies show that anti-PD-(L)1(CD274) therapy may lead to ATB32. However, some studies have shown that patients with both malignant tumors and tuberculosis benefit from anti-PD-(L)1 treatment, while the anti-tuberculosis response of ATB patients is not affected. It is worth noting that the combination of anti-PD-(L)1(CD274) and anti-tuberculosis therapy is well tolerated and has no obvious accidental toxicity33. Therefore, the effects of CD274 and anti-PD-(L)1 therapy on ATB need further study. Compared with the normal control, the expression of SERPING1 in tuberculosis patients increased continuously, and its coding protein C1-INH was positively correlated with the level of complement C1q. In tuberculosis patients, the increase of C1-INH and C1q may represent the immune escape mechanism of Mtb34. However, no more research exists on the correlation between SLC26A8, FCGR1B, and ANKRD22 and tuberculosis.

It is of great significance to find biomarkers to distinguish LTBI from ATB for the prevention and treatment of tuberculosis35. Relying on a single biomarker to distinguish ATB disease from LTBI may not achieve ideal results, but the combination of multiple biomarkers may bring new hope. The prospective study conducted by Sweeney et al. shows that the mRNA models based on three genes (GBP5, DUSP3, KLF2) have a reasonable correlation with the development of LTBI into ATB, but its exact value needs more research21. In distinguishing the predictive ability of LTBI and ATB, the three-gene signature is not lower than Sweeney’s signature. In addition to Sweeney’s signature, researchers also reported some other signatures. Except for Jan’s signature15, the discrimination ability of the 3-gene model is not lower than that of other models, and the number of genes is also the lowest. At the same time, our gene model also showed stable discrimination performance in HIV-positive patients.

The three-gene model can distinguish LTBI from ATB and is closely related to immune cells infiltration. According to the expression of three genes, ATB can be well divided into two clusters. These two clusters have different immune states. However, whether different immune states are related to the therapeutic effect of anti-tuberculosis drugs, the choice of therapeutic drugs and drug resistance needs further study. GSVA analysis showed that C1 was enriched in metabolism-related pathways. In contrast, C2 was mainly enriched in apoptosis, JAK-STAT signaling pathway and immune-related pathways, suggesting that C1 and C2 might differ in tuberculosis infection, immunity and treatment, and drug sensitivity analysis verified our conjecture.

Although chemotherapy should always be the primary treatment for tuberculosis at present, immunotherapy can only be used as an auxiliary treatment method, and clinicians lack understanding of immunotherapy for ATB. Tuberculosis immunotherapy includes immunoactive substances, TB therapeutic vaccines, chemical agents, cellular therapy, etc. Immunoactive substances include cytokines such as IL-2 and GM-CSF and immune blockers such as IL-4, denileukin diftitox, and PD-1 blockers36. Immunotherapy, such as cytokines and immune blockers, is limited to refractory pulmonary tuberculosis or in vitro experiments, including IL-237, GM-CSF38, and PD-1 blockers39. However, these drugs have not been identified as biomarker indications, which need improvement. Moreover, unlike the pathological diagnosis of tumors, we do not detect pathological immune responses in the pathological diagnosis of tuberculosis patients, which also needs improvement. The only immunotherapy for HIV-positive patients is TB therapeutic vaccines. Whether the immune change process between LTBI and ATB is similar to that of HIV-positive patients needs further study. Generally speaking, immunotherapy for tuberculosis still has great potential, especially for drug-resistant tuberculosis.

In this study, we included all the datasets containing LTBI and ATB in the GEO database. Compared with previous studies, we included the largest number of LTBI cases (492 cases of LTBI). Of course, our research also has limitations. First of all, we did not include the normal sample. There are reliable and economical methods to distinguish normal from LTBI, such as TST, and IGRAs35. The focus of this study is how to identify LTBI and ATB, so we did not include normal cases. Moreover, if too many datasets are included, the expression of some genes may be missing during the batch removal process. A total of 10 datasets were included in the total cohort, while six datasets used the GPL10558 platform, which makes the analysis results more reliable. Secondly, the three-gene signature can distinguish LTBI from ATB, but whether it can identify high-risk LTBI still needs further study. Third, our research results need to be further verified in a larger multi-center study. Because the number of specimens we included is large enough, we have verified it not only in the testing and total cohorts but also in various datasets. To some extent, it makes up for this defect. Finally, the effectiveness of our signature in children, tumors, rheumatic immune system diseases and patients undergoing immunotherapy needs further verification.

Conclusion

Generally speaking, our research has found a three-gene signature that can not only distinguish LTBI from ATB but also be closely related to tuberculosis immunity. The results of this study may be helpful to the clinical management of tuberculosis patients and guide the testing of new drugs in clinical trials, but they still need further verification.