Introduction

Colorectal cancer(CRC) is renowned worldwide for its high prevalence in the population1,2, and the incidence rate of CRC is closely related to global cancer mortality3. About 50% of postoperative deaths among CRC are related to distant metastasis (DM)4. About 75% -90% of patients diagnosed with CRC are considered to be unable to undergo surgical treatment5. For colorectal cancer patients, the Lung is one of the common sites of metastasis6. Studies have indicated that patients with CRC who present with lung metastases constitute approximately 10% to 15% of the total CRC patient population7. Early detection of lung metastasis is crucial in the clinical management of metastasis patients. Generally speaking, Compared with other metastases, such as liver metastasis and peritoneal metastasis, lung metastasis has a relatively better prognosis8,9,10. Studies have shown that early diagnosis of patients with colorectal cancer combined with lung metastasis, coupled with appropriate treatment, can lead to a 5-year survival rate of more than 50% for some colorectal cancer patients11. Therefore, it is imperative to establish an efficient and feasible prediction model. This approach will assist clinicians in making early diagnoses and implementing timely treatments for lung metastasis, ultimately enhancing patient prognosis.

In prior studies, researchers have developed predictive models for lung metastasis in patients with CRC11,12. These predictive models for lung metastasis of colorectal cancer lack external validation data to assess their feasibility. Furthermore, the performance of these models requires additional enhancement.

Machine learning (ML) is characterized by sophisticated algorithmic models that primarily focus on investigating the mechanisms and learning processes inherent in computer research data13. With the advancement of technology, medical data has become increasingly vast and complex. This evolution has created numerous opportunities for machine learning to address clinical challenges effectively. Machine learning techniques have been employed to tackle various clinical issues and have demonstrated superior predictive performance compared to traditional algorithms14. Our study seeks to develop an explainable ML algorithm that can predict CRC lung metastasis.

Methods

Study population

Our data set is sourced from SEER in the United States. Using SEER*stat version 8.4.13 software. The study selected CRC patients from 2010 to 2015. The patient data screening process is shown in Fig. 1. Only colorectal cancer as a primary cancer was considered, excluding incomplete data, unknown pathological diagnosis information, and ambiguous histology. We have collected the following data: age at diagnosis, gender(male or female), race, years of diagnosis, marital status, T stage, N stage, and histological types (8140/3, 8210/3, 8261/3, 8263/3, 8480/3, 8490/3). Additional variables encompassed grade (degree of tumor differentiation), primary tumor site (including colon and rectum), primary tumor size, and CEA levels. The tumor staging utilized in this study was based on version 0204. The ICD-O-3 manual serves as a reference for filtering histological type codes. At the same time, the site codes we selected include (C18.0 through C18.9; C20.9). The AJCC 7th edition TNM staging system was adopted for this study. Given that the SEER database comprises publicly available data, we do not require patients to sign informed consent during this study; likewise, ethical approval was unnecessary.

Fig. 1
figure 1

This study design and data processing flow diagram.

We selected patient data from Beijing Electric Power Hospital of Capital Medical University for external verification, and included patients who did not receive neoadjuvant radiotherapy before surgery to test the predictive performance of the optimal model. Our research was conducted retrospectively, and there was no violation of any aspects related to patient privacy; therefore, this study was granted an ethical exemption.

SPSS software (version 26.0) was utilized to perform statistical analyses. We performed Spearman correlation tests on all variables included in this study and presented the results as a heatmap to better illustrate the correlation between the data. We chose to express the categorical variables selected in this study as counts and percentages. We used the chi-square test, Fisher’s exact test, and Mann–Whitney U test to compare variables between the two groups. A multivariable logistic regression (LR) was developed, incorporating variables that demonstrated statistical significance in the univariable LR into the multivariable LR. In this study, we chose p < 0.05 as the criterion for statistical significance, and we calculated the two-sided p-value.

We apply Python 3.9.12 software to construct a machine learning algorithm. Incorporate the variables selected into the ML to establish a model for CRC lung metastasis. The sampled data is completely randomly divided into a training set and a test set in an 8:2 ratio. We used seven standard ML algorithms to establish a model for CRC lung metastasis. By comparing the predictive performance of the models (including accuracy, precision, recall rate, F1 score, and AUC), we selected the optimal model and evaluated its performance using fivefold cross-validation. Random Forest (RF) is an ML algorithm that can develop predictive models based on sample data. Previous studies have employed the RF algorithm to forecast renal disease, demonstrating high levels of accuracy15. Decision Tree (DT) represents one of the more successful algorithms within the realm of machine learning16. There are three key steps for us to establish a model depending on the DT algorithm: variable selection, node splitting, and tree pruning17. Support Vector Machine (SVM) is primarily employed to address classification problems18. Naive Bayes (NB) represents a straightforward variant of Bayesian networks that demonstrates strong predictive performance in addressing classification challenges19. K-Nearest Neighbor (KNN) is a straightforward ML that functions as a non-parametric classifier20. The eXtreme Gradient Boosting (XGBoost), a representative of ensemble learning, is widely popular due to its excellent predictive performance21. Gradient Boosting Machine (GBM) employs a multitude of smaller models, which are subsequently combined to generate the final ag. GBM utilizes a diverse array of smaller models, which are then integrated to produce the final aggregated prediction22. This study uses AUC and AUPR to evaluate the predictive performance of the model.

Sampling methods such as oversampling and undersampling mainly solve the problem of imbalanced categories in the original data. We use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the raw data and the RandomUnderSampler to undersample the raw data, which is widely used to improve prediction models.

We use Shapley Additive exPlanation (SHAP) to visualize and explain the importance of the established optimal model variables. Shap is an ML technique that describes the output of a model by explaining the impact of features on the outcome. It evaluates the effects of variables on outcomes by calculating their Shapley values.

The SEER database indicates that fewer colorectal cancer patients exhibit lung metastasis. We address imbalance data by employing under-sampling and over-sampling on the original data. Subsequently, a correlation matrix is used to evaluate the relationship between sampled data variables. The relationships between variables become more pronounced after sampling. As illustrated in Fig. 2.

Fig. 2
figure 2

Correlation heatmaps. (A) Correlation heatmap in the over-sampled dataset. (B) Correlation heatmap in the under-sampling data

Results

This study included 39,674 CRC patients, among which 1,369 (3.5%) had lung metastasis. We collected data from 207 patients in a hospital in China to validate the feasibility of the model from 2010 to 2015. The baseline data for this study are detailed in Table 1. Nine risk factors were included in the multivariate LR model, including age, histological type, grade, primary tumor site, T stage, N stage, tumor size, CEA levels, and tumor deposition. Further details are presented in Table 2. According to the LR model prediction results: AUC = 0.854; 95% CI (0.844–0.863); p < 0.001.

Table 1 The baseline characteristics.
Table 2 Univariable and multivariable LR analysis.

Seven kinds of ML algorithms were established and contrasted depending on metrics such as accuracy, precision, recall, F1 score, and AUC value. The model trained using oversampling techniques outperformed the one trained with undersampling methods; for detailed information regarding the seven machine learning models constructed through oversampling, please refer to Table 3. We present the performance results in Fig. 3 by employing both oversampling and undersampling approaches in building these 7 ML algorithms. Notably, all models achieved an AUC greater than 0.800 with models built with oversampled data. RF algorithm is superior to other algorithms. In the test set, RF demonstrated excellent predictive performance, with an AUC of 0.980 and an AUPR of 0.941; it showed excellent performance during fivefold cross-validation, with an average accuracy of 0.936.

Table 3 Prediction performance of 7 ML algorithms on the over-sampled dataset.
Fig. 3
figure 3

AUC values of 7 machine learning algorithms on over-sampled and under-sampled datasets. (A) The ROC in the test set with over-sampling. (B) The ROC in the training set with over-sampling. (C) The ROC in the test set with under-sampling. (D) The ROC in the training set with under-sampling.

When testing data from a Chinese hospital, RF also demonstrated impressive predictive ability: accuracy of 0.961, an AUC of 0.927, and an AUPR of 0.657, as illustrated in Fig. 4. We compared the performance of the RF algorithm and LR using AUC values, and the results showed that the RF algorithm (AUC = 0.980) outperformed the LR model (AUC = 0.854).

Fig. 4
figure 4

Performance of RF algorithm in predicting the AUPR value of CRC lung metastasis, as well as the AUC. (A) The AUPR in the internal test set. (B) The ROC in the external validation set. (C) The AUPR curves in the external validation set.

Interpreting RF using SHAP values. In Fig. 5A, clinical features are ranked according to their average absolute SHAP values to illustrate their relative significance. Figure 5B offers a comprehensive visualization of how various factors influence the RF.

Fig. 5
figure 5

The SHapley Additive exPlanations of the RF model. (A) SHAP feature importance quantified through the average absolute Shapley values. This plot illustrates the significance of each feature in the development of the predictive model. (B) Representation of the influence exerted by each feature on the final model output, assessed via SHAP values distribution. A data point within each row denotes every individual patient. The color indicates whether the continuous feature is at a high level (displayed in red) or a low level (depicted in blue) for that specific observation. When it comes to categorical features, the color blue signifies “no”, while the color red corresponds to “yes”.

We selected two patients diagnosed with CRC—one exhibiting lung metastasis and the other without—based on our constructed model (Fig. 6A). For patients with lung metastasis, significant risk factors included tumor deposits, tumor size ≥ 5 cm, CEA positive status, and age ≥ 60 years. Conversely, in patients without lung metastasis, protective factors encompassed CEA negative status, Grade I tumors, absence of tumor deposits, colon involvement (T3), and belonging to other ethnicities (Fig. 6B).

Fig. 6
figure 6

SHAP force plot for interpreting individual’s prediction outcomes. This plot offers a visual illustration of the RF model’s predictions, wherein the blue and red bars signify risk factors and protective factors, respectively. The length of the bars corresponds to the extent of feature importance. (A) Poor outcome; (B) favorable outcome.

As illustrated in Fig. 5, tumor deposits emerge as the most significant predictive factors for lung metastasis with CRC.

Our study constructs an online network calculator utilizing the RF algorithm. (http://121.43.117.60:8003/).

Discussion

CRC is a prevalent malignancy. Approximately 25% of patients are diagnosed with distant metastasis, which remains the primary cause of mortality among this patient population23. Generally speaking, the process of colorectal cancer metastasis primarily involves tumor cells entering the liver via the portal vein system. Subsequently, these cells can disseminate to other organs, including the lungs24. However, the metastatic pathway of lung metastases may circumvent the portal system by utilizing venous drainage, thereby entering the systemic circulation25. It is essential to investigate the risk factors associated with lung metastasis in CRC patients separately, as the patterns of lung metastasis may differ, and early detection and timely intervention are crucial. Studies have indicated that patients who identify lung metastasis at an early stage and undergo surgical intervention experience a 30% higher five-year survival rate compared to those who do not receive surgical treatment26. At present, the existing detection methods for lung metastasis, such as PET-CT and biopsy, exhibit certain limitations; PET-CT is not suitable for early screening due to its prohibitive cost and potential risk of radiation damage. Likewise, while biopsy can confirm lung metastasis, it also carries the risks of tumor dissemination and false negatives27. With artificial intelligence already widely applied in the medical field, it serves as a crucial tool for precision medicine, assisting in the selection of optimal diagnostic and treatment strategies1; the machine learning model, established on clinical and pathological data, not only circumvents the risks associated with high examination costs, limited scalability, and potential bodily harm but also exhibits strong predictive performance (with an accuracy of 0.961), assisting clinicians in identifying high-risk patients and formulating tailored treatment plans. We utilized an explainable ML algorithm that integrates clinical and pathological features to develop predictive models for lung cancer metastasis. Additionally, we performed a comparative analysis of these models.

Among the various approaches evaluated, the machine learning model demonstrated commendable performance. In our previous research, explainable ML algorithms have not been used to make corresponding predictions. The AUC of 7 algorithms predominantly exceeds 0.800. Consequently, we believe that the models we have built based on ML have good predictive performance. The RF algorithm is the best-performing model. Currently, research on CRC patients primarily emphasizes prognostic factors28,29,30. In prior studies, researchers have developed models while investigating the associated risk factors12,31. Some studies have predicted lung metastasis and assessed overall survival rates among these individuals32. Previous studies on lung metastasis of colorectal cancer were primarily based on traditional logistic regression, and the AUC of these models was often less than 0.812,33; the predictive performance of these models still requires improvement. Simultaneously, the adoption of interpretable machine learning algorithms for our models has significantly enhanced their explanation in practical applications, thereby facilitating users’ understanding of the model’s operational rules.

The LR revealed that age, histological type, grade, primary tumor site, T stage, N stage, tumor size, carcinoembryonic antigen (CEA) levels, and tumor deposits were all independent predictors of lung metastasis in CRC patients. The RF feature selection, consistent with those from the multivariable logistic regression analysis, indicated that tumor deposits are a key predictor. CEA levels and the T stage follow this. Notably, the T stage is a critical criterion for assessing tumor progression. Tumor deposits are observed in 20%–25% of patients with colon cancer33. At present, there is a limited body of research on tumor deposits, with the majority concentrating on their impact on prognosis34. CEA is one of the essential predictive factors in this study; the level of CEA is closely related to the distant metastasis of patients35. Studies have demonstrated that the T stage is directly associated with distant metastasis in various regions of the tumor32. In this study, T staging emerged as the most significant predictor within the RF model.

Furthermore, multivariable logistic analysis demonstrated a positive correlation between T staging and the risk of lung metastasis. This association may be attributed to the direct invasion of cancer cells into the systemic circulation via the venous system35. The primary tumor site emerged as a significant predictor in this study. It was observed that the rectum exhibited a higher propensity for lung metastasis compared to the colon (including the sigmoid), which aligns with previous findings12,31. However, some studies suggest that there is minimal difference in the impact of lung metastasis between these factors32. Consistent with previous research12, the N stage serves as a significant predictor of lung metastasis in CRC. Stage N is associated with the risk of lung metastasis in CRC patients. This may be attributed to the lymphatic system being the primary pathway for metastasis36, and the Lung is among the organs that contain the highest abundance of lymph27. Furthermore, it has been indicated that the lymphatic drainage from positive regional lymph nodes can directly reach the lungs, thereby facilitating the occurrence of lung metastasis37.

ML is a scientific discipline that employs computational techniques to learn from data. While statistics focuses on elucidating the relationships among data, computer science prioritizes the development of more efficient algorithms for computation. Machine learning thus represents the convergence of these two fields38. Whereas traditional statistics primarily emphasizes the testing of causal hypotheses, machine learning places greater emphasis on the predictive performance of models39. As a compilation of diverse algorithms, machine learning must take into account a range of factors, including data availability, data type, and the specific aspects that require prediction when addressing practical problems to identify the most suitable algorithm40. Predictive models developed through machine learning can enable healthcare professionals to deliver more personalized and precise diagnostic and therapeutic strategies for patients. As databases continue to expand and algorithms undergo optimization, machine learning algorithms are poised to assume an even more significant role within the medical ___domain.

We developed seven predictive models to predict lung metastasis in colorectal cancer. The performance of these seven algorithms was assessed based on accuracy, precision, recall, F1 score, and AUC. Among them, RF demonstrated superior predictability with an AUC of 0.980, surpassing that of the logistic regression, which had an AUC of 0.854. Consequently, RF emerged as the most effective algorithm.

Our study has several limitations: (1) The validation cohort consisted of single-center data with a limited number of patients, all of whom were Asian. (2) We anticipate that the accuracy of the model can be further enhanced by incorporating more risk factors associated with metastasis in future research. (3) Additionally, the SEER database does not provide details on specific treatment plans; further analysis is required to assess their impact on patient prognosis. Furthermore, regional differentiation was not included in the decision-making process within this model.

Our study developed and validated a model utilizing explainable ML algorithms, which incorporates clinical features to quantify the primary factors contributing to lung metastasis. Among these factors, tumor deposits, CEA levels, and T stage emerged as the three most significant predictors of lung metastasis in CRC patients. In comparison to logistic regression models, the random forest algorithm demonstrated superior predictive capability; thus, it offers the potential for personalized treatment strategies. We have built a web calculator (http://121.43.117.60:8003/).