Explainable machine learning for predicting lung metastasis of colorectal cancer

Guo, Zhentian; Zhang, Zongming; Liu, Limin; Zhao, Yue; Liu, Zhuo; Zhang, Chong; Qi, Hui; Feng, Jinqiu; Yao, Peijie

doi:10.1038/s41598-025-98188-5

Download PDF

Article
Open access
Published: 19 April 2025

Explainable machine learning for predicting lung metastasis of colorectal cancer

Zhentian Guo^1,2,
Zongming Zhang^1,2,
Limin Liu^1,2,
Yue Zhao^1,2,
Zhuo Liu^1,2,
Chong Zhang^1,2,
Hui Qi^1,2,
Jinqiu Feng² &
…
Peijie Yao²

Scientific Reports volume 15, Article number: 13611 (2025) Cite this article

1559 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Patients with lung metastasis of colorectal cancer typically have a poor prognosis. Therefore, establishing an effective screening and diagnosis model is paramount. Our study seeks to construct and verify a predictive model utilizing machine learning (ML) that can evaluate the risk of lung metastasis with newly diagnosed colorectal cancer (CRC) using Shapley Additive exPlanations (SHAP). Using the Surveillance, Epidemiology, and End Results database, 39,674 were extracted for model development, all of whom had been pathologically diagnosed with CRC. The data spans from 2010 to 2015. Our study has constructed seven ML algorithms based on the data mentioned above, including Random Forest (RF), Decision Tree, Support Vector Machine, Naive Bayes, K-Nearest Neighbor, eXtreme Gradient Boosting, and Gradient Boosting Machine. We selected the best algorithm and visualized it using SHAP. We conducted a validation of the model utilizing data from a Chinese hospital to assess its practicality. Based on this, we have constructed an open web calculator. 39,674 patient data were included in our study, among whom 1369 (3.5%) presented with distant lung metastasis. The Random Forest (RF) algorithm demonstrated the highest predictive capability within the internal test set (AUC of 0.980, AUPR of 0.941). Furthermore, the random forest algorithm also exhibited excellent performance in external validation sets. Meanwhile, we have also established a web calculator (http://121.43.117.60:8003/). The RF algorithm has demonstrated excellent predictive performance. It can assist clinicians in devising more personalized treatment plans.

Prognostic model for log odds of negative lymph node in locally advanced rectal cancer via interpretable machine learning

Article Open access 07 March 2025

Construction of a prognostic prediction model for colorectal cancer based on 5-year clinical follow-up data

Article Open access 21 January 2025

Machine learning for predicting survival of colorectal cancer patients

Article Open access 01 June 2023

Introduction

Colorectal cancer(CRC) is renowned worldwide for its high prevalence in the population^1,2, and the incidence rate of CRC is closely related to global cancer mortality³. About 50% of postoperative deaths among CRC are related to distant metastasis (DM)⁴. About 75% -90% of patients diagnosed with CRC are considered to be unable to undergo surgical treatment⁵. For colorectal cancer patients, the Lung is one of the common sites of metastasis⁶. Studies have indicated that patients with CRC who present with lung metastases constitute approximately 10% to 15% of the total CRC patient population⁷. Early detection of lung metastasis is crucial in the clinical management of metastasis patients. Generally speaking, Compared with other metastases, such as liver metastasis and peritoneal metastasis, lung metastasis has a relatively better prognosis^8,9,10. Studies have shown that early diagnosis of patients with colorectal cancer combined with lung metastasis, coupled with appropriate treatment, can lead to a 5-year survival rate of more than 50% for some colorectal cancer patients¹¹. Therefore, it is imperative to establish an efficient and feasible prediction model. This approach will assist clinicians in making early diagnoses and implementing timely treatments for lung metastasis, ultimately enhancing patient prognosis.

In prior studies, researchers have developed predictive models for lung metastasis in patients with CRC^11,12. These predictive models for lung metastasis of colorectal cancer lack external validation data to assess their feasibility. Furthermore, the performance of these models requires additional enhancement.

Machine learning (ML) is characterized by sophisticated algorithmic models that primarily focus on investigating the mechanisms and learning processes inherent in computer research data¹³. With the advancement of technology, medical data has become increasingly vast and complex. This evolution has created numerous opportunities for machine learning to address clinical challenges effectively. Machine learning techniques have been employed to tackle various clinical issues and have demonstrated superior predictive performance compared to traditional algorithms¹⁴. Our study seeks to develop an explainable ML algorithm that can predict CRC lung metastasis.

Methods

Study population

Our data set is sourced from SEER in the United States. Using SEER*stat version 8.4.13 software. The study selected CRC patients from 2010 to 2015. The patient data screening process is shown in Fig. 1. Only colorectal cancer as a primary cancer was considered, excluding incomplete data, unknown pathological diagnosis information, and ambiguous histology. We have collected the following data: age at diagnosis, gender(male or female), race, years of diagnosis, marital status, T stage, N stage, and histological types (8140/3, 8210/3, 8261/3, 8263/3, 8480/3, 8490/3). Additional variables encompassed grade (degree of tumor differentiation), primary tumor site (including colon and rectum), primary tumor size, and CEA levels. The tumor staging utilized in this study was based on version 0204. The ICD-O-3 manual serves as a reference for filtering histological type codes. At the same time, the site codes we selected include (C18.0 through C18.9; C20.9). The AJCC 7th edition TNM staging system was adopted for this study. Given that the SEER database comprises publicly available data, we do not require patients to sign informed consent during this study; likewise, ethical approval was unnecessary.

We selected patient data from Beijing Electric Power Hospital of Capital Medical University for external verification, and included patients who did not receive neoadjuvant radiotherapy before surgery to test the predictive performance of the optimal model. Our research was conducted retrospectively, and there was no violation of any aspects related to patient privacy; therefore, this study was granted an ethical exemption.

SPSS software (version 26.0) was utilized to perform statistical analyses. We performed Spearman correlation tests on all variables included in this study and presented the results as a heatmap to better illustrate the correlation between the data. We chose to express the categorical variables selected in this study as counts and percentages. We used the chi-square test, Fisher’s exact test, and Mann–Whitney U test to compare variables between the two groups. A multivariable logistic regression (LR) was developed, incorporating variables that demonstrated statistical significance in the univariable LR into the multivariable LR. In this study, we chose p < 0.05 as the criterion for statistical significance, and we calculated the two-sided p-value.

We apply Python 3.9.12 software to construct a machine learning algorithm. Incorporate the variables selected into the ML to establish a model for CRC lung metastasis. The sampled data is completely randomly divided into a training set and a test set in an 8:2 ratio. We used seven standard ML algorithms to establish a model for CRC lung metastasis. By comparing the predictive performance of the models (including accuracy, precision, recall rate, F1 score, and AUC), we selected the optimal model and evaluated its performance using fivefold cross-validation. Random Forest (RF) is an ML algorithm that can develop predictive models based on sample data. Previous studies have employed the RF algorithm to forecast renal disease, demonstrating high levels of accuracy¹⁵. Decision Tree (DT) represents one of the more successful algorithms within the realm of machine learning¹⁶. There are three key steps for us to establish a model depending on the DT algorithm: variable selection, node splitting, and tree pruning¹⁷. Support Vector Machine (SVM) is primarily employed to address classification problems¹⁸. Naive Bayes (NB) represents a straightforward variant of Bayesian networks that demonstrates strong predictive performance in addressing classification challenges¹⁹. K-Nearest Neighbor (KNN) is a straightforward ML that functions as a non-parametric classifier²⁰. The eXtreme Gradient Boosting (XGBoost), a representative of ensemble learning, is widely popular due to its excellent predictive performance²¹. Gradient Boosting Machine (GBM) employs a multitude of smaller models, which are subsequently combined to generate the final ag. GBM utilizes a diverse array of smaller models, which are then integrated to produce the final aggregated prediction²². This study uses AUC and AUPR to evaluate the predictive performance of the model.

Sampling methods such as oversampling and undersampling mainly solve the problem of imbalanced categories in the original data. We use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the raw data and the RandomUnderSampler to undersample the raw data, which is widely used to improve prediction models.

We use Shapley Additive exPlanation (SHAP) to visualize and explain the importance of the established optimal model variables. Shap is an ML technique that describes the output of a model by explaining the impact of features on the outcome. It evaluates the effects of variables on outcomes by calculating their Shapley values.

The SEER database indicates that fewer colorectal cancer patients exhibit lung metastasis. We address imbalance data by employing under-sampling and over-sampling on the original data. Subsequently, a correlation matrix is used to evaluate the relationship between sampled data variables. The relationships between variables become more pronounced after sampling. As illustrated in Fig. 2.

Results

This study included 39,674 CRC patients, among which 1,369 (3.5%) had lung metastasis. We collected data from 207 patients in a hospital in China to validate the feasibility of the model from 2010 to 2015. The baseline data for this study are detailed in Table 1. Nine risk factors were included in the multivariate LR model, including age, histological type, grade, primary tumor site, T stage, N stage, tumor size, CEA levels, and tumor deposition. Further details are presented in Table 2. According to the LR model prediction results: AUC = 0.854; 95% CI (0.844–0.863); p < 0.001.

Table 1 The baseline characteristics.

Full size table

Table 2 Univariable and multivariable LR analysis.

Full size table

Seven kinds of ML algorithms were established and contrasted depending on metrics such as accuracy, precision, recall, F1 score, and AUC value. The model trained using oversampling techniques outperformed the one trained with undersampling methods; for detailed information regarding the seven machine learning models constructed through oversampling, please refer to Table 3. We present the performance results in Fig. 3 by employing both oversampling and undersampling approaches in building these 7 ML algorithms. Notably, all models achieved an AUC greater than 0.800 with models built with oversampled data. RF algorithm is superior to other algorithms. In the test set, RF demonstrated excellent predictive performance, with an AUC of 0.980 and an AUPR of 0.941; it showed excellent performance during fivefold cross-validation, with an average accuracy of 0.936.

Table 3 Prediction performance of 7 ML algorithms on the over-sampled dataset.

Full size table

When testing data from a Chinese hospital, RF also demonstrated impressive predictive ability: accuracy of 0.961, an AUC of 0.927, and an AUPR of 0.657, as illustrated in Fig. 4. We compared the performance of the RF algorithm and LR using AUC values, and the results showed that the RF algorithm (AUC = 0.980) outperformed the LR model (AUC = 0.854).

Interpreting RF using SHAP values. In Fig. 5A, clinical features are ranked according to their average absolute SHAP values to illustrate their relative significance. Figure 5B offers a comprehensive visualization of how various factors influence the RF.

We selected two patients diagnosed with CRC—one exhibiting lung metastasis and the other without—based on our constructed model (Fig. 6A). For patients with lung metastasis, significant risk factors included tumor deposits, tumor size ≥ 5 cm, CEA positive status, and age ≥ 60 years. Conversely, in patients without lung metastasis, protective factors encompassed CEA negative status, Grade I tumors, absence of tumor deposits, colon involvement (T3), and belonging to other ethnicities (Fig. 6B).

As illustrated in Fig. 5, tumor deposits emerge as the most significant predictive factors for lung metastasis with CRC.

Our study constructs an online network calculator utilizing the RF algorithm. (http://121.43.117.60:8003/).

Discussion

CRC is a prevalent malignancy. Approximately 25% of patients are diagnosed with distant metastasis, which remains the primary cause of mortality among this patient population²³. Generally speaking, the process of colorectal cancer metastasis primarily involves tumor cells entering the liver via the portal vein system. Subsequently, these cells can disseminate to other organs, including the lungs²⁴. However, the metastatic pathway of lung metastases may circumvent the portal system by utilizing venous drainage, thereby entering the systemic circulation²⁵. It is essential to investigate the risk factors associated with lung metastasis in CRC patients separately, as the patterns of lung metastasis may differ, and early detection and timely intervention are crucial. Studies have indicated that patients who identify lung metastasis at an early stage and undergo surgical intervention experience a 30% higher five-year survival rate compared to those who do not receive surgical treatment²⁶. At present, the existing detection methods for lung metastasis, such as PET-CT and biopsy, exhibit certain limitations; PET-CT is not suitable for early screening due to its prohibitive cost and potential risk of radiation damage. Likewise, while biopsy can confirm lung metastasis, it also carries the risks of tumor dissemination and false negatives²⁷. With artificial intelligence already widely applied in the medical field, it serves as a crucial tool for precision medicine, assisting in the selection of optimal diagnostic and treatment strategies¹; the machine learning model, established on clinical and pathological data, not only circumvents the risks associated with high examination costs, limited scalability, and potential bodily harm but also exhibits strong predictive performance (with an accuracy of 0.961), assisting clinicians in identifying high-risk patients and formulating tailored treatment plans. We utilized an explainable ML algorithm that integrates clinical and pathological features to develop predictive models for lung cancer metastasis. Additionally, we performed a comparative analysis of these models.

Among the various approaches evaluated, the machine learning model demonstrated commendable performance. In our previous research, explainable ML algorithms have not been used to make corresponding predictions. The AUC of 7 algorithms predominantly exceeds 0.800. Consequently, we believe that the models we have built based on ML have good predictive performance. The RF algorithm is the best-performing model. Currently, research on CRC patients primarily emphasizes prognostic factors^28,29,30. In prior studies, researchers have developed models while investigating the associated risk factors^12,31. Some studies have predicted lung metastasis and assessed overall survival rates among these individuals³². Previous studies on lung metastasis of colorectal cancer were primarily based on traditional logistic regression, and the AUC of these models was often less than 0.8^12,33; the predictive performance of these models still requires improvement. Simultaneously, the adoption of interpretable machine learning algorithms for our models has significantly enhanced their explanation in practical applications, thereby facilitating users’ understanding of the model’s operational rules.

The LR revealed that age, histological type, grade, primary tumor site, T stage, N stage, tumor size, carcinoembryonic antigen (CEA) levels, and tumor deposits were all independent predictors of lung metastasis in CRC patients. The RF feature selection, consistent with those from the multivariable logistic regression analysis, indicated that tumor deposits are a key predictor. CEA levels and the T stage follow this. Notably, the T stage is a critical criterion for assessing tumor progression. Tumor deposits are observed in 20%–25% of patients with colon cancer³³. At present, there is a limited body of research on tumor deposits, with the majority concentrating on their impact on prognosis³⁴. CEA is one of the essential predictive factors in this study; the level of CEA is closely related to the distant metastasis of patients³⁵. Studies have demonstrated that the T stage is directly associated with distant metastasis in various regions of the tumor³². In this study, T staging emerged as the most significant predictor within the RF model.

Furthermore, multivariable logistic analysis demonstrated a positive correlation between T staging and the risk of lung metastasis. This association may be attributed to the direct invasion of cancer cells into the systemic circulation via the venous system³⁵. The primary tumor site emerged as a significant predictor in this study. It was observed that the rectum exhibited a higher propensity for lung metastasis compared to the colon (including the sigmoid), which aligns with previous findings^12,31. However, some studies suggest that there is minimal difference in the impact of lung metastasis between these factors³². Consistent with previous research¹², the N stage serves as a significant predictor of lung metastasis in CRC. Stage N is associated with the risk of lung metastasis in CRC patients. This may be attributed to the lymphatic system being the primary pathway for metastasis³⁶, and the Lung is among the organs that contain the highest abundance of lymph²⁷. Furthermore, it has been indicated that the lymphatic drainage from positive regional lymph nodes can directly reach the lungs, thereby facilitating the occurrence of lung metastasis³⁷.

ML is a scientific discipline that employs computational techniques to learn from data. While statistics focuses on elucidating the relationships among data, computer science prioritizes the development of more efficient algorithms for computation. Machine learning thus represents the convergence of these two fields³⁸. Whereas traditional statistics primarily emphasizes the testing of causal hypotheses, machine learning places greater emphasis on the predictive performance of models³⁹. As a compilation of diverse algorithms, machine learning must take into account a range of factors, including data availability, data type, and the specific aspects that require prediction when addressing practical problems to identify the most suitable algorithm⁴⁰. Predictive models developed through machine learning can enable healthcare professionals to deliver more personalized and precise diagnostic and therapeutic strategies for patients. As databases continue to expand and algorithms undergo optimization, machine learning algorithms are poised to assume an even more significant role within the medical ___domain.

We developed seven predictive models to predict lung metastasis in colorectal cancer. The performance of these seven algorithms was assessed based on accuracy, precision, recall, F1 score, and AUC. Among them, RF demonstrated superior predictability with an AUC of 0.980, surpassing that of the logistic regression, which had an AUC of 0.854. Consequently, RF emerged as the most effective algorithm.

Our study has several limitations: (1) The validation cohort consisted of single-center data with a limited number of patients, all of whom were Asian. (2) We anticipate that the accuracy of the model can be further enhanced by incorporating more risk factors associated with metastasis in future research. (3) Additionally, the SEER database does not provide details on specific treatment plans; further analysis is required to assess their impact on patient prognosis. Furthermore, regional differentiation was not included in the decision-making process within this model.

Our study developed and validated a model utilizing explainable ML algorithms, which incorporates clinical features to quantify the primary factors contributing to lung metastasis. Among these factors, tumor deposits, CEA levels, and T stage emerged as the three most significant predictors of lung metastasis in CRC patients. In comparison to logistic regression models, the random forest algorithm demonstrated superior predictive capability; thus, it offers the potential for personalized treatment strategies. We have built a web calculator (http://121.43.117.60:8003/).

Data availability

The datasets generated for this study are available on request to the first author or the corresponding author.

Abbreviations

ML:: Machine learning
CRC:: Colorectal cancer
SHAP:: Shapley additive exPlanations
DM:: Distant metastasis
LR:: Logistic regression
RF:: Random forest
DT:: Decision tree
SVM:: Support vector machine
NB:: Naive Bayes
KNN:: K-nearest neighbor
XGBoost:: EXtreme gradient boosting
GBM:: Gradient boosting machine

References

Mao, Y. et al. Machine learning algorithms are comparable to conventional regression models in predicting distant metastasis of follicular thyroid carcinoma. Clin. Endocrinol. (Oxf.) 98(1), 98–109. https://doi.org/10.1111/cen.14693 (2023).
Article PubMed Google Scholar
Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68(6), 394–424. https://doi.org/10.3322/caac.21492 (2018).
Article PubMed Google Scholar
Erratum: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 70(4), 313. https://doi.org/10.3322/caac.21609 (2020).
Li, T. et al. Predictive models based on machine learning for bone metastasis in patients with diagnosed colorectal cancer. Front. Public Health 10, 984750. https://doi.org/10.3389/fpubh.2022.984750 (2022).
Article PubMed PubMed Central Google Scholar
Yoon, K. D. & Suman, P. Invited commentary: Survival outcome of palliative primary tumor resection for colorectal cancer patients with synchronous liver and/or lung metastases: A retrospective cohort study in the SEER database by propensity score matching analysis. Int. J. Surg. 82, 85–86. https://doi.org/10.1016/j.ijsu.2020.08.008 (2020).
Article PubMed Google Scholar
Kopetz, S. et al. Improved survival in metastatic colorectal cancer is associated with adoption of hepatic resection and improved chemotherapy. J. Clin. Oncol. 27(22), 3677–3683. https://doi.org/10.1200/jco.2008.20.5278 (2009).
Article PubMed PubMed Central Google Scholar
Limmer, S. & Unger, L. Optimal management of pulmonary metastases from colorectal cancer. Expert Rev. Anticancer Ther. 11(10), 1567–1575. https://doi.org/10.1586/era.11.123 (2011).
Article PubMed Google Scholar
Zhang, G. Q. et al. Aggressive multimodal treatment and metastatic colorectal cancer survival. J. Am. Coll. Surg. 230(4), 689–698. https://doi.org/10.1016/j.jamcollsurg.2019.12.024 (2020).
Article PubMed Google Scholar
Luo, D. et al. Prognostic value of distant metastasis sites and surgery in stage IV colorectal cancer: A population-based study. Int. J. Colorectal Dis. 33(9), 1241–1249. https://doi.org/10.1007/s00384-018-3091-x (2018).
Article PubMed Google Scholar
Yi, C. et al. Is Primary tumor excision and specific metastases sites resection associated with improved survival in stage IV colorectal cancer? Results from SEER database analysis. Am. Surg. 86(5), 499–507. https://doi.org/10.1177/0003134820919729 (2020).
Article PubMed Google Scholar
Huang, Y. et al. Pulmonary metastasis in newly diagnosed colon-rectal cancer: A population-based nomogram study. Int. J. Colorectal Dis. 34(5), 867–878. https://doi.org/10.1007/s00384-019-03270-w (2019).
Article PubMed Google Scholar
Li, Y. et al. Predictive and prognostic factors of synchronous colorectal lung-limited metastasis. Gastroenterol. Res. Pract. 2020, 6131485. https://doi.org/10.1155/2020/6131485 (2020).
Article PubMed PubMed Central Google Scholar
Liu, W. et al. Prediction of lung metastases in thyroid cancer using machine learning based on SEER database. Cancer Med. 11(12), 2503–2515. https://doi.org/10.1002/cam4.4617 (2022).
Article PubMed PubMed Central Google Scholar
Frizzell, J. D. et al. Prediction of 30-day all-cause readmissions in patients hospitalized for heart failure: Comparison of machine learning and other statistical approaches. JAMA Cardiol. 2(2), 204–209. https://doi.org/10.1001/jamacardio.2016.3956 (2017).
Article PubMed Google Scholar
Xing, F. et al. A new random forest algorithm-based prediction model of postoperative mortality in geriatric patients with hip fractures. Front. Med. (Lausanne) 9, 829977. https://doi.org/10.3389/fmed.2022.829977 (2022).
Article PubMed Google Scholar
Che, D., Liu, Q., Rasheed, K. & Tao, X. Decision tree and ensemble learning algorithms with their applications in bioinformatics. Adv. Exp. Med. Biol. 696, 191–199. https://doi.org/10.1007/978-1-4419-7046-6_19 (2011).
Article CAS PubMed Google Scholar
Chern, C. C., Chen, Y. J. & Hsiao, B. Decision tree-based classifier in providing telehealth service. BMC Med. Inform. Decis. Mak. 19(1), 104. https://doi.org/10.1186/s12911-019-0825-9 (2019).
Article PubMed PubMed Central Google Scholar
Lee, Y. W., Choi, J. W. & Shin, E. H. Machine learning model for predicting malaria using clinical information. Comput. Biol. Med. 129, 104151. https://doi.org/10.1016/j.compbiomed.2020.104151 (2021).
Article PubMed Google Scholar
Lee, S. M., Park, J. H. & Park, H. J. Implications of systematic review for breast cancer prediction. Cancer Nurs. 31(5), E40–E46. https://doi.org/10.1097/01.NCC.0000305765.34851.e9 (2008).
Article PubMed Google Scholar
Pal, M., Parija, S., Panda, G., Dhama, K. & Mohapatra, R. K. Risk prediction of cardiovascular disease using machine learning classifiers. Open Med. (Wars) 17(1), 1100–1113. https://doi.org/10.1515/med-2022-0508 (2022).
Article CAS PubMed Google Scholar
Ma, B. et al. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput. Biol. Med. 121, 103761. https://doi.org/10.1016/j.compbiomed.2020.103761 (2020).
Article CAS PubMed Google Scholar
Lynch, C. M. et al. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int. J. Med. Inform. 108, 1–8. https://doi.org/10.1016/j.ijmedinf.2017.09.013 (2017).
Article PubMed PubMed Central Google Scholar
Roth, E. S. et al. Does colon cancer ever metastasize to bone first? A temporal analysis of colorectal cancer progression. BMC Cancer 9, 274. https://doi.org/10.1186/1471-2407-9-274 (2009).
Article PubMed PubMed Central Google Scholar
Mitry, E. et al. Epidemiology, management and prognosis of colorectal cancer with lung metastases: A 30-year population-based study. Gut 59(10), 1383–1388. https://doi.org/10.1136/gut.2010.211557 (2010).
Article PubMed Google Scholar
Robinson, J. R., Newcomb, P. A., Hardikar, S., Cohen, S. A. & Phipps, A. I. Stage IV colorectal cancer primary site and patterns of distant metastasis. Cancer Epidemiol. 48, 92–95. https://doi.org/10.1016/j.canep.2017.04.003 (2017).
Article PubMed PubMed Central Google Scholar
Heinemann, V. et al. FOLFIRI plus cetuximab versus FOLFIRI plus bevacizumab as first-line treatment for patients with metastatic colorectal cancer (FIRE-3): A randomised, open-label, phase 3 trial. Lancet Oncol. 15(10), 1065–1075. https://doi.org/10.1016/s1470-2045(14)70330-4 (2014).
Article CAS PubMed Google Scholar
Qiu, B., Shen, Z., Yang, D. & Wang, Q. Applying machine learning techniques to predict the risk of lung metastases from rectal cancer: a real-world retrospective study. Front. Oncol. 13, 1183702. https://doi.org/10.3389/fonc.2023.1183072 (2023).
Article Google Scholar
Deng, S. et al. Development and validation of a prognostic scoring system for patients with colorectal cancer hepato-pulmonary metastasis: a retrospective study. BMC Cancer 22(1), 643. https://doi.org/10.1186/s12885-022-09738-3 (2022).
Article CAS PubMed PubMed Central Google Scholar
Huang, B. et al. Smaller tumor size is associated with poor survival in stage II colon cancer: An analysis of 7,719 patients in the SEER database. Int. J. Surg. 33(Pt A), 157–163. https://doi.org/10.1016/j.ijsu.2016.07.073 (2016).
Article PubMed Google Scholar
Ding, X. et al. Risk and prognostic nomograms for colorectal neuroendocrine neoplasm with liver metastasis: A population-based study. Int. J. Colorectal Dis. 36(9), 1915–1927. https://doi.org/10.1007/s00384-021-03920-y (2021).
Article PubMed Google Scholar
Nordholm-Carstensen, A., Krarup, P. M., Jorgensen, L. N., Wille-Jørgensen, P. A. & Harling, H. Occurrence and survival of synchronous pulmonary metastases in colorectal cancer: A nationwide cohort study. Eur. J. Cancer 50(2), 447–456. https://doi.org/10.1016/j.ejca.2013.10.009 (2014).
Article PubMed Google Scholar
Mo, S. et al. Nomograms for predicting specific distant metastatic sites and overall survival of colorectal cancer patients: A large population-based real-world study. Clin. Transl. Med. 10(1), 169–181. https://doi.org/10.1002/ctm2.20 (2020).
Article PubMed PubMed Central Google Scholar
Nagtegaal, I. D. et al. Tumor deposits in colorectal cancer: Improving the value of modern staging-a systematic review and meta-analysis. J. Clin. Oncol. 35(10), 1119–1127. https://doi.org/10.1200/jco.2016.68.9091 (2017).
Article PubMed Google Scholar
Guo, Z. et al. Machine learning for predicting liver and/or lung metastasis in colorectal cancer: A retrospective study based on the SEER database. Eur. J. Surg. Oncol. 50(7), 108362. https://doi.org/10.1016/j.ejso.2024.108362 (2024).
Article PubMed Google Scholar
Hughes, E. S. & Cuthbertson, A. M. Recurrence after curative excision of carcinoma of the large bowel. JAMA 182, 1303–1306. https://doi.org/10.1001/jama.1962.03050520001001 (1962).
Article CAS PubMed Google Scholar
Dumont, F. et al. Significance of lymph node involvement in local recurrence of colorectal cancer. J. Surg. Oncol. 120(4), 722–728. https://doi.org/10.1002/jso.25631 (2019).
Article PubMed Google Scholar
Kato, Y. et al. Lymph node metastasis is strongly associated with lung metastasis as the first recurrence site in colorectal cancer. Surgery 170(3), 696–702. https://doi.org/10.1016/j.surg.2021.03.017 (2021).
Article PubMed Google Scholar
Deo, R. C. Machine learning in medicine. Circulation 132(20), 1920–1930. https://doi.org/10.1161/circulationaha.115.001593 (2015).
Article PubMed PubMed Central Google Scholar
Handelman, G. S. et al. eDoctor: Machine learning and the future of medicine. J. Intern. Med. 284(6), 603–619. https://doi.org/10.1111/joim.12822 (2018).
Article CAS PubMed Google Scholar
Weiss, J., Kuusisto, F., Boyd, K., Liu, J. & Page, D. Machine learning for treatment assignment: Improving individualized risk attribution. AMIA Annu. Symp. Proc. 2015, 1306–1315 (2015).
PubMed PubMed Central Google Scholar

Download references

Funding

This study was supported by Beijing Municipal Science & Technology Commission (No. Z171100000417056) and the Key Support Project of Guo Zhong Health Care of China General Technology Group (GZKJ-KJXX-QTHT-20240429).

Author information

Authors and Affiliations

Department of General Surgery, Beijing Electric Power Hospital, State Grid Corporation of China, Capital Medical University, Beijing, 100073, China
Zhentian Guo, Zongming Zhang, Limin Liu, Yue Zhao, Zhuo Liu, Chong Zhang & Hui Qi
China Clinical Medical Research Center for Hepatobiliary Diseases in General Surgery, China General Technology Group, Beijing, 100073, China
Zhentian Guo, Zongming Zhang, Limin Liu, Yue Zhao, Zhuo Liu, Chong Zhang, Hui Qi, Jinqiu Feng & Peijie Yao

Authors

Zhentian Guo
View author publications
Search author on:PubMed Google Scholar
Zongming Zhang
View author publications
Search author on:PubMed Google Scholar
Limin Liu
View author publications
Search author on:PubMed Google Scholar
Yue Zhao
View author publications
Search author on:PubMed Google Scholar
Zhuo Liu
View author publications
Search author on:PubMed Google Scholar
Chong Zhang
View author publications
Search author on:PubMed Google Scholar
Hui Qi
View author publications
Search author on:PubMed Google Scholar
Jinqiu Feng
View author publications
Search author on:PubMed Google Scholar
Peijie Yao
View author publications
Search author on:PubMed Google Scholar

Contributions

Zhentian Guo: Conceptualization, methodology, software, Investigation, Writing—original draft. Zongming Zhang: Conceptualization, Funding acquisition, Investigation, Formal Analysis, Supervision, Writing—review and editing. Limin Liu: Data curation, Investigation. Yue Zhao: Data curation, Investigation. Zhuo Liu: Data curation, Investigation. Chong Zhang: Data curation, Investigation. Hui Qi: Data curation, Investigation. Jinqiu Feng: Data curation, Investigation, Writing—review and editing. Peijie Yao: Data curation, Investigation.

Corresponding author

Correspondence to Zongming Zhang.

Ethics declarations

Competing of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, Z., Zhang, Z., Liu, L. et al. Explainable machine learning for predicting lung metastasis of colorectal cancer. Sci Rep 15, 13611 (2025). https://doi.org/10.1038/s41598-025-98188-5

Download citation

Received: 29 November 2024
Accepted: 09 April 2025
Published: 19 April 2025
DOI: https://doi.org/10.1038/s41598-025-98188-5