Performance of machine learning algorithms for lung cancer prediction: a comparative approach

Maurya, Satya Prakash; Sisodia, Pushpendra Singh; Mishra, Rahul; singh, Devesh Pratap

doi:10.1038/s41598-024-58345-8

Download PDF

Article
Open access
Published: 09 August 2024

Performance of machine learning algorithms for lung cancer prediction: a comparative approach

Satya Prakash Maurya¹,
Pushpendra Singh Sisodia²,
Rahul Mishra ORCID: orcid.org/0000-0002-7395-306X³ &
…
Devesh Pratap singh¹

Scientific Reports volume 14, Article number: 18562 (2024) Cite this article

8734 Accesses
11 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Due to the excessive growth of PM 2.5 in aerosol, the cases of lung cancer are increasing rapidly and are most severe among other types as the highest mortality rate. In most of the cases, lung cancer is detected with least symptoms at its later stage. Hence, clinical records may play a vital role to diagnose this disease at the correct stage for suitable medication to cure it. To detect lung cancer an accurate prediction method is needed which is significantly reliable. In the digital clinical record era with advancement in computing algorithms including machine learning techniques opens an opportunity to ease the process. Various machine learning algorithms may be applied over realistic clinical data but the predictive power is yet to be comprehended for accurate results. This paper envisages to compare twelve potential machine learning algorithms over clinical data with eleven symptoms of lung cancer along with two major habits of patients to predict a positive case accurately. The result has been found based on classification and heat map correlation. K-Nearest Neighbor Model and Bernoulli Naive Bayes Model are found most significant methods for early lung cancer prediction.

Developing a risk prediction tool for lung cancer in Kent and Medway, England: cohort study using linked data

Article Open access 17 October 2023

Lung cancer detection with machine learning classifiers with multi-attribute decision-making system and deep learning model

Article Open access 12 March 2025

Explainable AI for lung cancer detection via a custom CNN on CT images

Article Open access 13 April 2025

Introduction

The respiratory disease has enormously increased over the last decades which may be directly associated with the exposer of humans to the polluted atmosphere. Sustainable development goals (SDGs) ensure an aspiration of health and well-being for all¹, target 3.9 is associated with reducing death and illness from air, water, and soil pollution². Lung cancer is one of the most lethal diseases caused with increasing mortality rates globally by air pollution. Usually, this type of cancer begins in the lungs and may spread to other section of the body and its causes includes smoking, air pollution, and exposure to peculiar chemicals³. The prognosis for lung cancer varies depending on the type, stage, and overall health of the individual. The initial phases of lung cancer may not usually manifest symptoms. If early symptoms manifest, they may encompass symptoms such as short breathing, in addition to unforeseen symptoms like back pain. Tumors can lead to back pain by exerting pressure on the lungs or by spreading to the patient’s spinal cord and ribs⁴. Additional initial symptoms of lung cancer may encompass: a persistent or getting worse cough, expectorating phlegm or blood, exacerbation of chest pain during deep breathing, laughter, or coughing, hoarseness, wheezing, weakness, and fatigue, reduced appetite and weight loss, recurring respiratory infections like pneumonia or bronchitis⁵. The initial manifestations of lung cancer may be subtle, however, an early diagnosis is crucial for effective treatment alternatives and potential results.

However, it is a great challenge to detect and diagnose it in the early stage by doctors and researchers. The advancement in the storage of health records on digital platforms and data visualizations improved pattern analysis⁶. The early prediction of disease based on symptoms and textual information may enhance the diagnosis system. Aside from medical methods, soft computing techniques like applying machine learning algorithms to the main features of large, complicated lung cancer datasets may be significant for a specialist to find the disease early. On the contrary, the precision of detection depends on the availability of data and the process of selecting important measures, which further results in adequate treatment decisions.

Diverse mathematical models have already been utilized for the detection and prevention of diseases to facilitate early treatment. However, if lung cancer is diagnosed three years after its onset, it becomes unpreventable, and the likelihood of survival is extremely poor^7,8. Nevertheless, it is possible to treat the disease when the earliest signs are present before metastasis. Thus, if cancer is found within a specific time-frame of curability, along with various risk factors for further diagnosis, a suitable therapy can be provided to the patient, enabling the implementation of appropriate preventive measures. Several computer methods have been used to find or predict lung cancer, which helps doctors figure out the best way to treat patients and their chances of survival after being diagnosed. Researchers in the field of medical sciences have employed machine learning and soft computing approaches to accurately diagnose several forms of cancer in their early stages using categorization methods. Furthermore, researchers have identified various cutting-edge methods for early-stage prognosis of cancer therapy outcomes⁹. However, it is crucial to determine an appropriate learning algorithm for the purpose of detecting lung cancer and its correlation with the patient’s habits. This research aims to conduct a comparative analysis of several machine learning algorithms on the characteristics related to lung cancer, specifically focusing on the symptoms exhibited by patients and their habits.

Machine learning algorithms in lung cancer prediction

Lung cancer also referred to as lung carcinoma in the usual medical term, is originally a malignant tumor that grows in lung cells uncontrollably and can be identified by cell proliferation. Recent advancements in computer vision have enabled scientists to introduce various diagnostic methods using temporal image analysis¹⁰. However, with the growth in clinical data repositories, not only image analysis but also text data played a vital role in diagnosis. Several lung cancer studies focus on detection using symptom data and treatment decisions based on artificial intelligence, image processing, and learning algorithms. Several researchers implied neural network, support vector machine and decision tree¹⁰ convolutional neural network based non-linear cellular automata¹¹ Random Forest, XGBoost, and Logistic Regression¹² i.e. machine learning algorithms on clinical dataset to predict the recurrence of lung cancer and its survivability. A few comparative studies have also been presented such as ensemble techniques of Bagging and Adaboost and K-Nearest Neighbors, Decision Tree, and Neural Networks on Surveillance, Epidemiology and End Results (SEER) dataset¹³, XGBoost, GridSearchCV, Logistic Regression, Support Vector Machine, Gaussian Naïve Bayes, Decision tree, and K-Nearest Neighbor classifiers¹⁴ to evaluate lung cancer prediction through precision, recall, F1-Score parameters generated using confusion matrix and Area Under Curve (AUC) & Receiver Operating Characteristic (ROC) analysis. A few more machine learning classifiers such as Logistic Regression, Naïve Bayes and Random Forest, Support Vector Machine (SVM), Artificial Neural Network (ANN), k-Nearest Neighbors (KNN), Radial Basis Function Network (RBF), J48, MLP, Gradient Boosted Tree, Majority Voting, also tried for observing the performance of lung cancer prediction^15,16. Specifically, some standard machine learning techniques such as decision tree, boosting, random forest, neural network, naïve bayes, KNN, SVM are frequent in lung cancer prediction¹⁷. These machine learning algorithms showed their applicability on a temporal real-world larger dataset of lung cancer for risk prediction^18,19.

In binary classification, while using various methods, especially in diagnostic, prognostic and predictive research, Receiver Operating Characteristic (ROC) and Area under the Curve (AUC) analysis is an effective technique usually utilized to calculate measurement for the assessment of the differentiating ability of methods²⁰. The ROC curve is used to assess a test’s overall diagnostic performance and compare the performance of two or more diagnostic tests²¹. In other words, the ROC is informative about the performance over a series of thresholds and can be summarized by the AUC, which is a single number²². Also, A gender and age based study for a lung cancer dataset has been performed using machine learning which shows the potential of applicability of naïve bayes, SVM, KNN, random forest, decision tree, AdaboostM1, and neural network²³.

Apart from the above analysis, it is essential to inter-relate patient’s habits and symptoms, hence more precise in diagnosing and treating lung cancer. Moreover, it is equally important to find a suitable method of analyzing these datasets. Very few attempts have been made to compare different machine-learning methods for lung cancer prediction.

Dataset preparation and analysis

Dataset for lung cancer prediction has been collected from the source The dataset consists of a total of 16 attributes with 310 instances. The dataset attributes in the given instances are distributed over gender i.e., male and female. Table 1 illustrates the detailed description of all the 16 input feature attributes in the lung cancer study dataset, which are used in the prediction of lung cancer. The attributes are divided into two categories Habits and Symptoms, which may take values as positive or negative where represented by numeric 2 [yes] and 1 [no] respectively Table 2. There were thirty-three duplicate entries among the given instances in this dataset, which were removed before processing. The instance frequency count was performed, and positive case distribution was analyzed gender-wise. Further, the frequency analysis gender-wise has been performed for patient’s habits and the symptoms identified individually. A Pearson’s Correlation has also been plotted as a heat map to assess the attribute’s importance among the others. The attributes of the clinical dataset have been chosen based on the experts of this specialization and to measure the effectiveness of the cancer prediction system, which further helps the patient to know their cancer risk with low cost and decisions based on their appropriate treatment. Data are split into two sets of (80%) for training and testing (20%) of the dataset. During the training process, each model underwent 10-fold cross validation. This involved splitting the training set into a training subset and a validation subset with a ratio of 10:1 to fine-tune the attributes. The final accuracy metric was established by using the outcomes from the ten cross-validated models and the Area Under Curve (AUC) the Receiver Operating Curve (ROC).

Table 1 Description of all 16 input attributes in lung cancer study dataset.

Subjects

Abstract

Similar content being viewed by others

Developing a risk prediction tool for lung cancer in Kent and Medway, England: cohort study using linked data

Lung cancer detection with machine learning classifiers with multi-attribute decision-making system and deep learning model

Explainable AI for lung cancer detection via a custom CNN on CT images

Introduction

Machine learning algorithms in lung cancer prediction

Dataset preparation and analysis

Results and discussion

Comparison of performance of algorithms

Conclusion

Limitations and future scope

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Transformative Advances in AI for Precise Cancer Detection: A Comprehensive Review of Non-Invasive Techniques

Optimizing Air Pollution Prediction With Random Forest Algorithm

Search

Quick links