Abstract
Early diagnosis of breast cancer is exceptionally important in signifying the treatment results, of women’s health. The present study outlines a novel approach for analyzing breast cancer data by using the CatBoost classification model with a multi-layer perceptron neural network (CatBoost+MLP). Explainable artificial intelligence techniques are used to cohere with the proposed CatBoost with the MLP model. The proposed model aims to enhance the interpretability of predictions in breast cancer diagnosis by leveraging the benefits of CatBoost classification technique in feature identification and also contributing towards the interpretability of the decision model. The proposed CatBoost+MLP has been evaluated using the Shapley additive explanations values to analyze the feature significance in decision-making. Initially, the feature engineering is done using the analysis of variance technique to identify the significant features. The MLP model alone and the CatBoost+MLP model are being analyzed using divergent performance metrics, and the results obtained are compared with contemporary breast cancer identification techniques.
Similar content being viewed by others
Introduction
Breast cancer is a difficult health burden that affects not just the physical health of people afflicted but also the emotional and socioeconomic well-being of families and communities. Early identification is critical in reducing these effects by allowing prompt intervention and more effective treatment choices. When breast cancer is detected early, therapy is frequently less intrusive, resulting in increased survival rates and improved patient quality of life1. In 2020, breast cancer overtook lung cancer as the most common tumor in the world with 11.7% incidence, and it will continue to cause many cancer deaths worldwide1. Population growth and aging are the only factors anticipated to increase breast cancer incidence to three million and deaths to one million by 2040. Even with breakthroughs in breast cancer treatment, early breast cancer patients may have a five-year survival rate of over 85%. Personalized and precise breast cancer treatment requires accurate molecular diagnosis and prognosis2. At the moment, mammography, magnetic resonance imaging (MRI), ultrasound scans, Positron emission tomography (PET) scans, and puncture cytology are the mainstays of breast cancer diagnosis3. The probability of effective treatment and the prognosis greatly rise once breast cancer is diagnosed in its earliest stages4,5. Therefore, regular checks and a quick diagnosis are essential for avoiding and early detection of breast cancer. The significance of preventative actions in dealing with this health condition is highlighted by the fact that early discovery increases the likelihood of a cure and improves the general prospects for patients6.
Breast cancer presents in two forms: benign, recognized non-hazardous, and malignant, which is further categorized as normal or aberrant. Benign diseases are often not seen as life-threatening despite the potential for malignancy. Malignant breast cancer begins with uncontrolled cell development and may quickly spread to other organs. In malignant tissues, the nuclei appear significantly larger compared to healthy tissues, particularly as the disease progresses. Medical professionals may conduct needle biopsies or microscopic examination of tissues to confirm a diagnosis7. To ensure effective and prompt detection, it’s crucial to analyze recent investigation data along with relevant information obtained from archived records. Medical imaging and artificial intelligence (AI) algorithms play crucial roles in facilitating this diagnosis. A deeper knowledge of breast cancer is made possible by using different machine intelligent approaches designed to help interpret medical images8. The current study tries to address the following research questions (RQ) concerning the smart diagnosis of breast cancer.
-
RQ1: What is the significance of feature engineering in dealing with classification models and their impact on the outcome?
-
RQ2: The role of Categorical Boosting in handling the categorical data and regularization of the smaller datasets in preventing the overfitting issue.
-
RQ3: What is the significance of XAI techniques, particularly the utilization of Shapley values, in enhancing the interpretability of decision models, with a focus on determining the importance of features in the decision-making process?
-
RQ4: Can two or more features be interdependent on the decision process, and can the dependency of two features be assessed.
The motivation for machine learning in breast cancer diagnosis stems from the desire to improve patient outcomes, reduce the global impact of the illness, and advance state-of-the-art healthcare technology and research. This research offers several benefits, notably enhancing early diagnosis and facilitating individualized treatment. Beyond these advantages, its potential extends to a transformative impact on breast cancer management, potentially saving lives while influencing research methodologies, cost-effectiveness, and global accessibility to healthcare services. The incorporation of machine intelligence in breast cancer diagnostics not only contributes to improved diagnosis outcomes but also plays a role in reducing overall healthcare expenses. By harnessing the capabilities of machine learning, this study has the potential to influence breast cancer management, paving the way for more efficient diagnosis and healthcare. The proposed CatBoost with MLP model for analyzing breast cancer is presented using Fig. 1.
The current study makes the following contributions:
-
Feature engineering involves identifying crucial features and assigning optimal weights through ANOVA.
-
The framework driven by explainable artificial intelligence illustrates the relation and association between the features by utilizing Shapely values, making the contribution of each feature comprehended in the decision process.
-
The Catboost algorithm is used for categorical analysis of the features, followed by the multilayer perceptron neural network, to classify abnormal healthcare records.
-
The results of the proposed model are analyzed using other contemporary techniques to assess the performance effectively.
The subsequent sections of the study are organized as a literature section that presents the contemporary techniques used in breast cancer prediction. The third section presents the methods and material used in evaluating the current study, which discusses the feature selection technique, Implementation environment, and dataset details. Section four presents the multilayer perceptron neural network in classifying medical records. Section five presents the experimental results and the comparison with other contemporary techniques. Section six presents the conclusion and the future scope of the current study.
Literature review
Machine-intelligent models have become more prominent in developing breast cancer prediction models. Various ML methods, including support vector machines (SVM), decision tree (DT), neural networks (NN), J48 algorithm, Naive Bayes (NB), logistic regression (LR), Multi-layer perceptron (MLP) K-nearest neighbourhood(KNN), and random forests (RF), have been used by researchers to analyze intricate datasets and detect patterns that may suggest the presence of malignancies. The use of AI-based prediction models improves the precision of diagnostic procedures and shows the potential to provide individualized therapy suggestions. Waikato environment for knowledge analysis (WEKA) is a predominantly used platform for evaluating cancer diagnostic models9.
In the study by Silva-Aravena10, a hybrid model-based XGBoost algorithm and the Shapley-based XAI models are being experimented on over an Indonesian women’s Mendeley dataset11 for cancer prediction. The hybrid model yielded an accuracy of 81%, which is considered fair for medical-related diagnostic models. The study has not done any feature engineering, such as pre-processing the data before classifying the records, which could be the reason for low accuracy, and the class imbalance is also not being handled in the current study. Another study on the ultrasound and breast histopathology images using a hybrid DL model was conducted by Zakareya et al.12, and the authors have combined both the GoogLeNet13 and residual block14. The model has yielded a promising accuracy of 93% in evaluating the ultrasound images and 95% accuracy in evaluating over-breast histopathology images. The experimental analysis has stated that a semantic gap in content-based image retrieval (CBIR) systems is the fundamental obstacle that must be bridged for successful retrieval. To overcome this problem, various strategies, such as those described above, may be used to integrate semantic concerns.
In a research conducted by Uysal Fatih and Mehmet Murat Köse15, DL models were explored for predicting breast cancer. The study utilized various DL models to classify breast cancer ultrasound images into benign, malignant, or normal categories. The dataset comprised 780 samples, and the researchers employed data augmentation and pre-processing techniques. The study investigated three different models for classification. ResNet50 obtained an accuracy of 85.4%, ResNeXt50 earned an accuracy of 85.83%, and VGG16 achieved an accuracy of 81.11%. A study on the gene sequencing for breast cancer prediction using 9 different classification techniques that include linear discriminant analysis model (LDA), k nearest neighbours’ algorithm (KNN), gradient boosting (GB), adaboost (AB), SVM, RF, DR, LR, and NB are being evaluated16. The experimental results have proven that the DT has yielded a maximum accuracy of 94.03%. Authors have stated that granular computing may be used on the characteristics to choose strong datasets for training the model. Another study by Md. Mehedi Hassan et al.17, evaluated various classification techniques over the SEER breast cancer dataset from IEEE DataPort18. Techniques like SVM exhibited an accuracy of 89.19%, MLP attained a fair accuracy of 88.20%, RF with an accuracy of 90.44%, GB attained an accuracy of 90.06%, KNN attained an accuracy of 88.08%, and LR exhibited an accuracy of 89.07%, without any feature selection technique. On using the least absolute shrinkage and selection operator (LASSO) technique, the SVM has attained an accuracy of 89.57%, MLP has attained an accuracy of 89.44%, RF has attained an accuracy of 90.68%, GB has attained an accuracy of 90.56%, KNN has attained an accuracy of 88.82%, and LR has attained an accuracy of 89.19%.
In their research focused on the early detection of breast cancer, Bazazeh and Shubair19 applied various machine learning approaches, including SVM, RF, and NB methods, to analyze the Wisconsin breast cancer dataset (WBCD). NB exhibited superior performance in terms of specificity and sensitivity values, as evaluated through the WEKA software. Meanwhile, the RF technique yielded the best outcomes when considering the ROC curve parameter. SVM excelled in accuracy, precision, and creativity. In another study, Chougrad et al. 20 introduced a deep convolutional neural network (CNN) model with transfer learning to mitigate overfitting issues associated with small datasets. The model’s evaluation utilized four datasets: DDSM21, INbreast22, BCDR23, and MIAS24. Notably, the DDSM dataset achieved an accuracy of 97.35% with an area under the curve (AUC) of 0.98. The INbreast dataset demonstrated 95.5% accuracy and an AUC of 0.97, while the BCDR database attained 96.67% accuracy with an AUC of 0.96. Some of the recent studies on the similar dataset that is considered in the current study are being tabulated to analyze the performances of the models. Table 1 outlines the accuracies that are attained by various existing classification techniques over the WBCD dataset.
The size of the dataset has a significant impact on the performance of the model. The models that are trained over the smaller size dataset are prone to overfitting and more variance in the classification outcome. To handle the smaller size datasets, the algorithms like CatBoost are used in regularization of the dataset.
With recent advancements in artificial intelligence and its potential, our study primary focus on revolutionizing breast cancer prediction by leveraging electronic healthcare records. Early identification of breast cancer is crucial as it significantly enhances the treatment and patient survival. The propose approach aims to develop a prediction model that is capable of analyzing diagnostic records to identify subtle patterns and detect early indications of breast cancer. The motto is to provide healthcare practitioners with a powerful tool that can detect potential hazards at an early stage. By leveraging advanced technology and utilizing the vast amount of information stored in electronic records, we can ultimately enable more personalized and effective therapies. This hybridized machine learning models would have the potential to go beyond simple diagnostics, opening the path for a paradigm change in breast cancer care in which preventive strategies may be taken to reduce risks and enhance patient outcomes. The main efforts are to bring a significant transformation in the field of breast cancer prediction, that would better assist in patient-centric healthcare.
Methods and material
The current section of the manuscript discusses the dataset used in the experimentation of the proposed model, the feature selection mechanisms, and the experimental setup used in the current study.
Dataset description
The Breast Cancer Wisconsin (diagnostic) data set is openly accessible on the Kaggle platform for evaluation30. The dataset consists of a total of 569 records that are classified as binary classes, i.e., benign and malignant tissues. The benign tissues are the controlled cellular growth tissues confined to their tissue of origin and have limited boundaries. Benign tissues do not pose any threat to the body. On the other hand, the malignant tissues grow unrestrained and intrude on the neighboring tissues, which are harmful and assumed to be cancerous growth. The dataset consists of 357 records of benign tissues and 212 records of malignant tissue. The dataset is partitioned as 70%, 10%, and 20% for training, validation, and testing the proposed model. This setup allowed us to thoroughly evaluate the proposed CatBoost with MLP model using 114 samples. The labelled dataset consists of 31 features for each of the records, along with the other metadata presented in Table 2. Among the features discussed in the table below, the diagnosis holds the label of the record. The minimum and maximum values associated with each of the feature are also shown, which are generally used in assessing the mean and the standard deviation for each of the features.
Using a heatmap to depict the correlation among characteristics within a dataset offers a graphical depiction of the interconnections between variables. It provides valuable insights into possible patterns and dependencies within the data by presenting color-coded values that show the intensity and direction of relationships. The distribution of the features in the dataset is shown using a heatmap in Fig. 2. The deep red cells in the heatmap indicate a strong positive correlation between two features. The deep blue cells show a strong negative correlation between features. The different shades of red and blue represent varying levels of positive and negative correlations between the features. These correlations impact the decision-making process by identifying the features contribute similar information (positive correlation) or opposing effects (negative correlation) in the model.
Feature engineering
The current sub-section of the manuscript discusses the analysis of variance based feature selection mechanism. ANOVA31 is a strong statistical approach for comparing means across several groups frequently utilized in many industries. Its main goal is to see how statistically significant variations exist in the averages of two or more groups. ANOVA is considered effective in experimental designs when the study seeks to evaluate the influence of multiple constraints on a dependent variable. The ANOVA F-test implies that the data is normally distributed and that the variances of every group being compared are identical. If the determined F-value is greater than the F-distribution’s critical value, it shows significant variation among at least one pair of groups. Because of its efficiency in cases wherein one variable is numeric, and the other is categorical, the ANOVA test might be a viable option for feature selection in classification problems. The F-test based on ANOVA assessed using the variable \(ANOV{A}_{{f}_{test}}\) as shown in Eq. from (1) to (3).
In the above equations, the notation \({\sigma }_{inter}^{2}\) designates the inter-class variance and \({\sigma }_{intra}^{2}\) is used to determine the intra-class variance. The notation \({p}_{i}\) designates the mean of feature values associated with class \(i\) and the notation \({s}_{c}\) designates the sum of all classes being considered in the given problem. The notation \({s}_{s}\) represents the sum of samples in the given problem. The significance of each feature is further analyzed using the shapely-based evaluation to determine the significance of the data classification.
Implementation environment
The current section of the study outlines the details associated with the implementation environment. The implementation is carried forward in a standalone environment using the Jupiter notebook. The Python programming language implements the multilayer perceptron neural network for classification. The details of the interface of implementations is discussed in Table 3.
CatBoost with multilayer perceptron neural network
The current section of the study outlines the impact of the CatBoost algorithm using the multilayer perceptron neural network algorithm. The CatBoost algorithm and the objective function are discussed, and then the role of the multilayer perceptron neural network is discussed. Finally, the integration mechanism that is followed in the current study is discussed.
CatBoost algorithm
CatBoost advances the concept of categorical boosting, which optimizes category variable ordering throughout the boosting process, which does not need any feature encoding32. This is accomplished by taking into account the statistics of the target variable for various permutations of the categorical feature levels. CatBoost is a modern algorithm that thrives in situations involving category information. It is particularly suitable for machine learning applications that include categorical and heterogeneous data33. The framework is designed over a symmetric decision tree for the main learners, using fewer trainable parameters. It is designed to handle class variables and achieve high accuracy in processing aspects of class types efficiently and fairly. Furthermore, it effectively addresses the issue of gradient bias and prediction shift, mitigating the likelihood of overfitting. Catboost is a type category of boosting approach that is efficient in handling category features without introducing bias.
The training set consists of \(r\) samples, denoted as \(S\{\left({x}_{i},{y}_{i}\right), For i=\text{1,2},3\dots ,r\} .\) For each sample \({x}_{i}=\{{x}_{{1}_{i}},{x}_{{2}_{i}},\dots ,{x}_{{p}_{i}}\}\) that designates a \(p\)-dimensional input features, while \({y}_{i}\in L\) represents the labeled values. The trained model obtained is denoted as \({T}_{n-1}\). The objective in the subsequent round of training is to create a decision tree \({t}_{n}\) from the set of decision trees \(R\), to minimize the expectation \(E()\) corresponding to the loss function \(L()\). The parameter \({t}_{n}\) the following calculation is determined, where (x,y) represent test samples unrelated to the training set.
The final model \(C\), as represented by Equation (5), is obtained through the combination of the initial weak learner \({L}_{o}\) and the step size \({s}_{z}\) during the \({n}^{th}\) round of training, which is repeated \(N\) times.
CatBoost adds order boosting in the training set against noise points. CatBoost uses the ordered target statistics approach to automatically transform category features into numerical features, increasing immediate support for the categorical parameters. CatBoost utilizes categorical features, significantly augmenting feature dimensions. The base model is a completely symmetric tree, employing the same splitting criteria across all layers for enhanced stability and prediction speed. The formula for introducing prior distribution terms to mitigate the impact of noise and low-frequency features on the data distribution in greedy target-based statistics is expressed as shown in Eq. (6)34.
As discussed in the above equation, the notation \(p\) designates the prior value, and \(\omega\) denotes the corresponding weight. The notation \(d=\{{d}_{1},{d}_{2},\dots {d}_{n}\}\) denotes the permutation over the dataset observation \(\left(x,y\right).\)
Multilayer perceptron neural network
An MLP network architecture comprises three parts: an input layer, a hidden layer, and an output layer. The input layer assigns every input variable one particular input neuron. The buried layer contains the network’s underlying logic. The expected values provided by the output layer aid in precisely classifying the breast cancer samples. The actual computational power of the MLP is characterized by the unrestricted number of hidden units inserted across the input and output layers. An MLP, like a feed-forward network, moves data forward. The backpropagation method is used to train the neuron. MLPs are designed to approximate integrals, enabling the solution of non-linearly separable problems. Comprised of perceptrons, or neurons, an MLP processes \(F\) features as input \(I= {I}_{1},{I}_{2},\dots ,{I}_{n}\) assigning each feature a weight, with weights typically being integer values. It is important to note that non-numeric input features need to be converted into numeric ones before utilizing the MLP algorithm. For instance, a feature vector with \(n\) values can be transformed into \(n\) input features that signify the presence of these values. As depicted in Eq. (7), the input feature vector undergoes the function \({f}_{s}(I)\), which is assessed as the sum of the input instances35.
The notation \({\omega }_{i}\) the above equation denotes the weight assigned to the relevant data item \(i\). The weights are labeled as \({\omega }_{ih}\) , that connects the input and hidden layers, while the weights are designated as \({\omega }_{ho}\) that link that is associated with hidden to the output layer. The notation \({t}_{h}\) denotes the threshold at the buried layers. The network learns the relationship between input units and predicted output feedback by altering the weight and bias settings. Consequently, using the Eq. (8) method, the MLP network predicted output for the \({v}^{th}\) neuron over the \({g}^{th}\) node with an presumed inputs as \(\{{q}_{1},{q}_{2},\dots ,{q}_{i}\}\) at the corresponding layer.
Equation (9) shows the sigmoid function used in the previous equation to evaluate the neuron’s fitness, typically with a value between − 1 and 1 in the MLP model, where n is a set of real numbers.
The subsequent sub-layer within the hidden layer would receive the output \({O}_{h}(x),\) as input for further processing. The \({\omega }_{ho}\) i.e., the weight for hidden to the output layer and \({b}_{ho}\) i.e., bias corresponding to the hidden and output layer values are used to compute the neuron’s output, \({O}_{v}\), according to Equation (10).
Integrating the CatBoost with multilayer perceptron neural network
The interaction between gradient-boosting methods and neural networks has received a lot of interest in the area of machine learning. The integration of CatBoost, which is a robust gradient boosting technique best known for handling categorical data, with an MLP network is being analyzed in the current study. The motivation for the blend of these algorithms is that each has its strengths that are efficient in classifying the data. CatBoost can easily pre-process and improve the representation of category features, and MLPs are good at expressing complex patterns in data. The proposed method aims to combine the best parts of both approaches to improve prediction accuracy when dealing with datasets that have both number and category factors, as seen in Fig. 3. CatBoost is used as a pre-processing step to deal with categorical features in the dataset. This is the first step in making the algorithms work together. Through training data, the CatBoost model learns to turn category features into number representations that show the trends in the data. The changed traits are then put into an MLP structure to learn more about complex relationships and make them more general. The combination of these two lets the model use CatBoost’s skills at handling category features. This gives the MLP better information to learn complex patterns and relationships, which increases the total predictive power of the combined model.
The combined model often makes better predictions than single MLPs and traditional gradient-boosting methods. This shows that mixing CatBoost and MLPs makes it easier to predict success, especially when dealing with real-world datasets with both number and category factors. The current study outlines the impact of hybridizing machine learning approaches for improved model interpretability and performance across various applications.
SHAP-based feature analysis
The model’s explainability is analyzed using SHapley Additive exPlanations (SHAP)36 based feature evaluation. The SHAP assesses a fair contribution for every feature within a predictive model, allowing an unambiguous understanding of their effect on individual forecasts. SHAP provides a uniform and consistent method for understanding the decision outcome, that assists in a better understanding of each input feature that contributes to determining the final decision37. This has contributed to enhancing the transparency of black-box models and given users a systematic and straightforward way to grasp the complex association among the input features and model outputs. The feature significance corresponding to the Breast Cancer Prediction dataset is shown in Fig. 4.
In the model evaluation, the dependence graphs show the SHAP values associated with the features’ dependencies38. Dependency graphs are graphical representations of the relationship between model output and changes to model parameters that help to clarify how a machine learning model arrives at its decision. Figure 5 shows the relationships between features that are most important in the decision-making process, whereas Fig. 6 shows the elements that are least important.
It can be observed from the dependency graphs shown in Fig. 5 that the samples’ colours range from shades of red to blue, conveying crucial information about feature significance in the decision process and directionality of effects. Positive contributions are shown by the red colour, indicating that greater values of the feature are related to higher predictions. The blue colour shows the negative contributions, implying that higher feature values correspond to lower predictions. The underlying relationships may be better understood, and decisions can be made with the help of these colour shades, which indicate the effect of the specific feature on the model prediction outcome.
The SHAP data provided useful insights into the model’s internal workings by providing detailed knowledge of the influence of specific variables on the decision-making process. These dependency graphs not only demonstrated the relative significance of various features, but showed the relationships and interactions between them. This increased comprehensibility of features made it easier to identify significant factors impacting the decision model’s outputs, allowing for more informed interpretations and conclusions. However, SHAP-based analysis does not influence the performance of the model, it is used solely to analyze the influence of features on the decision-making process.
Hyperparameters
The model’s performance is further evaluated using hyperparameters like training and validation loss measures and training and validation accuracy measures. The model overfitting and underfitting could be appropriately assessed using the loss measures39. In the current study the number of samples in the dataset is 569, out of which 20% of the data, i.e., 114 images are used in evaluating the performance of the model. The model feature engineering uses ANOVA techniques for both MLP and CatBoost with MLP models to best fit the model. The feature engineering of the data would result in better regularization of the model, which can be observed from the graphs that are shown in the current sub-section. The training loss quantifies the extent to which the model performs on the data used for training throughout the training phase. It is obtained by comparing the expected results of the model to what is desired in the training data utilizing a loss function. The purpose of training is to minimize its loss, which denotes that the model is improving at generating predictions using the training data. The validation loss quantifies how well the model performs on a second data collection, known as the validation set. The model being trained has not been exposed to validation data during training. Instead, the model’s performance on the validation set is assessed after a particular amount of training epochs. The validation loss determines how successfully the model generalizes to previously unknown data. If the training loss reduces, but the loss due to validation increases, this may suggest overfitting, which occurs when the model fits the data used for training too well and does not generalize well to new data. The loss measures for the CatBoost with MLP and MLP alone are measured. Figure 7 shows the loss measures for MLP alone and the CatBoost with MLP.
From the loss graphs above, we can see that the CatBoost with MLP model converged more quickly than the MLP model alone. The CatBoost with MLP model was fine-tuned well before the 10th epoch, while the MLP model needed more epochs to generalize. The sudden spike in the validation loss curve of the CatBoost with MLP model is likely due to the small size of the dataset, which resulted in a smaller validation set, causing fluctuations in the validation loss during the early epochs.
The training accuracy of a model is an indicator of how well it operates on the training dataset. It is determined by contrasting the model’s predictions on training data to the actual labels and calculating the proportion of accurately predicted cases. High training accuracy implies that the model learned to match the training data accurately. High training accuracy, on the other hand, does not ensure strong performance on fresh, unknown data. Validation accuracy measures how successfully a model generalizes to new, previously unexplored data. It is determined by assessing the model using validation samples, which the model did not view during the training phase. The testing accuracy indicates how the model will likely function on actual data. Figure 8 shows the accuracy measures for MLP alone and the Catboost with MLP.
The accuracy graphs clearly show that the CatBoost with MLP model converged much faster than the MLP model alone. The overlap of the training and validation curves for the CatBoost with MLP model indicates that it is well generalized. The small spikes at the start of the validation curves are due to the smaller validation set size. However, the feature engineering, handling of categorical data, and ability to capture complex non-linear relationships between features helped manage the small dataset effectively. Both the loss and accuracy graphs demonstrate that the proposed CatBoost with MLP model performs better than the MLP model alone. The CatBoost with MLP model is well-fitted, with refined features.
Results and discussion
The current section of the manuscript presents the experimental results obtained on experimenting with the MLP and CatBoost with MLP models over the breast cancer prediction dataset. The proposed model is evaluated using divergent metrics like precision, recall, accuracy, F1-score, and Matthews correlation coefficient (MCC). The proposed models are further evaluated with existing models to assess the model’s impact on precision diagnosis. The assessment is based on the outcomes of true positive, true negative, false positive, and false negative. To evaluate the performance of the proposed model, the dataset is divided into three subsets 70, 10,20 as training, validation, and testing, respectively. All the instances in the dataset are randomly distributed in all three partitions to minimize the risk of introducing bias and artifacts into the subsets. The model’s performance is evaluated without the use of the ANOVA model for feature engineering, and the results obtained are shown using the confusion matrix in Fig. 9 and the corresponding evaluation metrics in Table 4.
In binary classification, a true positive is when a model correctly classifies a positive event as positive. It denotes the model’s precise recognition of the target class. False positive arises when a model mistakenly recognizes a negative event as positive. It illustrates a circumstance in which the model incorrectly shows the existence of a class that does not exist. True negative denotes the model’s precision in the identification of a negative instance as negative. It represents the model’s correct identification of the absence of the target class. False negative is when a model incorrectly labels a positive case as negative. This circumstance shows that the model failed to identify the existence of the target class. The confusion matrix corresponding to the MLP model alone and CatBoost with MLP are presented in Fig. 10. The summary of obtained results concerning various parameters is shown in Table 5.
The suggested model is being evaluated in comparison to other models that have been used in previous research for breast cancer recognition, as shown in Table 6. To maintain uniformity in the analysis, the same set of metrics are being considered, and those metrics that are not considered by other studies are being mentioned as not applicable (NA). As part of the comparative, various existing techniques like KNN, RF, LR, SVM, DT, XGBoost, NCC, and AB are being considered in the current study.
The suggested model’s performance is evaluated using receiver operating characteristic (ROC) curves42. These curves show the balance of recall and specificity at various categorization levels, giving useful information about the model’s capacity to discriminate between positive and negative instances. By comparing the true positive rate to the false positive rate, we gained an extensive understanding of the model’s performance characteristics. Figure 11 shows graphs of MLP alone and CatBoost with MLP.
The diagnostic odds ratio (DOR)43 was considered to be an important metric to evaluate the diagnostic efficacy of the diagnostic models. DOR value is calculated by dividing the probabilities of true positives by the probability of false positives, providing a comprehensive evaluation of the diagnostic accuracy of the model. This statistical metric provides a comprehensive view of a diagnostic test’s efficacy, considering both recall and specificity. The higher value of DOR indicates a more trustworthy and assured assessment result, hence, it is desired to have a higher DOR value. We got useful insights on overall diagnostic effectiveness by including the DOR in the research, allowing for a better comprehensive view of the trade-off between true and false positives. The mathematical equations corresponding to the DOR is shown in Eq. 11.
The obtained results of DOR associated with MLP alone and CatBoost with MLP are shown in Fig. 12. The results obtained from the DOR show that the proposed model outperformed.
The results that are presented in the current study are obtained on experimentation on a single round. The dataset size is comparatively small, which has resulted in negligible differences in the execution process over multiple rounds. The size of the dataset is one of the potential limitations of the current study, where the model performance is not deeply analyzed, and computational overhead is not properly justified, considering the potential for overfitting. We believe the current study would elucidate the crucial features that are considered in the classification of abnormal records in a precise manner, and Shapley would assist in better transparency in the decision process. The current study may be considered the reference model that can be used in real-time scenarios to evaluate the records on the fly. Smart diagnosis technology like the proposed model would assist medical practitioners in planning surgical procedures and personalized medicine44.
The CatBoost algorithm that is used in the current study would assist in dealing with the training data, which can effectively handle the regularization that can help prevent overfitting, making it effective even with smaller datasets. The results of CatBoosting with MLP has shown a better results over the conventional MLP model, the hyperparameter graphs that are shown in section "Hyperparameters" clearly illustrates the model has struggled initial to generalize, but over the epochs the performance of CatBoost with MLP is much superior than the MLP model alone. The ANOVA based feature engineering to identify the significant features for the decision making process. The Shapley analysis would assist in better interpretability of the decision process of the model. The analysis of the feature dependencies using Shapley values provides a comprehensive and transparent view of how features influence model predictions
To sum up the findings in the current study concerning to the research questions that are discussed in the introduction section of the manuscript, the RQ1 is concerning to feature significance, the ANOVA techniques are used in the current study for the feature selection. The performance of the proposed model with and without feature engineering is evaluated, and it is observed from the results that there is a difference of 0.013 in accuracy in the MLP model with and without feature engineering and a 0.040 difference in the case of the CatBoost+MLP model. The RQ2 is primarily about the impact of the categorical boosting algorithm and its impact on dealing with the categorical data, which efficiently recognizes the inherent dependencies and relations among the categorical features. The performance of the CatBoost technique with MLP is being analyzed to determine the impact of categorical boosting techniques in data classification. The results showed a slight enhancement in the performance of the model, with a 0.027 improvement in the model’s accuracy. The RQ3 is mainly about model interpretability, where SHAP-based evaluations are used in the current study to analyze the significance of the features and dependency graphs shown in the proposed methodology section, which illustrate the relationships and interdependency of the features in the decision process. The feature analysis and the resultant accuracies of the models have made it evident that the Shapley analysis would make the decision process evident. The RQ4 is concerned with the association and interdependencies of two or more features in the decision process, whereas in the dependency graphs, the range of color shades from red to blue would assist in determining the contribution of the features. However, the 8 more significant and 8 less significant features are being considered in the current study to confine the number of comparisons. More statistical analysis would assist in better comprehensibility of the hybridization of the classification algorithms with the CatBoost technique.
Conclusion
The proposed CatBoost with a multi-layer perceptron model for breast cancer prediction has outperformed the other conventional techniques that are being considered in the current study. The CatBoost with MLP model has exhibited an overall accuracy of 99.3%, when the training and testing partitions are considered in the ration 80 and 20, respectively, the validation is considered as an integral part of the training set. The model is being evaluated concerning to divergent metrics along with the measures like receiver operating characteristic Curves and diagnostic odds ratio. The proposed model has outperformed the other recent studies being considered in the current study. The XAI-based analysis using the SHAP values would ensure the transparency of the decision model, which makes the study evident. SHAP values provided a visual representation of the complex relationships between features, offering intuitive insights into how changes in one variable may propagate through the model and affect the final predictions. It desired to use SHAP-based feature analysis for better transparency of the internal work of the classification models. The current model can be further incorporated in real-time implementation alongside the existing smart diagnosis for better precision.
The proposed model can be further statistically enhanced using other approaches like stacking, and voting methods with a multi-layer perceptron model. The feature significance can be further enhanced with the local interpretable model-agnostic explanations (LIME) approach and the SHAP-based feature analysis. The proposed model can be further evaluated through fine-tuning the hyperparameters and changing the sequential training of the proposed model.
Data availability
The datasets analyzed during this work can be found at: https://www.kaggle.com/code/liamarguedas/predicting-breast-cancer-a-step-by-step-guide
References
Sandbank, J. et al. Validation and real-world clinical application of an artificial intelligence algorithm for breast cancer detection in biopsies. npj Breast Cancer 8, 129. https://doi.org/10.1038/s41523-022-00496-w (2022).
Tian, Y. et al. PI3K/AKT signaling activates HIF1α to modulate the biological effects of invasive breast cancer with microcalcification. npj Breast Cancer 9, 93. https://doi.org/10.1038/s41523-023-00598-z (2023).
Eswara Rao, G. V., Rajitha, B., Srinivasu, P. N., Ijaz, M. F. & Woźniak, M. Hybrid framework for respiratory lung diseases detection based on classical CNN and quantum classifiers from chest X-rays. Biomed. Signal Process. Control 88(Part B), 105567. https://doi.org/10.1016/j.bspc.2023.105567 (2024).
Patrício, M. et al. Using resistin, glucose, age, and BMI to predict the presence of breast cancer. BMC Cancer 18, 29. https://doi.org/10.1186/s12885-017-3877-1 (2018).
Ijaz, M. F., Alfian, G., Syafrudin, M. & Rhee, J. Hybrid prediction model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, synthetic minority over sampling technique (SMOTE), and random forest. Appl. Sci. 8, 1325. https://doi.org/10.3390/app8081325 (2018).
Chen, H. et al. Classification prediction of breast cancer based on machine learning. Comput. Intell. Neurosci. https://doi.org/10.1155/2023/6530719 (2023).
Jalloul, R., Chethan, H. K. & Alkhatib, R. A review of machine learning techniques for the classification and detection of breast cancer from medical images. Diagnostics 13, 2460. https://doi.org/10.3390/diagnostics13142460 (2023).
Rana, M. & Bhushan, M. Machine learning and deep learning approach for medical image analysis: Diagnosis to detection. Multimed. Tools Appl. 82, 26731–26769. https://doi.org/10.1007/s11042-022-14305-w (2023).
Fatima, N., Liu, L., Hong, S. & Ahmed, H. Prediction of breast cancer, comparative review of machine learning techniques, and their analysis. IEEE Access 8, 150360–150376. https://doi.org/10.1109/ACCESS.2020.3016715 (2020).
Silva-Aravena, F., Delafuente, H. N., Gutiérrez-Bahamondes, J. H. & Morales, J. A hybrid algorithm of ML and XAI to prevent breast cancer: A strategy to support decision making. Cancers 15(9), 2443. https://doi.org/10.3390/cancers15092443 (2023).
Dataset of Indonesian Women’s Reproductive, High-Fat Diet and Body Mass Index Risk Factors for Breast Cancer. https://data.mendeley.com/datasets/xfcyrffhy7/2. Last Accessed on 26 Nov 2023.
Zakareya, S., Izadkhah, H. & Karimpour, J. A new deep-learning-based model for breast cancer diagnosis from medical images. Diagnostics 13(11), 1944. https://doi.org/10.3390/diagnostics13111944 (2023).
Yuesheng, F. et al. Circular fruit and vegetable classification based on optimized GoogLeNet. IEEE Access 9, 113599–113611. https://doi.org/10.1109/ACCESS.2021.3105112 (2021).
Praveen, S. P. et al. ResNet-32 and FastAI for diagnoses of ductal carcinoma from 2D tissue slides. Sci. Rep. 12, 20804. https://doi.org/10.1038/s41598-022-25089-2 (2022).
Uysal, F. & Köse, M. M. "Classification of breast cancer ultrasound images with deep learning-based models. Eng. Proc. 31(1), 8. https://doi.org/10.3390/ASEC2022-13791 (2023).
Kurian, B. & Jyothi, V. Breast cancer prediction using an optimal machine learning technique for next generation sequences. Concurr. Eng. 29(1), 49–57. https://doi.org/10.1177/1063293X21991808 (2021).
Hassan, Md. M. et al. A comparative assessment of machine learning algorithms with the Least Absolute Shrinkage and Selection Operator for breast cancer detection and prediction. Decis. Anal. J. 7, 100245. https://doi.org/10.1016/j.dajour.2023.100245 (2023).
Teng, J. SEER breast cancer data. IEEE Dataport https://doi.org/10.21227/a9qy-ph35 (2019).
Bazazeh D, & Shubair R. Comparative study of machine learning algorithms for breast cancer detection and diagnosis. In 2016 5th international conference on electronic devices, systems and applications (ICEDSA), p. 1–4 (2016).
Chougrad, H., Zouaki, H. & Alheyane, O. Deep convolutional neural networks for breast cancer screening. Comput Methods Prog Biomed 157, 19–30 (2018).
DDSM: Digital Database for Screening Mammography, http://www.eng.usf.edu/cvprg/Mammography/Database.html. Last Accessed on 26 Nov 2023.
Moreira, I. C. et al. INbreast: Toward a full-field digital mammographic database. Acad. Radiol. 19(2), 236–248. https://doi.org/10.1016/j.acra.2011.09.014 (2012).
BCDR: Breast Cancer Digital Repository, https://bcdr.eu/information/about. Last Accessed on 30 Aug 2024.
MIAS Mammography, https://www.kaggle.com/datasets/kmader/mias-mammography. Last Accessed on 30 Aug 2024.
Sharma, D., Kumar, R. & Jain, A. Breast cancer prediction based on neural networks and extra tree classifier using feature ensemble learning. Meas. Sensors 24, 100560. https://doi.org/10.1016/j.measen.2022.100560 (2022).
Reshan, M. S. A. et al. Enhancing breast cancer detection and classification using advanced multi-model features and ensemble machine learning techniques. Life 13, 2093. https://doi.org/10.3390/life13102093 (2023).
Wadhwa, K., Singh, S., Sharma, A. & Wadhwa, S. Machine learning-based breast cancer prediction model. Int. J. Perform. Eng. 19(1), 55–63 (2023).
Strelcenia, E. & Prakoonwit, S. Effective feature engineering and classification of breast cancer diagnosis: A comparative study. BioMedInformatics 3, 616–631. https://doi.org/10.3390/biomedinformatics3030042 (2023).
Chen, H. et al. Classification prediction of breast cancer based on machine learning. Comput. Intell. Neurosci. https://doi.org/10.1155/2023/6530719 (2023).
Breast Cancer Wisconsin (Diagnostic) Data set. https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data. Last Accessed on 30 Aug 2024.
Fraiman, D. & Fraiman, R. An ANOVA approach for statistical comparisons of brain networks. Sci. Rep. 8, 4746. https://doi.org/10.1038/s41598-018-23152-5 (2018).
Wang, D. & Qian, H. CatBoost-based automatic classification study of river network. ISPRS Int. J. Geo-Inf. 12, 416. https://doi.org/10.3390/ijgi12100416 (2023).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 31, 6638–6648 (2018).
Chang, W., Wang, X., Yang, J. & Qin, T. An improved CatBoost-based classification model for ecological suitability of blueberries. Sensors 2023, 23. https://doi.org/10.3390/s23041811 (1811).
Ahmed, S. A software framework for predicting the maize yield using modified multi-layer perceptron. Sustainability 15, 3017. https://doi.org/10.3390/su15043017 (2023).
Naga Srinivasu, P., Ijaz, M. F. & Woźniak, M. XAI-driven model for crop recommender system for use in precision agriculture. Comput. Intell. https://doi.org/10.1111/coin.12629 (2024).
Srinivasu, P. N., Sandhya, N., Jhaveri, R. H. & Raut, R. From blackbox to explainable AI in healthcare: existing tools and case studies. Mobile Inf. Syst. https://doi.org/10.1155/2022/8167821 (2022).
Zuo, D. et al. Machine learning-based models for the prediction of breast cancer recurrence risk. BMC Med. Inform. Decis. Mak. 23, 276. https://doi.org/10.1186/s12911-023-02377-z (2023).
Ying, H. et al. Enhancing deep neural network training efficiency and performance through linear prediction. Sci. Rep. 14, 15197. https://doi.org/10.1038/s41598-024-65691-0 (2024).
Kumar, M., Singhal, S., Shekhar, S., Sharma, B. & Srivastava, G. Optimized stacking ensemble learning model for breast cancer detection and classification using machine learning. Sustainability 14, 13998. https://doi.org/10.3390/su142113998 (2022).
Uddin, K. M. M., Biswas, N., Rikta, S. T. & Dey, S. K. Machine learning-based diagnosis of breast cancer utilizing feature optimization technique, computer methods and programs in biomedicine. Update 3, 100098. https://doi.org/10.1016/j.cmpbup.2023.100098 (2023).
Tan, Y. K. et al. Receiver operating characteristic analysis using a novel combined thermal and ultrasound imaging for assessment of disease activity in rheumatoid arthritis. Sci. Rep. 12, 22115. https://doi.org/10.1038/s41598-022-26728-4 (2022).
Chakraborty, D. et al. Diagnostic validation study of rapid urinary tract infection diagnosis kit at peripheral health facilities of West Bengal, India. Sci. Rep. 14, 297. https://doi.org/10.1038/s41598-023-49489-0 (2024).
Srinivasu, P. N. et al. Enhancing medical image classification via federated learning and pre-trained model. Egypt. Inf. J. 27, 100530. https://doi.org/10.1016/j.eij.2024.100530 (2024).
Acknowledgments
This study is supported via funding from Prince sattam bin Abdulaziz University Project Number (PSAU/2024/R/1445).
Funding
The authors also acknowledge contributions to this project from the Rector of the Silesian University of Technology under a proquality Grant No. 09/010/RGJ24/0031.
Author information
Authors and Affiliations
Contributions
Conceptualization, P.N.S., G.J., A.G., S.C.N.J.S., ; methodology, M.W. M.F.I.; software, P.N.S., G.J., A.G., S.C.N., J.S., M.W., M.F.I.,.; validation P.N.S., G.J., A.G., S.C.N., J.S.,; formal analysis, P.N.S., G.J., A.G., S.C.N.,.,; investigation, J.S., M.W., M.F.I.; resources, P.N.S., G.J., A.G., S.C.N., data curation, P.N.S., G.J., A.G., S.C.N.,, writing—original draft preparation, P.N.S., G.J., A.G., S.C.N., ; writing—review and editing, J.S., M.W ., M.F.I.; visualization, P.N.S., G.J., A.G., S.C.N., supervision M.W., and M.F.I ; project administration, P.N.S.; M.W.,; and M.F.I ; funding acquisition, M.W., M.F.I..; All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Srinivasu, P.N., Jaya Lakshmi, G., Gudipalli, A. et al. XAI-driven CatBoost multi-layer perceptron neural network for analyzing breast cancer. Sci Rep 14, 28674 (2024). https://doi.org/10.1038/s41598-024-79620-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-79620-8
Keywords
This article is cited by
-
An explainable AI-driven deep neural network for accurate breast cancer detection from histopathological and ultrasound images
Scientific Reports (2025)
-
MicrobeNet: An Automated Approach for Microbe Organisms Prediction Using Feature Fusion and Weighted CNN Model
International Journal of Computational Intelligence Systems (2025)