Introduction

As one of the most common malignant tumors worldwide, lung cancer necessitates an early diagnosis to ac qhieve higher cure rates and survival rates. However, the complex pathophysiological processes and elusive symptoms associated with lung cancer often pose enormous challenges to its early identification1,2. Conventional diagnostic methods rely heavily on the experience and expertise of physicians, but their accuracy and efficiency are limited, especially when dealing with large-scale medical data3.

With the rapid development of medical imaging and bioinformatics technologies, leveraging machine learning (ML) and data mining techniques for lung cancer identification has become a hot research topic4,5,6. In the digital age, the collection, storage, and analysis of data have dramatically promoted the development across various sectors7. However, with the continuous growth and sharing of data, privacy protection has become an increasingly serious challenge, especially in the medical field. Safeguarding personal health data is of great significance as these data involve sensitive information such as medical history and genetic data8,9. Simultaneously, the privacy protection for medical data has become one of the significant challenges in lung cancer identification research. Conventional ML algorithms typically require training on centralized datasets. This implies that sensitive medical data need to be aggregated for processing, thus inducing potential risks for privacy breaches10,11.

In recent years, privacy protection technologies such as differential privacy (DP) have received considerable attention as effective means of safeguarding sensitive information in data analysis and ML tasks12,13. DP provides a rigorous mathematical framework for quantifying and controlling privacy risks associated with the publication of statistical information derived from sensitive data. As an important privacy protection mechanism, DP has been widely applied in various scenarios, including healthcare, social networks, and financial services, to ensure the proper protection of individual privacy while allowing for the meaningful analysis and utilization of relevant data14. By introducing noise or perturbation to obfuscate data, DP techniques can provide robust protection for individual data privacy while maintaining data utility. This technology is favored not only in academia but also in policy-making and legal regulations. For example, some countries and regions have incorporated DP into data protection laws to ensure respect for individual privacy rights during data processing and sharing15.

Kernel methods, particularly Gaussian kernel functions, have demonstrated remarkable effectiveness in various ML tasks, notably in clustering16,17. Gaussian kernels can be employed to flexibly map data into high-dimensional feature space with high efficiency, in which complex patterns can be captured and exploited for clustering purposes. The adaptability of Gaussian kernels arises from their ability to capture non-linear relationships inherent in the data, enabling robust and pronounced clustering performance18,19. In practice, Gaussian kernel-based approaches can be utilized to effectively handle complex data distributions, uncovering hidden structures that aid in accurate and insightful clustering outcomes20. The versatility and power of Gaussian kernels make them a cornerstone in modern ML. This contributes to an enhanced ability of ML to interpret and utilize intricate data patterns in such tasks as lung cancer identification and other challenging applications21.

This study focuses on developing and optimizing privacy-preserving clustering methods tailored for medical data, addressing the critical challenge of safeguarding sensitive patient information while enabling effective data utilization. Privacy breaches in healthcare not only compromise individual rights but also hinder data sharing and collaborative research, both essential for advancing diagnostic methods like those for lung cancer. In this paper, a novel algorithm is proposed based on Gaussian kernel functions and DP techniques, integrated with the Fuzzy C-Means (FCM), for lung cancer identification22,23,24. By integrating advanced techniques such as differential privacy (DP) with clustering algorithms, this work aims to balance privacy protection with data utility. Specifically, the proposed approach ensures that sensitive medical information remains secure while preserving the accuracy and interpretability of clustering results, enabling reliable diagnostic insights.

The main contributions of this paper are elucidated as follows.

  1. 1.

    Propose a Gaussian kernel-based DP technique that can preserve model predictive performance while protecting individual privacy.

  2. 2.

    Construct a lung cancer identification algorithm based on DP techniques, which is capable of effectively mining potential features from medical data, thus achieving accurate cancer identification.

  3. 3.

    Conduct experimental validation on the proposed algorithm based on multiple public datasets, which demonstrates that this algorithm can achieve favorable lung cancer identification performance while protecting privacy.

Material and method

Data sources

In this study, the data are collected from three different platforms as experimental datasets, including the open-source dataset from the ML Repository of the University of California, Irvine (UCIML, http://archive.ics.uci.edu/dataset/62/)25, the open dataset from the National Cancer Institute (NLST, https://cdas.cancer.gov/datasets/nlst/)26, and the publicly available dataset released by Stanford University (NSCLC, https://wiki.cancerimagingarchive.net/display/Public/NSCLC+Radiogenomics)27. The experimental datasets are composed of patient demographics, including age distribution, gender, body mass index (BMI), and follow-up treatments sourced from systems maintained by the University of California, Irvine, the National Cancer Institute, and Stanford University. The collection of all data has been formally reviewed and approved by respective institutions, allowing access to these data for research purposes. In this retrospective study, the datasets from publicly available databases are utilized without any sensitive personal information, and hence no approval from ethics review boards is required. The effectiveness and portability of this algorithm are validated using the Lung Lesions and Cancer Set (LLCS), a lung cancer dataset collected by Gaochun Hospital, an affiliate of Jiangsu University. The LLCS comprises 256 case data, including 128 lung cancer cases and 128 other lung disease cases. The dataset includes various patient characteristics, such as height, weight, and imaging data. All LLCS data used in this study are collected in compliance with relevant guidelines and regulations (Ministry of Science and Technology of the People’s Republic of China, Policy No. 2006398). This study has been approved by the Ethics Committee of Gaochun Hospital, Jiangsu University (No. K202300107). As a retrospective study, only the clinical data of patients are collected, without interfering with their treatment plans or imposing any physical risk. Besides, the informed consent from participants is not required for this retrospective study. However, the research team will take all necessary measures to ensure the confidentiality of patient information. This study will be conducted in accordance with good clinical practice and ethical standards set out in the Declaration of Helsinki of 1964 and its subsequent amendments.

DP technology

As a cutting-edge approach in the field of data privacy and security, DP aims to enable the sharing and analysis of sensitive data while safeguarding individual privacy. This revolutionary concept has received accumulating attention in recent years due to the increasing concern over data breaches and privacy violations in various domains, such as healthcare, finance, and social media28,29. In essence, DP provides a mathematical framework for quantifying the privacy guarantees of a data-sharing mechanism. The fundamental idea is to add carefully calibrated noise to query responses or data records to mask sensitive information under the premise of ensuring the meaningful analysis of relevant data. The noise addition ensures that any single individual’s data has a negligible impact on the overall results of the analysis, thereby preserving privacy30.

One of the key advantages of DP is its rigorous formalization, which allows for precise privacy guarantees to be quantified and enforced. By carefully controlling the added amount of noise to the data, DP mechanisms can strike a balance between privacy and utility, ensuring that valuable insights can still be extracted from the data without compromising individuals’ privacy31. The Notations and descriptions are shown in Table 1 below.

Table 1 Notations and descriptions.

Specifically, if a dataset D and its corresponding random query function Q (query) are established, only a small probability difference between the output of the query function executed on any two adjacent datasets D and \(\:{D}^{{\prime\:}}\) is required in DP.

For any adjacent datasets (D and \(\:{D}^{{\prime\:}}\)) and any output S, the following inequality can be obtained,

$$\:\text{Pr}\left[Q\left(D\right)\in\:S\right]\le\:\text{exp}\left(\epsilon\:\right)\text{*}\text{Pr}\left[Q\left({D}^{{\prime\:}}\right)\in\:S\right]$$
(1)

where, \(\:\text{P}\text{r}\) represents the probability function; \(\:\epsilon\:\) represents the privacy parameter of DP, also known as the privacy budget. The smaller the \(\:\epsilon\:\), the larger the added amount of noise. This indicates a lower probability of individual privacy disclosures and a higher degree of privacy protection.

As an important concept in DP, global sensitivity is used to measure the sensitivity of query results in different datasets and the maximum change in sensitivity functions caused by the deletion or addition of any data sample. Specifically, for a sensitive function f: D→R, where D represents the ___domain of the dataset and R represents the range of the sensitive function, the global sensitivity can be defined as follows,

$$\:\varDelta\:f=max\left|f\left(D\right)-f\left({D}^{{\prime\:}}\right)\right|\\\:St.\:D,{D}^{{\prime\:}}\to\:R$$
(2)

where, \(\:\left|f\left(D\right)-f\left({D}^{{\prime\:}}\right)\right|\) represents the norm distance from \(\:f\left(D\right)\) to \(\:f\left({D}^{{\prime\:}}\right)\); the global sensitivity \(\:\varDelta\:f\) represents the maximum sensitivity after noise is added to the dataset D. Here, the noise is generated randomly using the Laplacian mechanism, and the Laplacian noise density can be defined as follows,

$$\:Lap\left(x\right)=\frac{1}{2b}\text{exp}\left(-\frac{\left|x-\mu\:\right|}{b}\right)$$
(3)

where, b represents the scale parameter; \(\:\mu\:\) represents the position parameter; x represents the position variable. In this function, when x=\(\:\mu\:\), \(\:Lap\left(x|b\right)=\frac{1}{2b}\) is the largest, and both sides fall exponentially. The scale of the Laplace distribution, b, is determined by the sensitivity of the function and the desired privacy level. Sensitivity measures the maximum possible change in the function’s output due to the addition or removal of a single record in the dataset. The scale is given by:

$$\:b=\frac{\varDelta\:f}{\epsilon\:}$$
(4)

In the case of a constant\(\:\:\varDelta\:f\), the smaller the privacy budget, the larger the scale parameter b, and the greater the added noise.

For any function \(\:f:\text{D}\to\:{R}^{k}\), the Laplacian mechanism can be expressed as:

$$\:Q\left(x,f\left(\text{*}\right),\epsilon\:\right)=f\left(x\right)+({C}_{1},{C}_{2},\dots\:,{C}_{k})$$

Where \(\:{C}_{i}\)is an independent equally distributed random variable subject to \(\:{C}_{i}\sim\text{L}\text{a}\text{p}\left(\frac{\varDelta\:f}{\epsilon\:}\right)\) sampling, in fact, noise is added to the query result \(\:f\left(x\right)\) of the database, the first output result is \(\:f\left(x\right)+{C}_{1}\), and the k-th output result is \(\:f\left(x\right)+{C}_{k}\).

In the theory of Laplace noise distribution, \(\:\varDelta\:f\), \(\:\epsilon\:\), \(\:\mu\:\), and \(\:b\) are the four most critical parameters. The proper configuration of these parameters directly influences the strength of privacy protection and the performance of clustering (as shown in Fig. 1). The position parameter (\(\:\mu\:\)) represents the mean of the Laplace or Gaussian noise distribution, determining the center of the noise distribution. In this study, \(\:\mu\:\) is set to 0 to ensure that the added noise does not introduce systematic bias to the clustering centers, thereby maintaining the randomness of privacy protection and the stability of model performance.

According to the properties of the Laplace noise distribution, the noise scale parameter \(\:b\) is proportional to \(\:\varDelta\:f\) and inversely proportional to \(\:\epsilon\:\) (\(\:b=\varDelta\:f/\epsilon\:\)). As shown in Fig. 1B and D, and 1E, changes in \(\:\varDelta\:f\) or \(\:\epsilon\:\) independently lead to the same changes in the parameter \(\:b\). To simplify the experimental process, \(\:\varDelta\:f\) is fixed at 0.05 in this study. Instead, \(\:\epsilon\:\) is dynamically adjusted to accommodate different scale parameters and sensitivity settings.

Fig. 1
figure 1

The impact of Laplace noise settings with different parameters on image information. (A) The original lung CT image. (B) Laplacian noise distribution with different parameter settings. (C) The original lung cancer image after adding Laplacian noise with parameters \(\:\mu\:=0,\epsilon\:=0.5\),\(\:\varDelta\:f=0.05\:and\:b=0.1\). (D) The original lung cancer image after adding Laplacian noise with parameters \(\:\mu\:=0,\epsilon\:=0.2\),\(\:\varDelta\:f=0.05\:and\:b=0.25\). (E) The original lung cancer image after adding Laplacian noise with parameters \(\:\mu\:=0,\epsilon\:=0.5\),\(\:\varDelta\:f=0.125\:and\:b=0.25.\) (F) The original lung cancer image after adding Laplacian noise with parameters \(\:\mu\:=0.5,\epsilon\:=0.5\),\(\:\varDelta\:f=0.05\:and\:b=0.1.\).

FCM

Fuzzy C-Means (FCM) is a clustering algorithm based on fuzzy theory. Compared with conventional K-means clustering algorithms, FCM allows data points to belong to multiple clusters at the same time by assigning a membership degree to each data point to describe its degree of belonging to each cluster32,33. FCM is widely used in fuzzy clustering and has achieved good results in image segmentation, pattern identification, data mining, and other fields.

Assuming that the dataset \(\:D=\{{x}_{1},{x}_{2},{x}_{3},\dots\:,{x}_{N}\}\) is an n-dimensional sample data, FCM can obtain the cluster center by minimizing the objective function. The objective function is essentially the sum of the Euclidean distances from the points to the classes (sum of squares of errors),

$$\:{J}_{m}=\sum\:_{i=1}^{N}\sum\:_{j=1}^{C}{\mu\:}_{ij}^{m}{\left|\right|{x}_{i}-{c}_{j}\left|\right|}^{2}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:1\le\:m<\infty$$
(5)

where, m represents the number of clusters (number of classes) of the cluster; N represents the number of samples; C represents the number of cluster centers; \(\:{x}_{i}\) represents the i-th sample; |||| represents the Euclidean norm; \(\:{c}_{j}\) represents the JTH cluster center, the same as the feature dimension of the sample; \(\:{\mu\:}_{ij}\) represents the membership of the sample \(\:{x}_{i}\) to the cluster center (namely the probability of \(\:{x}_{i}\) belonging to \(\:{c}_{j}\)). Numerically, \(\:{\mu\:}_{ij}\) and \(\:{c}_{j}\) can be calculated as follows,

$$\:\left\{\begin{array}{c}{\mu\:}_{ij}=\frac{1}{\sum\:_{k=1}^{C}{\left(\frac{{\left|\right|x}_{i}-{c}_{j}\left|\right|}{{\left|\right|x}_{i}-{c}_{j}\left|\right|}\right)}^{\frac{2}{m-1}}}\\\:{c}_{j}=\frac{\sum\:_{i=1}^{N}{\mu\:}_{ij}^{m}\cdot\:{x}_{i}}{\sum\:_{i=1}^{N}{\mu\:}_{ij}^{m}}\end{array}\right.$$
(6)

where, \(\:{\mu\:}_{ij}\) and \(\:{c}_{j}\) contain each other. Hence, in actual calculations, the assignment method is usually used for iterative calculations. When Eq. (7) is satisfied, the iteration ends.

$$\:\underset{i,j}{\text{max}}\left\{|{\mu\:}_{ij}^{t+1}-{\mu\:}_{ij}^{t}|\right\}<\epsilon$$
(7)

where, \(\:\epsilon\) represents the convergence condition threshold, usually 0.01 or 0.001.

The function of combining DP with FCM based on Gaussian kernel function

Based on the analysis of the above privacy leakage problems, DP protection can be achieved by adding random noise satisfying the Laplace distribution to the center point of the clustering iterative process34,35. In each iteration, the same noise is added to the membership matrix and the cluster center point, which results in a large deviation of the cluster center point, thus eventually increasing the number of algorithm iterations and reducing the availability of relevant data. Therefore, the privacy budget al.___location method combining DP with FCM based on the Gaussian kernel function (DPFCM_GK) can be constructed as follows.

The radial basis function (RBF) is a commonly used method for function approximation. RBF is based on a central point and determines the output value of the sample by calculating the distance between the input sample and the central point. In practice, Euclidean distance representation is often used in RBF36,37,38. As an important technique in ML, Gaussian kernel functions can be employed to clarify the relationship between Gaussian kernel functions and fuzzy sets, namely that Gaussian kernel functions can represent the relationship between objects,

$$\:\text{R}\text{B}\text{F}=\text{k}\left|\right|x-{x}^{{\prime\:}}\left|\right|=\text{e}\text{x}\text{p}(-\frac{{\left|\right|x-{x}^{{\prime\:}}\left|\right|}^{2}}{2{\sigma\:}^{2}})$$
(8)

where, \(\:{x}^{{\prime\:}}\) represents the center of the kernel function; \(\:\sigma\:\) represents the width parameter of the function, and it is used to control the radial range of the function.

Let sensitive data x represent the data to be published, where \(\:f\left(x\right)=x\in\:{R}^{k}\), i.e., D=x and \(\:{D}^{{\prime\:}}={x}^{{\prime\:}}\) be two neighboring datasets. So \(\:\parallel f\left(D\right)-f\left({D}^{{\prime\:}}\right)\parallel \le\:\varDelta\:f\). For any output \(\:o\in\:{R}^{k}\), the probability density of the Laplace mechanism is:

$$\:\left\{\begin{array}{c}Pr[f\left(D\right)=o]=\prod\:_{i=1}^{k}\frac{\epsilon\:}{2\varDelta\:f}\text{exp}\left(-\frac{\epsilon\:\left|{f\left(D\right)}_{i}-{o}_{i}\right|}{\varDelta\:f}\right)\\\:Pr[f\left({D}^{{\prime\:}}\right)=o]=\prod\:_{i=1}^{k}\frac{\epsilon\:}{2\varDelta\:f}\text{exp}\left(-\frac{\epsilon\:\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{o}_{i}\right|}{\varDelta\:f}\right)\end{array}\right.$$
(9)

The ratio of probabilities for D and \(\:{D}^{{\prime\:}}\) is:

$$\:\frac{\text{Pr}\left[f\left(D\right)=o\right]}{\text{Pr}\left[f\left({D}^{{\prime\:}}\right)=o\right]}=\frac{\prod\:_{i=1}^{k}\frac{\epsilon\:}{2\varDelta\:f}\text{exp}\left(-\frac{\epsilon\:\left|{f\left(D\right)}_{i}-{o}_{i}\right|}{\varDelta\:f}\right)}{\prod\:_{i=1}^{k}\frac{\epsilon\:}{2\varDelta\:f}\text{exp}\left(-\frac{\epsilon\:\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{o}_{i}\right|}{\varDelta\:f}\right)}\:=\prod\:_{i=1}^{k}\text{e}\text{x}\text{p}\left(\frac{\epsilon\:\left(\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{o}_{i}\right|-\left|{f\left(D\right)}_{i}-{o}_{i}\right|\right)}{\varDelta\:f}\right)$$
(10)

Using the triangle inequality, we know:

$$\:\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{o}_{i}\right|-\left|{f\left(D\right)}_{i}-{o}_{i}\right|\le\:\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{f\left(D\right)}_{i}\right|$$
(11)

Summing over all i, we have:

$$\:\sum\:_{i=1}^{k}\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{f\left(D\right)}_{i}\right|\le\:\varDelta\:f$$
(12)

Thus, the probability ratio becomes:

$$\begin {aligned}\:\frac{\text{Pr}\left[f\left(D\right)=o\right]}{\text{Pr}\left[f\left({D}^{{\prime\:}}\right)=o\right]}&=\prod\:_{i=1}^{k}\text{e}\text{x}\text{p}\left(\frac{\epsilon\:\left(\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{o}_{i}\right|-\left|{f\left(D\right)}_{i}-{o}_{i}\right|\right)}{\varDelta\:f}\right)=\text{exp}\left(\sum\:_{i=1}^{k}\frac{\epsilon\:\left(\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{o}_{i}\right|-\left|{f\left(D\right)}_{i}-{o}_{i}\right|\right)}{\varDelta\:f}\right)\\&\le\:exp\left(\frac{\epsilon\:}{\varDelta\:f}\sum\:_{i=1}^{k}\left|{f\left({D}^{{\prime\:}}\right)}_{i}-{f\left(D\right)}_{i}\right|\right)\le\:exp\left(\frac{\epsilon\:}{\varDelta\:f}\varDelta\:f\right)=\text{exp}\left(\epsilon\:\right)\end {aligned}$$
(13)

For any measurable output set S, the same ratio bound holds by integrating the density function over S. This ensures:

$$\:\text{Pr}\left[Q\left(D\right)\in\:S\right]\le\:\text{exp}\left(\epsilon\:\right)\text{*}\text{Pr}\left[Q\left({D}^{{\prime\:}}\right)\in\:S\right]$$
(14)

Gaussian kernel functions have local properties and are suitable for allocating privacy budgets during cluster iterations. In each cluster, points farther away from the cluster center have smaller Gaussian kernel values, while those closer to the cluster center have larger Gaussian kernel values. The value of Gaussian kernel functions reflects the influence of the cluster center. A larger Gaussian value of the cluster center point suggests that the cluster points are more densely distributed around the cluster center, indicating a better clustering effect. Under this circumstance, a smaller privacy budget can be allocated to achieve a higher level of privacy protection21.

Gaussian weights (\(\:{\omega\:}_{i}\)) play a pivotal role in determining the allocation of privacy budgets to cluster centers. These algorithms leverage Gaussian functions to quantify the influence of data points on cluster centroids, with the weights reflecting the relative significance of each cluster in the overall clustering process. The algorithm defines \(\:G\left({c}_{i}\right)\) as the Gaussian function value for the i-th cluster center ci, computed based on the distances between the center ci and all data points \(\:{x}_{j}\:\)in the dataset. The Gaussian function \(\:G\left({c}_{i}\right)\) is formulated as,

$$\:G\left({c}_{i}\right)=\sum\:_{j=1}^{N}\text{exp}\left(-\frac{{\left|\left|{x}_{j}-{x}_{i}^{{\prime\:}}\right|\right|}^{2}}{2{\sigma\:}^{2}}\right)$$
(15)

Where \(\:{\left|\right|{x}_{j}-{x}_{i}^{{\prime\:}}\left|\right|}^{2}\) represents the squared Euclidean distance between data point \(\:{x}_{j}\)and cluster center\(\:\:{x}_{i}^{{\prime\:}}\). \(\:\sigma\:\) is a parameter controlling the width of the Gaussian function, influencing how data points contribute to \(\:G\left({c}_{i}\right)\).

The Gaussian weight \(\:{\omega\:}_{i}\) for the i-th cluster is then calculated as:

$$\:{\omega\:}_{i}=\frac{G\left({c}_{i}\right)}{\sum\:_{j=1}^{k}G\left({c}_{j}\right)}$$
(16)

Where k is the total number of clusters. This normalization ensures that \(\:{\omega\:}_{i}\) represents the proportion of the total Gaussian function value contributed by cluster i, thereby quantifying its importance in the clustering process.

The allocation of privacy budgets \(\:{\epsilon\:}_{i}^{t}\:\)to each cluster center \(\:{c}_{i}\) during iteration t is governed by the following formula:

$$\:{\epsilon\:}_{i}^{t}={\epsilon\:}^{t}\cdot\:\frac{1+\underset{j}{\text{min}}{\omega\:}_{j}}{1+{\omega\:}_{i}}$$
(17)

Where \(\:{\epsilon\:}_{i}^{t}\) denotes the privacy budget allocated to the i-th cluster center in iteration t. \(\:{\epsilon\:}^{t}\) represents the overall privacy budget available for the current iteration t. \(\:\underset{j}{\text{min}}{\omega\:}_{j}\) denotes the minimum Gaussian weight among all clusters, ensuring a baseline for privacy protection relative to the least influential cluster. \(\:{\omega\:}_{i}\) is the Gaussian weight of the i-th cluster, determining the proportion of \(\:{\epsilon\:}^{t}\) allocated based on its importance in the clustering process.

Impact of Gaussian Weights on Privacy Budgets:

  • Higher \(\:{\omega\:}_{i}\) values indicate that the i-th cluster center \(\:{c}_{i}\) has a greater influence on the clustering result. As a result, \(\:{\epsilon\:}_{i}^{t}\:\)decreases because stricter privacy protections are required around influential centers to prevent privacy breaches.

  • Lower \(\:{\omega\:}_{i}\) values suggest that the corresponding cluster center \(\:{c}_{i}\) has less impact on the clustering outcome. Consequently, \(\:{\epsilon\:}_{i}^{t}\)increases, allowing for more relaxed privacy protection due to the reduced importance of data around these centers.

The formula\(\:\:{\epsilon\:}_{i}^{t}={\epsilon\:}^{t}\cdot\:\frac{1+\underset{j}{\text{min}}{\omega\:}_{j}}{1+{\omega\:}_{i}}\)​ ensures mathematical rigor by scaling the global privacy budget \(\:{\epsilon\:}^{t}\) inversely to 1+\(\:{\omega\:}_{i}\), thereby adjusting for the impact of each cluster center’s weight \(\:{\omega\:}_{i}\)​ on the privacy guarantee. Utilizing \(\:1+\underset{j}{\text{min}}{\omega\:}_{j}\) normalizes the privacy allocation across clusters, ensuring equitable distribution relative to the least contributing cluster, thereby maintaining fairness in privacy protection across all clusters.

The proposed model is designed to streamline operations through a structured workflow. The operational summary encompasses three key stages: input processing, core computation, and output generation, ensuring efficiency and accuracy. To clarify the interactions between these stages, Fig. 2 illustrates the operational summary of the proposed model.

Fig. 2
figure 2

Diagram depicting the operational summary of the proposed model.

In this study, a multimodal dataset comprising both image data and clinical feature data is collected. These image data consist of medical images, while these clinical feature data include physiological indicators, such as height, weight, and the number of structures. To leverage these data effectively, convolutional neural networks (CNNs) are employed to extract features from the image data39,40,41. The DPFCM_GK algorithm was implemented in Python using libraries like NumPy, SciPy, and scikit-learn for numerical computations and clustering tasks. CNN-based feature extraction was performed with TensorFlow. Experiments were conducted on a workstation with an Intel Core i7 processor, 16 GB RAM, and an NVIDIA GeForce RTX 3080 GPU. Optimization techniques, including vectorized operations, batch processing, and early stopping, ensured computational efficiency. The CNN architecture for lung cancer image feature extraction consists of grayscale input images (256 × 256 pixels), four convolutional layers with 64 filters (3 × 3) and ReLU activation, max-pooling layers (2 × 2, stride 2), and fully connected layers (sizes 512 and 256). The network was trained using stochastic gradient descent (SGD) with momentum, a learning rate of 0.001, and dropout regularization (dropout rate 0.5) to reduce overfitting. Hyperparameters, including the learning rate, batch size, and dropout rate, were optimized through cross-validation to balance convergence and computational efficiency. These image features are combined with the clinical feature data to form the final feature set D. This integrated feature set D not only contains advanced representations of image data features but also incorporates information from clinical features, thus providing a comprehensive and diverse feature input for the subsequent data analysis and modeling.

Code availability

The custom code used in this study is available at Zenodo (DOI: https://doi.org/10.5281/zenodo.14789535) and can be accessed by readers for reproduction of the results and further analysis. The code is provided with no access restrictions and includes all necessary files for implementation. For additional details or questions, please contact the corresponding author.

Evaluation index

To verify the effectiveness of the joint clustering algorithm based on DP and Gaussian kernel functions, DPFCM_GK is employed to perform clustering procedures on four experimental data. The results are compared with those obtained based on the FCM and differential privacy fuzzy C-means clustering algorithm (DPFCM). The accuracy (ACC), precision (PRE), recall (REC), F1 score (F1-score)42, and adjusted Rand index (ARI)43 are selected to evaluate the effectiveness of the algorithm,

$$\:\left\{\begin{array}{c}ACC=\frac{TP+TN}{TP+FN+FP+FN}\\\:PRE=\frac{TP}{TP+FP}\\\:REC=\frac{TP}{TP+FN}\\\:F1\_Score=\frac{2*PRE*ReCall}{PRE+ReCall}\\\:ARI=\frac{RI-E\left(RI\right)}{Max\left(RI\right)-E\left(RI\right)}\end{array}\right.$$
(18)

where, TP (true positive), FN (false negative), FP (false positive), and TN (true negative) represent the number of true classes, false negative classes, false positive classes, and true negative classes, respectively; RI represents the Rand index, which is equal to \(\:\frac{2(a+b)}{n*(n+1)}\) numerically; a represents the number of samples divided into the same class; d represents the number of samples divided into different classes. The calculation results of each model were recorded using Excel software and subsequently analyzed with SPSS version 20.0. Continuous variables were expressed as mean ± standard deviation (\(\:\stackrel{-}{x}\pm\:s)\). For repeated measurements across multiple datasets, repeated measures analysis of variance (ANOVA) was performed. Comparisons of continuous variables between two groups were conducted using paired t-tests. Categorical data were presented as percentages (%) and analyzed using chi-square (\(\:{\chi\:}^{2}\)) tests to evaluate differences between groups. A p-value of less than 0.05 was considered statistically significant.

Result

Effectiveness analysis results

In this study, the clinical feature data of patients are modeled as the feature set due to the absence of image data in the UCIML dataset. The NLST dataset and NSCLC dataset contain image data, thus requiring the use of CNNs for feature extraction from the image data, followed by the combination with the clinical features to form the feature set. FCM, DPFCM, and DPFCM_GK are implemented using Python programming, and the clustering analysis is performed based on the UCIML, NLST, and NSCLC datasets, with the results shown in Fig. 3. As shown in Fig. 3, the privacy budget ε ranges from 0 to 5 with a step size of 0.01. The numbers of TP, FN, FP, and TN are recorded under different privacy budget ε values for the FCM, DPFCM, and DPFCM_GK models, and the values of ACC, PRE, ReCall, F1_Score, and ARI are also calculated.

Fig. 3
figure 3

Results of the effectiveness analysis of various algorithms based on the experimental data. Figure 3(A, B, C) illustrate the ACC calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(D, E, F) illustrate the PRE calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(G, H, I) illustrate the REC calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(J, K, L) illustrate the F1-score calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(M, N, O) illustrate the ARI calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. The horizontal axis in all figures represents the privacy budget ε. The black line represents the identification results of FCM, the blue line represents the identification results of DPFCM, and the red line represents the identification results of DPFCM_GK.

Table 2 Results of the effectiveness analysis of DPFCM and DPFCM_GK.

It can be observed that the identification results of FCM based on the UCIML, NLST, and NSCLC datasets are independent of the privacy budget. However, the identification performance of DPFCM and DPFCM_GK is gradually improved with an increase in the privacy budget ε. Specifically, for the UCIML, NLST, and NSCLC datasets, DPFCM reaches the optimal clustering results when the privacy budgets are 2.65, 2.83, and 3.04, respectively; while DPFCM_GK reaches the optimal clustering results when the privacy budgets are 1.52, 1.24, and 2.32, respectively. As shown in Table 2, the comparison shows that DPFCM_GK outperforms DPFCM across privacy budgets ε. DPFCM_GK achieves better results in accuracy (ACC), precision (PRE), recall (ReCall), F1_Score, and adjusted rand index (ARI). Additionally, the P-value analysis confirms the differences are statistically significant, highlighting that DPFCM_GK strikes a better balance between privacy protection and model performance.

Comparison results of spatial and time complexity

To calculate the spatial complexity of each model, 12 different privacy budget values ε are adopted in this study, namely 0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, and 5. Besides, a clustering analysis is performed on DPFCM and DPFCM_GK based on the UCIML, NLST, and NSCLC datasets, with the iteration counts of each algorithm recorded in Fig. 4.

Fig. 4
figure 4

Analysis results of clustering iteration times for various algorithms based on the experimental data. Figure 4(A, B, C) illustrate the calculation results of clustering iteration counts for each algorithm based on the UCIML, NLST, and NSCLC datasets, respectively.

Since noise can disrupt the original clustering convergence process, the iteration counts of algorithms with the implementation of DP protection will be higher than those without the implementation of DP protection. In other words, when the privacy budget is small, the iteration counts of DPFCM_GK and DPFCM are higher than those of FCM. As the privacy budget gradually increases, the added random noise decreases gradually, resulting in a reduction in the average iteration counts of the two DP protection algorithms, approaching those of FCM. Meanwhile, DPFCM_GK exhibits a faster convergence trend and significantly faster convergence speed, exhibiting statistical significance (T = 23.08, 43.47, and 48.93; P < 0.05).

Furthermore, this study conducted an in-depth analysis of the time complexity of the proposed DPFCM_GK method and compared it with several existing privacy-preserving methods, including LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, and Ldp-fl44,45,46,47,48. The iteration time of FCM, DPFCM, DPFCM_GK, LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, and Ldp-fl are also calculated for each privacy budget state, as shown in Fig. 5. For the UCIML dataset, under the same privacy budget, the DPFCM_GK model demonstrates significantly lower time complexity, greatly reducing the runtime compared to models like DPFCM, LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, and LDP-FL. Even at higher privacy budget values, DPFCM_GK continues to perform efficiently. The reduction in runtime is statistically significant, with T-values of (T = 21.08, 316.24, 102.35, 222.37, 162.23 and 159.25, P < 0.05). For the NLST and NSCLC datasets, the DPFCM_GK algorithm still maintains the lowest runtime across all models(T = 26.88, 217.66, 132.19, 267.33, 182.12 and 144.13, P < 0.0 182.12 and 144.13, P < 0.05; T = 31.78, 267.89, 156.71, 265.38, 191.22 and 124.56. P < 0.05), demonstrating its superior time efficiency.

Fig. 5
figure 5

Analysis results of clustering running time (ms) of various algorithms based on the experimental data. Figure 5(A, B, C) illustrate the calculation results of clustering iteration time (ms) for each algorithm based on the UCIML, NLST, and NSCLC datasets, respectively.

Algorithm availability verification

To validate the utility of the data, a classification analysis is performed on the lung cancer dataset collected from our hospital using FCM, DPFCM, and DPFCM_GK to verify the robustness and portability of these algorithms. The privacy budget ε is set to 0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, and 5, respectively. The numbers of TP, FN, FP, and TN identified by FCM, DPFCM, and DPFCM_GK under various conditions are also recorded, and their identification results are simultaneously calculated, as shown in Fig. 6.

Fig. 6
figure 6

Performance evaluation and availability verification results of DPFCM_GK, DPFCM, and FCM using LLCS. (A) Comparison of identified cases for DPFCM_GK, DPFCM, and FCM. The numbers inside the colored blocks represent the count of identified cases corresponding to the horizontal axis. Red blocks indicate the results of DPFCM_GK, blue blocks indicate the results of DPFCM, and gray blocks indicate the results of FCM. M1, M2, and M3 denote DPFCM_GK, DPFCM, and FCM, respectively. The symbol “*” represents a statistically significant difference between two algorithms with a p-value less than 0.05, while “**” represents a p-value less than 0.01. (B, D, F) ROC curves for DPFCM_GK, DPFCM, and FCM, respectively, illustrating their performance under LLCS classification. (C, E, G) PR curves for DPFCM_GK, DPFCM, and FCM, respectively, demonstrating their precision-recall relationships under LLCS classification.

For the lung cancer dataset collected by our hospital, as ε increases, the clustering results of the DPFCM-GF and DPFCM algorithms gradually converge to those of the FCM algorithm, achieving an accuracy rate of 94.5%. With in the range of 0.5 ≤ ε ≤ 2.5, the clustering performance of the DPFCM-GF algorithm is superior to that of the DPFCM algorithm, with statistically significant differences (\(\:{\chi\:}^{2}=\) 4.54, 9.68, 29.12, 21.21, and 4.34; P < 0.05). The ROC and PR curves of the LLCS dataset were drawn. As the privacy budget ε increases, the Area Under the Curve (AUC) values predicted by the DPFCM and DPFCM-GF algorithms for the ROC curve of the LLCS dataset converge to 0.84. The DPFCM-GF algorithm demonstrates faster convergence and better recognition performance compared to DPFCM.

By setting different privacy budgets, the study evaluated the classification accuracy and data loss rate of each model on the LLCS dataset, providing a comprehensive comparison of the performance of different algorithms under varying privacy budget conditions. Furthermore, with the privacy budget fixed at 5, this study analyzed the relationship between the number of iterations and the data loss rate for each model. The results, presented in Fig. 7, clearly demonstrate the advantages of the DPFCM_GK method in balancing convergence speed and data loss rate. The DPFCM_GK method consistently achieves higher accuracy compared to other methods (DPFCM, LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, and Ldp-fl) across all privacy budget levels. The DPFCM_GK method achieves the lowest misclassification rate among all models, particularly when the privacy budget is small. As the privacy budget increases, the misclassification rate decreases for all methods, but DPFCM_GK maintains its advantage (T = 4.20, 8.44, 10.92, 3.95, 7.16, 8.51, P < 0.05). The DPFCM_GK method converges faster to a lower misclassification rate compared to other methods, highlighting its efficiency in achieving better results within fewer iterations. When the privacy budget ε is set to 5, the confusion matrices of the DPFCM, DPFCM_GK, LDP-Fed, LDP-FL, LDP-SGD, MGM-DPFL, and MGM-FedSGD models are evaluated using the LLCS dataset (as shown in Fig. 7D and J).

Fig. 7
figure 7

Performance evaluation of DPFCM_GK and other methods under varying privacy budgets using LLCS. (A) The relationship between the test accuracy (%) of various models and privacy budget values. (B) The relationship between the misclassification rate (%) of various models and privacy budget values. (C) The relationship between the misclassification rate (%) of various models and the number of iterations, with the privacy budget fixed at 5. (D-J) When the privacy budget ε is set to 5, the confusion matrices of the DPFCM, DPFCM_GK, LDP-Fed, LDP-FL, LDP-SGD, MGM-DPFL, and MGM-FedSGD models are evaluated using the LLCS dataset.

Discussion

Medical data refer to various information related to patients’ health, and they are collected, stored, and processed in the healthcare field, including personal physical conditions, medical records, diagnostic results, treatment plans, and prescription medications49. These data flow among medical institutions, healthcare providers, researchers, and insurance companies for disease diagnosis, treatment planning, medical research, and healthcare service management. However, the privacy breach of medical data could pose serious risks50,51,52. Firstly, the leakage of sensitive health information may lead to infringement of personal privacy rights, even exposing individuals to risks like identity theft and financial loss. Secondly, the leakage of medical data may weaken individuals’ trust in the healthcare system and reduce their willingness to seek medical care and share health information, thereby affecting the quality and effectiveness of healthcare services. Additionally, the unauthorized use of leaked medical data could be exploited by malicious actors for cyberattacks, resulting in the paralysis of healthcare systems and posing serious threats to patient safety. Therefore, there is an urgent demand for realizing the sharing and fusion of data under the premise of protecting the privacy of medical data53,54.

To optimize or solve the problems associated with medical data security and privacy protection, data encryption and data protection technologies have been proposed and explored by many researchers55,56. DP is a data processing framework that protects personal privacy information by adding noise to the original data, making it difficult for attackers to infer sensitive personal information from the noise. The application of DP in medical data security has exhibited great potential, which can effectively protect the privacy of medical data while allowing for data sharing and analyses.

With the advancement of deep learning (DL) algorithms, particularly convolutional neural networks (CNNs), significant progress has been made in applying DL-based models to computer-aided diagnosis (CAD) systems for lung tumor detection. These models aim to improve diagnostic accuracy, reduce false positive rates, and enhance execution efficiency (Table 3)57,58. Much like feature-based CAD systems, DL-based systems typically follow a three-step workflow: nodule detection and segmentation, feature extraction, and clinical judgment inference59. However, unlike traditional methods that rely on manually crafted features, DL-based CAD systems can automatically learn and extract intrinsic features from nodules60,61, as well as model their 3D shapes. This automatic feature extraction capability allows for more accurate and comprehensive analysis of lung nodules. For instance, Ciompi et al.62 developed a model based on OverFeat63,64, which extracts 2D-view feature vectors from CT scans in three planes: axial, coronal, and sagittal. These vectors enable a multi-view analysis of the nodule, capturing essential features for classification. Recent CNN models have further advanced this approach by providing a global, holistic view of nodules for more detailed feature characterization from CT images. Buty et al.59 proposed a complementary CNN model that incorporates both shape and appearance features for lung nodule classification. In their approach, a spherical harmonic model60 is used for nodule segmentation, which yields shape-based descriptors of the nodule. Simultaneously, a deep convolutional neural network (DCNN)63 is employed to extract texture and intensity features, referred to as appearance features. These two types of features are then combined to enhance the downstream classification process, enabling the model to effectively distinguish between benign and malignant nodules. Similarly, Venkadesh et al.66 implemented an ensemble approach by combining two different CNN models: a 2D-ResNet50-based model67 and a 3D-Inception-V1 model68. Each of these models specializes in extracting specific features from pulmonary nodules. The 2D-ResNet50 model focuses on extracting 2D features, while the 3D-Inception-V1 model captures 3D information. By concatenating the features extracted by these models, the ensemble model achieves superior performance in identifying malignant nodules of varying sizes from raw CT images. The ensemble CNN approach offers several advantages, particularly in its ability to accurately classify nodules of diverse shapes, sizes, and textures. Leveraging the capabilities of state-of-the-art CNNs, these models can generate high-quality features that significantly improve diagnostic outcomes. Once these features are extracted, clinical judgment inference can be performed using various machine learning (ML) techniques, including logistic regression (LR), random forest (RF), support vector machine (SVM), and neural networks (NNs).

Another study utilized the Inception-V3 network to classify whether the tissue was LUAD, LUSC, or normal from H&E-stained histopathology whole-slide images69. A highlight of this study is that the model can also predict whether a given tissue has somatic mutations in several lung cancer driver genes, including STK11, EGFR, FAT1, SETBP1, KRAS, and TP53. Lin et al.70 used a two-step model — a DCGAN to generate synthetic lung cancer images and an AlexNet for lung cancer classification using both original and synthetic datasets. Similar work was also done by Ren and colleagues71. They also used DCGAN for data augmentation. To improve performance, they then designed a regularization-enhanced transfer learning model called VGG-DF for data discrimination to prevent overfitting problems with pre-trained model auto-selection.

Although the aforementioned studies have made significant advancements in the identification of lung nodules and lung cancer using artificial intelligence (AI) and computer-aided diagnosis (CAD) systems, the issue of patient privacy protection has received limited attention in these works. This lack of focus on privacy is particularly concerning, given that AI-based systems often require access to sensitive medical data, such as patient CT scans and health records, to train and validate their models. Most existing methods prioritize improving diagnostic accuracy, reducing false positives, and optimizing computational efficiency, but they largely overlook the critical ethical and legal implications of handling patient data. Traditional data anonymization techniques, such as removing identifiable information, are no longer sufficient to guarantee privacy in the era of AI. Advanced re-identification techniques can potentially reconstruct patient identities from de-identified medical images or data patterns, especially when combined with external data sources. This highlights the critical need for robust privacy-preserving mechanisms in AI-driven CAD systems. However, privacy-preserving approaches, such as federated learning and differential privacy, have been rarely applied in the context of lung cancer detection or CAD systems. This oversight leaves a significant gap in the current research landscape.

In the DP research, clustering algorithms are employed to classify data into sets with similar characteristics, which can be used to discover hidden patterns and correlations in the data. However, conventional DP clustering algorithms often face a conflict between privacy protection and data utility. On one hand, adding noise to protect privacy may introduce risks related to privacy leakage, leading to inaccurate clustering results72. On the other hand, reducing noise to improve the accuracy of clustering results may compromise the effectiveness of privacy protection.

In this study, a clustering algorithm is proposed by combining DP with the Gaussian kernel function. The Gaussian kernel function is a commonly used kernel function that maps data into high-dimensional space for better clustering and classification. The controlled noise allows for more precise clustering results while protecting data privacy. In this study, the DP technique based on the Gaussian kernel function provides an effective means of securing medical data. Through noise addition, DP can protect sensitive information while maintaining the accuracy of clustering results. Based on four datasets, such clustering metrics as ACC, PRE, ReCall, F1_Score, and ARI gradually align with those of FCM as the privacy budget increases, indicating that the Gaussian kernel function can perform the clustering analysis reliably while preserving data privacy. Additionally, the proposed DPFCM_GK converges rapidly with an increase in the privacy budget. Compared with conventional DP algorithms, DPFCM_GK exhibits faster convergence, fewer iterations, and lower time complexity. Moreover, DPFCM_GK is verified using a lung cancer dataset collected from our hospital. The results indicate that as the privacy budget ε increases, DPFCM_GK can effectively perform lung cancer identification and classification. This validates that this algorithm can also effectively limit the risk of privacy leakage under the premise of ensuring data utility.

Table 3 Publications relevant to ML on detection and diagnosis.

The success of DPFCM_GK lies in its innovative integration of differential privacy mechanisms with the robust fuzzy clustering framework. By adding carefully calibrated noise to the data or computations, DPFCM_GK effectively protects sensitive information while maintaining high clustering accuracy. Its fuzzy clustering approach assigns degrees of membership to each data point, enhancing tolerance to noise and uncertainty, which is crucial in privacy-preserving scenarios. The algorithm is designed to balance privacy and utility by optimizing parameters such as the privacy budget ε, ensuring minimal performance degradation while achieving strong privacy guarantees. Furthermore, the mathematical enhancements to the traditional FCM model, such as adjusting the calculation of membership degrees and cluster centers with privacy-preserving constraints, allow DPFCM_GK to perform well under differential privacy constraints. This combination of privacy-preserving design, robustness to noise, and optimized performance makes DPFCM_GK highly effective for sensitive data clustering in practical applications.

In this study, the effect of the privacy budget ε on the convergence behavior of the DPFCM_GK algorithm was analyzed. When ε decreases, representing stricter privacy constraints, the Gaussian mechanism introduces more significant noise into the system. This noise interferes with the gradient descent process by distorting the descent direction, which slows the convergence of the algorithm and increases fluctuations in the stability of the clustering results. On the other hand, higher values of ε, corresponding to weaker privacy constraints, reduce the noise added by the mechanism. This allows the algorithm to achieve faster convergence and enhances the consistency and reliability of clustering outcomes. This relationship highlights the inherent trade-off between ensuring rigorous privacy protection and maintaining optimal model performance. Stricter privacy measures safeguard data confidentiality but at the cost of computational efficiency and clustering accuracy. Conversely, relaxing privacy constraints benefits algorithm stability and convergence but reduces the level of privacy protection.

However, the limitations of the DPFCM_GK approach are crucial to providing a balanced view of its performance. Specifically: While the joint clustering algorithm demonstrates effectiveness with small to medium-sized datasets, it faces computational challenges when applied to larger datasets. As the dataset size increases, the time complexity of the algorithm rises, which may hinder its scalability for real-world applications involving large volumes of data. The differential privacy mechanism integrated into the method introduces a trade-off between ensuring data privacy and maintaining data utility. The noise added to safeguard privacy may reduce the accuracy of the clustering results, particularly when working with sensitive datasets. This trade-off should be considered when implementing the approach in practical, privacy-sensitive domains. The algorithm was primarily evaluated on a specific dataset. Although the results were promising, further exploration is needed to evaluate its adaptability to diverse data types or datasets from different fields. The algorithm’s performance may vary when applied to data with different structures or characteristics, suggesting that more testing is needed to ensure its generalizability.

While the results highlight the effectiveness of our method, we recognize the importance of integrating more recent advancements in the field to further strengthen the robustness of our approach. One such advancement is the introduction of quantum machine learning algorithms, as discussed by Rishabh Gupta et al. in “Quantum Machine Learning Driven Malicious User Prediction for Cloud Network Communications.” This paper introduces an innovative model for predicting malicious users in cloud network communications, leveraging quantum machine learning to efficiently identify and predict malicious users. This approach enhances system security by proactively detecting malicious entities before a data breach occurs73. Furthermore, in their subsequent research, they developed the DT-PPM model, which employs techniques such as data partitioning, noise injection, and classification to safeguard medical data in cloud environments. This model offers a secure way to share outsourced data from pathology centers while ensuring privacy protection and significantly reducing computation time74.

Their research shares similarities with our approach, particularly in their use of noise injection to protect data privacy. However, their focus on multi-center data, especially data from various pathology centers, presents challenges that are not yet addressed in our current study. While we focus on adding Laplace noise to medical data using a Gaussian kernel, we have not yet explored the complexities of privacy protection in a multi-center environment. Future work will aim to incorporate these advanced techniques to tackle privacy issues in multi-center data environments. Specifically, we plan to extend our method to include mechanisms that protect data from multiple pathology centers, where privacy concerns are more pronounced due to the involvement of diverse data sources. This will require adapting our Gaussian kernel-based approach to effectively handle data partitioning and multi-center noise injection, building on the foundation set by the reference model.

Conclusion

In conclusion, the proposed DPFCM_GK offers a robust solution for cancer identification, ensuring both data accuracy and privacy. This algorithm exhibits faster convergence, fewer iterations, and lower time complexity compared with conventional DP algorithms. Therefore, it is considered an appealing choice for real-world applications in medical data analyses. Overall, it represents a significant step forward in securing sensitive medical data, especially in the field of cancer identification. Its applicability to other medical datasets and its potential for broader healthcare applications may be explored in further studies.