Introduction

AI has created untapped opportunities for accelerating precision in medicine, including transformations in medical imaging that offer improved efficiency and enhanced disease diagnosis, therapy planning, and surveillance. To date, even with relatively small datasets, studies have shown promise of AI across imaging modalities, clinical subspecialties, and organ domains, such as disease delineation1,2,3,4, diagnosis5,6,7,8,9, outcomes10,11, and underlying genomics12,13, some of which surpass human performance.

Ultimately, “big data” is a key to the success of AI in medicine. For this reason, many AI investigations have focused on relatively common adult diseases pooled from one or a few large centers, e.g., breast or lung cancer, pneumonias, heart disease, intracranial hemorrhage, or acute ischemic strokes. However, many rare or pediatric diseases with data scattered across hospitals currently do not benefit from the advancements in AI. Even CheXNet, a chest radiograph dataset with >100,000 annotations5 and considered “large” among medical datasets dwarfs in comparison to non-medical ImageNet, which contains >14 million annotated images14. Current AI models on cross-sectional imaging, e.g., MRI or CT—often considered the clinical workhorse—are trained on significantly fewer datasets, raising questions of AI reliability and generalization and thereby further exacerbating a general lag in AI adoption in healthcare.

Medical data is not scarce, however. Large tomes exist across the world in the form of electronic health records and imaging archives and are fertile grounds for large-scale AI developments. Unfortunately, barriers to data sharing across institutions—while necessary for patient privacy—have impeded progress in AI for healthcare. FL has emerged as one potential solution that enables model training across multiple, decentralized datasets, without direct patient data sharing15. It offers better privacy and local data autonomy while facilitating learning from a distributed data source in which diverse factors contribute to disease phenotypes and their outcomes. From such a network, AI can learn complex relationships and potentially uncover new clinical perspectives.

Recent FL works in medicine have shown feasibility16,17,18,19 but with limited scope, e.g., few participating FL sites or small range of classes or datasets that are limited in diversity and size. Prior FL investigations20,21,22 have examined segmentation of adult gliomas that typically arise in the supratentorial brain. Children, however, present with more diverse brain tumor pathologies, the majority occurring in the posterior fossa (PF) spaces that include the brainstem and the cerebellum.

In this work, we present an end-to-end, MRI-based FL platform for PF tumors, FL-PedBrain, on a large international pediatric dataset of 19 institutions from North America, Europe, West Asia, North Africa, and Australia (Fig. 1). We targeted pediatric tumors, given both their pathologic diversity and general scarcity even within subspecialty pediatric hospitals. Hence, a successful FL platform could uniquely benefit this data-sparse, yet vulnerable population.

Fig. 1: Federated Learning Platform.
figure 1

Participating sites (a) and FL procedure (b). a North America: Stanford Children’s Hospital (ST—Palo Alto, California), Seattle Children’s Hospital (SE—Seattle, Washington), Phoenix Children’s Hospital (PH—Phoenix, Arizona), Primary Children’s Hospital (UT—Salt Lake City, Utah), Children’s Hospital Orange County (CH—Orange County, California), Dayton Children’s Hospital (DY—Dayton, Ohio), Indiana University Riley Children’s (IN—Indianapolis, Indiana), Lurie Children’s Hospital of Chicago (CG—Chicago, Illinois), NYU Langone Medical Center (NY—New York City, New York), Children’s Hospital of Philadelphia (CP—Philadelphia, Pennsylvania), Duke Children’s Hospital (DU—Durham, North Carolina), Boston Children’s Hospital (BO—Boston, Massachusetts), Toronto Sick Kids Hospital (TO—Toronto, Canada); Europe: Great Ormand Street Hospital (GO—London, United Kingdom),Tepecik Health Sciences (TK—Izmir, Turkey), Koç University (KC—Istanbul, Turkey); North Africa: Centre International Carthage Médical (TU—Monastir, Tunisia); West Asia: Tehran University of Medical Sciences (TM—Tehran, Iran); Australia: The Children’s Hospital at Westmead (AU—Sydney, Australia). b Our FL framework incorporates FL warm-up on the largest sites and proximal regularization to learn on heterogeneous sites, but we report the best results with μ = 0.

We examine a heterogeneous group of pediatric PF tumors with diverse clinical outcomes, dependent on tumor pathology or genomics, surgical resection margins, or their candidacy for new drug therapies. We conducted tumor classification and segmentation jointly, as success of such tasks prior to surgery can directly impact precision in surgical margins, radiation targets, and alter other therapy strategies that aim cure with minimal risks. Specifically, we orchestrated FL training with real data from 19 participating sites from five continents and compared its efficacy against the traditional pooled data approach, i.e., centralized data sharing (CDS). We investigated real-life scenarios where some sites provide missing or imbalanced data. Finally, we explored the underlying sources of data heterogeneity, such as variations in image quality, or site-specific tumor features.

Results

Study cohort

A total of 1468 unique PF tumor subjects (mean age 7.6 years; 48% females) were included, comprising 596 MB, 210 EP, 335 PA, and 327 DIPG. Table 1 summarizes the demographic information and the tumor pathologies distributed across the 19 institutions.

Table 1 Demographic table

Classification

Table 2 summarizes the model performances, comparing FL and CDS. FL achieved classification performance on par with CDS, without a statistically discernible difference. We present FL and CDS confusion matrices summarizing the classification performance on the four tumor pathologies (MB, EP, PA, DIPG) in Fig. 2a, b. Figure 2c also illustrates per site, overall accuracies. Compared to either FL and CDS, Siloed training significantly underperforms and shows large performance variance across the sites (Fig. 2c).

Table 2 Classification accuracies for CDS and FL on all validation sets
Fig. 2: Performance of FL on the validation sites compared to CDS and siloed training using the ST site model.
figure 2

Confusion matrices (a, b) for the classification task and per site performance (c). The hospitals TK, AU, and TU are external and independent hold-out sites. Source data are provided as a Source Data file. Source data are provided as a Source Data file.

Segmentation

As shown in Tables 3 and 4, FL achieves an overall segmentation performance that approaches CDS on both, the 16 validation datasets and the 3 hold-out test sites. Compared to either FL or CDS, Siloed training underperforms by >20%, a performance drop that is greater for segmentation than for classification (Fig. 2c). Both, FL and CDS, yielded the best segmentation performance on DIPG tumors, whereas performance on MB was lower than the other tumors (Table 3). Within the tumor subgroups, FL matched that of CDS performance on MB and PA tumor (no t-statistic difference), while FL slightly underperformed compared to CDS on EP and DIPG. Such variations might suggest heterogeneity in tumor voxel volume between the sites (see section on Heterogeneity). While the mean Dice Similarity Coefficient (DSC) on the validation sets were congruent for both FL and CDS, FL exhibited slightly higher variability, i.e., greater standard deviation, suggesting underlying differences in the model behavior.

Table 3 Segmentation Statistics for CDS and FL on Validation Sets
Table 4 Segmentation DSC performance on the three independent hold-out sites

Supplementary Table 1 presents the classification and segmentation results for the 16 independently trained models, each using its respective site-specific dataset. The outcomes suggest subpar performance across the board, attributable to the limited size of individual datasets. Notably, models from sites UT and CP showed the highest segmentation DSCs, reaching 0.57. However, models from five sites did not converge.

Visualizations and quality of FL training

Figure 3 illustrates sample segmentation outputs from FL-PedBrain compared to the ground-truth segmentations. We also present a t-SNE23 visualization of projected embedded features from FL-PedBrain classification model from the validation set (Fig. 4a). Note unique tumor features that are also distinct from normal pediatric brains. A corresponding violin plot of all per-example Dice scores (DSC) in the 16-site population is also shown in Fig. 4b. Finally, Fig. 5 illustrates convergence during training, comparing CDS to FL. As expected, and consistent with expected observation15, CDS requires fewer learning updates to converge.

Fig. 3: Sample FL segmentation predictions compared to ground-truth segmentations.
figure 3

Sample predictions of the FL-trained model compared to ground-truth segmentations across various tumor types sampled at different depth regions of the brain. The Source model is provided in the GitHub link.

Fig. 4: Classification and Segmentation Results.
figure 4

Visualization of the FL-trained model features (a) and DSCs (Dice scores) (b). The violin plot displays the median, quartiles, and minimum and maximum values of the distribution. Source data are provided as a Source Data file.

Fig. 5: FL Training Runs.
figure 5

Training runs comparing CDS (a) to FL (b), and ablation study showing impact of participating sites in the network (c). Runs using CDS (a) and FL (b) show fast convergence for both the classification and segmentation tasks. A federated warm-up was performed on the two largest sites first. These runs (c) show the influence of adding sites into the FL training network on the worst-case class prediction (Ependymoma) measured at 100 FL rounds, 150, 200, 250, and 300. The error bars represent 1 std. deviation of variation among five independent FL runs. Source data are provided as a Source Data file.

Heterogeneity

Since data heterogeneity is a key consideration in AI studies, we examined various sources of data heterogeneity. One notable factor was the significant class imbalances across the participating sites, both in the sample sizes and the pathologic subgroups, with some sites completely missing certain tumor types, as shown in Table 1. Interestingly, we observed differences in T2-MRI pixel variance in Fig. 6, especially for DIPG and PA, possibly reflecting larger variances in solid, hemorrhagic, necrotic, or cystic components, or other tumoral habitats unique to astrocytomas. We also found significant variation in relative tumor volumes across sites. Despite such sources of data heterogeneity—including extreme class imbalances—we found no evidence such factors impacted FL convergence.

Fig. 6: Visualization of heterogeneity.
figure 6

Differences in T2-MRI pixel variance, especially for DIPG and Pilocytic, possibly reflecting larger variances in solid, hemorrhagic, necrotic, or cystic components. Significant variation in tumor volumes across sites were found. Source data are provided as a Source Data file.

Impact of FL Warm-up

FL across just two of the largest centers achieved >70% classification accuracy and segmentation DSC, except for the EP cases. By adding in the remaining sites, performance was significantly enhanced, as shown by Fig. 5c. We found that Federated Warm-up was important; without warm-up, training times were up to 10 times longer and overall performance lower, especially for EP classification and segmentation.

Better performance with more active FL sites

In our study, we assess the impact of site activity on FL performance by conducting an ablation experiment. This experiment measures the FL system’s performance relative to the quantity of active training sites, as depicted in Fig. 5c. We rerun the full FL experiment by integrating more sites into the training process (x axis), prioritizing those with larger datasets. The performance evaluation is based on the F1 score—specifically, the classification accuracy of the label that performs the poorest. Our findings indicate a positive correlation between the number of active sites and the F1 score: as more sites participate in the FL network, the F1 score improves, eventually equaling the peak score achieved when all available sites are active.

Future challenges and practical implementation

FL-PedBrain introduces logistical challenges, communication overhead, model synchronization, and computational demands.

Communication and logistical challenges

In FL, every participating hospital must regularly exchange model updates—specifically, the model weights after each FL training round. For our classification-segmentation model, this equates to transmitting ~125 MB of model weights per round. This culminates in a data transfer of ~74 GB per hospital for each training session with 200 rounds. Training the largest dataset for 1 epoch consumes ~3–4 minutes on a V100 GPU, and the time to then transfer all 16 models from each hospital to the central parameter server (coordinating hospital) in Fig. 1a is roughly 7 minutes at 1 MB/s internet upload rate, assuming that the central server’s download rate is much faster than 1 MB/s. This equates to about 10 minutes per round (1 epoch per round) and 2000 minutes to ship one trained model. Although CDS only requires a one-time collection of 200–1000 GB of DICOMs, FL offers benefits by removing the need for data use agreements and the need for deidentification, which can take a long time to establish and verify. Finally, FL provides advantages such as continuous quality control and oversight from each of the sites’ technical model builders. The provided figures are rough estimates; actual performance will vary as hospitals differ in computing power, communication standards, and data transfer speeds. Asynchronous Federated Learning (FL) is particularly beneficial in environments where hospitals exhibit diversity not just in data but also in computational and networking resources. Recent methods for heterogeneous FL24 can potentially alleviate communication and compute overheads.

Need for on-site technical expertise

Additionally, having both clinical and AI experts per site would greatly enhance and streamline the FL workflow, enabling them to (1) inspect the training and evaluation data for any obvious imaging artifacts or integrity of diagnosis and (2) monitor the training process as the model evolves. We intend our FL framework not to be used just for static datasets like in the CDS case but rather as a bedrock for active learning on growing datasets. Therefore, human integration into the FL pipeline is a very promising future direction.

Discussion

We present an FL system for pediatric cancer, FL-PedBrain, specifically targeting PF tumors. While brain tumors represent the most common solid neoplasm of childhood, they remain sparse compared to adult tumors, dispersed across pediatric or subspecialty centers. Thus, a successful collaborative platform that enables large-scale AI learning across institutions could uniquely benefit this population. Here, we capitalize on a large and diverse brain MRI dataset of pediatric PF tumors to date from 19 global institutions and present and evaluate an FL design that jointly conducts tumor pathology prediction and segmentation, optimized for this relatively data-sparse population.

Overall, we found robust generalization of FL-PedBrain across all sites, including the three external holdouts. Compared to CDS that uses pooled data from all sites, FL deviates by less than 1.5% in the classification and only 3% in the segmentation performances, with no statistical difference between CDS and FL on classification and slightly lower segmentation performance of FL on two of the four tumor groups. On the other hand, Siloed training—or training confined to a local site—performs ~20% worse compared to either FL or CDS, highlighting the risks of AI generalization and brittle models.

Prior FL studies on brain tumors have exclusively focused on segmentation of adult gliomas20,21,22. In this work, we trained the classifier and segmentation jointly. Unlike prior FL studies that employed extensive image manipulations, e.g., skull-stripping and rigid atlas-based brain co-registration20,21,22, we used real-life, raw MRI data that included brain tissue, skull, scalp, and head sizes of all ages, so that FL-PedBrain could be used in an end-to-end clinical deployment. Despite the heterogeneous dataset (infant to adult head sizes and diverse tumor pathologies beyond gliomas, e.g., embryonal and glial tumor cells of origin) and not requiring image manipulation prior to FL training, FL-PedBrain, performed segmentation on par with prior adult studies. FL-PedBrain also outperformed (F1 scores of 0.877 and 0.856 for CDS and FL, respectively) a prior pilot study9 that used pooled data for PF tumor prediction (F1 score of 0.800). Recent advances in FL strategies19,25,26,27,28 tackle learning on heterogeneous data and environments. Federated Proximal learning (FedProx25) is an adjustment to Federated Averaging that can accommodate for model drift. One important ingredient is the proximal weight penalty to ensure that the local updates do not stray too far from the global model, thereby making the training process more robust to data heterogeneity among clients. We have found that Federated Averaging (µ = 0) achieves higher and more consistent segmentation performance across the 19 sites on average compared to other advanced strategies such as weight transfer, exploiting synthetic data, and knowledge distillation26.

Moreover, we presented the Federated Warm-up method to combat challenges of severe non-uniform distributions of sample sizes across the FL network. This allows the training process to learn from the sites with the largest data samples for a few federated rounds. Thereafter, the learning proceeds to all the sites, including the ones with missing classes.

FL-PedBrain jointly classifies and segments brain tumors, which addresses several clinical needs. First, a more precise, pre-surgical knowledge of the PF tumor pathology could impact therapy. For example, less aggressive, safer resection margins may be desirable for more radio-sensitive MB compared to EP. Patient counseling and therapeutic strategy may vastly differ for non-resectable DIPG versus less aggressive but “infiltrative”-appearing PA tumors that can mimic DIPG. Second, since PF tumors often plague critical brain regions such as the brainstem, more precise tumor localization via segmentation that ports into surgical navigation can also optimize maximal resection for cure (e.g., EP or PA tumors) while minimizing risks. It may also enhance radiation targets and offer efficiency in radiomics or other quantitative tumor analytics.

Currently, manual maximal, linear measurements (x, y, z dimensions) are used to calculate tumor growth or regression. While useful, these are crude metrics for tumor tissues that are asymmetric or irregular, and also prone to interobserver variability. Thus, FL-PedBrain could be used to more reliably calculate tumor volumes across serial imaging. Segmentation masks generated from FL-PedBrain can also be plugged into a radiomics pipeline for tumor genomics4 or enhanced risk-stratification10 and potentially clarify patient candidacy for various individualized therapies.

We recognize several limitations of this work. First, the MRI scans originated from various MRI hardware across different sites, each employing unique protocols, leading to disparities in image quality. For example, we found site-to-site variations in T2-MRI image intensities. Second, differences in clinical practices and culture or site-specific barriers to MRI may impact tumor features at the time of diagnosis. For example, sites that use MRI to screen children at high risk of brain tumors, e.g., patients with specific genetic syndromes, might find tumors at an earlier phase versus under-resourced communities that catch tumors at a later period, when the patient finally decompensates. Alternatively, a particular subspecialty center may attract more complex or advanced tumors due to referral patterns. For example, we found that some sites tend to host larger size tumors, e.g., PA tumors, compared to others. Class imbalance within TK and AU sites likely contribute to a slightly lower classification performance. For segmentation task, where the scores are calculated per pixel across the entire 256 × 256 × 64 head volume, we observe similar performance between FL and CDS.

Nevertheless, we incorporated such heterogeneous conditions to properly investigate FL feasibility in a real-life setting. Real-world datasets are generally non-IID (non-independent and identically distributed) and can thus impact the final FL performance compared to baseline CDS. Here, we highlight multiple sources of heterogeneity inherent in our data: (1) imbalanced number of samples per site; (2) age and sex differences across sites (Table 2); (3) imbalanced or missing tumor classes on few sites of the federated network; (4) site-specific variations in the MRI signal intensities; and (5) site-specific variations in the tumor sizes. More sophisticated FL techniques such as Federated Proximal techniques and variations25 might improve training convergence with imbalanced classes. However, we have not found improvement using such methods. Despite such sources of data heterogeneity, we show robustness of FL-PedBrain as shown by the training convergence graph (Fig. 5b) with an FL performance that not only closely matches CDS, but also offers the advantages of AI training without data sharing across the sites within the FL network. Lastly, while the study does not account for human inter-reader variability, our segmentation masks reflect a consensus-based ground truth validated by six experts in the field.

In conclusion, we presented and evaluated a federated platform for pediatric brain tumors that is privacy-preserving and does not require sharing of data and showed its feasibility on a heterogeneous tumor pathology and diverse MRI dataset from 19 geographic centers. We emphasize the potential of FL in accelerating large-scale, clinically translatable AI for pediatric datasets and other heterogeneous, privacy-preserving data.

Next steps will include a study on the prospective deployment of real-time FL-PedBrain at local hospitals, requiring no additional data processing to enhance clinical usability. The methodology and results of this work lay the groundwork for future applications of FL in radiology and beyond, towards collaborative, efficient, and ethical AI-driven developments.

The key results are as follows:

  1. 1.

    FL-PedBrain, a platform for an MRI-based FL, performs on-par with the traditional CDS AI method for the concurrent classification and segmentation of pediatric posterior fossa brain tumors. Both, FL and CDS, approaches yield 20 to 30% higher performance improvements in segmentation compared to siloed learning from localized, limited data sources.

  2. 2.

    Heterogeneity is inherent in real-world medical image data and can be quantitatively described by class imbalances, MRI signal intensities, and even tumor sizes across different centers.

  3. 3.

    Despite data heterogeneity, FL-PedBrain achieves high generalization performance across 19 sites across the world.

Methods

This multi-center, retrospective study underwent approval by the Stanford University institutional review board (IRB) and execution of data use agreements across the participating sites, with a waiver of consent/assent (IRB No. 51059: Deep Learning Analysis of Radiologic imaging). Nineteen institutions from North America, Europe, West Asia, North Africa, and Australia participated in the study (Fig. 1a). Waiver of consent was granted by the IRB for the following reasons: (1) As a retrospective study, the research involves no more than minimal risk to the participants as the materials involved (data, documents, records) have already been collected and precautions will be taken to ensure confidentiality, (2) the waiver will not adversely affect the rights and welfare of the participants as there are procedures in place that protect confidentiality, and (3) the information learned during the study will not affect the treatment or clinical outcome of the participants.

The inclusion criteria were: patients who presented with a new, treatment-naive PF tumor; had pathologic confirmation for any of the following benign or malignant tumors: medulloblastoma (MB); ependymoma (EP); pilocytic astrocytoma (PA); and in the case of diffuse intrinsic pontine glioma (DIPG), MRI and/or biopsy-based diagnosis; obtained pre-treatment brain MRI that included axial T2-weighted imaging (T2-MRI). Subjects were excluded if the imaging was non-diagnostic due to severe motion degradation or other artifacts. Table 1 summarizes cohort demographics and site-specific tumor pathology.

Tumor segmentation was performed on axial T2-MRI by an expert board-certified, pediatric neuroradiologist (KY, >15 years’ experience), followed by a consensus agreement among three pediatric neuroradiologists (AJ, JW, MK) and two pediatric neurosurgeons (SC, RL). Segmentation was performed over the whole tumor, inclusive of cystic, hemorrhagic, or necrotic components within the tumor niche. T2-MRI was selected as it is most frequently acquired on routine MRI protocols; is embedded within pre-surgical navigation; and most reliably identifies the tumor margins regardless of enhancement, hence, recommended for pediatric glioma assessment29.

MRI acquisition

MRI of the brain was obtained using either 1.5 or 3 T MRI systems. The following vendors were employed across sites: GE Healthcare, Waukesha, WI; Siemens Healthineers, Erlangen, Germany; Philips Healthcare, Andover, MA; and Toshiba Canon Medical Systems USA Inc., Tustin, CA. The T2-weighted MRI (T2-MRI) sequence parameters were: T2 TSE clear/sense, T2 FSE, T2 propeller, T2 blade, T2 drive sense (TR/TE 2475.6-9622.24/80-146.048); slice thickness 1–5 mm with 0.5 or 1 mm skip; matrix ranges of 224–1024 × 256–1024.

Study design

Dataset distribution

Of the 19 sites, 16 sites were selected to participate in the model training and validation; the remaining three sites served as independent, external hold-out sites. A dataset from a database of normal pediatric brain MRI (N = 1667 from ST site) was used for pretraining. Within each of the 16 sites that participated in model training and validation, 75% of the MRI data was used in the training set; the remaining 25% was used as hold-out validation sets. Sample collection on sex and/or gender were not considered for sample selection.

Statistics and reproducibility

No statistical method was used to predetermine sample sizes of the training, validation, and external, independent validation sites. All data collected from the 19 sites were used. The training runs showed minor variations in convergence for different random seeds.

Data preprocessing

Each site must possess the small but important knowledge to manage consistent data preprocessing, a task that, under CDS, would typically be centralized by a trusted party. To streamline preprocessing, we have minimized any complex preprocessing steps (e.g., brain registration to a common atlas or skull-stripping). Preprocessing only includes: (1) normalization of each 3D image to a simple 0–255 intensity range and (2) volume extraction of 64 congruent axial slices of 256 × 256. These preprocessing steps are executed via an automated script applied to the DICOM data across all 19 sites. The number of 64 slices was chosen such that it can handle virtually all of the variations of the individual sites’ T2 sequence parameters (e.g., TSE, FSE, Propeller, etc.) with a large range of slice thicknesses (e.g.,1–5 mm) based on site-specific scanner technology and protocols. Therefore, our FL system can accommodate a large range of sequence parameters and axial slices. While normal pediatric MRI data of the pediatric brain were not required in the FL experiments, we observed that it could help retrain the model to identify the geometry and spatial locations of the pediatric brain across all ages, i.e., infants to adult head sizes of teenagers. The normal dataset (N = 1667) was shared and distributed amongst the participating sites for both CDS and FL approaches. However, the normal cohort was not used in the validation or the hold-out test sets.

Federated model development and evaluation

We developed a 3D model that jointly performs tumor pathology prediction (MB, EP, PA, DIPG, normal) and segmentation masks using FL (Fig. 1b). In the CDS approach, we combined the datasets from all 16 sites into a single pool, on which we trained the model. We also examined a Siloed model trained using the training and validation data from a single site only (Site ST, which hosted the largest single institution dataset), which was then evaluated on the 16 hold-out validation sets and 3 external independent sites. In contrast, the FL strategy used a method known as Federated Averaging15. Within this framework, the 16 sites did not share data. Instead, they only share information via model parameters learned on each site-specific data.

Each FL round began with local model training at the individual sites, after which each site transmitted the learned weights back to a central server. Here, the model weights from each of the 16 sites were averaged, creating a unified, global set of weights. These weights were then distributed back to each site to initiate the next FL round, where local training resumed. This iterative process, alternating between local training and centralized averaging, continued through many FL rounds. Eventually, the finalized global model underwent evaluation across the 16 validation sets and three hold-out test sets, its performance reflecting the collaborative—yet segregated—approach that characterizes the FL paradigm.

We modified the conventional FL strategy by creating a “warm-up” phase for the initial model, called Federated Warm-up, which enabled an efficient FL training to hasten convergence, given the large disparity in data distributions that underlie the 16 participating centers. The FL training consists of two stages enabling efficient learning: an initial 50 rounds of Federated Averaging on the ST and SE sites followed by 150 additional rounds of Federated Averaging across all 16 sites. A convergence plot that illustrates this Federated Averaging “warm-up” is shown in Fig. 5.

We employed a 3D-UNet architecture, incorporating a Kinetics-pretrained encoder that was initially trained on large-scale video data30. The 3D architecture allowed for processing 64 slices of high-resolution planes per datum, necessitating substantial GPU memory to manage large batch sizes. For the CDS training, 200 epochs were conducted with a combined loss function of Cross-Entropy and Dice Score Loss, utilizing Adam Optimization with a learning rate of 0.0001. This combined loss function facilitated the learning of both classification and segmentation predictions.

For classification performance, we calculated model raw accuracies and F1 scores. For segmentation, we utilized the same model as in the classification task to calculate the DSCs. The DSC determines the overlap between the predicted and ground-truth segmentations and thus offers insights into the quality of segmentation. We also conducted a two-sided t-test on the DSCs and compared the performance between CDS and FL. The distribution of predictions is approximately normally distributed due to the large test sample sizes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.