Introduction

Differentiated thyroid cancer (DTC) accounts for approximately 90% of thyroid cancers, with papillary thyroid cancer (PTC) representing the most common type1. Iodine-131 (131I) therapy is a standard treatment for patients with DTC who have undergone near-total or total thyroidectomy, aiming to reduce the risk of disease recurrence and metastasis2. Evaluating postoperative disease status is pivotal in determining the necessity and strategy for 131I therapy, which is typically based on a variety of findings mainly including surgical records, pathology reports, serum thyroglobulin (Tg) levels, and neck ultrasonography3. Post-therapeutic 131I scan can identify any 131I-avid structural disease, helping to assess disease status and guiding disease management and follow-up strategies2. Nevertheless, patients who did not meet the criteria for 131I treatment or chose to omit 131I therapy were unable to undergo post-therapeutic 131I scan imaging, thereby impeding physicians from accessing this information. Tg is a powerful tumor marker for monitoring disease persistence, recurrence, and progression, informing treatment decisions and predicting outcomes. However, Tg levels may be influenced by several factors, such as postoperative timing, residual thyroid tissue, thyroglobulin antibodies (TgAb), thyroid stimulating hormone (TSH) levels, and disparate assays4,5,6. Therefore, identifying new reliable biomarkers is imperative. The American Thyroid Association (ATA) guidelines’ three-tiered risk assessment system is widely used to guide 131I residual ablation and adjuvant therapy decisions, emphasizing pathology report data encompassing lesion resection extent, peripheral infiltration degree, tumor subtype, and molecular expression2. Inspired by this guideline, we sought to extract more information from pathological tissues for predicting lymph node and/or distant metastases in post-therapeutic 131I scan.

Pathomics emerges as a promising tool, extracting sub-visual features from digital pathological slices to depict disease manifestations and behaviors7. Previous studies demonstrate pathomics’ ability to differentiate benign and malignant thyroid nodules and predict genetic mutations8,9,10. Advancements in digital pathology allow researchers to explore novel applications. Recently, a study by Kim et al. highlighted the feasibility and potential benefits of digital pathology in predicting lateral cervical lymph node metastasis, suggesting a correlation between the information gleaned from whole-slide images (WSIs) and progression-free survival (PFS) in thyroid cancer patients11. However, such research remains nascent, constrained by predefined morphological parameters, limiting full utilization of WSI information.

This study aimed to develop and validate a digital pathology image-based model that predicts lymph node and/or distant metastases in PTC patients following initial 131I therapy in post-therapeutic 131I scan. The proposed pathology model enables early treatment regimen adjustment for intermediate- and high-risk PTC patients, optimizing therapy dosage.

Materials and methods

Patients

A retrospective study was conducted using the institutional database of Henan Provincial People’s Hospital. Patients who received initial 131I treatment between January 2022 and August 2023 were included. Inclusion criteria were: (1) underwent near/total thyroidectomy; (2) histologically confirmed PTC; (3) met the ATA guidelines for being intermediate- or high-risk patients. Exclusion criteria were: (1) age less than 18 years; (2) positive TgAb (TgAb > 4 IU/mL); (3) patients with other malignant tumors; (4) incomplete clinical data or missing follow-up; (5) missing or poor-quality pathology slides. Eligible patients were randomly allocated to training and test cohorts in a 7:3 ratio. The study flow diagram is shown in Fig. 1.

Fig. 1
figure 1

The flow diagram of the study population.

This study was reviewed and approved by the Institutional Review Board of the Henan Provincial People’s Hospital & Zhengzhou University People’s Hospital (No: 2024-625). Due to its retrospective nature, informed consent requirements were waived by the Ethics Committee of the Henan Provincial People’s Hospital & Zhengzhou University People’s Hospital. All methods were performed in accordance with the relevant guidelines and regulations.

131I therapy

All patients were required to consume a low-iodine diet and withdraw from the levothyroxine (LT4) treatment for 4 weeks. The serum TSH level was > 30 mU/L in all patients. TSH, stimulated Tg (sTg), and TgAb measurements, neck ultrasound, and chest computed tomography (CT) scan were completed prior to 131I treatment. Tg levels were determined using a Roche Cobas E602 Automatic electrochemiluminescence immunoanalyser from Roche Diagnostics Products (Shanghai) Co., Ltd.; and the Tg value detection range (0.04–500 ng/mL) exceeded the upper limit of dilution detection. TgAb levels were determined using the UniCel DxI 800 (Beckman Coulter USA). Next, 131I uptake (RAIU) was measured 24 h after the administration of 0.111 MBq of 131I using a gamma counter during hospitalization. The administered 131I activity was 3.70 GBq (100 mCi) for remnant ablation, 3.70–5.55 GBq (100–150 mCi) for adjuvant therapy, and 5.55–7.4 GBq (150–200 mCi) for distant metastases. Radiation protection precautions were managed in strict accordance with national and local regulations.

Post-therapeutic 131I scan

A posttherapy whole-body scan (WBS) with routinely 1 bed position single photon emission computed tomography/computed tomography (SPECT/CT) (scan range includes neck and chest, and if necessary, additional scans should be performed for suspected positive areas in WBS) was performed on the fifth day following 131I administration using dual-head SPECT-CT scanners (GE Discovery NM/CT670). Image analysis was conducted by two nuclear medicine physicians with a minimum of 10 years of diagnostic experience. Discrepancies were resolved by consensus. The 131I-avid foci identified in WBS were further assessed on SPECT/CT, and categorized based on intensity, ___location, and nature. Only 131I-avid lymph nodes and/or distant metastases were defined as 131I scan positive, otherwise as 131I scan negative.

Preparation of digital WSIs and image annotation

Specimens were fixed in a 3.7% neutral formalin, dehydrated, and embedded in paraffin. Each section was approximately 4 μm thick. Slides were scanned using the Digital Pathology scanner (KF-PRO-400, KFBIO, China) to obtain the WSIs. Select the open-source software Qupath version 0.3.2 for annotation12. Two experienced pathologists manually annotated regions of interest (ROIs) within WSIs, including tumors, stroma, and calcifications.

Data processing

WSIs were segmented into non-overlapping 512 × 512 patches at 0.5 μm/patch resolution. H&E-stained tissue was separated from the background by Otsu’s method13. Macenko’s method was applied for color normalization14. Z-score normalization standardized the RGB intensity distribution, which served as model input. During the training phase, we utilized online data augmentation methods such as random cropping, horizontal flipping, and vertical flipping. For test patches, only normalization was performed.

Multiple instance learning-based deep learning pipeline

The deep learning pipeline consists of two parts: patch-level prediction and WSI-level prediction. During patch-level prediction, each patch was assigned a binary label (0 or 1) based on the patient’s diagnostic status (positive or negative). Several well-known networks, including ResNet-18, ResNet-50, ResNet-101, Inception-v3, VGG-19, and DenseNet-121, were evaluated for performance optimization. Labels and corresponding probabilities were assigned to all patches. Subsequently, we utilized the Patch Likelihood Histogram (PLH) pipeline and the Bag of Words (BoW) pipeline to consolidate the patch likelihoods, leading to the WSI-level prediction. During the model training, we harness transfer learning by initializing the model with pre-trained weights from the ImageNet dataset to faster learning and improved prediction accuracy.

Feature selection and pathology model construction

Only one feature was retained from each pair with a Pearson’s correlation coefficient greater than 0.9. We further refined the features using lasso regression. Then, six machine learning algorithms were utilized, including logistic regression (LR), support vector machine (SVM), random forest, extratrees, light gradient boosting machine (LightGBM), and multilayer perceptron (MLP), to develop our models. WSI-level area under the receiver operator curve (ROCAUC) serves as the performance metric. A 5-fold cross-validation method was employed in the training cohort and combined with a grid search algorithm to optimize the hyperparameters of the model.

Clinicopathological data

Clinicopathologic data identified as potential predictors include the following: age, sex, histologic type, T-stage, N-stage, presence or absence of capsular invasion, presence or absence of vascular invasion, presence or absence of nerve invasion, risk of ATA recurrence stratification, serum TSH, sTg, and TgAb levels, and 24-h 131I uptake (RAIU).

Model building and performance comparison

Univariate analysis identified covariates associated with metastasis. Factors with P < 0.05 were subjected to multivariate logistic regression to construct a clinical model. Subsequently, we further developed an integrated model with pathomics features and clinicopathology features. Calibration curves and receiver operating characteristic (ROC) were used to assess the calibration of three models. The model construction workflow is shown in Fig. 2.

Fig. 2
figure 2

The workflow of the models’ development (Created by Figdraw). A whole slide image (WSI) of each patient was acquired, annotated, and segmented into patches. The patches were then processed by various convolutional neural networks (CNNs). DenseNet-121 demonstrating the best performance was selected for patch-level prediction. The PLH pipeline and BoW pipeline were used to summarize patch likelihoods. Multiple machine learning approaches were employed to develop a WSI-level prediction model. The clinical model was developed based on important clinicopathological data. The integrated model was developed with pathomics features and clinicopathology features. The performance of different models was evaluated in a test dataset.

Statistical analysis

Shapiro-Wilks test was used to assess the normality of the distribution of clinical characteristics in these cohorts. The Student’s t-test or Mann–Whitney U test was used for continuous variables and the Chi-square or Fisher’s exact test for categorical variables. Statistics were conducted using two-sided tests, and statistical significance was defined as P < 0.05. Youden Index was used to estimate the optimal cutoff points of the ROC curves. The area under the curve (AUC) of the different models was compared using the Delong test. Correlation analysis was conducted using Spearman’s rank correlation coefficients.

Results

Clinicopathological characteristics

Clinicopathological characteristics and comparisons of 131I scan results (negative vs. positive) between training and test cohorts in 229 eligible patients were described in Table 1. In the training cohort, 21.7% (35/161) of patients had metastatic disease: 30 with cervical lymph node metastases, 2 with lung metastases, and 3 with cervical lymph node and lung metastases. In the test cohort, 14.7% (10/68) of patients had metastatic disease: 8 with cervical lymph node metastases, 1 with lung metastases, and 1 with cervical lymph node and lung metastases. In the training cohort, significant disparities in sTg (P = 0.001) and RAIU (P = 0.018) were observed between patients with positive and negative 131I scan results. No statistically significant differences were observed in the test cohort.

Table 1 Clinicopathological characteristics and comparisons of 131I scan results (negative vs. positive) between training and test cohorts in 229 PTC patients.

Development and performance evaluation of pathomics-based model

At the patch-level, ROC curves were utilized to compare and evaluate six CNN-based models (Fig. 3a and Table S1). DenseNet-121 showed the best performance, with AUC values of 0.973 (95% CI 0.972–0.974) for the training cohort and 0.649 (95% CI 0.644–0.654) for the test cohort. To elucidate the decision-making process of CNNs, the gradient-weighted class activation mapping (Grad-CAM) method was employed, generating attention heatmaps that highlight important regions for the model prediction (Fig. 3b).

Fig. 3
figure 3

Model evaluation and visualization. (a) The difference between various deep learning models in the test cohort; (b) Visualization of features extracted from DenseNet-121 model using the Grad-CAM method; (c) The results of downscaling the model data visualized by the t-SNE algorithm; (d) The final features selected for the pathology model; (e) WSI-level AUCs for various machine learning algorithms in the test cohort; (f) Heatmap for WSI-level prediction phase.

A total of 26 features were integrated through multi-instance learning, yielding 11 probabilistic features and 2 predictive label features per process. Feature selection was performed based on Pearson’s correlation coefficient, retaining only one feature from a pair of highly correlated features if the correlation coefficient exceeded 0.9. To effectively visualize these constructed features, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm was utilized, as shown in Fig. 3c. These features underwent further selection and optimization using a lasso regression model, culminating in the retention of 5 features, including a histogram probabilistic feature “prob-0.2”, three TF-IDF probabilistic features “prob01”, “prob05”, “prob06”, and a TF-IDF predictive feature “pred1” with the highest feature weight (Fig. 3d). The 5 features were retained for subsequent analysis.

The performance metrics of six machine learning models were described in Fig. 3e; Table 2. In the test cohort, the AUC values of these classifiers range from 0.613 to 0.805. The LR classifier demonstrated the highest performance with an AUC of 0.805 (95% CI 0.660–0.951), and the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were 0.600, 0.776, 0.316 and 0.918, respectively. Consequently, we selected the LR model as the classification algorithm based on its superior AUC. The probability heatmap of the pathology model is depicted in Fig. 3f.

Table 2 The WSI-level performances for predicting the presence of 131I-avid metastases in 131I scan.

Integration with clinicopathological characteristics

In univariate logistic regression analysis, lymph node and/or lung metastases in post-therapeutic 131I scan was significantly correlated with ≥ 55 years (P = 0.039) and sTg (P = 0.007) (Table S2). The multivariate logistic regression analysis identified that only sTg (P = 0.015) was associated with lymph node and/or lung metastases, which was subsequently used to construct the clinical model (Table S3). The AUC of the clinical model was 0.652 (95% CI 0.569–0.735) in the training cohort and 0.548 (95% CI 0.386–0.711) in the test cohort (Table 3). To streamline the model, the optimal critical value of sTg was determined based on the Youden index, resulting in a sTg cutoff value of 3.50 in the training cohort, which was subsequently incorporated into the combined model.

Table 3 The performance of models.

Model performance

Table 3 summarises the model performance results. The clinical model showed unsatisfactory predictive performance in the test cohort with an AUC of 0.548 (95% CI 0.386–0.711). The pathology model demonstrated commendable performance with an AUC of 0.805 (95% CI 0.660–0.951). The combined model, despite exceptional performance in the training cohort (AUC of 0.973, 95% CI 0.939–1.000), showed diminished efficacy in the test cohort (AUC of 0.790, 95% CI 0.631–0.948).

The Hosmer-Lemeshow goodness-of-fit test indicated a robust fit for the pathology model in both the training and test cohorts, with P-values of 0.898 and 0.532, respectively (Table S4). Statistical analyses revealed that the pathology model, as well as the combined model, exhibited significantly enhanced performance when compared with the clinical model, with P-values of 0.026 and 0.019, respectively (Figure S1). No significant difference was found between the pathology and combined models (P = 0.645).

The calibration performance of the pathology model was notably good (Fig. 4a,b). The decision curve analysis (Fig. 4c) demonstrated that the pathology model had a good net clinical benefit. Notably, both the pathology and combined models delivered superior net benefits compared to the clinical model.

Fig. 4
figure 4

ROC curves (a), calibration curve analysis (b), and decision curves analysis (c) in the test cohort.

Interpretation of the DenseNet-121 classification

To visualize and interpret the relative importance of different regions in our classification task, Fig. 3b and f exhibited representative class activation maps (CAMs) alongside their corresponding patches and WSIs, respectively. The CAMs highlighted regions that contributed most to the classification task in red. These highlighted regions were then evaluated and interpreted from a pathologist’s perspective. As illustrated in Fig. 3b, when distinguishing between metastatic and non-metastatic cases, our model captured areas with higher microvessel density (MVD). This finding was corroborated in Fig. 3f, where the WSI in the first row, featuring a tumor region with higher MVD, was predicted as metastatic. Conversely, the WSI in the second row, characterized by lower MVD among tissues, was predicted as non-metastatic.

Discussion

Accurate postoperative assessment is critical for 131I therapy in patients with PTC, forming the foundation for individualized treatment and minimizing risks associated with under- or over-dosing. This study developed a digital pathology image-based model that predicts lymph node and/or distant metastases on initial post-therapeutic 131I scan in PTC patients following near/total thyroidectomy. Our results demonstrated that the pathology model outperforms the clinical model, indicating its potential to provide reliable information for postoperative assessment and guide treatment decisions.

In our multivariate logistic regression, sTg was the only statistically significant predictor, and sTg above 3.5 mg/L was associated with a higher risk of metastasis after surgery. This result aligns with previous studies. Ciappuccini et al. found that sTg > 28 ng/ml is the best cutoff value to predict the persistent/recurrent disease in patients with DTC15. Prabhu et al. found that ps-Tg < 1 ng/ml was able to exclude PTC metastatic disease when TgAb was negative, while ps-Tg > 5 ng/ml was associated with the risk of lymph node or distant metastasis16. However, the predictive value of postoperative Tg is highly controversial due to various influencing factors, such as residual thyroid, TSH levels, and the time interval between surgery and 131I therapy6,17,18. Grünwald et al. hypothesized that the ratio of serum Tg to 131I uptake (Tg/131I uptake) in the thyroid bed can be used to normalize Tg values even in the presence of residual thyroid tissue19. They demonstrated that the Tg/131I uptake ratios might be used to identify patients at high risk of recurrence or metastasis. Giovanella et al. normalized Tg values by integrating TSH levels and residual thyroid estimates (RAIU), and culminated in a decision tree model which achieved AUCs ranging from 0.610 to 0.708 across different centers for the identification of persistent/metastatic DTC by incorporating N stages and Tg values20. It is worth noting that the optimal threshold for postoperative Tg levels varies considerably among data sets due to variability of patients and data collection processes. Although various approaches have been taken during recent years, there is still no universally accepted and reliable conclusion. Therefore, it is crucial to find new reliable biomarkers for postoperative evaluation of DTC patients.

Previous studies have demonstrated the utility of morphometric parameters as diagnostic and prognostic tools owing to their objectivity and quantifiability11,21,22,23,24. It has been shown that the longest diameters and largest areas of metastatic foci in lymph nodes are associated with PFS of lateral cervical lymph node recurrence after surgery in PTC patients11. In particular, when the longest diameter exceeded 8.0 mm and the largest area was greater than 35.6 mm2, patients had a significantly poorer prognosis. Similarly, in a study of 305 patients, Bardet et al. have shown that PTC patients with initial microscopic lymph node involvement (less than 10 mm) have an intermediate outcome between N0–Nx patients and macroscopic N1 patients (greater than 10 mm)21. These critical thresholds offer valuable guidance for clinical decision-making. However, the utility of morphometrics is constrained by its predefined feature extraction protocols and therefore may fall short in capturing the subtleties and latent information embedded within pathological images. In contrast, deep learning-based pathomics leverages advanced techniques to automatically learn and extract deep features from pathology images, thereby introducing new avenues for pathology research22. Anand et al. trained a deep neural network using weakly supervised learning techniques to predict BRAF V600E mutation status9. The model demonstrated excellent predictive performance in a large independent cohort of 444 patients with an AUC of 0.98 (95% CI 0.97–1.00). Tsou et al. developed a convolutional neural network model to classify PTCs into BRAFV600E or RAS mutations using digital pathology images from The Cancer Genome Atlas (TCGA) database10. The model achieved an AUC of 0.878 in the test cohort. In our study, image features are extracted based on the DenseNet-121 network, and patch-level predictions are aggregated into WSI-level predictions using PLH and BoW pipelines. Notably, our model demonstrated promising performance in the test cohort, achieving an area under the curve (AUC) of 0.805 (95% CI  0.660–0.951). Although an AUC of 0.805 is not exceptionally high, we have observed that the AUC for advanced tasks such as predicting recurrence using pathomics generally falls within the range of 0.75 to 0.85, which aligns with our findings25,26. Additionally, based on the visualization of CAMs, we find that the pathology model accurately captured areas with higher MVD, which contributed significantly to the prediction of metastasis. This finding is consistent with previous research and demonstrats the interpretability of our digital pathology model27,28.

In this study, two critical factors were instrumental in constructing the pathology model. The first factor involves the integration of patch-level and WSI-level predictions. We decomposed the WSI into multiple patches and extracted local pathologic features for each patch. Subsequently, the patch-level predictions were aggregated through two pipelines, PLH and Bow, to achieve a comprehensive prediction of the whole WSI. A similar approach was employed by Cao et al., who introduced an ensemble multi-instance deep learning model for predicting microsatellite status based on digital pathology images, achieving an AUC of 0.8504 (95% CI 0.7591–0.9323) in an external validation set of Asian colorectal cancer patients29. The second factor involves the exploration of various neural network structures in patch-level prediction to identify the most suitable model for this study. In our study, DenseNet-121 emerged as a focal point owing to its distinctive dense connectivity property30. DenseNet-121 enhances feature reuse and transfer by directly connecting the output of each layer to all subsequent layers, thereby effectively mitigating the problem of vanishing gradients and improving the richness of the feature representation. In contrast to certain prior studies, which predominantly focused on a singular or limited array of neural network structures-for instance, Tsou et al. exclusively utilized the Inception-v3 network for feature extraction, and another study employed Resnet-101 as the feature extractor10,31. These studies failed to comprehensively assess the performance of multiple neural networks in pathology image analysis tasks, and thus may not be able to identify the optimal model.

Our study initially included 404 patients, which was reduced to 229 after stringent exclusion criteria were applied. Despite the modest sample size, we successfully demonstrated that the pathology model surpassed the clinical model in predictive accuracy. We initially hypothesized that models combining pathological and clinical features would yield superior predictive capabilities. However, the results were suboptimal, as the fusion model exhibited slightly lower AUC values compared to the pathology model. This discrepancy might be attributed to the inferior performance of the clinical model, which solely relied on sTg values and failed to provide sufficient predictive enhancement for the combined model. Similar findings have been reported in previous study32. Therefore, we selected the pathology model as our final model. This model is not affected by sTg values and has demonstrated excellent predictive performance.

This study has several limitations. Firstly, its retrospective design is susceptible to selection bias and information bias. Secondly, as a single-center study, despite employing internal cross-validation to mitigate overfitting risks, the predictive performance of the pathology model still warrants further validation in a larger and more diverse patient cohort. Thirdly, our model is inapplicable to patients with TgAb-positive, and future research should consider this patient subset. Fourthly, our study did not incorporate molecular data including BRAF, RAS, or TERT promoter mutations. Fifthly, the postoperative clinical model is based on TSH-stimulated Tg, rather than TSH-suppressed Tg. Lastly, we only investigated 131I avid foci in this study. However, non-radioiodine avid foci remain a significant challenge in radioiodine refractory patients, who might have poor treatment outcomes and prognosis.

In summary, the integration of pathomics into the clinical workflow holds promise for enhancing the precision of postoperative assessments, thereby facilitating more informed and personalized treatment strategies for patients with DTC. The development and validation of our deep learning-based pathology model represent a significant step forward in this endeavor, with the potential to improve patient outcomes through more accurate prediction of disease states.