Abstract
Whole-body bone scan (WBS) is usually used as the effective diagnostic method for early-stage and comprehensive bone metastases of breast cancer. WBS images with breast cancer bone metastasis have the characteristics of low resolution, small foreground, and multiple lesions, hindering the widespread application of deep learning-based models. Automatically detecting a large number of densely small lesions on low-resolution WBS images remains a challenge. We aim to develop a unified framework for detecting multiple densely bone metastases based on low-resolution WBS images. We propose a novel unified detection framework to detect multiple bone metastases based on WBS images. Considering the difficulties of feature extraction caused by low resolution and multiple lesions, we innovatively propose the plug-and-play position auxiliary extraction module and feature fusion module to enhance the ability of global information extraction. In order to accurately detect small metastases in WBS, we designed the self-attention transformer-based target detection head. This retrospective study included 512 patients with breast cancer bone metastases from Peking Union Medical College Hospital. The data type is whole-body bone scan image. For our study, the ratio of training set, validation set and test set is about 6:2:2. The benchmarks are four representative baselines, SSD, YOLOR, Faster_RCNN_R and Scaled-YOLOv4. The performance metrics are Average Precision (AP), Precision and Recall. The detection results obtained through the proposed method were assessed using the Bonferroni-adjusted Wilcoxon rank test. The significant level is adjusted according to different multiple comparisons. We conducted extensive experiments and ablation studies on a private dataset of breast cancer WBS and a public dataset of bone scans from West China Hospital to validate the effectiveness and generalization. Experiments were conducted to evaluate the effectiveness of our method. First, compared to different network architectures, our method obtained AP of 55.0 ± 6.4% (95% confidence intervals (CI) 49.9–60.1%, \(p<0.05\)), which improved AP by 45.2% for the SSD baseline with AP 9.8 ± 2% (95% CI 8.1–11.4%). For the metric of recall, our method achieved the average of 54.3 ± 4.2% (95% CI 50.9–57.6%, \(p<0.05\)), which has improved the recall values by 49.01% for the SSD model with 5.2 ± 12.7% (95% CI 10–21.3%). Second, we conducted ablation studies. On the private dataset, adding the detection head module and position auxiliary extraction module will increase the AP values by 14.03% (from 33.3 ± 2% to 47.6 ± 4.4%) and 19.3% (from 33.3 ± 2% to 52.6 ± 6.1%), respectively. In addition, the generalization of the method was also verified on the public dataset BS-80K from West China Hospital. Extensive experimental results have demonstrated the superiority and effectiveness of our method. To the best of our knowledge, our work is the first attempt for developing automatic detector considering the unique characteristics of low resolution, small foreground and multiple lesions of breast cancer WBS images. Our framework is tailored for whole-body WBS and can be used as a clinical decision support tool for early decision-making for breast cancer bone metastases.
Similar content being viewed by others
Introduction
Breast cancer has been treated as a major health problem for women1,2. The incidence of breast cancer is increasing. According to news released by the World Health Organization’s International Agency for Research on Cancer (IARC), the number of new cases of breast cancer worldwide in 2020 reaches 2.26 million, becoming “the number one cancer in the world”3. For breast cancer, it is characterized by high heterogeneity and easy recurrence of metastasis4, among which bone is one of the most common sites of metastasis5,6. According to an earlier experiment, 70% of breast cancer patients developed bone metastasis in the late stage of cancer7. Early detection of bone metastasis is important not only for decreasing morbidity but also for disease staging, outcome prediction, and treatment planning.
Common imaging methods include whole-body bone scan (WBS, or bone scintillation imaging), computed tomography (CT), magnetic resonance imaging (MRI) and single photoelectric tomography (SPECT), etc8. WBS is widely accepted as the standard method for investigating the existence and extent of bone metastasis9. Compared with other modalities, WBS is inexpensive, can provide whole-body metastasis evidences with low-dose radiation, and have a high accuracy in diagnosing bone metastasis. Radionuclides are used in WBS to detect the sites of abnormal metabolism in bone tissues. Due to the increased activity of osteoblasts in the metastatic site, tracer aggregation can be seen in bone scan images10,11. The abnormalities in WBS images are called hot spots and generally appear as higher intensity signals than the surroundings, which are the key factors for diagnosing bone metastasis12. Although WBS is effective for diagnosing bone metastasis, the analysis of the images is still a difficult and subjective task that requires extensive experience. The recognition results heavily rely on the subjective prone-to-error interpretation, and there exists large intra-observer and inter-observer variability. An automated lesion detection system for breast cancer bone metastases is desired.
For the processing of WBS, artificial intelligence (AI), especially deep learning solutions, have become a new research focus. For example, some researchers focus on the task of classifying bone scan data. Papandrianos et al.13 bulit a Convolutional Neural Networks (CNN) model to perform the first three-class classification (malignant, normal, degenerative) of bone scan images of prostate patients. Hajianfar et al.14 applied ten different CNN networks to distinguish normal cases from abnormal cases and non-neoplastic disease cases from malignant bone diseases. Some researchers regard the diagnosis of bone metastases as the segmentation task. For example, Shimizu et al.15 designed a butterfly network (BtrflyNet AP) for bone segmentation and hot spots extraction of bone metastases based on two U-nets. Furthermore, considering that the hot spots often showed left-right asymmetry, Saito et al.16 extracted the hot spots based on ensembled butterfly-type networks from a pair of anterior and posterior images. However, these tasks require large amounts of pixel-level annotations, which is time-consuming.
Lesion detection is very useful for helping physicians locate the lesion areas and save time in analyzing WBS images. However, the related research is still in its infancy. For example, some researchers directly applied object detection frameworks in natural images (e.g., Faster-RCNN, YOLO) to WBS lesion detection tasks17. Then, some works have made some improvements. For example, when applying YOLOv4 to detect metastasis lesions from prostate cancer, Li et al.18 proposed negative sample mining to reduce false positives. They trained the detection model with non-metastasis WBS images to obtain false positive targets, and then fine-tuned the model with both true and false positive targets. Cheng et al.19 designed a body part detection network using Faster-RCNN, and then input the detected sternum images into a YOLOv3-based lesion localization model. This method can narrow the detection range of the model and reduce other interference, enhancing the positioning capabilities.
In summary, for the task of lesion detection, many current works directly apply traditional detection models for bone metastasis in WBS, which lacks the specific analysis and consideration of WBS image characteristics. We found that WBS images showed the very low-resolution appearances, as well as a large and dense distribution of lesions, many of which demonstrated small foregrounds. According to statistics, in our private datasets, each patient has an average of 10.4 lesions, and the average lesion area of each patient is 0.25% of the body area. Figure 1 also shows our WBS images containing dense and small hot spots. These problems will seriously affect the detection and localization of metastasis sites in WBS images. In the preliminary experiment, we also found that directly using common object detection methods for lesion detection in WBS can only achieve a very low accuracy17,20. Furthermore, as shown in the first line of Table 1, by directly using the SSD object detection method without making any improvements, we only obtain an AP value of 9.8 ± 2% and a recall of 5.2 ± 12.7%. How to automatically detect a large number of densely small lesions on low-resolution WBS images is challenging.
We first searched the Web of Science database till April, 2024, without language restrictions, with the queries of bone metastasis, deep learning, bone scan, a total of 103 results were collected. We then searched the PubMed database with the same conditions, and collected a total of 63 results. We systematically reviewed the literatures and identified 34 original studies that used deep learning techniques to analyze bone scan images. Of the 34 studies, 14 cases were analyzed using the classification model. 9 cases used segmentation techniques to distinguish metastases from bone scan images and 5 cases used object detection techniques to mark metastases. 6 cases used bone scan images, mixed with the above techniques to calculate the patient’s bone scan index for systematic evaluation. We systematically reviewed the results and found that there was no study for investigating automatic detector considering the unique characteristics of WBS images.
To address these problems, we propose Bone Metastases Detector with Self Attention and Auxiliary Information - (SAAI-BMDetector), a novel unified framework to detect multiple bone metastases based on WBS images, which is the most significant contribution. To the best of our knowledge, this is the first attempt for developing automatic detector considering the unique characteristics of low resolution, small foreground and multiple lesions of WBS images. This study also makes two contributions. First, considering the difficulties of feature extraction caused by low resolution and multiple lesions, we innovatively propose multiple-level feature fusion module (FF) and the plug-and-play position auxiliary extraction module (PAE) to enhance the ability of global information extraction. The PAE is the off-the-shelf module, which can be applied to subsequent datasets that do not have pixel-level labels. Second, in order to accurately detect small metastases, we designed the self-attention transformer-based target detection head, which has higher accuracy to other automated object detection methods. Finally, we conduct extensive experiments and discussions on a private breast cancer bone metastasis WBS dataset and a publicly available bone scan dataset. Compared with baselines, our SAAI-BMDetector performs well and achieves the state-of-the-art performance. We conduct extensive ablation studies and the results clearly show the effectiveness of different components. Our framework is tailored for whole-body WBS and can be used as a clinical decision support tool for early decision-making for breast cancer bone metastases. Besides, the experimental results on the public dataset also demonstrate the robustness and generalization of this method. The major contributions of our work are as follows:
-
(1)
We propose a novel unified automatic lesion detection framework for breast cancer bone metastasis based on WBS images. To our knowledge, this is the first work considering the unique characteristics of low resolution, small foreground, and multiple lesions of breast cancer WBS images.
-
(2)
Considering the difficulties of feature extraction caused by low resolution and multiple lesions, we innovatively designed the multi-level feature fusion module and plug-and-play position auxiliary extraction module. For small densely lesions in WBS, we also innovatively provided self-attention transformer-based target detection head to help localize small metastases.
-
(3)
Compared with the other state-of-the-art baselines (e.g., SSD), our SAAI-BMDetector has the increase of 45.2% in AP and 49.01% in recall, which verifies the superiority and effectiveness of our method. Besides, we also verified the robustness using the public WBS dataset.
Materials and methods
Datasets
Private dataset
In this study, 512 imaging data were retrospectively collected from Peking Union Medical College Hospital (PUMCH). This is a retrospective analysis of image data. All experimental protocols were approved by the Ethics Review Committee of Peking Union Medical College Hospital, Chinese Academy of Medical Sciences (IRB: I-23PJ1889). Informed consent was obtained from all the study participants. All methods were performed in accordance with the relevant guidelines and regulations. Scans were acquired using four different scanners: Discovery NM 630, Infinia from GE Healthcare, Precedence from Philips, and E.Cam from Siemens. A whole-body bone scan was performed 3–4 hours after the intravenous injection of 740–925 MBq (20-25 mCi) of 99mTc-MDP. The scans were conducted on a dual-head gamma camera equipped with low-energy parallel-hole high-resolution collimators. The energy acquisition window was centered at 140 KeV with a 20% window. Scanning was performed at a velocity of 15–20 cm/min, with a matrix size of 256 × 1024. The age of subjects in the private dataset is 53.03 ± 11.63 years.
99mTc-MDP is a radiopharmaceutical injected into a patient’s vein that can enter bone cells and precipitate mineral components. As a result, 99mTc-MDP tends to accumulate in active areas of bone formation in the affected area, resulting in increased local radiopharmaceutical activity that manifests as a hot spot of WBS, allowing physicians to identify bone metastases. Images in the private dataset contain two types of annotations: (1) object detection bounding box annotation and (2) pixel-level annotation. Both were manually labeled and only hot spots were identified as bone metastases. That is, only sites identified as breast cancer metastatic lesions were labeled, and our study focuses on the detection of malignant lesions without including benign-malignant differentiation. The annotations came from two experts, one with 3 years of experience and the other with 10 years of experience. Annotation examples are shown in Fig. 2. They are the raw WBS image, the WBS image with target detection box annotations (the green box indicates the ___location and size of the bone metastasis lesion), and the WBS image with pixel-level annotations (black mask shows lesion shape). The data type is the two-dimensional (2D) WBS. The training, validation and testing datasets are divided as 6:2:2.
Public dataset
A publicly available dataset, BS-80K20, was also used in the experiment. BS-80K contains 5479 WBS images of patients with bone metastasis from West China Hospital. All the data are stored as grayscale images in JPEG format, also containing anterior and posterior views (Fig. 3). The differences between BS-80K and our private dataset are: (1) bone metastasis lesions in BS-80K are not only from breast cancer, but also from other types of cancers; (2) BS-80K only provides object detection box annotations with true lesions (Abnormal) and non-bone metastatic lesions (Normal). The data type is the 2D WBS. The training, validation and testing datasets are divided as 6:2:2.
Preprocessing
Some pre-processing steps were taken before feeding the WBS image into the detector. First, we expand the dataset by dividing the whole-body images into four sub-images. Second, to ensure the consistency of the input, we use the resizing technique, the Letterbox method21, to scale original image to the size of \(640 \times 640\). Third, the edges are filled with gray to achieve a fixed input size to avoid distortion of the skeletal structure. Fourth, hybrid enhancement techniques such as Mosaic and MixUp22 are used to enrich the diversity of the samples. In this way, the model can adapt to the variations of the skeletal features in different backgrounds and conditions.
Model overview
We propose the SAAI-BMDetector framework to address the bone metastasis lesion detection task, which is illustrated in Fig. 4. Our model can be divided into four parts: the Main Feature Extraction module (MFE module), Position Auxiliary Extraction module (PAE module), Feature Fusion module (FF module), and Detection Head module (DH module).
The MFE module uses a backbone to extract the general features of the input images based on YOLOv521. In view of the difficulty of feature extraction from low-resolution WBS images, we design the PAE module and FF module. The PAE module is a network independent of the main framework and is used to extract the semantic features of the input images. We calculate the pixel-wise entropy of the image according to the prediction results, thereby providing more auxiliary features that are difficult to extract in the MFE module. In the FF module, we select three locations to fuse three sequences of features with different sizes provided by the PAE module. In the down-sampling step, inspired by transformer’s attention mechanism, we set four ST_Encoder blocks to enhance the global information extraction capability, and set a T_Decoder receiving information entropy as a query. To solve the problem of small lesions of WBS image, we add a detection head in the DH module for detecting small targets to improve the performance of model.
Model overview. It consists of four modules. The main feature extraction module (MFE module) and the position auxiliary extraction module (PAE module) simultaneously extract the features from the input WBS image. The features extracted from MFE are the main feature sources of this model. The auxiliary features and entropy obtained from PAE are the sources of lesion texture and ___location information. The main feature and auxiliary feature will be merged in feature fusion module (FF module). The fused features are first analyzed by Swin Transformer Encoder (ST_Encoder) for correlation, and then the ___location information contained in entropy is fused by Transformer Decoder (T_Decoder). Finally, features of different sizes are entered into the detection head module (DH module) for detection. The largest features containing the most details will be fed into the added small target detection head in order to obtain more small lesions.
Main feature extraction module (MFE module)
The MFE module uses the connection of CSPDarknet53 and SPPF pooling layer as backbone, similar to YOLOv521. CSPDarknet53 can maintain high detection accuracy while removing redundant gradients to improve model detection efficiency. SPPF is used to capture features from different scale feature maps.
Position auxiliary extraction module (PAE module)
To adapt to the characteristics of low resolution, multiple lesions and difficult feature extraction of WBS image, we design a PAE module. It is an off-the-shelf module that provides coarse auxiliary features for lesion detection. Unlike other modules, the weights of the PAE modules are obtained by training on pixel-level annotated data and fixed thereafter. For the input WBS images, the module will get two types of outputs (two sets). The first is auxiliary features (contains both high-level semantic features and shallow texture features) and the second is entropy (provides lesion ___location information).
Inspired by UNet23, this module consists of an encoder and decoder block. The encoder consists of two types of operations, convolution and maximum pooling, to initially extract the features and gradually enlarge the receptive field. The decoder performs up-sampling operation while fusing shallow features collected from the encoder. In order to use both low-level and high-level features, the feature sequence obtained in the up-sampling stage of decoder block is selected as the first set of output of PAE module. To facilitate feature fusion, we select three sequences with similar feature map dimensions to the feature receiving sites. Here, different sequences refer to feature maps with different dimensions. These three sequences contain both low-level texture features and high-level global features.
To accurately localize lesions, we apply entropy calculation to refine the predicted mask. By calculating the entropy of the predicted mask, PAE module further provides accurate lesion ___location information. As visualized results shown in Fig. 5a, the binary entropy was selected to effectively outline the boundary of the lesion. These features containing the ___location of the lesion will serve as the second set of output of PAE module, which will also be provided to FF module for fusion and utilization.
(a) Pixel-level mask and entropy. (b) Architecture of ST_Encoder. (MLP = multi-layer perceptron, LN = LayerNorm). W-MSA and SW-MSA denote multi-head self-attention modules with regular and shifted windowing configurations, respectively. (c) Architecture of T_Decoder. (MHA = multi-head attention, FFN = feed forward network). This block treats entropy as an object query. After converting entropy into a suitable vector format, T_Decoder feeds the processed query into the MHA block together with the origin features as key and value to perform cross-attention.
Feature fusion module (FF module)
After getting three different features from MFE and PAE modules, we improved the backbone for feature fusion and correlation extraction. First, we select three sites with different feature sizes to receive and fuse the multi-level features obtained by PAE module respectively. Second, we use three ST_Encoder (Swin Transformer Encoder) blocks to introduce attention mechanisms to improve the model’s perception of global features. Third, we add a T_Decoder (Transformer Decoder) block to receive and fuse the ___location features gathered from PAE module, improving the positioning capabilities.
-
(1)
ST_Encoder: objects correlation extraction
Breast cancer bone metastatic lesions have certain distribution characteristics. For example, the lesions are concentrated in the spine, sternum and ribs, and patients often have multiple lesions. It is necessary to extract the correlation information between foreground and background, objects and other foregrounds. Inspired by TPH-YOLOv524, we add three ST_Encoders in the down-sampling stage to explore the relationships between objects through the self-attention mechanism, which is used to better utilize the information contained in the feature sequences. Inspired by Swin Transformer25, we use the self-attention mechanism that includes sliding window operations to quickly extract relevant information between pixels. As shown in Fig. 5b, after layer normalization, ST_Encoder first executes in-window self-attention to get the semantic relationships between different pixels in the same window. Then, the nonlinear change of the word vector is carried out through MLP layer. After that, the module carries out information interaction between windows through window shifting, and repeats the above steps to complete a global feature extraction.
-
(2)
T_Decoder: lesion ___location feature fusion
The T_Decoder (Transformer Decoder) is added to the shallowest target detection layer to fuse the features contained in the entropy gathered from PAE module, which is shown in Fig. 5c. We use the entropy as query, and use the original features as key and value to fuse the features extracted by MFE and PAE modules. The features of lesion ___location contained in entropy can guide the model to focus on sites that are more likely to be lesions.
Detection head module (DH module)
To improve the ability to detect small lesions, we designed the DH module. Since the area of a single bone metastasis lesion in the WBS image is usually small, the detection head in original YOLOv5 is not suitable for the bone lesion detection task. We add a head for small target detection, and inject it with larger feature maps containing more details to improve the overall detection performance.
Training process
In order to make fully use of both private and public datasets, we adopt the following training strategy. Firstly, for MFE, FF and DH modules, we adopted the pre-training weights provided by YOLOv5 to ensure the initial detection capability of the model. Secondly, we used the public dataset BS-80K for pre-training of MFE, FF and DH modules. In order to maintain consistency with the private dataset, we only trained with the real lesion (Abnormal) labels in the public dataset. In addition, since the public dataset does not contain pixel-level annotations, the PAE module will not be trained. For the PAE module, we trained it with pixel-level annotations in private dataset, and then fixed the weights for this part as the off-the-shelf module for subsequent use. Finally, we fine-tuned and tested the entire model using our private dataset. The overall complexity of the model is contributed by four main components, which are the main framework of backbone, the PAE module, and the ST_Encoder and T_Decoder in the FF module. The time complexity of backbone can usually be represented as a typical form of convolutional network. The total complexity of our model is \(O(N^2)\), where N is the spatial dimension of the feature map.
Statistic analysis
Statistical analyses in this study are performed using SPSS software (version 22.0.0, IBM SPSS Statistics, Armonk, NY, USA) and Python (version 3.6.1). The performance of deep learning object detection is presented with mean, standard deviation, median, 95% confidence interval. The detection results were assessed using the Bonferroni-adjusted Wilcoxon rank test for the metrics. All statistical tests are two-sided. A value less than 0.05 was considered statistically significant. Multiple comparisons were conducted between proposed method and other baselines. The significance level \((\alpha )\) for multiple comparisons was adjusted to \(\frac{\alpha }{m}\), where m represents the number of multiple comparisons. For example, in this case, \(\alpha\) values for these metrics are set to 0.05. When four multiple comparisons are performed, the adjusted \(\alpha\) values are set to 0.0125. For other ablation studies, we also apply the Bonferroni-adjusted multiple comparisons. Furthermore, we calculate effect sizes, which are estimated based on Cohen’d method.
Experiment configuration
Implementation details
We implemented SAAI-BMDetector with Pytorch 1.11.0, using the NVIDIA RTX3090 GPU for training and testing. Our model adopted the end-to-end training strategy. On a public dataset, the model usually converged within 130 epochs. On private datasets, the model typically trained 100 epochs due to the small amount of data. We used SGD as the optimizer. In the warm-up stage (initial 3 epochs), one-dimensional linear interpolation was adopted to update the learning rate of each iteration. The cosine annealing algorithm with an initial learning rate of 0.01 was adopted after the warm-up to update the learning rate. The input of the model was images with size 640 × 640 gathered from Mosaic data enhancement22, which was composed of four random WBS images. The output sizes of the four detection heads were 160 × 160 (output of the shallowest detection layer, used to detect the smallest target), 80 × 80, 40 × 40, and 20 × 20 (output of the deepest detection layer, used to detect the largest size target), respectively. The hyperparameter configuration used in our proposed experiments are shown in Table 1.
Evaluation metrics
In our work, three evaluation metrics of precision and recall are involved, which can be calculated by the confusion matrix. The confusion matrix is usually a square matrix of two rows and two columns, with four cells representing TP (True Positive, the number of Positive samples predicted to be positive), FN (False Negative, the number of positive samples predicted to be negative), FP (False Positive, the number of positive samples predicted to be negative), and TN (True Negative, the number of negative samples predicted to be negative).
In target detection tasks, TP refers to the number of true boxes corresponding to detection boxes whose confidence level is greater than the confidence threshold and whose Intersection over Union (IoU) is greater than IoU threshold. FP refers to the number of remaining detection boxes (although detected, it is not considered positive because the IoU is too small or the true box is repeatedly detected). FN refers to the number of remaining true boxes (positive but not detected). The equations of precision and recall are as follows:
When calculating the above two metrics, the concepts of confidence threshold and IoU threshold are involved. When the IoU threshold is fixed, precision and recall will change with the change of the confidence threshold, resulting in a PR (precision-recall) curve.
The evaluation metric AP, first introduced by VOC2007, refers to the area under the smoothed PR curve when the IoU threshold is 0.5. It not only evaluates the classification ability of the object detection model, but also reflects the localization ability.
Baselines
In this section, we show the representative baselines to compare the lesion detection performance with our method. We choose four representative baselines, which are SSD26, YOLOR17, Faster_RCNN_R17, and Scaled-YOLOv427. In the bone scan target detection task, these four models have been verified by some researchers, showing their applicability, which makes them the proper control group for evaluating the performance of new models. Specifically, SSD is the basic object detection framework and have fast detection capabilities. YOLOR is good at small target detection and can demonstrate our model’s sensitivity to lesion details. Faster_RCNN_R is a typical two-stage object detection method. By removing support vector machine and linear models, the detection effect can be improved. It is suitable for evaluating the improvement of our model’s accuracy. Scaled-YOLOv4 provides a benchmark with a good balance between detection precision and speed. By comparing these models, we can fully understand the relative performance of our model in terms of accuracy, and detection ability against small lesions.
Results
Comparisons with state-of-the-art methods
We compared the performance of our method SAAI-BMDetector with other target detection algorithms applied to WBS images over the past few years. All models are tested using private dataset. Our SAAI-BMDetector model achieves the best performance in the overall detection ability.
In the experiments, we used AP, precision, and recall as comparison metrics. The results are shown in Table 2. The significant differences are shown between each baseline and our proposed SAAI-BMDetector. For our dataset, our proposed method achieved the highest AP, precision and recall. Specifically, for AP, we have obtained the mean of 55.0% (standard deviation: 6.4%, median: 54.8%, 95% confidence interval: 49.9%, 60.1%). For precision value, we have the mean of 61.5% (standard deviation: 6.4%, median: 61.4%, 95% confidence interval: 56.4%, 66.6%). For recall value, we can observe the mean of 54.3%(standard deviation: 4.2%, median: 53.2%, 95% confidence interval: 50.9%, 57.6%).
Compared with the baseline SSD, which applies the basic object detection method for WBS lesion detection without any modifications, our SAAI-BMDetector achieved better performance with AP \((p<0.0125)\), precision \((p<0.0125)\) and recall \((p<0.0125)\). SSD, as proposed by26, is an object detection model with the chest skeletal region as input and it is not well adapted to our task. This may due to the fact that the original model was designed for lung cancer patients, which showed larger lesions. Compared with the baselines YOLOR17, Faster_RCNN_R17 and Scaled-YOLOv427, our proposed method also performed better on the private dataset. For YOLOR, it shows AP with mean 38.5% (standard deviation: 1.9%, median: 38.5%, 95% confidence interval: 36.9%, 40%, \(p<0.0125\)). For Faster_RCNN_R, it shows AP with mean 39.4% (standard deviation: 2.7%, median: 39.2%, 95% confidence interval: 37.2%, 41.5%, \(p<0.0125\)). For Scaled-YOLOv4, it shows AP with mean 42.7% (standard deviation: 6.4%, median: 42.5%, 95% confidence interval: 37.5%, 47.8%, \(p<0.0125\)). By observing the results, none of the three methods considered the unique characteristics of lesions presenting in breast cancer patients’ WBS images, thus resulting in relatively poor performance. The results also indicate that our SAAI-BMDetector model has a strong ability to identify metastasis lesions of WBS images.
Ablation study
We conducted ablation studies on both public and private datasets. The results are shown in Table 3. The Backbone consists of the main feature extraction module, without any other modifications. The metrics are shown with AP, precision and recall. The significant differences are compared between the Backbone and other modifications.
For public datasets, due to the lack of pixel-level annotations, we did not train the PAE module. Therefore, we analyze the roles of DH module and ST_Encoder. For the public dataset, the AP, precision and recall of Backbone are 60.5 ± 3.9%, 64.2 ± 7.8%, 40.5 ± 8.7%, respectively. Significance levels were also determined using Bonferroni adjusted significance levels. From the experimental results, by adding DH module, we can observe that the detection performance of the model is improved. For example, the AP improved from 60.5 ± 3.9% to 66.9 ± 3.2%. This indicates that the capability of model to find all lesions has been greatly improved and the model is able to “see” more small lesions. Furthermore, the overall detection performance of the model continues to improve after adding ST_Encoder. The AP improved from 60.5 ± 3.9% to 68.7 ± 3.2%. This indicates that the introduction of a self-attention mechanism enables the model to discover more global information and detect more lesions.
For the private dataset, by adding different components in sequence, the performance metrics gradually increase. For example, the basic Backbone shows AP of 33.3 ± 2%. Then, adding DH module, ST_encoder, PAE module, T_decoder, the values of AP reach 47.6 ± 4.4%, 49.5 ± 5.3%, 52.6 ± 6.1%, 55.0 ± 6.4%, respectively. Specifically, We can observe that the overall detection performance continues to improve after the addition of PAE module. The improvements of AP and precision in Table 3 prove that both high-resolution texture features and high-level semantic features provided by PAE module have positive effects for the model. After adding T_Decoder, the overall detection performance of the model can still be improved. This can verify the value of features provided by PAE module for the detection of bone metastases as well as the role of decoder block. We continue to make effective adaptations to the model, so that more and more small objects can be detected. Compared with the Backbone, both AP and recall of the final version of SAAI-BMDetector have been improved. Among that, AP has been increased by 21.7%, and recall has been increased by 41.3%. These demonstrate the effectiveness of our design against the characteristics of WBS images.
The effectiveness of PAE module
In the early stages of model design, we envisaged embedding the PAE module into the main model rather than as an off-the-shelf module. This operation is to handle the situation where public datasets do not have pixel-level annotations. When the number of auxiliary feature sequences and the fusion mode remain unchanged, we embed the PAE module into the main model. The parameters of the module will be trained together with the main model. The following experiments will demonstrate that the PAE module can improve model detection even without pixel-level annotated data.
We designed experiments on public datasets to verify this hypothesis. In particular, we compare the performance of the presented state-of-the-art (SOTA) model given by20 with our proposed embedded PAE module. The experimental results are shown in Table 4. The significant differences are calculated between each baseline and our proposed SAAI-BMDetector (with PAE module). By analyzing the training and testing results on the public BS-80K dataset, it can be seen that embedding PAE module into the main model can improve the ability to detect bone metastases. Compared with SOTA (using Faster R-CNN and Cascade R-CNN20), our proposed scheme has the best performance in terms of AP of 63.1 ± 0.7% (median: 63.1%, 95% confidence interval: 62.6%, 63.6%).
Visualization results
We visualized some prediction results in Fig. 6. For each group, the green boxes on the left are ground_truth (GT), and the red boxes on the right are the predictions. By analyzing the above images, we can see that most of the GT can be covered by prediction results. We can intuitively see that most lesions are indeed small and most of them are concentrated in the chest and pelvis.
Visualization. GT are the green boxes on the left image and prediction results are the red boxes on the right. The red boxes in (a, b, c) indicate false positive results. Examples of missing detection in (a) and (c) are marked with blue boxes. Boxes in (b) and (c) with low precision are marked with yellow. The orange boxes in (b) represent overlapping detection boxes.
Discussion
In recent years, various medical imaging modalities, such as CT28, X-ray18, and SPECT26, have been used to evaluate bone metastasis of multiple cancers. However, due to the challenge of low spatial resolution of WBS images, research on analyzing bone metastases based on WBS images has recently attracted attention. For the task of lesion detection for WBS bone metastasis, many current works directly applied traditional detection models. They did not consider that WBS images of breast cancer bone metastases had the characteristics of low resolution, small foreground, and multiple lesions. These issues will affect the detection of metastatic sites in WBS images. To address this problem, we innovatively proposed a breast cancer bone metastasis detection method tailored for WBS images.
In this study, we have created our novel lesion detection framework to detect multiple bone metastases based on WBS images based on self-attention and position auxiliary information. As mentioned in Table 2, our proposed method achieved the highest AP, precision and recall, which are higher than the results of other baselines. To further reduce random error, we summarized the effect sizes estimated based on Cohen’s d method for the results of our method and the SOTA baselines. When compared with the SSD, the effect sizes are 5.912, 2.435 and 3.048, respectively. The statistical results indicate that our methods are effective in comparison with other controls. When comparing the recall of YOLOR and our method, we found that although the p value is less than 0.0125, which is statistically significant, the effect size is 0.288. This may be because the backbone we used is similar in form to YOLOR, and both methods can be used for the recognition of small lesions. Considering other evaluation metrics, such as AP and precision, they all have large effect sizes, which also shows that our method is effective.
For the comprehensive analysis, we also perform ablation experiments to evaluate the efficiency of the proposed DM module, PAE module, ST_Encoder model and T_Decoder model. As shown in Table 3, for the public dataset, we can observe that adding DH module improve the AP from 60.5 ± 3.9% to 66.9 ± 3.2%. Meanwhile, the improvement of ST_Encoder is gradually increasing. For example, when further adding ST_Encoder, the AP improves from 60.5 ± 3.9% to 68.7 ± 3.2%. The experiments in the public datasets also verifies the generalizability of our method. Moreover, for private dataset, we can observe that by adding the DH module, ST_Encoder, PAE module, and T_Decoder, the AP values increase from 33.3 ± 2.0% to 47.6 ± 4.4%, 49.5 ± 5.3%, 52.6 ± 6.1%, and 55.0 ± 6.4%, respectively, which proves the effectiveness of adding these modules. These also shows that self-attention and auxiliary information are very important for detecting the dense and small lesions for low-resolution WBS images. In addition, since the BS-80K dataset does not distinguish different types of cancer, the experiments on this dataset can also verify the generalization of our method on other types of cancer.
As shown in Fig. 6, the visual output results are also clearly observed. We can intuitively see that most lesions are indeed small and most of them are concentrated in the chest and pelvis. This is because breast cancer bone metastasis is mainly concentrated in the chest and pelvis, which is consistent with the knowledge of nuclear medicine experts. As can be seen from red boxes in Fig. 6a, there are some false positives in the predicted results, which may be due to the sensitivity of the model to abnormal concentrations of some non-metastatic sites in the WBS image. As shown in the yellow boxes in Fig. 6b and c, the lesions circled by some prediction boxes are consistent with those located by GT, but the size of the boxes is different, which belongs to the problem of low precision. Where there is a high concentration of lesions, the model may give overlapping prediction boxes (orange boxes in Fig. 6b), which is related to the setting of the IOU threshold. Despite the problems described above, we have fewer missed cases (blue boxes in Fig. 6a and c), even though most lesions are very small.
The deep learning models developed using convolutional neural networks in this study showed high lesion detection capabilities on breast cancer bone scan images, which was in line with the expectations of the nuclear medicine physicians who collaborated in this study. These models help accelerate the clinical examination process, improve hospital resource allocation, and promote the most appropriate and timely patient care. Inclusion of data from other centers may enhance the generalizability of machine learning models and produce more consistent and reliable results. Further research on prospectively acquired data is necessary to interpret more clinical information and distinguish different types of benign lesions, such as spondylarthritis.
Limitations and future scope
There are several limitations in our study. First, for most studies, disease identification is not limited to a single type of imaging, and is based on the combination of various diagnostic imaging techniques, biochemical test results, patient medical records, and pathological evidence. This study relies solely on bone scan images, and further integration of multimodal data diagnostic models is the future direction. Second, currently, there is a lack of general public datasets for breast cancer bone scan research. In order to verify the generalization of the method, we selected the public bone scan dataset BS-80k, which contains bone scan images of various diseases. Third, since the annotation of our dataset included only malignant metastatic lesions, our study focused on the detection of malignant lesions rather than the differentiation of benign and malignant lesions. However, there do exist some benign lesions such as chondromas, and lumbar vertebral degenerative changes, which have a certain pattern of display on the WBS images. In the future, more studies about the differentiation of benign and malignant lesions are needed. Fourth, we mainly focused on lesion detection in breast cancer bone scans in this study. In the future, we can focus on more tasks, such as patient survival prognosis, differentiation of lumbar degenerative lesions, etc.
Conclusion
In this work, we propose a new framework for detecting multiple bone metastases, taking into account the unique challenges of low resolution, small foreground and multiple lesions in WBS images. To solve the problems of difficult feature extraction by low resolution and multiple lesions, we designed the position auxiliary extraction module and multi-level feature fusion module. Besides, we add self-attention head for detecting small lesions. Compared with baselines, our model achieves state-of-the-art performance for both private dataset and public BS-80K dataset. Experimental results show that our proposed architecture can effectively identify multiple bone metastases of breast cancer patients even with low-resolution WBS images. Our framework is tailored for whole-body WBS and can be used as a clinical decision support tool for early decision-making for breast cancer bone metastases.
Data availability
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
References
Kashyap, D. et al. Global increase in breast cancer incidence: risk factors and preventive measures. Biomed. Res. Int. 2022(1), 9605439 (2022).
Giannakeas, V., Lim, D. W. & Narod, S. A. Bilateral mastectomy and breast cancer mortality. JAMA Oncol. 10(9), 1228–1236 (2024).
Fuentes, J. D. B. et al. Global stage distribution of breast cancer at diagnosis: A systematic review and meta-analysis. JAMA Oncol. 10(1), 71–78 (2023).
Riggio, A. I., Varley, K. E. & Welm, A. L. The lingering mysteries of metastatic recurrence in breast cancer. Br. J. Cancer 124(1), 13–26 (2021).
Kolahi Azar, H. et al. The progressive trend of modeling and drug screening systems of breast cancer bone metastasis. J. Biol. Eng. 18(1), 14 (2024).
Yu, X. & Zhu, L. Nanoparticles for the treatment of bone metastasis in breast cancer: Recent advances and challenges. Int. J. Nanomed. 19, 1867–1886 (2024).
Coleman, R. E. & Rubens, R. D. The clinical course of bone metastases from breast cancer. Br. J. Cancer 55(1), 61–66 (1987).
Yang, M., Liu, C. & Yu, X. Skeletal-related adverse events during bone metastasis of breast cancer: Current status. Discov. Med. 27(149), 211–220 (2019).
Iagaru, A. & Minamimoto, R. Nuclear medicine imaging techniques for detection of skeletal metastases in breast cancer. PET Clin. 13(3), 383–393 (2018).
Coleman, R. E., Brown, J. & Holen, I. Bone metastases. Abeloff’s Clinical Oncology e803, 809–830 (2020).
Cook, G. J., & Fogelman, I. Skeletal metastases from breast cancer: imaging with nuclear medicine. Paper presented at: Seminars in nuclear medicine, (1999).
Kakhki, V. R. D., Anvari, K., Sadeghi, R., Mahmoudian, A.-S. & Torabian-Kakhki, M. Pattern and distribution of bone metastases in common malignant tumors. Nucl. Med Rev. 16(2), 66–69 (2013).
Papandrianos, N., Papageorgiou, E., Anagnostis, A. & Papageorgiou, K. Efficient bone metastasis diagnosis in bone scintigraphy using a fast convolutional neural network architecture. Diagnostics 10(8), 532 (2020).
Hajianfar, G., Sabouri, M., Salimi, Y, et al. Artificial intelligence-based analysis of whole-body bone scintigraphy: The quest for the optimal deep learning algorithm and comparison with human observer performance. Zeitschrift für Medizinische Physik. (2023).
Shimizu, A. et al. Automated measurement of bone scan index from a whole-body bone scintigram. Int. J. Comput. Assist. Radiol. Surg. 15, 389–400 (2020).
Saito, A. et al. Extraction of metastasis hotspots in a whole-body bone scintigram based on bilateral asymmetry. Int. J. Comput. Assist. Radiol. Surg. 16, 2251–2260 (2021).
Liao, C.-W. et al. Artificial intelligence of object detection in skeletal scintigraphy for automatic detection and annotation of bone metastases. Diagnostics 13(4), 685 (2023).
Li, J. et al. Primary bone tumor detection and classification in full-field bone radiographs via YOLO deep learning model. Eur. Radiol. 33(6), 4237–4248 (2023).
Cheng, D.-C., Liu, C.-C., Hsieh, T.-C., Yen, K.-Y. & Kao, C.-H. Bone metastasis detection in the chest and pelvis from a whole-body bone scan using deep learning and a small dataset. Electronics 10(10), 1201 (2021).
Huang, Z. et al. BS-80K: The first large open-access dataset of bone scan images. Comput. Biol. Med. 151, 106221 (2022).
Wang, C.-Y., Liao, H.-Y.M., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W. & Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. Paper presented at: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (2020).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. End-to-end object detection with transformers. Paper presented at: European conference on computer vision (2020).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. Paper presented at: Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 182015. (2015)
Zhu, X., Lyu, S., Wang, X., & Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. Paper presented at: Proceedings of the IEEE/CVF international conference on computer vision (2021).
Liu, Z., Lin, Y., Cao. Y., et al. Swin transformer: Hierarchical vision transformer using shifted windows. Paper presented at: Proceedings of the IEEE/CVF international conference on computer vision (2021).
Lin, Q. et al. Detecting multiple lesions of lung cancer-caused metastasis with bone scans using a self-defined object detection model based on SSD framework. Phys. Med. Biol. 67(22), 225009 (2022).
Moustakidis, S., Siouras, A., Papandrianos, N., Ntakolia, C. & Papageorgiou, E. Deep learning for bone metastasis localisation in nuclear imaging data of breast cancer patients. Paper presented at: 2021 12th International Conference on Information, Intelligence, Systems and Applications (IISA) (2021).
Faghani, S. et al. A deep learning algorithm for detecting lytic bone lesions of multiple myeloma on CT. Skeletal Radiol. 52(1), 91–98 (2023).
Acknowledgements
This research was supported by the CAMS Innovation Fund for Medical Sciences (CIFMS) 2021-I2M-1-014; Guangdong Basic and Applied Basic Research Foundation (2023A1515110721); The Key Research and Development Program of NingXia (2023BEG02060); The Fundamental Research Funds for the Central Universities (FRF-TP-22-050A1).
Author information
Authors and Affiliations
Contributions
J.S. and R.Z. implemented the algorithms, designed and performed the experiments and wrote the manuscript. Z.Y. and Z.C. performed the data analysis. Z.H., L.H. and Q.S. provided the data. J.W. and Y.X. designed and supervised the research.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
All experiments were approved by the Ethics Review Committee of Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, and the approval details can be found in the supplementary Files. Informed consent was obtained from all study participants.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shi, J., Zhang, R., Yang, Z. et al. Automatic detecting multiple bone metastases in breast cancer using deep learning based on low-resolution bone scan images. Sci Rep 15, 7876 (2025). https://doi.org/10.1038/s41598-025-92594-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-92594-5