Introduction

The butterfly-shaped thyroid is an essential endocrine gland in the neck, and thyroid hormones play a crucial role in regulating metabolism, heart rate, and calorie burning1. Cancerous cells, known as thyroid nodules, can grow on the surface of the thyroid. Various conditions, including simple excessive growth, diffuse cysts, infections, or tumors, can form nodules (benign or malignant). Approximately 6% of women and 1%–2% of men are affected by thyroid nodules, making this a common nodular disease2,3,4. Therefore, strengthening the detection of thyroid nodules, particularly improving the screening capabilities for early-stage nodules, and facilitating timely intervention are of significant clinical importance.

Ultrasound (US) imaging is a non-invasive method for soft tissue imaging and has become the primary clinical tool for assessing thyroid nodules5. Figure 1 shows a thyroid ultrasound image, with nodules indicated by red arrows and other tissues marked by white arrows. In Fig. 1a, structures such as blood vessels, trachea, and muscles are present, making it difficult to distinguish the nodule from surrounding tissues, which affects detection accuracy. Figure 1b highlights the characteristics of early nodules, which are typically small in size and exhibit indistinct features.

Fig. 1
figure 1

Examples of various thyroid nodule ultrasound images, with thyroid nodules indicated by red arrows and other tissues marked by white arrows. (a) An ultrasound image of a thyroid nodule containing various tissues. (b) An ultrasound image of an early, small nodule.

During the diagnostic process, physicians analyze thyroid nodules by identifying features such as increased blood flow, punctate echogenic foci, irregular borders, and height-to-width ratios6. However, this process presents several challenges: first, the variability in shape, size, and margins of nodules complicates the distinction of different types of abnormalities on thyroid ultrasound; second, the inherent speckle noise of ultrasound can reduce contrast resolution, hindering visual quality and obscuring important information in the diagnostic process, thereby affecting the radiologist’s interpretation7,8,9. Although intelligent diagnostic systems can alleviate medical pressures and reduce human errors, they still face the following challenges:

  • Superficial feature extraction: Superficial features (such as size, shape, texture, and aspect ratio) contain rich information and are crucial for enhancing model detection performance. Therefore, extracting high-quality superficial features from ultrasound images is a primary concern.

  • Early detection: Early identification is critical for diagnosing and treating thyroid nodules. However, the small size and indistinct features of nodules make enhancing early detection performance the second major challenge.

  • Background interference: The background of thyroid ultrasound images is complex, containing various tissues and substantial noise interference, significantly impacting detection accuracy. Therefore, suppressing background interference to enhance the overall detection performance of the model is the third research challenge.

To address the critical issues in our research, we conducted a literature review and experimental analysis, proposing a detection model specifically for nodules in thyroid ultrasound images. First, we collected substantial ultrasound image data from collaborating hospitals and performed comparative analyses using various advanced algorithms. After balancing detection efficiency and performance, we selected the YOLOv8-L model as our core model, which demonstrated the best overall performance. The main contributions of this study are as follows:

  • Proposed a thyroid nodule detection framework based on YOLOv8-L, enabling rapid localization of nodules in ultrasound images and distinguishing between benign and malignant cases, thereby improving diagnostic efficiency and alleviating medical burdens.

  • We improved the baseline model for the thyroid nodule detection task by integrating various modules that enhance representation capabilities into the backbone network, considering the unique characteristics of ultrasound backgrounds and nodules. To boost the diagnostic ability for early nodules, we introduced the Gather and Distribute Mechanism (GD)23 during the feature fusion stage, facilitating the interaction and fusion of global and multi-level features through convolution and self-attention mechanisms. This approach enhanced multi-scale feature fusion capability and improved the model’s detection performance for early nodules through fine-grained feature extraction.

Related work

Computer-aided diagnostic tools can effectively improve radiologists’ efficiency when detecting thyroid nodules. They mainly include manually annotated feature methods based on machine learning and automatic feature extraction methods based on deep learning. The former primarily uses shallow features such as image texture10, geometric morphology11, and statistical distribution12 and then classifies the target through a classifier, but its accuracy and efficiency are not as good as deep learning. Deep learning, represented by convolutional neural networks, can automatically extract deep semantic features in images, avoiding the need for manual feature design and extraction. It has become a research hotspot in diagnosing thyroid nodules on ultrasound images.

The target detection algorithm based on deep learning is constantly evolving in the ultrasound field and is widely used to detect thyroid nodule ultrasound images, achieving excellent performance. Zheng et al. proposed the Cascade Mask RCNN model, which optimizes Region of Interest (ROI) classification and bounding box correction by designing a more effective detector and improving the loss function to balance easy and complex samples, thereby improving the detection effect13. Fang et al. introduced a feature pyramid, spatial re-mapping, and anchor box redesign into the Faster R-CNN framework, significantly improving detection accuracy and speed14. Song et al. optimized the SSD model and used the MC-CNN framework of the spatial pyramid network to capture the multi-level features of the target for accurate detection15. Wang et al. fused YOLOv2 and ResNet to generate more precise feature maps by combining shallow and deep features for efficient detection16. Ma et al. combined YOLOv3 with DMRF-CNN to obtain deep edge and texture features and used multi-scale detection to identify nodules of different sizes17. Zhang et al. improved YOLOv3 by replacing the original backbone network with a high-resolution network (HRNet) and optimizing anchor boxes using a clustering algorithm to improve the accuracy of nodule localization and identification18. Yang et al. introduced involution, deformable convolution, and an attention mechanism into YOLOX, significantly improving thyroid nodules’ detection, especially early nodules19.

Early detection is crucial for preventing complications associated with thyroid cancer, so developing models that can accurately diagnose early-stage nodules promptly is of great significance for the clinical treatment of thyroid nodules. This study addresses the challenges of insufficient detection accuracy and difficulty in identifying the morphology of early nodules. By optimizing the algorithm structure, the model’s ability to characterize nodule features is improved. Using YOLOv8-L as the baseline model, we propose a detection framework specifically for thyroid nodule ultrasound images and systematically evaluate its application in ultrasound detection. The structure of the research process is shown in Fig. 2.

Fig. 2
figure 2

Process architecture diagram: It consists of four stages: data acquisition, data processing, model building, and experimental analysis.

Method

The overall architecture of this study is shown in Fig. 3, which includes the backbone (feature extraction network), neck (feature fusion network), and head (detection head). In the backbone, the traditional convolution in the C2f. module of S2 is replaced with the Deformable Convolutional Networks (DCN)21, and the Channel Prior Convolutional Attention (CPCA)22 is introduced to weight the offset learned by DCN so that it pays more attention to the target region. DCN can extract multi-scale features at different levels, more comprehensively describe the diversity and complexity of nodules, and has strong adaptability, effectively capturing the rich local information in shallow features.

Fig. 3
figure 3

Propose the general structure of the method.

Before feature fusion, we use CPCA to perform weighted operations on the four feature maps extracted from the backbone network to reduce the impact of background interference in thyroid ultrasound images. To address the problem of small individual nodules and insignificant features in early nodules, we use a feature fusion network based on the GD mechanism23, which achieves a balance between latency and accuracy and significantly enhances the ability to fuse multi-scale features. With the self-attention mechanism, nodule features are better captured and enhanced, improving the model’s ability to detect early nodules.

In the first detection step, we use a decoupled head, which includes three branches: cls, reg, and projection Conv. Label Assign adopts the AnchorFree format and uses TAL dynamic matching to select high-quality anchors as positive samples from the perspective of task alignment and incorporate them into the loss design. Calculating the loss with the aid of a border further enhances the model’s generalization ability20.

Backbone

This study divides the backbone feature extraction network into five stages (as shown in Fig. 4, S1–S5). In stage S1, the height and width information of the ultrasound image is first converted into channel features using the first layer of the base structure. Then, a second channel is used to compress the feature map. Further, the number of channels increases, and the feature map is observed for down sampling. Then, a feature map with a rich gradient flow is extracted through the C2f. module.

Fig. 4
figure 4

Backbone feature extraction network.

In stages S2 to S5, the feature map’s width and height are further compressed, increasing the number of channels. The spatial pyramid pooling module (SPPF) converts the feature map of variable size to a uniform size to achieve multi-pixel feature fusion. Stages S2 to S5 generate four feature maps as input to the neck network.

We replaced the traditional convolution in stage S2 during the optimization process with the DCN structure. DCN automatically adjusts the receptive field by adaptively learning the offset of the feature map, and at the same time, through attention weighting, it has more flexible sampling ability and adaptability than traditional convolution. DCN can better adapt to nodules with different morphological features and enhance the model’s ability to capture shallow features. To address the problem of background interference in ultrasound image data, CPCA is introduced after the SPPF module to perform attention weighting on the input feature map, reducing the impact of background information on detection performance and improving the representational power and robustness of the model.

Deformable convolution

The superficial features of thyroid nodules are rich in local information, and the efficient extraction of superficial features is crucial to the nodule detection process. However, due to their fixed geometry, traditional convolution units cause the receptive field size of the same CNN to be the same when processing different regions of the image and cannot adapt to features that change in position.

To solve this problem, we replace the traditional convolution in the C2f. module with DCN21 in the S2 stage of the backbone network. DCN adapts the offset of the feature map through an additional convolution layer and applies it to the sampling position of the standard convolution, enabling the network to respond flexibly to deformation. In addition, to further correct and refine the range of new sampling points, CPCA is incorporated into the deformable convolution structure to improve the model’s ability to resist spatial transformations. Figure 5 shows the structure of the deformable convolution, and Fig. 6 compares the sampling positions of the standard convolution and the deformable convolution.

Fig. 5
figure 5

It improved the deformable convolution structure. Deformable convolution learns offsets based on parallel networks, causing the convolution kernel to offset the sampling points of the feature map. The introduction of the CPCA mechanism helps to make it pay more attention to the ROI.

Fig. 6
figure 6

CPCA structure diagram.

In the deformable convolution, based on the input image, the traditional convolution kernel is first used to obtain the feature map \({F}_{in}\), and then the obtained feature map is used as input. A convolution is applied, followed by CPCA for weighting to obtain the feature map \({F}_{0}\). The width and height of \({F}_{0}\) correspond to the input feature map, with each pixel representing \(\text{K}\times \text{K}\) sampling points, with an offset \(\Delta \text{p}(\Delta \text{x},\Delta \text{y})\). The learning of these offsets is conducted through interpolation and backpropagation. The weighted values of the offsets are then applied to the corresponding sampling areas (green grid). Finally, the convolution operation is performed in the adjusted sampling area (blue grid), achieving the effect of deformable convolution. This process can be mathematically represented as follows.

First, a set of pixel points R is sampled from the input Feature Map, as shown in Eq. (1). R defines the receptive field. Taking a 3 × 3 convolution as an example, the elements in R represent the relative positions of the sampling points.

$$\text{R}=((-\text{1,0}),(\text{1,0}),...(\text{0,1}),(\text{1,1}))$$
(1)

Then, based on R, the sampled results are calculated using a convolution operation, as shown in Eq. (2).

$$\text{y}({\text{p}}_{0})=\sum_{{\text{p}}_{\text{n}}\in \text{R}}\text{W}({\text{p}}_{\text{n}})\bullet \text{X}({\text{p}}_{0}+{\text{p}}_{\text{n}})$$
(2)

Finally, an offset \(\Delta {p}_{n}\) is added to each sampling point \({p}_{0}\)+\({p}_{n}\), where \(\{\Delta {p}_{n}|n=\text{1,2},3,...,N\}\), and the output pixel value \(\text{y}({p}_{0})\) at point \({p}_{0}\) in the Feature Map is obtained through convolution operations. The sampling positions in the Feature Map are usually integers, but the offset \(\Delta {p}_{n}\) is typically a small fractional value. Therefore, bilinear interpolation is used to retrieve values in the input Feature Map. As shown in Eq. (3).

$$\text{y}({\text{P}}_{0})=\sum_{{\text{p}}_{\text{n}}}\text{w}({\text{p}}_{\text{n}})\bullet \text{x}({\text{p}}_{0}+{\text{p}}_{\text{n}}+\Delta {\text{p}}_{\text{n}})$$
(3)

For a floating-point sampling point p,first,the actual pixel points \(\text{x}(\text{q})\) in its four neighboring regions are retrieved.The function G(.) is used to compute the weights for each \(\text{x}(\text{q})\). Finally, the weighted values of \(\text{x}(\text{q})\) are summed and used as the pixel value \(\text{x}(\text{p})\) at position in the input feature map. As shown in Eq. (4).

$$\text{x}(\text{p})=\sum_{\text{q}}\text{G}(\text{p},\text{q})\bullet \text{x}(\text{q})$$
(4)

Attention mechanism

In the thyroid nodule detection method proposed in this study, the backbone network is used to initially extract the nodule features from ultrasound images and output four feature maps for the neck network to use. To reduce the interference of background information on the model, we introduced CPCA22 between the backbone and neck networks. Specifically, before the feature maps extracted by the backbone network are input into the neck network, the attention mechanism is applied to suppress noise and reduce background interference.

The structure of CPCA, as shown in Fig. 6, consists of channel attention and spatial attention sequentially. It dynamically allocates weights across both channel and spatial dimensions, allowing it to capture spatial relationships while retaining channel priorities. First, the spatial information of the feature map is aggregated through average pooling and max pooling operations, and a channel attention map is generated via a shared Multi-Layer Perceptron (MLP). The input features are then multiplied by the channel attention map to produce channel-refined features, which are fed into a deep convolution block to generate a spatial attention map. After generating the spatial attention map, the convolution block continues to receive and perform channel-spatial fusion operations, enhancing feature representation. Finally, the result of channel-spatial fusion is multiplied with the channel-refined features to obtain the optimized feature output. The mathematical expression of this process is shown below:

$${\text{CA}}\left( {\text{F}} \right) = \sigma \left( {{\text{MLP(Avgpool(F)}}) + {\text{MLP(MaxPool(F))}}} \right)$$
(5)
$${F}_{c}=CA(F)\otimes F$$
(6)
$$\text{SA}(\text{F})={\text{Conv}}_{1\times 1}(\sum_{\text{i}=0}^{3}{\text{Branch}}_{\text{i}}(\text{DwConv}(\text{F})))$$
(7)
$$\widehat{F}=SA({F}_{c})\otimes {F}_{c}$$
(8)

In this context, \(\text{CA}(\text{F})\) represents the channel attention map, \({F}_{c}\) denotes the channel prior, \(\text{SA}(\text{F})\) stands for the spatial attention map, and \(\widehat{F}\) refers to the refined features output.

Neck

This study introduces the GD mechanism23 as the Neck structure in the YOLO series to improve the integration of cross-layer information. This mechanism solves the difficulties and losses in the information interaction and fusion process of the traditional FPN structure. Through convolution and self-attention mechanisms, the GD mechanism can efficiently fuse multi-scale features on a global scale and inject global information into features at each layer, overcoming the shortcomings of FPN, improving the efficiency of information transmission, and ultimately enhancing the model’s ability to detect early thyroid nodules.

The GD mechanism includes a Low- Gather and Distribute (Low-GD) and a High-Gathering and Distribute (High-GD), as shown in Fig. 7a, b. The aggregation and distribution process consists of three parts: the Feature Alignment Module (FAM), the Information Fusion Module (IFM), and the Injection module (Inject). FAM collects and aligns features at each level, IFM is responsible for fusing the aligned features to generate global information, and Inject is accountable for distributing the fused global information to the feature maps at each level and injecting it into the features at each level through a self-attention mechanism.

Fig. 7
figure 7

GD mechanism structure: (a) low-gather and distribute branch structure; (b) high- gather and distribute branch structure.

The input to the Neck is derived from the four effective feature layers, B2, B3, B4, and B5, extracted from the backbone network, where \({B}_{i}\in {R}^{N\times {C}_{{B}_{i}}\times {R}_{{B}_{i}}}\), with batch size represented by N, channels by C, and spatial dimensions by R(where \(\text{R}=\text{HXW}\)). Additionally, the dimensions of \({R}_{B2}\),\({R}_{B3}\),\({R}_{B4}\),\({R}_{B5}\) are \(R\),\(\frac{1}{2}R\),\(\frac{1}{4}R\),\(\frac{1}{8}R\), respectively.

Low-GD

Low-GD includes LOW-FAM, LOW-IFM, and the Injection module, which perform fusion on B2, B3, B4, and B5 to capture high-resolution features of small target information.

In LOW-FAM, AvgPool is used to down sampling the input features, thereby unifying the feature sizes. LOW-IFM, on the other hand, consists of multiple layers of \(RepBlock\) and \(Split\). The \(RepBlock\) aligns the input features \({F}_{align}\) (where \((\text{channel}=\text{sum}({\text{C}}_{\text{B}2},{\text{C}}_{\text{B}3},{\text{C}}_{\text{B}4},{\text{C}}_{\text{B}5},)\)) to generate \({F}_{fuse}(\text{channel}=({\text{C}}_{\text{B}4}+{\text{C}}_{\text{B}5})\). The intermediate channels are adjusted according to the size of the target model. The features generated by \(RepBlock\) are split into two dimensions, denoted as \({F}_{inj\_P3}\) and \({F}_{inj\_P4}\), which correspond to different levels of feature fusion. The Injection module divides the input features \({F}_{align}\), as well as the globally injected information \({F}_{inj}\), for fusion. Through Conv calculation, the local features \({F}_{\text{act}}\) and the global features \({F}_{\text{global}\_\text{embed}}\) are obtained. The local features \({F}_{\text{local}\_\text{embed}}\) are then fused through Conv and attention mechanisms to generate the final output features \({F}_{out}\). Since there is a size difference between the local and global features, average pooling and bilinear interpolation are employed to adjust the sizes of \({F}_{inj}\), \({F}_{\text{global}\_\text{embed}}\) and \({F}_{\text{act}}\), ensuring proper alignment. To more effectively inject global information into different layers, a self-attention mechanism is used to control the fusion, along with adding \(RepBlock\) to further extract and fuse information. Among these,\({F}_{local}\) is equivalent to \(\text{Bi}\). The mathematical representation of the above process is shown below:

$${F}_{align}=LOW\_FAM([B2,B3,B4,B5])$$
(9)
$${F}_{fuse}=RepBlock({F}_{align})$$
(10)
$$F_{inj\_P3} ,\;F_{inj\_P4} = Split\left( {F_{fuse} } \right)$$
(11)
$${F}_{\text{global}\_\text{act}\_\text{Pi}}=resize(Sigmoid({Conv}_{act}({F}_{inj\_Pi})))$$
(12)
$${F}_{\text{global}\_\text{embed}\_\text{Pi}}=resize({Conv}_{global\_embed\_Pi}({F}_{inj\_Pi}))$$
(13)
$${F}_{\text{act}\_\text{fuse}\_\text{Pi}}={Conv}_{local\_embed\_Pi}(Bi)*{F}_{global\_act\_Pi}+{F}_{global\_embed\_Pi}$$
(14)
$${P}_{i}=RepBlock({F}_{att\_fuse\_Pi})$$
(15)

High-GD

High-GD comprises High-FAM, High-IFM, and the Injection module, responsible for further fusing the outputs of Low-GD from {P3, P4, P5}.

In High-FAM, the input feature dimensions are reduced and unified into a consistent size. High-IFM includes a Transformer module and \(Split\) process, which operates in three steps: (1) The Transformer merges \({F}_{align}\) to obtain \({F}_{fuse}\)0. (2)The Conv layer reduces \({F}_{fuse}\) to the sum of \(({\text{C}}_{\text{P}4},{\text{C}}_{\text{P}5})\)0. (3) The \(Split\) process divides \({F}_{fuse}\) along the channel dimension into \({F}_{inj\_N4}\) and \({F}_{inj\_N5}\), which are then fused with the corresponding features from the previous layers.The Injection process is identical to that in Low-GD. In this context, \({F}_{local}\) is equivalent to \(\text{Pi}\). The mathematical representation of the above process is as follows:

$${F}_{align}=High\_FAM([P3,P4,P5])$$
(16)
$${F}_{fuse}=Transformer({F}_{align})$$
(17)
$${F}_{inj\_N4},{F}_{inj\_N5}=Split(Conv1\times 1({F}_{fuse}))$$
(18)
$${F}_{global\_act\_Ni}=resize(Sigmoid({Conv}_{act}({F}_{inj\_Ni})))$$
(19)
$${F}_{global\_embed\_Ni}=resize({Conv}_{global\_embed\_Ni}({F}_{inj\_Ni}))$$
(20)
$${F}_{att\_fuse\_Ni}={Conv}_{local\_embed\_Ni}(Pi)*{F}_{global\_act\_Ni}+{F}_{global\_embed\_Ni}$$
(21)
$$\text{Ni}=\text{RepBlock}({\text{F}}_{\text{att}\_\text{fuse}\_\text{Ni}})$$
(22)

Head

In the detection head step, a decoupled head structure24 and AnchorFree25 strategies are used, as shown in Fig. 8. This structure separates the classification branch and the regression branch to correspond to different task requirements, where Varifocal Loss (VFL Loss)26 is the classification loss and Distribution Focal Loss (DFL Loss)27 + CIOU Loss is the bounding box regression loss. In addition, the Anchor-Free strategy is used, where the target is represented by multiple key points or centroid points and corresponding boundary information without preset anchors so that target detection can be performed directly. This avoids the complexity of manually designing and adjusting the prior box hyperparameters25.

Fig. 8
figure 8

Structure of the decoupling detection head.

Experimentation and results

Experimental data

A partner hospital provided the thyroid ultrasound image data used in this study. Under the premise of fully protecting patient privacy, 3,055 ultrasound images of 205 patients aged 10 to 77 years were collected (2019 to 2023). Each image contains at least one thyroid nodule region for 3088 nodules, including 1752 benign and 1336 malignant nodules, divided into training and test sets in a 7:3 ratio. In addition, we have also included the currently publicly accessible DDTI public database28 as part of the test set. All data were labeled by professional physicians in the ultrasound department to ensure the authenticity and reliability of the data. An example of the ultrasound image data is shown in Fig. 9.

Fig. 9
figure 9

Ultrasound data of thyroid nodules: (a) Ultrasound image of benign thyroid nodule (b) Ultrasound image data of malignant thyroid nodule.

Data processing

Data quality is affected by equipment acquisition, doctor’s operation, and patient cooperation. Deep learning relies on high-quality data to ensure the accuracy and reliability of the model. Therefore, this study processed the thyroid ultrasound image data by sharpening and other means to enhance the image contrast and highlight the nodules and their boundary information. The processed ultrasound images (as shown in Fig. 10) have higher contrast, more explicit nodule boundaries, and more prominent features.

Fig. 10
figure 10

Ultrasound data of thyroid nodules (a) Unprocessed thyroid ultrasound image (b) Processed thyroid ultrasound image.

Experimental environment and training strategies

The system utilized in this research is equipped with an RTX 4070 GPU (32 GB memory), an Intel Core i9-13980HX processor, and is configured with CUDA 11.1 and cuDNN 8.0.5 for accelerated GPU computing. The development platform is PyCharm, with the project built on PyTorch 2.0.During model training, a transfer learning strategy was employed to optimize performance. Initial training was conducted using the VOC dataset to obtain pre-trained weights, followed by fine-tuning. The optimizer used is Adaptive Moment Estimation (Adam), with a momentum of 0.95 and an initial learning rate of 0.01. Training was conducted for a total of 200 epochs with a batch size of 16.Fig. 11 illustrates the change in loss and mAP@50 as the number of training epochs increases. By the 185th epoch, both loss and mAP stabilized. From the 190th epoch onwards, the loss decreased significantly and quickly plateaued. To prevent overfitting, Mosaic data augmentation was disabled during the final 10 epochs. Each experiment was repeated three times to ensure data reliability, and the average values were taken to minimize the influence of random error.

Fig. 11
figure 11

Loss and Map50 changes during training.

Comparative experiments

Experimental index assessment

This study employs three key metrics to comprehensively evaluate the performance of the thyroid nodule detection model: Precision (P), Recall (R), and mAP@50. Precision reflects the proportion of predicted positive samples that are truly positive, measuring the model’s ability to minimize false positives, thereby reducing unnecessary examinations and treatments. Recall evaluates the proportion of actual positive samples correctly identified by the model, which is critical for reducing missed diagnoses a particularly important aspect in medical diagnosis.

To balance precision and recall, we introduce mAP@50 as a comprehensive performance evaluation metric. mAP@50 considers not only the overlap between the predicted bounding box and the ground truth but also measures accuracy across different thresholds, fully reflecting the model’s localization and classification capabilities. A higher mAP@50 indicates better classification accuracy and bounding box localization in nodule detection.

In summary, these three indicators Precision, Recall, and mAP@50 offer a comprehensive evaluation of the model’s performance in thyroid nodule detection, ensuring its accuracy, sensitivity, and reliability in practical application.

Comparative experiments on different models

YOLOv8 demonstrated outstanding detection performance in thyroid ultrasound diagnosis. Based on comparative experiments with different model weights, and considering both inference efficiency and detection accuracy, YOLOv8-L showed the best overall performance, as illustrated in Table 1. Therefore, YOLOv8-L was selected as the baseline model for this study.

Table 1 Compare the results of experiments with different weights.

To meet the characteristics of ultrasound imaging and clinical diagnostic needs, we optimized the baseline algorithm to enhance its feature representation capabilities and improve the detection of small nodules. Experimental results indicate that the mAP@50 of the improved model increased by 6.6% compared to the baseline algorithm, with all performance metrics surpassing the baseline model, particularly in the detection of malignant nodules.

To verify the feasibility and effectiveness of our model, we compared it with several single- and two-stage models as well as their baseline algorithms. All experiments were conducted using the same thyroid ultrasound dataset, training strategies, and evaluation metrics. The comparison results are shown in Table 2. Zheng et al.’s Cascade Mask-RCNN model demonstrated better performance in detecting benign nodules, improving mAP@50 by 4.7% over its baseline. Although Yang et al.'s Improved-YOLOX model accounted for the diversity of nodule morphology and the needs of early diagnosis, it was limited by the randomness of deformable convolutional offset sampling and the FPN architecture, achieving only a 5.2% increase in mAP@50, which still falls short compared to our model.

Table 2 Compare the experimental results of different algorithms.

Comparison of attention mechanisms

In this study, four attention mechanisms were integrated into the network to reduce the impact of redundant information in ultrasound images on model performance: Coordinate Attention (CA)29, Spatial Transformer Networks (STN)30, Convolutional Block Attention Module (CBAM)31, and CPCA. The comparison results of these four attention mechanisms are presented in Table 3. Experimental results show that CPCA significantly improved the detection performance of thyroid nodules, with an mAP@50 of 91.5%. In comparison, CA and STN only improved mAP@50 by 0.2% and 0.4%, respectively, over the baseline model. As a hybrid attention mechanism that combines spatial and channel information, CBAM achieved an mAP@50 of 90.8% when integrated into the baseline model but still fell short of CPCA. This is because CBAM enforces a uniform spatial attention distribution across the channel dimension, which limits its ability to dynamically adjust attention weights between channel and spatial dimensions.

Table 3 Experimental results comparing different attention mechanisms.

Ablation experiments

To verify the effectiveness of the improvements made to the baseline model in this study, ablation experiments were conducted on individual modules (DCN, CPCA, and GD), and the results are presented in Table 4.

Table 4 Results of the ablation experiment.

DCN module: This replaces the traditional convolution in the C2f. module at stage S2, enhancing the model’s ability to capture shallow features. Experimental results show that the DCN module increased the mAP@50 for thyroid nodules to 89.8%, which is a 0.9% improvement over the baseline model. The DCN’s ability to flexibly adjust the sampling area according to different nodule shapes strengthens the network’s representational and detection capabilities.

Enhanced DCN (with CPCA): While DCN enables flexible sampling by learning offsets, it further improves the model’s robustness to spatial transformations through attention-weighted mechanisms. The combined use of DCN and CPCA resulted in a 2.8% increase in mAP@50.

CPCA module: CPCA applies attention weighting to the four feature maps output by the backbone network to mitigate the interference of background information. The results show that the introduction of CPCA improved the mAP@50 by 1.5%.

GD module: The GD mechanism efficiently integrates and exchanges information through convolution and self-attention, enhancing the model’s ability to fuse multi-scale features. Results indicate that the GD module increased the mAP@50 by 1.4%.

Results

To provide a clear comparison of the detection performance of our proposed model against advanced algorithms in thyroid ultrasound imaging, we conducted comparative experiments using the test set as the sample. The results, as shown in Fig. 12, indicate that our model significantly outperforms others in terms of precision, recall, and mAP@50. This demonstrates the model’s effectiveness in accurately locating and identifying nodules in thyroid ultrasound images.

Fig. 12
figure 12

Experimental metric score radar chart. Red indicates the score of the metric in this study, while blue and green indicate the scores of other excellent algorithm metrics, respectively.

Additionally, we conducted follow-up evaluations with users, and the results suggest that the proposed method surpasses clinical doctors’ visual diagnosis in overall performance while requiring minimal time for the entire diagnostic process. Therefore, this method demonstrates notable accuracy and efficiency in the diagnosis and treatment of thyroid nodules in ultrasound images, indicating its vast potential for clinical application.

Conclusion and discussion

Conclusion

The objective of this study is to enhance the detection accuracy of nodules in thyroid ultrasound images, particularly for early-stage nodules. By utilizing the YOLOv8-L model as a baseline, we propose a novel framework that integrates advanced feature extraction techniques and model optimization strategies, achieving precise localization and classification of nodular lesions. Experimental results demonstrate that the optimized model performs exceptionally well in detecting early small nodules, indicating its promising clinical application potential.

In this study, statistical uncertainty is also a crucial consideration. Although our method shows favorable performance in improving detection sensitivity and accuracy, there are inherent statistical uncertainties introduced by factors such as sample selection, data noise, and model parameter estimation. This uncertainty may affect the stability and reliability of the detection results. Therefore, we recommend that future research employ larger and more diverse datasets to mitigate the impact of sample variation on the results. Additionally, incorporating statistical methods such as confidence intervals and error analysis can provide a more comprehensive evaluation of model performance, thereby enhancing the credibility of the results.

By improving detection sensitivity and accuracy, we anticipate that this method can serve as an effective auxiliary tool for radiologists, subsequently increasing the efficiency of early screening for thyroid diseases and ultimately improving patient diagnosis and prognosis. Furthermore, our optimization approach is not limited to YOLO-based models but is applicable to various mature object detection models. This broad applicability suggests that the method has the potential for widespread adoption, particularly in the early screening and diagnosis of thyroid diseases. By promoting the advancement of intelligent medical technology in the field of imaging, we aim to provide more precise and efficient healthcare services to a wider patient population, further enhancing the scientific basis and timeliness of medical decision-making, thereby improving patient prognosis and quality of life.

Discussion

The proposed method in this study primarily focuses on improving nodule detection accuracy and early nodule detection rates, resulting in some sacrifices in efficiency. As such, it does not yet meet the requirements for real-time diagnostics. However, in clinical settings, radiologists typically perform better when analyzing ultrasound videos, indicating significant potential for real-time diagnostic development in video imaging. Therefore, we hope that future researchers can build on the results of this study to achieve lightweight optimizations, developing more efficient deep learning solutions that maintain detection accuracy while enhancing the system’s real-time performance. This would further promote the application of intelligent medical technology.

Currently, the use of deep learning technologies for intelligent diagnosis of thyroid ultrasound images has become a research hotspot, yielding significant progress. However, this field still faces challenges such as the limited availability of publicly accessible datasets and the lack of standardized evaluation metrics, which need to be addressed in future work. As a data-driven discipline, the quality of data directly influences the final performance of the model. The inherent presence of speckle noise and low contrast obscures important information in visual diagnosis, reducing the quality and reliability of ultrasound images. Therefore, future work should focus on the development of denoising and filtering algorithms for thyroid ultrasound images to further improve the visual interpretability of ultrasound data and ensure data quality.