Introduction

Polyp is a common mucosal lesion in gastrointestinal tract (commonly occured in colon, stomach, rectum etc). The majority of gastrointestinal polyps are benign with no symptom while only the adenomatous type has the potential to progress into malignancy. Hence, polyp requires early diagnosis and completely excision1,2,3,4. In clinic, endoscopy allows gastrointestinal tract visualized directly, which promote the observation of polyp. Also, except some special types, polyp can be excised under microscopy. However, that requires clinicians well-trained and this manual process is not only labor-intensive but also prone to variability. An efficient alternative is the development of automated polyp segmentation algorithms that assist clinicians in precisely locating and delineating polyp regions for further analysis5,6,7,8.

Polyps exhibit considerable variability in size and shape at different stages of development, and accurately segmenting them is challenging due to their similarity to surrounding mucosa. To address these challenges, several deep learning models have been implemented, showing promise in polyp segmentation. For instance, Akbari et al.9 employed a fully convolutional network (FCN) combined with Otsu thresholding to delineate prominent polyp regions. Sun et al.10 introduced an FCN-based framework that incorporates dilated convolution to capture expansive semantic features without losing resolution. UNet-based11 architectures, particularly those with encoder-decoder structures like ResUNet++12 and UNet++13, have demonstrated effective performance by merging high-level features from the decoder with low-level features from the encoder through skip connections, thereby integrating semantic and contextual information efficiently. Nevertheless, these methods often focus predominantly on segmenting the polyp’s entire area, occasionally overlooking crucial boundary details. Recent advancements have included area-boundary constraints14 and multi-task frameworks to enhance boundary delineation, significantly improving segmentation accuracy15. Furthermore, Fan et al.16 introduced a reverse attention module to leverage boundary cues for more precise polyp segmentation.

Despite notable progress in automatic polyp segmentation, significant challenges remain. Flat lesions or inadequate bowel preparation can obscure polyp margins, compromising segmentation accuracy. It is crucial to utilize boundary-specific information to improve the delineation between polyps and adjacent tissues. Image processing typically associates inner body structures with low-frequency components, while edges correlate with high-frequency components. Figure 1 illustrates the different frequency components in the red, green, and blue bands of a colonoscopy image. The high-frequency components (LH, HL and HH) contain rich edge-related information, whereas the low-frequency component (LL) captures the inner body features of a polyp. This differentiation underscores that effective models must integrate both low and high-frequency details to achieve a comprehensive understanding of both the inner consistency and the edge details of polyps.

Fig. 1
figure 1

Visualizations of different components of a given colonoscopy image, where LL, LH, HL and HH denote the low-frequency, horizontal, vertical and high-frequency components, respectively.

In summary, two primary challenges persist in polyp segmentation using CNN-based models. Firstly, despite their commendable performance, CNNs are restricted by local receptive fields, which limit their ability to capture comprehensive contextual information crucial for accurate inference. Furthermore, while attention-enhanced networks substantially improve by modeling long-range dependencies, they frequently generate blurred polyp boundaries, resulting in considerable misclassification of adjacent pixels. Secondly, conventional learning paradigms in Euclidean space cannot effectively capture the high-frequency details which are vital for representing edge distributions.

To overcome these limitations, we introduce a Frequency Attention Model (FAM) designed to preserve edge details more effectively during feature map encoding and decoding. FAM operates by segregating feature maps into high and low-frequency components, enhancing them within and across components through Intra-Component Attention (ICA) and Cross-Component Attention (CCA), respectively. This process not only refines the features within each frequency band but also enhances inter-component interactions, significantly improving feature distinguishability. The main contributions are as follows:

  • We propose a FAM, which projects feature maps into the frequency ___domain, enhancing edge detail preservation through a dual attention mechanism. This mechanism refines both high and low-frequency components, substantially sharpening polyp edges in segmentation tasks.

  • We design the Frequency Attention-Embedded Network (FAENet) for polyp segmentation, which incorporates FAMs instead of original convolution layers to refine feature maps across various scales efficiently.

  • Through extensive testing on two challenging datasets, FAENet demonstrates superior performance over state-of-the-art models. Our results are further validated by an ablation study that highlights the advantages of FAM over convolution layer and conventional self-attention models.

Related works

Semantic segmentation of medical images

In medical imaging, segmentation is critical for delineating relevant areas within computer-aided diagnosis systems17,18. Conventional segmentation methods, such as random walk and cantilever beam approaches, have been applied successfully in medical image analysis due to their simplicity and low computational cost19,20,21. These methods rely on probabilistic boundary delineation or physical models, making them efficient and effective in scenarios with limited computational resources or simpler segmentation tasks, such as single-organ contour extraction. However, their performance is often constrained in complex scenarios involving irregular structures, subtle boundaries, or varying imaging conditions. In contrast, FAENet incorporates frequency-based attention mechanisms, which significantly enhance segmentation accuracy by refining edge details and capturing contextual information. While this comes at the cost of increased computational requirements, it positions FAENet as a more robust solution for demanding applications like polyp segmentation, where precision and adaptability are paramount.

CNN-based approaches have shown promising results, with UNet achieving widespread success. Innovations such as ResUNet++12, which integrates additional layers into UNet22, and H-DenseUNet23, which combines 2D and 3D networks for feature extraction and context aggregation, exemplify advancements aimed at refining segmentation accuracy. UNet++13 further optimizes this by redesigning skip connections to minimize the semantic gap between the encoder and decoder stages. Deep learning networks have often matched the precision of manual delineation in segmenting medical images, yet accurately defining certain critical regions, such as organ boundaries with low contrast, continues to pose challenges24,25,26,27. To address these issues, several innovative strategies have been introduced. One effective technique integrates attention mechanisms into the network architecture, enhancing the model’s focus on salient features within the medical images28. This integration aids in accentuating important areas while diminishing the impact of non-essential data, thereby boosting the accuracy of segmentation29,30. Another strategy leverages advanced imaging processing techniques to further refine the detection and delineation of subtle features in medical scans31. These approaches collectively contribute to overcoming inherent limitations in traditional imaging segmentation, particularly in areas where contrast is minimal and detail is crucial32,33.

In this study, we propose to segregate feature maps into high and low-frequency bands and apply targeted enhancements with ICA and CCA. ICA focuses on refining the details within each frequency band, while CCA facilitates interactions between the bands, thereby enhancing the overall distinguishability of features. This dual approach ensures that both the nuances within each frequency range and the synergy between them are optimized, leading to more precise feature representation.

Learning in frequency ___domain

Leveraging the intricate frequency and spatial representation strengths of wavelet transforms, these have been integrated into deep neural networks to enhance semantic segmentation, with various approaches documented34,35,36,37,38. Common methodologies typically utilize wavelet transforms for preprocessing or post-processing tasks, or as replacements for certain CNN operations like up- and down-sampling. Preprocessing techniques play a vital role in enhancing the suitability of input data for advanced machine learning models. Studies have demonstrated that preprocessing can improve feature extraction by balancing data distributions and reducing input noise39,40. In the context of frequency-based segmentation, preprocessing steps such as normalization and denoising could align input images more effectively with the requirements of frequency transformations, leading to more accurate attention maps and segmentation results. Despite their effectiveness, many of these methods tend to be object-specific, thus limiting their broader applicability37.

For instance, Aerial LaneNet41, a symmetric CNN architecture augmented with wavelet transform, is specifically tailored for lane-marking segmentation in aerial images. Similarly, the Constrained Wavelet Neural Network (CWNN) employs wavelet-constrained pooling layers as a substitute for traditional pooling operations, focusing on synthetic aperture radar image segmentation42. Wavelet Segmentation Network (WaveSNet)43 integrates wavelet transforms to enhance detail during down-sampling and employs inverse transforms to restore details in the up-sampling phase. In contrast, we use wavelet transform to generate low-frequency and high-frequency images as dual-branch input to extract corresponding features.

Method

Overview

As shown in Fig. 2, FAENet inherits the design of U-shape network with its stable capability for medical image segmentation. The architecture is designed to process an input image through multiple convolution layers (batch normalization and ReLU are omitted in this paper), each followed by a pooling operation to reduce spatial dimensions and capture higher-level features. These features are then intricately processed through our novel FAM, which enhances feature representation by focusing on both low-frequency and high-frequency components separately. After processing through the FAM, features from both branches are merged using concatenation followed by element-wise summation, ensuring that both detailed textural information and broader contextual details are preserved and emphasized. This combination allows the network to maintain sharpness in boundaries and improve the segmentation accuracy, particularly in challenging areas of the image. Subsequently, the concatenated features undergo several convolution layers to refine the feature maps, which are then progressively up-sampled back to the original image dimensions. Each up-sampling stage is paired with a convolution operation to further refine details and maintain spatial consistency. The final output is generated through a Softmax layer that classifies each pixel into the appropriate label, producing the segmented output.

Fig. 2
figure 2

Overview of the proposed FAENet.

Frequency attention model

The Frequency Attention Model (FAM) is the core module of FAENet, designed to extract and process both low-frequency and high-frequency details in parallel. As illustrated in Fig. 3, using the Discrete Wavelet Transform (DWT), the input feature map \(F_{input}^{fa} \in {\mathbb {R}}^{H \times W \times C}\), where H, W, and C represent height, width, and number of channels, is decomposed into four components: \(F_{LL}, F_{LH}, F_{HL}, F_{HH}\). These components capture the low-frequency (contextual) and high-frequency (edge) features at half the spatial resolution \(\frac{H}{2} \times \frac{W}{2} \times C\).

Low-frequency features: The low-frequency component \(F_{LL}\), which contains global contextual information, is processed through a \(3 \times 3\) convolution layer, followed by element-wise summation and a \(1 \times 1\) convolution layer, resulting in the initialized low-frequency feature \(F_{lc} \in {\mathbb {R}}^{\frac{H}{2} \times \frac{W}{2} \times C}\).

High-frequency features: The high-frequency components \(F_{LH}, F_{HL}, F_{HH}\), which represent boundary and texture details, are concatenated and compressed into a single feature map using a \(1 \times 1\) convolution layer. The resulting high-frequency feature \(F_{hc} \in {\mathbb {R}}^{\frac{H}{2} \times \frac{W}{2} \times C}\) preserves critical edge details for further enhancement.

Intra-component attention (ICA): ICA is applied to independently enhance \(F_{lc}\) and \(F_{hc}\). The process includes:

  • Generating three feature maps \(F_{1}^{ica}, F_{2}^{ica}, F_{3}^{ica} \in {\mathbb {R}}^{H \times W \times C}\) using \(1 \times 1\) convolutions.

  • Computing position-wise similarity via attention maps:

    $$\begin{aligned} AM_{ica} = \text {Softmax}\left( \left( F_{1}^{ica}\right) ^\top \times F_{2}^{ica}\right) , \end{aligned}$$
    (1)

    where \(AM_{ica} \in {\mathbb {R}}^{(HW) \times (HW)}\).

  • Aggregating enhanced features:

    $$\begin{aligned} F_{output}^{ica} = F_{input}^{ica} + AM_{ica} \times F_{3}^{ica}, \end{aligned}$$
    (2)

    resulting in \(F_{output}^{ica} \in {\mathbb {R}}^{H \times W \times C}\) with enriched feature representation.

Cross-component attention (CCA): CCA refines the interactions between low-frequency (\(F_{lc}^{ica}\)) and high-frequency (\(F_{hc}^{ica}\)) features:

  • Attention maps are computed as:

    $$\begin{aligned} AM_{cca} = \text {Softmax}\left( \left( F_{lc}^{ica}\right) ^\top \times F_{hc}^{ica}\right) , \end{aligned}$$
    (3)

    where \(AM_{cca} \in {\mathbb {R}}^{(HW) \times (HW)}\).

  • Refined feature maps \(F_{loutput}^{cca}\) and \(F_{houtput}^{cca}\) are produced, maintaining the dimensions \(H \times W \times C\).

Reconstruction: The inverse DWT (iDWT) integrates the refined low-frequency and high-frequency features back into the spatial ___domain. This process ensures minimal distortion while preserving both contextual and edge-specific details.

Summary: FAM is a modular frequency-based enhancement block that processes and enriches feature maps without altering their dimensions. By integrating both contextual and boundary information, FAM ensures robust feature representation, making it highly effective for segmentation tasks.

Fig. 3
figure 3

Detailed pipelines of the Frequency Attention Model (FAM), Intra-Component Attention (ICA), and Cross-Component Attention (CCA). The Discrete Wavelet Transform (DWT) decomposes the input into frequency components (LL, LH, HL, HH). ICA enriches features within each frequency band, while CCA enhances interactions between low-frequency and high-frequency components. The refined features are integrated into the spatial ___domain using the Inverse DWT (iDWT).

Experiments and results

Datasets

We evaluated the segmentation capabilities of FAENet on polyps using the publicly available Kvasir-SEG and CVC-ClinicDB datasets, which comprise optical colonoscopy images. As listed in Table 1, the dataset is partitioned at a ratio of 3:1:1 for training, validation and test. The Kvasir-SEG dataset, sourced from Vestre Viken Health Trust in Norway, includes 1,000 images with resolutions ranging from 487 \(\times\) 332 to 1920 \(\times\) 1072 pixels. These images were annotated by a medical doctor and validated by an experienced gastroenterologist. The CVC-ClinicDB, comprising 612 frames with a resolution of 288 \(\times\) 368 pixels, was derived from video sequences captured during standard optical colonoscopy procedures from 13 patients.

Table 1 Dataset properties.

Evaluation metrics

In this study, we evaluated the performance of our predictions on the test set using four standard evaluation metrics:

$$\begin{aligned} \text {Dice}=\frac{2\times TP}{2\times TP+FP+FN},~~\text {IoU}=\frac{TP}{TP+FP+FN},~~\text {Sensitivity}=\frac{TP}{TP+FN},~~\text {Specificity}=\frac{TN}{TN+FP}, \end{aligned}$$
(4)

where TP, TN, FP, and FN denote the counts of true positives, true negatives, false positives, and false negatives, respectively. Moreover, mDice and mIoU are calculated over all test set.

Implement details

FAENet and benchmark models were implemented on a Linux platform using PyTorch, accelerated by an NVIDIA A40 GPU. We employed data augmentation techniques such as random flipping and cropping to improve model robustness. Batch size is set to 64 for training with the sub-patch is with a size of \(256\times 256\). The training regimen set the initial learning rate at 0.02 across a maximum of 500 epochs, using the SGD optimizer with a polynomial decay learning strategy and a momentum of 0.9. Moreover, the Softmax cross-entropy is selected as the loss function. The model version with the lowest validation loss was selected for subsequent evaluations. FAENet was benchmarked against leading models such as SegNet44, UNet22, UNet++13, PraNet16, ResUNet++12, SSCNet35 and XNet45, ensuring consistent settings for all comparisons.

Comparisons with state-of-the-art methods

The results presented in Table 2 demonstrate the performance of FAENet against a variety of state-of-the-art segmentation models, including CNN-based (e.g., UNet, PraNet) and Transformer-based (e.g., Swin-Unet, MedT) methods. FAENet achieves the highest Dice coefficient (0.917 on Kvasir-SEG and 0.933 on CVC-ClinicDB) and mIoU (0.863 on Kvasir-SEG and 0.883 on CVC-ClinicDB), outperforming all baseline methods. Notably, Transformer-based methods like Swin-Unet and MedT, despite their advanced architectures, show marginally lower performance (e.g., MedT achieves 0.847 Dice on Kvasir-SEG) compared to FAENet. This highlights the effectiveness of FAENet’s frequency attention mechanisms in preserving edge details and handling diverse polyp morphologies.

Table 2 Comparisons on two datasets with other methods, where the bold text indicates the best.

A closer look at sensitivity and specificity metrics reveals FAENet’s ability to balance true positive identification with reduced false positives. On CVC-ClinicDB, FAENet achieves a sensitivity of 0.993 and specificity of 0.960, surpassing Swin-Unet (0.989 sensitivity and 0.883 specificity) and MedT (0.988 sensitivity and 0.889 specificity). These improvements underscore FAENet’s robustness in identifying challenging polyp boundaries, even in edge cases involving small or flat lesions.

As shown in Fig.  4, visual inspections on the Kvasir-SEG dataset illustrate FAENet’s ability to handle complex polyp structures with high precision. Compared to Swin-Unet and MedT, FAENet demonstrates sharper boundary delineation and fewer false positives in flat or small lesions. For example, in cases where lesions are nearly indistinguishable from surrounding mucosa, FAENet’s frequency-based attention mechanisms ensure accurate segmentation, as evident in its ability to preserve edge details without compromising sensitivity.

On the CVC-ClinicDB dataset, FAENet consistently outperforms baseline methods, including Transformer-based models, by correctly segmenting irregular and faint polyp boundaries (see in Fig. 5). Visual comparisons show that FAENet produces results closest to the ground truth, with minimal boundary artifacts and higher consistency across varied lesion sizes. Particularly in edge cases involving extremely small lesions, FAENet’s sensitivity ensures that no significant regions are overlooked, providing robust segmentation even under challenging imaging conditions.

Fig. 4
figure 4

Visual inspections of random samples from Kvasir-SEG test set, where input image, ground truth label and predicted results by different methods are visualized.

Fig. 5
figure 5

Visual inspections of random samples from CVC-ClinicDB test set, where input image, ground truth label and predicted results by different methods are visualized.

Effects of FAM

The experimental results detailed in Table 2 provide a robust evaluation of the proposed FAENet against its two variants: FAENet-C, where the FAM is replaced with standard convolution layers, and FAENet-S, where FAM is substituted by self-attention layers. These comparisons are crucial to understanding the specific impact and advantage of incorporating the FAM into the network architecture.

FAENet demonstrates superior performance across all metrics on both the Kvasir-SEG and CVC-ClinicDB datasets, illustrating the effectiveness of the frequency attention mechanism. Notably, FAENet achieves a Dice score of 0.917 and an mIoU of 0.863 on Kvasir-SEG, significantly outperforming FAENet-C (Dice: 0.817, mIoU: 0.745) and FAENet-S (Dice: 0.832, mIoU: 0.781). This trend persists across the datasets, where FAENet consistently leads with higher sensitivity and specificity. For instance, on CVC-ClinicDB, FAENet records a specificity of 0.960, considerably higher than that of FAENet-C (0.841) and FAENet-S (0.874). These results underline the efficacy of FAENet in maintaining high precision in segmenting polyps while reducing false positives more effectively than its counterparts.

Effects of ICA and CCA

As shown in Table 2, removing ICA causes a noticeable decline in segmentation performance, with the Dice coefficient dropping from 0.917 to 0.875 on Kvasir-SEG and from 0.933 to 0.887 on CVC-ClinicDB. Similarly, the mIoU decreases by approximately 5%, highlighting the reduced capability to model detailed context within each frequency band. These results underscore that ICA effectively refines frequency-specific features, improving segmentation precision, particularly in scenarios requiring nuanced boundary delineation. Excluding CCA from FAENet results in moderate but significant performance degradation, with the Dice coefficient decreasing to 0.882 on Kvasir-SEG and 0.890 on CVC-ClinicDB. The mIoU also suffers, reflecting reduced synergy between contextual and edge-specific features. These findings highlight the importance of CCA in modeling cross-frequency correlations, which is crucial for accurately segmenting complex structures with diverse spatial and frequency characteristics.

Efficiency analysis

To evaluate the computational efficiency of FAENet, we measured the inference time per image on the Kvasir-SEG and CVC-ClinicDB datasets and compared it with state-of-the-art segmentation methods. As shown in Table 3, FAENet achieves an average inference time of 11.0 ± 0.4 ms for 256\(\times\)256 images from the Kvasir-SEG dataset and 12.9 ± 0.5 ms for 384\(\times\)288 images from the CVC-ClinicDB dataset. These times are competitive with other methods, such as UNet and PraNet, while being faster than transformer-based models like Swin-Unet (15.4 ± 0.7 ms on Kvasir-SEG) and MedT (16.2 ± 0.6 ms on Kvasir-SEG). This balance of efficiency and accuracy highlights FAENet’s suitability for real-time clinical applications, where computational performance is as critical as segmentation accuracy.

Table 3 Inference times (mean ± standard deviation) for each method on Kvasir-SEG (256\(\times\)256) and CVC-ClinicDB (384\(\times\)288) datasets.

Discussion and conclusion

The extensive comparison and evaluations demonstrate that the FAENet sets a new benchmark in polyp segmentation across challenging datasets. By innovatively integrating frequency-based attention mechanisms, FAENet addresses common limitations in polyp segmentation such as capturing fine-grained details and handling complex textures and shapes. The distinct use of frequency attention to refine both low and high-frequency components allows the network to perform robustly under diverse imaging conditions, making it particularly effective in scenarios where precision and reliability are paramount.

Real-time segmentation is essential for clinical applications, where computational delays can hinder decision-making. Recent advancements in parallel computation, such as heterogeneous system-on-chip architectures and FPGA-based accelerations, have shown promise in reducing computational complexity for tasks like blood flow simulation48,49,50. Incorporating similar parallelization techniques into FAENet could optimize its computational efficiency, enabling faster inference without sacrificing accuracy. Future work will explore these avenues to further enhance FAENet’s clinical utility.

FAENet, while effective, has several limitations that warrant attention. First, its computational complexity and memory requirements may hinder deployment in real-time or resource-constrained environments. Moreover, the method’s sensitivity to hyperparameters and its reliance on high-quality input images suggest the need for further optimization and preprocessing strategies to ensure consistent performance across diverse datasets and imaging modalities. Future work will focus on reducing computational demands through model optimization and exploring automated hyperparameter tuning methods. Additionally, we aim to validate FAENet on broader medical imaging tasks and diverse datasets to enhance its generalizability and clinical applicability.