Introduction

Recently, an image-guided radiotherapy (IGRT) or adaptive radiotherapy (RT) has become the effective methods along with technical advancements in imaging modalities such as magnetic resonance imaging or cone-beam computed tomography (CBCT)1,2,3,4. During RT, geometrical variations of target volumes, organs at risk, weight changes (loss/gain), tumor regression and/or progression can significantly affect the treatment outcome5. The patients’ anatomy can vary along the time that leads to an inaccurate dose delivery (or accumulated dose calculation) which is driven from the static (stationary) treatment planning6. The adaption by daily accumulation can resolve the problem of previous independent computed tomography (CT) scan methods7.

Within this, the synthetic CT generation using CBCT image for adaptive radiotherapy (RT) is also prevalent. Particularly, using the deformable image registration (DIR), it is possible to generate the synthetic CT from planning CT (pCT) and daily CBCT image. By adopting the original plan on pCT to this synthetic CT image, the change of dose delivery compared to the original plan is calculated and some left partial plans can be modified considering the daily change of the anatomy. Furthermore, the structure set contoured on pCT can be modified with respect to the same DIR applied to the pair of pCT and CBCT. However, the artifact existed in both pCT and CBCT is a vulnerable factor in DIR8,9,10,11,12. On the other hand, an attempt has made for the CBCT to directly calculate the patient dose. The dose calculation on CBCT is also suggested by using a deep learning method and has shown a great feasibility. During the conversion process of Hounsfield unit (HU)-electron density curve to CBCT, missing information due to the CBCT artifacts hinders the accurate dose calculation on CBCT13.

There are a lot of methods for removing artifacts in CBCT, but they are mostly for scattering, motion, and cupping artifacts. However, for the head and neck RT, strong metal artifacts significantly degrade the accuracy of dose calculation and the performance of DIR, as well as the image quality of CBCT. In contrast to the CT scanner for pCT, to our knowledge, CBCTs attached to linear accelerator (LINAC) have no dedicated software to reduce metal artifacts.

Existing metal artifact reduction (MAR) methods for a typical dental CBCT utilize a passive inpainting method to prevent destroying tissue parts rather than reducing the artifacts14,15. Recently, a deep learning method was used for MAR in dental CBCT16. As the generative adversarial network has shown the effective result, and the lack of data acquisition, the attempt to solve the MAR problem as unsupervised method using the cycleGAN17 has been also proposed. The method using dual ___domain18, sinogram and image19,20 has been also proposed. The CBCT image data used in aforementioned studies were acquired from dental CBCT. More recently, several studies have presented deep learning-based MAR for CBCT attached to LINAC21,22. In those studies, images from MV-CBCT were used to synthesize the target kV-CBCT images having no metal artifacts17,23,24.

In this study, we aim to develop a deep learning-based MAR model to reduce the metal artifacts in CBCT for head and neck cancer RT. To train the proposed MAR model, we synthesized the kV-CBCT images including metallic implants, with and without metal artifacts (simulated image data pairs) through sinogram image handling process. We propose the deep learning architecture which only uses image ___domain, and focuses on both artifact removal and reconstruction of anatomic structure.

Methods

Ethics declarations

All procedures performed in study involving human participants were in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. The experimental protocols were approved by the Seoul National University Hospital Institutional Review Board (IRB approval No. H-2111-198-1280) and Written informed consent by the patients was waived due to a retrospective nature of our study.

Generation of synthetic dataset

Since the acquisition of pairs of metal artifact CBCT image and artifact free CBCT slices in one patient is not available, we synthesized the CBCT dataset following a previous work25. The metal artifact image and metal artifact free image were selected randomly in our database, as shown in Table 1. The CBCT images were obtained between 2019 and 2020 from the C-arm CBCT attached to the VitalBeam LINAC (Varian Medical Systems/ Siemens Healthineers, CA, USA) in Seoul National University Hospital.

Table 1 Description of the CBCT database for MAR model development and performance test.

The overall synthetic data generation process is illustrated in Fig. 1. The segmentation of metal region was done through HU thresholding for extraction of metal region. We assumed the HU value of the metal is 8000. The erosion filter and Gaussian blur were applied for boundary smoothing. The position of metals segmented from the metal artifact image were slightly changed according to the anatomical structure of the clean (artifact free) image. Then the position-corrected metal images were converted into synthetic metal artifact CBCT images, as encompassing Radon transformation and inverse Radon transformation. During sinogram handling processing, various artifact strengths can be generated by multiplying values in an arbitrary range during the radon transform and inverse radon transform process. Position-corrected metal images were synthesized into clean images, and the synthesized images would be target images. Synthetic metal artifact images would be input images for the proposed MAR model. In this regard, we synthesized a paired data of metal artifact free CBCT images (i.e., ground truth) and metal artifact CBCT images. We generated the synthesized dataset focusing on producing the case of containing the oral cavity, since it is the most complex part and difficult for restoration due to its complex anatomy structure.

Fig. 1
figure 1

Overall workflow of generation of synthetic dataset for training the proposed MAR model.

Architecture of MAR models

Single-encoder models

The first model is based on the UNet26 encoder (Fig. 2(a)), which is built for artifact removal, rather than the realistic restoration of the artifact-corrupted region. It consists of convolution layers and maxpooling layers. Maxpooling shows the effectiveness in detecting the artifact region and removing the artifact since the pixel values in artifact region is unusually high compared to the tissue and bone, which enabled maxpooling to detect and remove those regions well.

Fig. 2
figure 2

Model architectures of four single-encoder models (see the text for detailed descriptions of architecture).

The second model is based on the FusionNet27 encoder (Fig. 2(b)), but several structures are modified for restoration of anatomic structures. For better restoration, the residual connection is added to each convolution layers, and the maxpooling is changed to the strided convolution layer. In this architecture, the characteristic of restoration is highly increased which makes a small-negative effect in artifact reduction, the model tries to preserve information in input as many as possible. The purpose of the second model is on anatomy restoration, not a radical artifact reduction. The decoder consists of upsampling layer and convolution layer with residual connection.

The third model is based on the modified UNet encoder by adding residual convolutional block to the first model (denoted as UNet/L, which indicates large encoder of UNet) (Fig. 2(c)).

The fourth model is based on the modified FusionNet by adding maxpooling layer and spatial attention layer to the second model (denoted as FusionNet/L, which indicates large encoder of FusionNet) (Fig. 2(d)). The third and fourth models are designed to merge the strengths of the UNet (i.e., artifact removal) and FusionNet (i.e., restoration of anatomic structure) into the single-encoder models. Since these models are combinations of UNet and FusionNet, convolutional and attentional features are extracted and linked to decoder through skip connections.

Dual-encoder models

Our proposed model architecture is based on the dual encoder–decoder architecture. The first encoder (encoder1 of Fig. 3) is based on the UNet manipulated for the artifact removal, while the second encoder (encoder2 of Fig. 3) is based on the FusionNet manipulated for the restoration of metal artifact region. The basic structures for each encoder are the same as the UNet and FusionNet described as the first model and second model, respectively.

Fig. 3
figure 3

Model architectures of three dual-encoder models (see the text for detailed descriptions of architecture).

After each encoder produces convolutional feature and attentional feature, two features are mixed in regional proper way using the feature fusion. The convolutional feature is produced through convolution operation in a convolution block, and the attentional feature is produced through spatial and channel-wise attention28,29 operations on the feature maps that are operated from the convolution block.

Our purpose is to add the anatomy restoration detail in radically-removed feature map. Therefore, three dual-encoder models are designed with respect to feature fusion methods in the decoder. The first dual-encoder model uses two convolutional features with a simple addition method (Fig. 3(a)). The second dual-encoder model uses one convolutional feature and one attentional feature with the spatial attention method (Fig. 3(b)). The third dual-encoder model reflects the convolutional feature and spatial attention feature to the decoder through channel wise attention operation (denoted as channel wise attention block in Fig. 3(c)). This is designed to provide high restoration performance by providing sufficient local (focusing where) and global (focusing what) information for the decoder restoration process through the attention operation of the information extracted from each encoder. The third model dual-encoder model is our final proposed model (ours) with the best performance in this task.

Since we proposed and experimented four single-encoder models and three dual-encoder models, the model structures were complicated. Therefore, we summarized the characteristics of each model in Table 2.

Table 2 Characteristics of single-encoder and dual-encoder models.

Loss function

Since we have the synthesized metal artifact dataset including metal artifact free CBCT images for labels, the supervised way of training is applied. Loss function is the combination of mean squared error (MSE) loss, structural-similarity index (SSIM) loss, and auxiliary loss. MSE and SSIM losses are described in following Eqs. (1) and (2):

$$\:{L}_{MSE(y,\widehat{y})}=\frac{1}{n}\sum\:{({y}_{i}-{\widehat{y}}_{i})}^{2}$$
(1)
$$\:{L}_{SSIM\left(x,y\right)}=1-\frac{\left({2\mu\:}_{x}{\mu\:}_{y}+{C}_{1}\right)+({2\sigma\:}_{xy}+{C}_{2})}{({\mu\:}_{x}^{2}+{\mu\:}_{y}^{2}+{C}_{1})({\sigma\:}_{x}^{2}+{\sigma\:}_{y}^{2}+{C}_{2})}$$
(2)

The FusionNet based encoder2 for anatomy restoration has over-preserved the metal artifact, so the direct loss between the middle encoder feature and the target (artifact removed) image using MSE as the auxiliary loss is added. This auxiliary loss has given the effect of leaving the metal artifact region while preserving the proper features. For this, we took a convolution layer to each layer’s encoder feature, and directly calculated the loss between the target and the convolution output.

Experiments

The total numbers of synthesized dataset for training, validation, and test are 29104, 3795, and 7201, respectively. The training was done for 100 epoch with batch size 32, and the random crop was used for data augmentation.

Quantitative evaluation

The following four evaluation indicators were used to quantitatively evaluate the performance of MAR. For each evaluation metrics, MSE is utilized to measure the difference in pixel values between input and output images, as Eq. (3). It is particularly useful for highlighting significant errors.

$$\:MSE(\widehat{y},y)=\frac{1}{n}\sum\:{({y}_{i}-{\widehat{y}}_{i})}^{2}.$$
(3)

The SSIM assesses the quality of MAR by comparing the luminance, contrast, and structural information between the input and output images, as Eq. (4). This metric is effective in evaluating how well essential details are preserved during artifact reduction.

$$\:SSIM\left(x,y\right)=\frac{\left({2\mu\:}_{x}{\mu\:}_{y}+{C}_{1}\right)+({2\sigma\:}_{xy}+{C}_{2})}{({\mu\:}_{x}^{2}+{\mu\:}_{y}^{2}+{C}_{1})({\sigma\:}_{x}^{2}+{\sigma\:}_{y}^{2}+{C}_{2})}.$$
(4)

Peak signal-to-noise ratio (PSNR) evaluates the performance of MAR by measuring the ratio between the maximum possible signal and the noise introduced by the reduction process, as Eq. (5). It used for quantifying the extent to which the output image retains the quality of the target image after artifact removal.

$$\:PSNR(x,y)=10{log}_{10}\left(\frac{{{MAX}_{I}}^{2}}{\sqrt{MSE}}\right),$$
(5)

and Pearson correlation coefficient (PCC) was used to check the linear relationship between images, as Eq. (6):

$$\:PCC\left(x,y\right)=\frac{\sum\:({x}_{i}\text{}-\stackrel{-}{x})({y}_{i}\text{}-\stackrel{-}{y})}{\sqrt{\sum\:{({x}_{i}\text{}-\stackrel{-}{x})}^{2}\sum\:{({y}_{i}-\stackrel{-}{y})}^{2}}}.$$
(6)

Artifact index method

To evaluate the performance of MAR, the artifact index30 calculation was conducted. The artifact index calculation, as indicated by the green and red circles in Fig. 5 of the qualitative results, was performed using below Eq. (7):

$$\:\text{Artifact Index}=\sqrt{{{SD}_{green}}^{2}-{{SD}_{red}}^{2}}.$$
(7)

The standard deviation (SD) of the green circle represents regions with high standard deviation, indicating significant metal artifacts and image distortion. In contrast, the SD of the red circle represents regions with the lowest standard deviation, suggesting minimal artifacts in each CT image.

Results

Quantitative evaluation

The quantitative evaluation was performed with our synthesized dataset. Quantitative result has been shown on Table 3. Our model, having dual-encoder and spatial and channel wise attention, showed the best result compared to the other single-encoder and dual-encoder models. The residual connection has the effect of over-preserving, so the metrics of FusionNet is slightly lower than the UNet in single-encoder models.

Table 3 Quantitative evaluation results depending on architectures.

In addition, the single-encoder models with large encoder architecture (i.e., UNet/L and FusionNet/L) did not show effective results. The single-encoder models having similar structures with dual-encoder models in a single merged encoder were observed to have low quantitative values compared to the dual-encoder models.

We also compared the feature fusion methods. The dual-encoder with simple addition showed the in-between performance of single-encoder UNet and FusionNet. And the method using spatial attention showed slightly higher numerical performance improvement compared to the method using only simple addition. By adding spatial and channel-wise attention, it can be observed that the performance increases. This model achieved the highest numerical evaluation results among all the models.

Artifact index

The artifact index demonstrated significant improvement across all models compared to the input image as shown in Fig. 4. Due to the varying characteristics of each deep learning model, the numerical differences were minimal. Models based on FusionNet with residual connection blocks exhibited a wide range of values due to their over-preserving nature, while the single UNet-based models showed a relatively narrow value distribution. Our proposed model, utilizing spatial and channel-wise attention methods, effectively removed artifacts, resulting in the lowest mean index and the narrowest distribution.

Fig. 4
figure 4

This includes the artifact index calculation results for the input image, along with quantitative results for seven models, comprising both single encoder and dual encoder architectures. The yellow line represents the mean value and the green triangle indicates the median value.

Qualitative evaluation

The comparison was conducted between areas frequently subjected to adaptive RT within the oral cavity and regions with high occurrences of metal insertion in dental areas. Tissue restoration performance is indicated by the orange arrow, over-preservation of tissue is highlighted by the red arrow, and reduction of heavy streak artifacts is denoted by the purple arrow. Considering the high intensity of metal artifacts in the input images due to the insertion of high-density implants, The UNet removed the artifact region well but mostly makes the grayish and blurry tissue region (see orange arrows in Fig. 5). Unlike UNet, the FusionNet over-preserved the input image’s infromation (see red arrows in Fig. 5) and did not remove the metal artifact effectively, while preservation and restoration of artifact-corrupted region is better than UNet. In addition, the UNet/L and FusionNet/L models showed some improvements compared to the UNet and FusionNet model mentioned above, but they did not demonstrate effective results. The proposed architecture, which used dual encoder did two positive effect from each encoder, which effectively removes metal artifacts but also restores the tissue without blurring or making the new greish region (see in Fig. 5(a)). In particular, the tooth area indicated by the yellow arrow in Fig. 5(b) for patient 2 demonstrates noticeably good performance.

Fig. 5
figure 5

Comparison between the single-encoder models and dual-encoder incorporating with spatial & channel wise attention (our proposed model).

Figure 6 presents the comparison results for different feature fusion methods of dual-encoder models. Observing the red arrows in Fig. 6, the model that simply added convolutional features in the decoder failed to effectively removed streak artifacts from the input image. In contrast, the model using spatial attention demonstrates more effective streak artifacts removal. And as indicated by the orange arrow in Fig. 6(a) using spatial attention improved tissue restoration performance better than simple addition.

Fig. 6
figure 6

Comparison between the feature fusion methods adopted for dual-encoder models.

Our proposed model, which utilizes both spatial and channel-wise attention, demonstrated superior performance in streak artifact reduction and tissue restoration. The tissue area marked by the orange arrow was realistically restored, and streak artifacts (see red arrow in Fig. 6(a)) were also effectively removed. Particularly for severe metal artifacts case in Fig. 6(b), the models with simple addition and spatial attention left substantial streak artifacts (purple arrows in Fig. 6(b)) and failed to restore the tissue areas adequately (orange arrows in Fig. 6(b)). However, our proposed model effectively restored the tissue area (orange arrow in Fig. 6(b)) and exhibited high-performance severe streak artifacts reduction capabilities (purple arrow in Fig. 6(b)).

Discussion

We developed and trained a synthetic dataset exclusively with kV-CBCT data. Our proposed model, which uses dual-encoder instead of a single-encoder, exhibited superior performance in reducing metal artifacts. The dual-encoder facilitates more nuanced data processing and integration. Each encoder is tailored to different aspects of the imaging data, enhancing artifact reduction synergistically. The first encoder extracts fine details from images, while the second encoder focuses on identifying and mitigating noise and artifacts. This specialized approach ensures a thorough analysis of both the image structure and potential disruptions. Additionally, by employing an attention mechanism, we selectively utilized essential information, thereby improving the performance in MAR. This method has demonstrated potential to enhance the quality of the output images and provide benefits in clinical settings.

Despite the synthetic pCT’s accuracy via DIR, the deformation of structures in the oral cavity is discouraged. We noted geometric discrepancies between the deformed structures, attributed to the reduction of metal artifacts in CBCT. Moreover, we observed that the volumetric similarity between the deformed structures (from both the original and the MAR-enhanced CBCT) and the ground truth contour was actually less than that between the structure on the original pCT and the ground truth contour. Although the commercial MAR technique (O-MAR, Philips Healthcare, Netherlands) was applied to the pCT, pronounced streak artifacts remained. These residual metal artifacts on the pCT could hinder the generation of accurate deformed structures.

In future research, we plan to employ deep learning-based MAR techniques on both pCT and CBCT to create synthetic CT and deformed structures. Furthermore, future studies should explore direct dose calculation on CBCT, which was not addressed in our current study, to observe potential dose differences between the original and MAR-enhanced CBCT.

This study has some limitations to be discussed. Firstly, the dataset used in this study was acquired from a single institution. The performance of the model developed in this study was evaluated fairly using patient data not used in training, and it is expected to perform well even when data from other institutions is utilized. Although a sufficient number of data was used to evaluate the model’s performance, utilizing a larger dataset or data from other institutions in the future would greatly assist in enhancing the model’s performance. Secondly, various experiments were conducted to find the optimal model structure for this study using different CNN-based architectures, but the latest structures introduced recently, such as Diffusion-based models, were not applied. As shown in Table 3, our model demonstrated exceptionally good performance metrics sufficient for clinical use. However, if there is a desire to further improve the MAR performance, it would be beneficial to apply the latest AI model structures in future work.

Conclusion

As the daily CBCT image acquisition became popular, the importance of CBCT images in adaptive RT became bigger. Acquisition of high-quality CBCT images, and artifact correction makes the more accurate synthetic planning CT and deformed structures. In addition, the accurate dose calculation on CBCT can be also available. In this study, we developed a dual encoder deep learning model for MAR for CBCT in head and neck RT It was shown that our proposed deep learning-based method has dramatically improved the image quality of CBCT. By using the proposed MAR model to CBCT, it can achieve more accurate synthetic pCT for head and neck patients as well as better tissue recognition and structure delineation. For further improvements in adaptive RT for head and neck cancer patients, the implementation of deep learning-based MAR model to both the pCT and CBCT is required.