Hybrid transformer and convolution iteratively optimized pyramid network for brain large deformation image registration

Cui, Xinxin; Zhou, Yuee; Wei, Caihong; Suo, Guodong; Jin, Fengqing; Yang, Jianlan

doi:10.1038/s41598-025-00403-w

Download PDF

Article
Open access
Published: 05 May 2025

Hybrid transformer and convolution iteratively optimized pyramid network for brain large deformation image registration

Xinxin Cui¹^na1,
Yuee Zhou¹,
Caihong Wei²,
Guodong Suo¹,
Fengqing Jin¹ &
…
Jianlan Yang^1,2^na1

Scientific Reports volume 15, Article number: 15707 (2025) Cite this article

771 Accesses
Metrics details

Subjects

Abstract

In recent years, the pyramid-based encoder-decoder network architecture has become a popular solution to the problem of large deformation image registration due to its excellent multi-scale deformation field prediction ability. However, there are two main limitations in existing research: one is that it over-focuses on the fusion of multi-layer deformation sub-fields on the decoding path, while ignoring the impact of feature encoders on network performance; the other is the lack of specialized design for the characteristics of feature maps at different scales. To this end, we propose an innovative hybrid Transformer and convolution iteratively optimized pyramid network for large deformation brain image registration. Specifically, four encoder variants are designed to study the impact of different structures on the performance of the pyramid registration network. Secondly, the Swin-Transformer module is combined with the convolution iterative strategy, and each layer of the decoder is carefully designed according to the semantic information characteristics of different decoding layers. Extensive experimental results on three public brain magnetic resonance imaging datasets show that our method has the highest registration accuracy compared with 9 cutting-edge registration methods, which fully verifies the effectiveness and application potential of our model design.

Exploring the performance of implicit neural representations for brain image registration

Article Open access 13 October 2023

Lightweight hybrid transformers-based dyslexia detection using cross-modality data

Article Open access 16 May 2025

Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application

Article Open access 15 April 2022

Introduction

Deformable medical image registration technology¹ plays a crucial role in medical image segmentation and fusion^2,3, image-guided surgical navigation^4,5, tumor growth monitoring, and the evaluation of treatment effects⁶. Given a pair of images to be registered, referred to as the moving image $\:{I}_{m}$ and the fixed image $\:{I}_{f}$, this technology primarily predicts a dense deformation field ∅ by learning the nonlinear spatial correspondence between the two images. Subsequently, $\:{I}_{m}$is interpolated according to ∅ to achieve deformation. The wraped images $\:{I}_{w}$ and $\:{I}_{f}$ attain consistency in both spatial and anatomical structures. Notably, the images to be registered may originate from different acquisition devices, time points, or modalities. By performing this registration, clinicians are able to conduct more comprehensive and accurate diagnoses and treatments.

Traditional registration methods^{7,8,9,10,11,12,13} determine optimal transformation parameters through continuous iterations, which can be very time-consuming. Each time a new pair of images is input, the registration process must be repeated from scratch. This approach is particularly inadequate for volume data generated in clinical applications, as it does not meet the demands for real-time registration. In contrast, registration methods based on deep learning can train a registration network using the entire training dataset. Once trained, the model can complete the registration of a pair of images in a matter of seconds or even less, demonstrating significant potential for clinical applications¹⁴. Based on the training methodology, these methods can be primarily categorized into supervised and unsupervised approaches. However, registration methods based on supervised learning often face limitations in network performance due to their reliance on gold standards. The Spatial Transformer Network (STN)¹⁵ can realize the back propagation of network learning errors, making the registration method based on unsupervised learning become the mainstream technology.

Unsupervised registration methods primarily rely on Convolutional Neural Networks (CNNs). The widely-used VoxelMorph¹⁶ framework employs a network architecture akin to U-Net¹⁷ to frame the deformation image registration challenge as an optimization problem involving function parameters $\:{\uptheta\:}$. Specifically, this can be expressed as$\:{\:F}_{\theta\:}\left({I}_{m},{I}_{f}\right)={\varnothing}$. The framework trains an end-to-end CNNs to directly predict and output a dense deformation field ∅, which is subsequently utilized by the Spatial Transformer Network (STN) to distort the moving image $\:{I}_{m}$, resulting in the warped image $\:{I}_{w}$. During training, the model continuously updates the network parameters by calculating the dissimilarity between $\:{I}_{w}$ and $\:{I}_{f}$to learn the optimal deformation field ∅. Subsequent scholars have made improvements based on VoxelMorph^18,19,20. However, these methods only predict the deformation field at the highest resolution layer of the decoder network and struggle to address large deformations present in the image. When medical images are affected by lesions, their tissue structures undergo significant deformation. For instance, patients with Alzheimer’s disease exhibit pronounced brain atrophy²¹. Consequently, the challenge of large deformation image registration has become a focal point of research in this field.

At present, the decomposition idea is mainly used to solve the large deformation image registration problem, which can be specifically divided into cascade decomposition method and pyramid decomposition method. The cascade decomposition method breaks down the large deformation registration problem into multiple smaller deformation registration tasks with identical components by cascading several sub-networks of the same architecture^22,23,24. Each subsequent sub-network refines and enhances the predictions made by the preceding one, thereby progressively constructing the deformation field until the final optimized deformation field is achieved. However, this approach typically incurs high computational costs, making it difficult to implement clinically. In contrast, the pyramid decomposition method utilizes a multi-scale feature map input strategy to predict the deformation field from coarse to fine^{25,26,27,28,29}. This method decomposes the large deformation registration issue into multiple smaller deformation registration problems with varying components. The low-resolution layer is adept at capturing global deformations, while the high-resolution layer focuses on local deformations. Its multi-scale input approach significantly alleviates the training burden on the network.

Specifically, PiViT²⁷ adopts a pyramid iterative composite structure to solve the large deformation registration problem. By using Swin-Transformer³⁰ in the decoding low-scale layer for small-scale iterative registration, it can not only capture the global deformation in the feature map, but also reduce the network computational burden. However, this method has low registration accuracy and does not fully design the dual-stream feature encoder. NICE-Trans²⁶ uses non-iterative pyramid decoding to predict and fuse multi-layer deformation fields, achieving better registration performance than PiviT, but it only performs deformation decomposition once in each decoding layer and does not consider the differences in feature maps of different decoding layers. ModeTv2²⁹ also uses a non-iterative pyramid decoding structure to predict and fuse multi-layer deformation fields. Different from NICE-Trans, this method uses a multi-head neighborhood cross-attention mechanism to decompose multiple motion modes in the feature map to achieve a more sophisticated registration process, and uses CUDA extensions to improve the training and inference speed of the model. Additionally, many existing pyramid-based registration methods place excessive emphasis on the multi-layer deformation field fusion operations within the decoding path, neglecting the design of hierarchical feature encoders. Some researchers have recognized this issue and have incorporated channel attention mechanisms³¹ or residual modules³² into the encoder architecture; however, no study has conducted a comprehensive comparative analysis of these various encoder designs.

To this end, we propose a novel pyramid network that integrates the Transformer³³ with convolutional iterative optimization, making full use of the adaptive global modeling capability of Transformer and the efficient local feature extraction capability of CNN pyramid structure to better solve the large deformation registration problem in brain Magnetic Resonance (MR) images. The motivation for our approach is that it is imperative to design the layer-wise feature encoders appropriately and to carefully design each layer of the decoder based on the differences in the layer-wise feature maps. First, we introduce an enhanced dual-branch encoder and develop three additional variant encoders for a focused and in-depth comparative analysis. Second, we incorporate convolutional iterative operations into various pyramid decoding layers to facilitate deformation decomposition within each resolution layer. In comparison to PiViT²⁷, we use deconvolution modules for iterative operations and study the number of iterations; in contrast to NICE-Trans²⁶, our pyramid decoding structure takes into account the differences between different feature maps. Compared with ModeTv2²⁹, we do not use the cross self-attention mechanism for deformation decomposition, but use a simpler convolution iteration strategy to achieve this goal. This study has extensively verified the proposed method on three public brain MR imaging datasets. The experimental results show that the proposed method significantly outperforms multiple existing advanced registration algorithms in terms of registration accuracy and robustness. Specifically, the main contributions of this study can be summarized as follows:

(1) Based on the pyramid decomposition method for large deformation, we introduced the concept of convolutional iteration. According to the semantic information characteristics of the feature maps at each layer of the decoding path, we meticulously designed a pyramid network that hybrid Transformer and convolutional iteration to solve the challenges of large deformation brain MR images registration.

(2) We designed an enhanced dual-branch encoder to fully extract the hierarchical features of the image. Additional downsampling from the original input image provides prior information to compensate for the loss of semantic spatial information caused by successive hierarchical downsampling operations. Furthermore, we designed three encoder variants: a pure CNN structure, a residual convolution structure, and a combination of convolution and squeezed residual modules. This allows for a more effective comparison and analysis of their respective advantages, disadvantages, and impacts on model performance.

(3) We conducted a large number of quantitative and qualitative analysis experiments on three public brain MRI datasets. Compared with other advanced registration methods, our method has better registration performance, which fully demonstrates the effectiveness of our network architecture design and its potential for clinical application.

Related work

Traditional iterative registration method

Traditional medical image registration methods primarily rely on mathematical models and can be categorized into feature-based and intensity-based approaches. Feature-based methods involve extracting key feature points from images to guide the registration process. This approach exhibits strong robustness to variations in image quality, with the effectiveness largely dependent on the design of the feature extraction algorithm. Notable feature extraction algorithms include the Scale-Invariant Feature Transform (SIFT)³⁴ and Speeded-Up Robust Features (SURF)³⁵. In contrast, intensity-based methods leverage the grayscale information of the images to minimize the differences between fixed and moving images through optimization algorithms. Common registration models in this category include elastic mapping⁷, B-spline-based methods⁸, statistical parameter transformation³⁶, and optical flow methods⁹. However, these methods tend to have high computational complexity, and their registration accuracy can be adversely affected by noise and intensity variations present in the images. Overall, traditional registration methods can be viewed as iterative optimization processes grounded in mathematical models. While the results of these methods are often highly interpretable and generalizable, their registration speed is relatively slow, which difficult to meet the practical applications in clinical scenarios.

Single-scale registration method

At present, registration methods utilizing deep learning predominantly rely on unsupervised techniques. The encoder-decoder architecture based on CNN is a widely adopted framework. This approach directly predicts a dense deformation field through end-to-end training^16,37. However, the performance of registration networks is limited by the receptive field of the convolutional kernels. To address this limitation, some researchers have incorporated attention modules²⁰ or residual modules³⁸ within the encoder, decoder, or skip connection components of the network, thereby enhancing the network’s receptive field and improving registration accuracy. Subsequently, as the visual transformer showed great potential in the field of medical image processing, Chen et al. first proposed the ViT-V-Net³⁹ model that mixed CNN and Transformer to verify the effectiveness of Transformer in solving 3D medical image registration tasks. To reduce the training cost of the network, Chen et al. used Swin-Transformer to build the encoder of the network and proposed the classic TransMorph⁴⁰ model. After, the hybrid network model based on CNN and Transformer attracted the research interest of some scholars and carried out some related research^41,42.

Regardless of the U-Net-based registration model or the hybrid registration model combining CNN and Transformer, they all directly splice the moving image and the fixed image together and send them to the encoder, and the decoder directly forces the prediction at the last layer to output a dense deformation field, the registration process is implicit and the interpretability is relatively low. Therefore, the registration network based on the cross-attention mechanism is considered to be a promising method by explicitly learning the spatial correspondence between feature image pairs. This approach utilizes the query key matching idea of the cross-attention mechanism, generates Q from one image, K and V from another image, and then performs a dot product operation to make the registration process very rational and natural^43,44,45. However, similar to the U-Net-based registration network, this method still only predicts the deformation field in the last layer of the decoding path, which is unable to cope with the problem of registration between medical images with large deformations.

Multi-scale registration method

Cascade and pyramid-based registration methods have been proposed for multi-scale learning of deformation fields. They decompose the large deformation registration problem into multiple small deformation registration problems. Although both methods can solve the large deformation registration problem to a certain extent, the cascade method will bring high computational cost, while the pyramid method only estimates the deformation field once in each decoder layer and there is still a lot of room for improvement. Therefore, the cascade pyramid network structure^32,46,47 was proposed to combine the advantages of these two architectures. However, these works are all focused on the design of the decoding layer path and the design of the feature encoder is too simple. This is mainly to balance the computational overhead of the encoder and decoder. However, we believe that in the pyramid registration network, the design of the hierarchical feature extraction encoder and the reasonable fusion of the multi-layer deformation field of the decoding path are all very important. Therefore, it is worth studying how to improve the feature extraction ability of the hierarchical feature encoder and reduce the redundant calculation in the multi-scale decoder. To this end, we propose a new enhanced feature encoder. At the same time, we introduce the idea of convolution iteration to supplement the decomposition of the deformation motion of each layer of the decoder. We also design different modules for low-resolution and high-resolution semantic features to minimize the computational overhead of the network.

Methods

Network overview

The registration framework we propose consists primarily of an enhanced dual-branch feature encoder (shown in Fig. 1 (d)) and a multi-scale pyramid iterative decoder (shown in Fig. 2). The dual-branch feature encoders share weight parameters and are designed to extract multi-scale features from the moving image $\:{I}_{m}$ and the fixed image $\:{I}_{f}$, thereby generating a set of feature maps. Further details will be provided in the Enhanced Pyramid Feature Encoder section. The multi-scale pyramid iterative decoder comprises a multi-scale registration path and a convolutional iterative path. The multi-scale registration path can address large deformation registration challenges, while the convolutional iterative path simulates local complex motion in moving images with smaller steps, thus facilitating more refined registration. The specific implementation process will be outlined in the Multi-scale Pyramid Iterative Decoder section.

Enhanced pyramid feature encoder

We propose a 4-layer enhanced pyramid feature encoder, as shown in (d) of Fig. 1, denoted as ExtraPrior_ConvBlock. Specifically: We first input the moving image $\:{I}_{m}\epsilon{\mathbb{R}}^{h\times\:w\times\:l}$ and the fixed image $\:{I}_{f}\epsilon{\mathbb{R}}^{h\times\:w\times\:l}$into the 4-layer feature encoder respectively, where $\:\text{h},\text{w},\text{l}$ represent the length, width, and height of the input image respectively. In the first layer (i.e., L = 1), $\:{I}_{m}/{I}_{f}$ only passes through the convolution unit to obtain the initial hierarchical features $\:{M}_{1}/{F}_{1}\epsilon{\mathbb{R}}^{c\times\:h\times\:w\times\:l}$. The convolution unit is shown in the blue box in Fig. 1(d), which consists of two convolution blocks. Each convolution block contains a 3 × 3 × 3 convolution layer, followed by a LeakyReLU layer with an activation parameter of 0.2 and an instance normalization layer. In the last three layers (i.e., 1 < L < 5), firstly, $\:{M}_{L-1}/{F}_{L-1}$ is downsampled by a 2x average pooling operation, so that the number of channels c of the feature map is doubled and the resolution is halved; at the same time, $\:{I}_{m}/{I}_{f}$ is directly downsampled by an average pooling operation to obtain the prior feature information of the corresponding layer. Then, the feature maps of the same level obtained by downsampling $\:{M}_{L-1}/{F}_{L-1}$ and directly downsampling $\:{I}_{m}/{I}_{f}$ are spliced in the channel dimension and fed into a convolution module to fully extract image features. The additional prior feature information can make up for the loss of semantic information caused by downsampling only the previous layer feature map $\:{M}_{L-1}/{F}_{L-1}$. Finally, the corresponding hierarchical features $\:{M}_{2}/{F}_{2}$, $\:{M}_{3}/{F}_{3}$ and $\:{M}_{4}/{F}_{4}$are output.

In order to more fully study the impact of feature encoders on the performance of pyramid networks, we designed three encoder variants, as shown in Fig. 1, namely: (a) Pure_ConvBlock, (b) Res_ConvBlock, and (c) ExtraPrior + SE_ConvBlock. Specifically, for the encoder type (a), we use the pure CNNs continuous downsampling structure used by most pyramid registration networks; for the encoder type (b), we add a residual convolution module based on (a), which specifically includes a 3 d convolution module and an activation layer; the encoder type (c) is similar to the encoder structure we use, and both add prior feature information from the original input image to the last three layers. The difference is that (c) is designed to use a squeeze excitation (SE) module to dynamically recalculate the weights of the concatenated feature maps. We will conduct an in-depth comparative analysis of these four encoder structures in the Comparative analysis of four variant encoders section.

Multi-scale pyramid iterative decoder

Corresponding to the four-layer encoder structure, our decoder also employs a four-layer architecture (L = 4,3,2,1) to facilitate a coarse-to-fine registration process. Each layer of the decoder is meticulously designed using a multi-scale pyramid strategy alongside a convolutional iteration strategy. Initially, since low-scale feature maps typically encompass a significant amount of global semantic information, we utilize four consecutive Swin-Transformer modules at the low-scale layers (L = 4,3) to explicitly model the global spatial correlation between pairs of feature maps. In the high-scale layers (L = 2,1), given the high image resolution and the relatively few motion patterns present in the feature maps, we use lightweight deconvolution modules (called DeconvBlock) in the last two layers to extract local complex spatial relationships in the feature maps. Detailed descriptions of the Swin-Transformer and DeconvBlock modules will be provided in Sect. 3.4. Furthermore, each layer of the decoder learns the correspondence between hierarchical feature pairs through a singular path, which may overlook displacement changes among local small structures. To mitigate this issue, we incorporate an additional layer of convolutional iteration operations in the first three layers of the decoder (L = 4, 3, 2) to simulate displacement movements between hierarchical feature maps in a more refined manner.

Specifically, the entire framework of the decoder is shown in Fig. 2. When L = 4, we first concatenate the coarsest resolution level hierarchical features $\:{M}_{4}$ and $\:{F}_{4}$ as the initial input (the concatenation result is denoted as $\:{C}_{4}$ here), which is then fed into a Swin-Transformer module to generate an initial attentional feature map ${Z}_{42}$, and then $\:{Z}_{42}$ is directly sent to a 3 d convolutional registration head to obtain the initial residual deformation field $\:{\phi\:}_{4}$. After that, we utilize the spatial transformer network (STN) to deform $\:{M}_{4}$ according to $\:{\phi\:}_{4}$ to obtain the wraped image $\:{W}_{4}$, and then continue to concatenate $\:{W}_{4}$ and $\:{F}_{4}$ (the concatenation result is denoted as $\:{C}_{41}$ here) and send it to the DeconvBlock module and the 3 d convolutional registration head for an iterative registration to generate the residual deformation field $\:{\phi\:}_{41}$. The initial deformation field $\:{{\varnothing}}_{4}$ is fused from $\:{\phi\:}_{4}$ and $\:{\phi\:}_{41}$. The entire process of the fourth decoding layer can be explained by formula group (1):

$$\begin{aligned}\begin{array}{c}\:{C}_{4}=Concat\left({M}_{4},{F}_{4}\right),\\ \:{Z}_{42}=Swin-Transformer\left({C}_{4}\right),\\ \:{\varphi}_{4}=Conv3d\left({Z}_{42}\right),\\ \:{W}_{4}=STN\left({M}_{4},{\varphi}_{4}\right),\\ \:{C}_{41}=Concat\left({W}_{4},{F}_{4}\right),\\ \:{\varphi}_{41}=Conv3d\left(DeconvBlock\left({C}_{41}\right)\right),\\ \:{{\varnothing}}_{4}=Fusion\left({\varphi}_{4},\:{\varphi}_{41}\right).\end{array}\end{aligned}$$

(1)

When L = 3, we upsample the attention feature map $\:{Z}_{42}$ and deformation field $\:{{\varnothing}}_{4}$generated by the previous layer to obtain $\:{Z}_{31}$ and $\:{{\varnothing}}_{4\_up}$ respectively, and then use STN and $\:{{\varnothing}}_{4\_up}$to deform $\:{M}_{3}$ to obtain the wraped image $\:{W}_{3}$. Then, $\:{W}_{3},{F}_{3}$ and $\:{Z}_{31}$ are concatenated together (the result of the concatenation is denoted as $\:{C}_{3}$) and sent to the Swin-Transformer module to generate the attention feature map $\:{Z}_{32}$ of this layer, and then $\:{Z}_{32}$ is sent to a 3 d registration head to obtain the residual deformation field $\:{\varphi\:}_{3}$ of this layer. In addition, as in the fourth layer, we use $\:{\varphi\:}_{3}$ to continue to deform $\:{M}_{3}$ to obtain the wraped image $\:{W}_{31}$, and then concatenate $\:{W}_{31}$ and $\:{F}_{3}$ (the result of the concatenation is denoted as $\:{C}_{31}$) and send them to DeconvBlock module and the 3 d convolution registration head for an iterative registration to generate the residual deformation field $\:{\varphi}_{31}$. The deformation field $\:{{\varnothing}}_{3}\:$of the third-layer decoder is the fusion of $\:{\varphi\:}_{3}$ and $\:{\varphi\:}_{31}$. The whole process can be explained by formula group (2):

$$\begin{aligned}\begin{array}{c}\:{Z}_{31}=UpConv\left({Z}_{42}\right),\\ \:{{\varnothing}}_{4\_up}=UpConv\left({{\varnothing}}_{4}\right),\\ \:{W}_{3}=STN\left({M}_{3},{{\varnothing}}_{4\_up}\right)\\ \:{C}_{3}=Concat\left({W}_{3},{F}_{3},{Z}_{31}\right),\\ \:{Z}_{32}=Swin-Transformer\left({C}_{3}\right),\\ \:{\varphi\:}_{3}=Conv3d\left({Z}_{32}\right),\\ \:{W}_{31}=STN\left({M}_{3},{\varphi\:}_{3}\right),\\ \:{C}_{31}=Concat\left({W}_{31},{F}_{3}\right),\\ \:{\varphi\:}_{31}=Conv3d\left(DeconvBlock\left({C}_{31}\right)\right),\\ \:{{\varnothing}}_{3}=Fusion\left({\varphi\:}_{3},\:{\varphi\:}_{31}\right).\end{array}\end{aligned}$$

(2)

When L = 2, we still upsample the attention feature map $\:{Z}_{32}$ and deformation field $\:{{\varnothing}}_{3}\:$ generated by the previous layer to obtain $\:{Z}_{21}$ and $\:{{\varnothing}}_{3\_up}$ respectively, and then use STN and $\:{{\varnothing}}_{3\_up}$ to deform $\:{M}_{2}\:$to obtain the wraped image $\:{W}_{2}$. Then, $\:{W}_{2}$, $\:{F}_{2}$, and $\:{Z}_{21}$ are concatenated together (the concatenated result is denoted as $\:{C}_{2}$) and sent to the DeconvBlock module to generate the convolutional feature map $\:{Z}_{22}$ of this layer. $\:{Z}_{22}$ is sent to a 3D registration head to obtain the residual deformation field $\:{\varphi\:}_{2}$ of this layer; at the same time, as in the third layer, we use $\:{\varphi\:}_{2}$ to continue to deform $\:{M}_{2}\:$to obtain the wraped image $\:{W}_{21}$, and then splice the obtained $\:{W}_{21}$ and $\:{F}_{2}$ (the splicing result is denoted as $\:{C}_{21}$ here) and send it to the second DeconvBlock module and the 3D convolution registration head for an iterative registration to generate the residual deformation field $\:{\varphi}_{21}$. The deformation field $\:{{\varnothing}}_{2}\:$of the second layer decoder is fused from $\:{\varphi\:}_{2}$ and $\:{\varphi\:}_{21}$. The whole process can be explained by formula group (3):

$$\begin{aligned}\begin{array}{c}\:{Z}_{21}=UpConv\left({Z}_{32}\right),\\ \:{{\varnothing}}_{3\_up}=UpConv\left({{\varnothing}}_{3}\right),\\ \:{W}_{2}=STN\left({M}_{2},{{\varnothing}}_{3\_up}\right),\\ \:{C}_{2}=Concat\left({W}_{2},{F}_{2},{Z}_{21}\right),\\ \:{Z}_{22}=DeconvBlock\left({C}_{2}\right),\\ \:{\varphi\:}_{2}=Conv3d\left({Z}_{22}\right),\\ \:{W}_{21}=STN\left({M}_{2},{\varphi\:}_{2}\right),\\ \:{C}_{21}=Concat\left({W}_{21},{F}_{2}\right),\\ \:{\varphi\:}_{21}=Conv3d\left(DeconvBlock\left({C}_{21}\right)\right),\\ \:{{\varnothing}}_{2}=Fusion\left({\varphi\:}_{2},\:{\varphi\:}_{21}\right).\end{array}\end{aligned}$$

(3)

When L = 1, which is the last layer of the decoder, we no longer add an extra round of convolution iterations, but directly predict and generate the final deformation field $\:{{\varnothing}}_{1}$. The specific approach is similar to the non-iterative path of the second layer. First, the attention feature map $\:{Z}_{22}$ and deformation field $\:{{\varnothing}}_{2}$ generated by the previous layer are upsampled to obtain $\:{Z}_{11}$ and $\:{{\varnothing}}_{2\_up}$ respectively, and then STN and $\:{{\varnothing}}_{2\_up}$are used to deform $\:{M}_{1}$ to obtain the wraped image $\:{W}_{1}$. Then $\:{W}_{1}$, $\:{F}_{1}$, and $\:{Z}_{11}$ are spliced together (the splicing result is denoted as $\:{C}_{1}$ here) and sent to the DeconvBlock module and the 3D registration head in sequence to directly obtain the final deformation field $\:{{\varnothing}}_{1}$ of this layer. The entire process of the decoder of this layer can be explained by formula group (4):

$$\begin{aligned}\begin{array}{c}\:{Z}_{11}=UpConv\left({Z}_{22}\right),\\ \:{{\varnothing}}_{2\_up}=UpConv\left({{\varnothing}}_{2}\right),\\ \:{W}_{1}=STN\left({M}_{1},{{\varnothing}}_{2\_up}\right),\\ \:{C}_{1}=Concat\left({W}_{1},{F}_{1},{Z}_{11}\right),\\ \:{{\varnothing}}_{1}=Conv3d\left(DeconvBlock\left({C}_{1}\right)\right).\end{array}\end{aligned}$$

(4)

Decoder submodule

In this section, we will explain the Swin-Transformer module and the DeconvBlock module mentioned in the previous section in detail. As shown in (a) of Fig. 3, the concatenated feature map $\:{C}_{L}$ (L = 4,3) first passes through a 3 d convolution layer to adjust the channel dimension, and then is sent to the standard Swin-Transformer module to calculate the spatial correlation between feature maps. The standard Swin-Transformer module mainly includes layer normalization (LN), window/shift window multi-head self-attention mechanism (W-MSA/SW-MSA), multi-head perceptron (MLP) and residual connection. Then, on the one hand, the calculated initial attention feature map is sent to a 3D registration head to obtain the residual deformation field $\:{\phi\:}_{L}$ (L = 4,3) of the layer, and on the other hand, it is input into a DeconvBlock module to obtain the final attention feature map $\:{Z}_{ij}$. As shown in (b) of Fig. 3, the DeconvBlock module mainly includes a convolution layer, an instance normalization layer and a LeakyRelu activation function.

Loss function

Depending on the different application scenarios of the registration task, the choice of similarity metric function will also be different. For unsupervised single-modality registration tasks, negative local normalized cross-correlation(NCC) and mean square error(MSE) are popular choices^16,40; for complex multi-modality registration tasks, mutual information(MI)⁴⁸, local cross-correlation of multi-contrast registration⁵, and modality-independent neighborhood descriptor (MIND)⁴⁹ and other methods have been proposed. In this work, we mainly focus on the large deformation registration problem in single-modality brain MR, and the total loss function $\:{\mathcal{L}}_{total}$ can be expressed as formula (5):

$$\:{\mathcal{L}}_{total}({I}_{m},{I}_{f},{{\varnothing}}_{1})={\mathcal{L}}_{sim}({I}_{m}\circ\:{\phi\:}_{1},{I}_{f})+{\gamma\:\mathcal{L}}_{smooth}\left({\phi\:}_{1}\right),$$

(5)

Among them, $\:{\mathcal{L}}_{sim}$ represents the similarity loss function, which is used to penalize the appearance dissimilarity between the wraped image $\:{I}_{m}\circ\:{\phi\:}_{1}$ and the fixed image $\:{I}_{f}$. $\:{\mathcal{L}}_{smooth}$ denotes the smoothing loss term, which primarily regularizes the gradient of the final deformation field $\:{\phi\:}_{1}\:$to make it smooth. where $\:{\upgamma\:}$ represents the regularization parameter.

For the similarity metric loss $\:{\mathcal{L}}_{sim}$, we adopt the widely used local normalized cross-correlation to calculate⁴⁰, the calculation formula is presented in formula (6):

$$\:{\mathcal{L}}_{sim}=-NCC\left({I}_{m}\circ\:{\phi\:}_{1},{I}_{f}\right)=-\sum\:_{p\epsilon{\Omega\:}}\frac{{\left(\sum\:_{{p}_{i}}({I}_{f}\left({p}_{i}\right)-\overline{{I}_{f}}\left(p\right))\left({I}_{m}\circ\:{\phi\:}_{1}\left({P}_{i}\right)-\overline{{I}_{m^\circ\:{{\varnothing}}_{1}}}\left(p\right)\right)\right)}^{2}}{\left(\sum\:_{{p}_{i}}({I}_{f}\left({p}_{i}\right)-\overline{{I}_{f}}\left(p\right))\right)\left(\sum\:_{{p}_{i}}\left({I}_{m}\circ\:{\phi\:}_{1}\left({P}_{i}\right)-\overline{{I}_{m^\circ\:{{\varnothing}}_{1}}}\left(p\right)\right)\right)},$$

(6)

here Ω represents the entire three-dimensional spatial ___domain, and $\:{p}_{i}$represents the voxel in the neighborhood of the local window center voxel $\:\text{p}$. In our experiment, the local window size, $\:\text{n}$, is set to 9, which means its neighborhood encompasses a range of $\:{n}^{3}$. The terms $\:\overline{{I}_{f}}\left(p\right)$ and $\:\overline{{I}_{m}\circ\:{{\varnothing}}_{1}}\left(p\right)$represent the average voxel values within the local window at voxel p in the fixed image $\:{I}_{f}$ and the wraped image $\:{I}_{m}\circ\:{\phi\:}_{1}$, respectively.

For the smooth loss term $\:{\mathcal{L}}_{smooth}$, we experimented with two regularizers. The first one is the diffusion regularizer^16,37, which essentially solves the first-order derivative of the spatial gradient of the deformation field. its calculation formula is provided in formula (7):

$$\:{{\gamma\:}_{1}\mathcal{L}}_{diffusion}\left({{\varnothing}}_{1}\right)={\sum\:}_{p\in\:{\Omega\:}}{\parallel\:\nabla\:{{\varnothing}}_{1}\left(p\right)\parallel\:}^{2},$$

(7)

where $\:{\gamma\:}_{1}$represents the regularization parameter.The second one is the bending energy^18,26, which essentially solves the second-order derivative of the spatial gradient of the deformation field. It can penalize severe bending deformations and may therefore be helpful for large deformation registration tasks. its calculation formula is provided in formula (8):

$$\:{{\gamma\:}_{2}\mathcal{L}}_{bending}\left({{\varnothing}}_{1}\right)={\sum\:}_{p\in\:{\Omega\:}}{\parallel\:{\nabla\:}^{2}{{\varnothing}}_{1}\left(p\right)\parallel\:}^{2},$$

(8)

where $\:{\gamma\:}_{2}$ represents the regularization parameter.

Experimental setting

Dataset and preprocessing

We selected the public datasets LPBA40⁵⁰, Mindboggle101⁵¹, and OASIS⁵² to conduct the scan-to-scan registration task for our method. For each image in these three datasets, we first applied maximum and minimum normalization, followed by center cropping. The LPBA40 and Mindboggle101 datasets were cropped to a size of 160 × 192 × 160, while the OASIS dataset was cropped to a size of 160 × 192 × 224. Among them, the LPBA40 dataset consists of 40 T1-weighted brain MR scans, which we divide into 30 × 29 pairs for training and 10 × 9 pairs for testing, using the segmentation maps of 54 structures annotated by experts as evaluation facts. For the Mindboggle101 dataset, we selected the NKI-RS-22 and NKI-TRT-20 subsets to create 40 × 39 pairs of scans for training, the OASIS-TRT-20 subset is used to form 20 × 19 pairs for testing, using the segmentation maps of 62 structures annotated by experts as evaluation facts. For the OASIS dataset, since its officially published test set does not provide a corresponding anatomical label, therefore, similar to the division method of TransMorph⁴⁰, we randomly divide its training set into 394 images for training and the remaining 20 images for testing and use the segmentation maps of 30 structures annotated by experts as evaluation facts.

Evaluation metrics

In our experiments, we quantitatively evaluate our registration results using the following metrics: (1) Dice score (DSC), (2) average symmetric surface distance (ASSD), (3) the percentage of non-negative Jacobian values %$\left|{J}_{{{\varnothing}}_{1}}\right|\le\:0$, (4) the average time required for the model to register a pair of images on the GPU, (5) the amount of memory utilized during the model training process, and (6) the total number of trainable parameters in the model. Among them, DSC is used to calculate the overlapping area between the corresponding anatomical segmentation volumes. The higher its value, the higher the registration accuracy. ASSD is used to evaluate the boundary similarity between the corresponding anatomical segmentation volumes. The lower its value, the smaller the boundary error of the registered image. The value of %$\left|{J}_{{{\varnothing}}_{1}}\right|\le\:0$ is used to quantify the quality of the deformation field, where a smaller value indicates fewer unreasonable cross-folding points, thus reflecting a higher quality of the deformation field. In addition, the registration time, memory usage, and the total amount of trainable parameters can reflect the registration efficiency of the model, and their values are all lower, the better.

Implementation details

Specifically, we run the traditional registration method on a workstation equipped with a 13 th Gen Intel(R) Core(TM) i5-13500 H 2.60 GHz CPU, and implement the learning-based registration model using the PyTorch framework on an Ubuntu workstation equipped with an NVIDIA GeForce RTX3090 24GB. In our experiments, the batch size is set to 1, and the Adam optimizer with a fixed learning rate $\:{l}_{r}=0.0001$ is used to train our network for 30 epochs on the LPBA40 and Mindboggle101 datasets, and 50 epochs on the OASIS dataset. The diffusion regularization coefficient $\:{{\upgamma\:}}_{1}$ is empirically set to 1. In the encoder, the initial number of channels of the network is set to c=16, then the hierarchical feature map dimensions of the 4-layer encoder are $\:\text{d}1\in\:\left\{\text{16,32,64,128}\right\}$, and the corresponding 4-layer decoder output dimensions are $\:\text{d}2\in\:\left\{\text{256,128,64,32}\right\}$. In addition, we set the number of attention heads of the four consecutive Swin-Transformer modules embedded in the decoder to {8, 4, 2, 1}, and the corresponding window size to (5, 5, 5).

Comparison methods

We employ nine advanced registration methods for comparison. (1) SyN¹²: a traditional diffeomorphic registration method that requires no training. We directly perform the registration task between image scans on the test sets of LPBA40, Mindboggle101, and OASIS. (2) VoxelMorph¹⁶ (abbreviated as VM): a single-scale convolutional registration network based on U-Net. (3) VoxelMorph_diff³⁷ (abbreviated as VM_diff): a diffeomorphic variant of the VoxelMorph method. (4) TransMorph⁴⁰ (abbreviated as TM): a U-shaped single-scale registration network utilizing a Swin-Transformer encoder. (5) TransMorph_diff⁴⁰ (abbreviated as TM_diff): a diffeomorphic version corresponding to the TransMorph method. (6) TransMatch⁴⁴: a single-scale registration network that employs a cross-attention mechanism and a dual-path encoder. (7) PiViT²⁷: a registration network utilizing a low-scale iterative combined pyramid framework. (8) NICE-Trans²⁶: a pyramid registration network featuring a dual-branch encoder and a pure Swin-Transformer decoder. (9) ModeTv2²⁹: a pyramid registration network based on a multi-head neighborhood cross-attention mechanism.

Result and discussion

Quantitative comparison of the results of different registration methods

The quantitative registration results of various registration methods on LPBA40, Mindboggle101 and OASIS datasets are presented in Tables 1 and 2, and Table 3, respectively. It can be seen that our method achieves the best performance in DSC and ASSD scores on the LPBA40 and OASIS datasets, and also achieves sub-optimal results on the Mindboggle101 dataset with a large degree of deformation, which fully demonstrates the high accuracy and good generalization of our method. Specifically, the traditional method SyN achieves higher scores than the learning-based method VM on all three datasets in terms of DSC Metric, while the quality of its generated deformation field also achieves suboptimal results on all three datasets. For each learning-based registration method, the pyramid-based method significantly outperforms the method based on a single-scale predicted deformation field. For example, in Table 1, PiViT, NICE-Trans, ModeTV2, and our method are 5.8%, 7.5%, 7.5%, and 8% higher than VM in terms of the DSC Metric respectively; 4%, 5.7%, 5.7%, and 6.2% higher than TM respectively; and 2.4%, 4.1%, 4.1%, and 4.6% higher than TransMatch respectively. In Table 2, although the registration accuracy of our method on the Mindboggle101 dataset is slightly lower than that of ModeTv2, in the subsequent cross-dataset registration task research, the proposed method shows better generalization than ModeTv2, as shown in Table 13. In addition, compared with Table 1, in Tables 2 and 3, the registration accuracy of TM is 0.2% and 0.3% higher than that of PiViT, respectively. The score distribution of other methods is basically consistent with Table 1.

Table 1 Quantitative results of different registration methods on the LPBA40 dataset (54 ROIs). Bold values indicate the best values, and underlined values indicate suboptimal values.

Full size table

Table 2 Quantitative results of different registration methods on the Mindboggle101 dataset (62 ROIs). Bold values represent the best values, and underlined values represent suboptimal values.

Full size table

Table 3 Quantitative results of different registration methods on the OASIS dataset (30 ROIs). Bold values indicate the best values, and underlined values indicate suboptimal values.

Full size table

From the perspective of registration efficiency, the learning-based method exhibits significant advantages in registration time compared to SyN. In Table 1, we present the average time required for each method to register a pair of images on the GPU, along with memory usage (Mem) and the total number of trainable parameters (Params). It is evident that PiViT achieves the fastest registration time of 0.029 s, followed closely by VM_diff. PiViT demonstrates the least memory usage at 5248 MiB, followed by VM_diff. VM has the smallest total number of trainable parameters, whereas TransMatch has the largest. Overall, our method does not have a clear advantage over other methods in terms of registration efficiency, especially compared with ModeTv2, which is the second-best method overall. It should be noted that ModeTv2 has made a special cuda extension to its algorithm to improve computational efficiency. At present, the registration speed of our method within 1 s can also meet the requirements of clinical real-time registration.

Visualization analysis of registration results

Figure 4 presents the axial visual registration results for each registration method applied to the Mindboggle101 and LPBA40 data sets. The first and fourth rows display the registration outcomes for each method, while the second and fifth rows provide grid visualizations of the deformation fields corresponding to these results. The third and sixth rows depict the difference maps comparing the registration results with the fixed image. Focusing on the Mindboggle101 dataset, it can be seen from the area marked by the blue box that our method can better handle large deformations in local areas. Given the significant degree of deformation present in the Mindboggle101 dataset, it is observed that the registration accuracy across all methods is generally low. In contrast, the grayscale distribution of our method in the difference map of the LPBA40 dataset is significantly more uniform.

The boxplot in Fig. 5 illustrates the distribution of DSC scores for anatomical regions generated by various registration methods applied to the LPBA40 and Mindboggle101 datasets. Similar to FAIM¹⁸, for the LPBA40 dataset, we merged the corresponding 54 ROIs into 7 regions for display: Frontal, Parietal, Occipital, Temporal, Cingulate, Hippocampus, and Putamen. As shown in (a) of Fig. 5, our method achieves the highest DSC score in all areas except for the Frontal. For the Mindboggle101 dataset, we merged the corresponding 62 ROIs into 5 regions for display: Frontal, Parietal, Occipital, Temporal, and Cingulate. As indicated in (b), our method achieves the highest DSC score in all areas except for the Frontal and Parietal.

Further analysis of model registration performance in extreme cases

Evaluating the performance of our method under extreme conditions (such as extremely atrophied or severely deformed brains) is crucial to verify the robustness of the model. To this end, we selected two samples with significant structural differences from the public dataset Mindboggle101 with large deformations, and mainly evaluated the performance of several advanced pyramid registration methods under these conditions. The detailed experimental results are shown in Table 4. It can be observed that although the registration accuracy of multiple methods has decreased to varying degrees under such large deformation registration conditions, our method and ModeTv2 show significant performance advantages over NICE-Trans and PiViT, which shows that our proposed model has good robustness. In order to make a clearer and more comprehensive comparison, we further visualized the registration results and the corresponding deformation fields, as shown in Fig. 6, which mainly shows the axial (first two rows) and sagittal slices (last two rows) of the samples. The red rectangle marks the situation where there are large differences between the registration results. It can be found that there is a certain degree of difference between the fixed image and the wraped image, accompanied by many cracks, and the corresponding deformation field grid visualization results also have different degrees of cross-folding points.

Table 4 Registration results for a single case with extreme deformation.

Full size table

Analysis of continuous deformation of decoding layer

The visualization results presented in Fig. 7 illustrate that the Swin-Transformer modules employed in the low-scale decoder layers (i.e., Levels 4 and 3) effectively emphasize the global features of the feature map. For example, in the heat maps corresponding to the initial residual deformation fields $\:{\varphi\:}_{4}$ and $\:{\varphi\:}_{41}$, the brighter areas are more evenly distributed and cover the entire image area. As the number of layers increases, the heat maps for the deformation fields of different layers demonstrate a gradual concentration on local large deformations. This observation supports our decision to utilize deconvolution modules in the high-scale layers of the decoder (i.e., Levels 2 and 1), which sufficiently highlight locally significant information. Furthermore, the comparison of the heat maps for $\:{\phi\:}_{3}$ and $\:{\phi\:}_{31}$confirms that the deformation field resulting from the optimization output of the convolution iteration addresses the issue of insufficient deformation decomposition in a single decoder layer to a certain extent. Additionally, the grid visualization results of the deformation field across each layer of the decoder indicate that as the residual deformation sub-fields continue to merge, the deformation field produced by the subsequent decoder layer becomes finer than that of the previous layer, culminating in the final deformation field $\:{{\varnothing}}_{1}$.

Ablation studies

Comparative analysis of four variant encoders

In this section, we conduct a focused quantitative comparative analysis of the four encoders designed in Fig. 1. First, Fig. 8 shows the line graph of the validation set DSC scores of the models trained on the (a) LPBA40, (b) Mindboggle101, and (c) OASIS datasets using different encoders. As can be seen from (a) and (b) in Fig. 8, the DSC score of ExtraPrior_ConvBlock (green curve) increases steadily and reaches the highest level compared with the other three methods. In Fig. 8(c), the performance of ExtraPrior_ConvBlock (green curve) is second only to ExtraPrior + SE_ConvBlock (blue curve). In addition, Tables 5 and 6, and 7 show the quantitative registration results of the three datasets, respectively. Compared with the other three variant models, the ExtraPrior_ConvBlock method we adopted achieved the highest registration accuracy on all three datasets, thus confirming the importance of prior feature information. The second-best performance is attributed to Pure_ConvBlock, while the registration accuracy of Res_ConvBlock containing the residual structure has decreased. However, the residual structure has a significant advantage in accelerating the network convergence speed, and the registration time of this model is always faster on the three datasets, which may be attributed to the fact that the total amount of trainable parameters of this model is the smallest, as shown in Table 5. In contrast, the ExtraPrior + SE_ConvBlock method does not show outstanding performance relative to our method. This shows that simply implementing the channel attention mechanism cannot improve model performance. We have to admit that convolution kernels are still very effective in extracting image features.

Table 5 Quantitative results of models trained with different encoders on the LPBA40 dataset.

Full size table

Table 6 Quantitative results of models trained with different encoders on the Mindboggle101 dataset.

Full size table

Table 7 Quantitative results of models trained with different encoders on the OASIS dataset.

Full size table

Research on the number of convolution iterations

In this section, we study the number of convolution iterations in each layer of the decoding path and verify it on the LPBA40 and Mindboggle101 datasets. Specifically, we conducted five experiments, namely: no convolution iteration operation in the decoding layer (denoted as baseline), performing one convolution iteration operation in each of the first two layers of the decoder (denoted as 1_1iter), performing two convolution iterations in each of the first two layers of the decoder (denoted as 2_2iter), performing one convolution iteration operation in each of the first three layers of the decoder (denoted as 1_1_1iter), and performing two convolution iterations in each of the first three layers of the decoder (denoted as 2_2_2iter), the results are presented in Table 8. We observe that the introduction of convolution iterations has some effect on improving the quality of the deformation field compared to the baseline. As for the model registration accuracy, we observed that in the first three layers of the decoder, the model’s registration accuracy did not exhibit a positive correlation with the increase in the number of convolution iterations. Furthermore, while increasing the number of convolution iterations does not augment the total number of trainable parameters, it does elevate the computational burden of the network, consequently slowing down the registration time. Therefore, it is essential to determine the optimal number of convolution iterations.

Table 8 Quantitative results of models trained with different Convolution iterations on the LPBA40 and Mindboggle101 datasets.

Full size table

Comparative analysis of multiple loss functions

For the total loss function of the model (see formula (5)), with the fixed similarity loss $\:{\mathcal{L}}_{sim}$unchanged, we adopted three regularization combinations to verify the impact of different loss function combinations on model performance. Among them, according to experience, the values of $\:{\gamma\:}_{1}$ and $\:{\gamma\:}_{2}\:$are both set to 1.0. The specific experimental results are shown in Table 9. It can be observed that the diffusion regularizer achieves the best registration accuracy on the Mindboggle101 and OASIS datasets, and performs second best on the LPBA40 dataset. The combination of diffusion regularization and bending energy achieves the best registration accuracy on the LPBA40 dataset, and achieves second best registration accuracy on the Mindboggle101 and OASIS datasets. The worst performance is to use only bending energy as a regularization term. In general, through this experiment, we can see the potential of using a combination of diffusion regularization and bending energy to improve the model registration accuracy, and parameter tuning may further improve the registration results.

Table 9 Experimental results of multiple loss functions.

Full size table

Hyperparameter analysis

Here we take the LPBA40 and OASIS datasets as examples to analyze the impact of different values of the regularization weight parameter $\:{\gamma\:}_{1}$and the number of different attention heads in the decoding layer on the model performance.

For $\:{\gamma\:}_{1}$, we test the effects of 0.5, 1.0, and 2.0 on model performance. The detailed results are shown in Table 10. It can be observed that when$\:\:{\gamma\:}_{1}=1.0$, the model achieves the best balance between registration accuracy and deformation field quality. When $\:{\gamma\:}_{1}=2.0$, the model will strengthen the deformation field constraint. The deformation field generated by the model under this value has the least degree of folding, but the registration accuracy is lower than when $\:{\gamma\:}_{1}=1.0$.When $\:{\gamma\:}_{1}=0.5$, the model will reduce the deformation field constraint. The deformation field generated by the model under this value has a higher degree of folding, and the registration accuracy is also lower than when $\:{\gamma\:}_{1}=1.0$.

Table 10 Test results for different regularization hyperparameter $\:{\gamma\:}_{1}$ values.

Full size table

In order to analyze the impact of the number of attention heads in the decoding layer on the model performance, we set up 4 sets of comparative experiments to illustrate the rationality of the choice of (8, 4, 2, 1). The detailed experimental results are shown in Table 11, from which it can be observed that compared with other choices, setting the number of attention heads to (8, 4, 2, 1) has a significant performance advantage. Reducing the number of heads in different decoding layers does not bring much advantage in registration efficiency, but affects the model registration accuracy; increasing the number of heads does not further improve the model registration accuracy, which shows that the number of heads is not positively correlated with the registration accuracy. Reasonable setting of the number of heads is very important for balancing model accuracy and computational efficiency.

Table 11 Experimental results of different numbers of attention heads in the decoding layer.

Full size table

Study on the generalization of the model

Our generalization experiments are mainly divided into two groups. The first group focuses on the cross-dataset and cross-task setting, using the model trained on the OASIS dataset to perform atlas-based registration tasks on the LPBA40 and Mindboggle101 test sets. The results are shown in Table 12, showing that our method is always able to achieve excellent registration accuracy. It is worth noting that due to the inconsistency of data distribution, we only evaluated NICE-trans and our method. The second group is completely based on the cross-dataset setting, using the model trained on the Mindboggle101 dataset to perform inter-subject-based registration tasks on the LPBA40 test set. The results are shown in Table 13, where the values in brackets indicate the difference from the corresponding results in Table 1. It can be seen that the registration accuracy of each method has decreased to varying degrees relative to the results in Table 1; however, our method continues to maintain the highest registration accuracy.

Table 12 Quantitative results of registration of models trained on the OASIS dataset and tested on the LPBA40 and Mindbogle101 test sets.

Full size table

Table 13 Quantitative results of registration of models trained on the Mindboggle101 dataset and tested on the LPBA40 test sets.

Full size table

Conclusion

We propose a novel registration method that integrates iterative concepts with pyramid codecs. Initially, through a comprehensive analysis of the impact of various encoders on the pyramid registration network, we demonstrate that prior information from the original image can compensate for the loss of spatial feature information resulting from continuous convolution and pooling operations to a certain extent. Second, for the feature information of both high-scale and low-scale layers within the decoder, we implement distinct designs aimed at minimizing the computational complexity of the network while preserving its capacity to perceive both global and local information. Additionally, an iterative optimization strategy is employed to address the inadequate decomposition of deformation present in each layer of the pyramid decoder. A substantial amount of experimental results substantiates the effectiveness of our model design.

Through a large number of experimental analyses, our method also has some limitations. First, the computational efficiency of the model needs to be improved, especially compared with some baseline methods that only use CNN. The reason for this problem is that our model uses a large number of attention heads and introduces convolution iteration operations; second, when dealing with extreme registration situations (severe atrophy or severe deformation of brain structures), the model registration accuracy decreases and the deformation field produces more folds. This is mainly attributed to the fact that in these extreme cases, the complex large deformations contained in brain MR images pose great challenges to the registration task. Third, our method is only trained and tested on a single-modality dataset, and effective migration in multi-modality registration scenarios needs further research.

In future work, we believe that several directions warrant further exploration. First, study effective model optimization strategies to reduce its computational overhead and inference time. Methods such as mixed precision training, attention weight pruning, and knowledge distillation are possible directions to consider. Second, expand the training set and include samples with severe brain atrophy or severe deformation into the model training set to enhance the model’s registration performance in extreme cases. Third, extend the proposed method to multimodal or multi-sequence data, and combine it with existing multimodal strategies to further verify the adaptability and generalization ability of the method in different modalities. We hope this study can offer valuable insights into the application of deep learning in the field of registration.

Data availability

The datasets generated and/or analyzed during the current study are available at the following links: LPBA40: https://resource.loni.usc.edu/resources/atlases-downloads, Mindboggle101: https://osf.io/yhkde, OASIS: https://sites.wustl.edu/oasisbrains.

References

Sotiras, A., Davatzikos, C. & Paragios, N. Deformable medical image registration: A survey. IEEE Trans. Med. Imaging. 32 (7), 1153–1190 (2013).
Article PubMed PubMed Central Google Scholar
Xu, Z., Niethammer, M. & DeepAtlas Joint semi-supervised learning of image registration and segmentation. Medical Image Computing and Computer Assisted Intervention–MICCAI : 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22, 420-9 (Springer, 2019). (2019).
Du, J., Li, W., Lu, K. & Xiao, B. An overview of multi-modal medical image fusion. Neurocomputing 215, 3–20 (2016).
Article Google Scholar
Peters, T. & Cleary, K. Image-guided Interventions: Technology and Applications (Springer Science & Business Media, 2008).
Heinrich, M. P., Jenkinson, M., Papież, B. W., Brady, S. M. & Schnabel, J. A. Towards realtime multimodal fusion for image-guided interventions using self-similarities. Medical Image Computing and Computer-Assisted Intervention–MICCAI : 16th International Conference, Nagoya, Japan, September 22–26, 2013, Proceedings, Part I 16, 187–194. (Springer, 2013). (2013).
Boes, J. L. et al. Image registration for quantitative parametric response mapping of cancer treatment response. Transl Oncol. 7 (1), 101–110 (2014).
Article PubMed PubMed Central Google Scholar
Bajcsy, R. & Kovačič, S. Multiresolution elastic matching. Comput. Vis. Graphics Image Process. 46 (1), 1–21 (1989).
Article Google Scholar
Rueckert, D. et al. Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans. Med. Imaging. 18 (8), 712–721 (1999).
Article CAS PubMed Google Scholar
Thirion, J. P. Image matching as a diffusion process: an analogy with Maxwell’s demons. Med. Image Anal. 2 (3), 243–260 (1998).
Article CAS PubMed Google Scholar
Rueckert, D., Aljabar, P., Heckemann, R. A., Hajnal, J. V. & Hammers, A. Diffeomorphic registration using B-splines. Medical Image Computing and Computer-Assisted Intervention–MICCAI : 9th International Conference, Copenhagen, Denmark, October 1–6, 2006 Proceedings, Part II 9, 702-9 (Springer, 2006). (2006).
Glaunes, J., Qiu, A., Miller, M. I. & Younes, L. Large deformation diffeomorphic metric curve mapping. Int. J. Comput. Vis. 80, 317–336 (2008).
Article PubMed PubMed Central Google Scholar
Avants, B. B., Epstein, C. L., Grossman, M. & Gee, J. C. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 12 (1), 26–41 (2008).
Article CAS PubMed Google Scholar
Shen, D. & Davatzikos, C. HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans. Med. Imaging. 21 (11), 1421–1439 (2002).
Article PubMed Google Scholar
Boveiri, H. R., Khayami, R., Javidan, R. & Mehdizadeh, A. Medical image registration using deep neural networks: a comprehensive review. Comput. Electr. Eng. 87, 106767 (2020).
Article Google Scholar
Jaderberg, M., Simonyan, K. & Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28, (2015).
Balakrishnan, G. et al. IEEE,. An unsupervised learning model for deformable medical image registration. Proceedings of the IEEE conference on computer vision and pattern recognition. 9252-60 (2018).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, proceedings, part III 18, 234 – 41 (Springer, 2015). (2015).
Kuang, D. & Schmah, T. Faim–a convnet method for unsupervised 3d medical image registration. Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, Proceedings 10, 646 – 54 (Springer, 2019). (2019).
Fan, J., Cao, X., Yap, P. T., Shen, D. & BIRNet Brain image registration using dual-supervised fully convolutional networks. Med. Image Anal. 54, 193–206 (2019).
Article PubMed PubMed Central Google Scholar
Li, Y. X., Tang, H., Wang, W., Zhang, X. F. & Qu, H. Dual attention network for unsupervised medical image registration based on VoxelMorph. Sci. Rep. 12 (1), 16250 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cao, X. et al. Image registration using machine and deep learning. Handbook of Medical Image Computing and Computer Assisted Intervention. 319 – 42 (Elsevier, (2020).
De Vos, B. D. et al. A deep learning framework for unsupervised affine and deformable image registration. Med. Image Anal. 52, 128–143 (2019).
Article PubMed Google Scholar
Zhao, S. et al. Unsupervised 3D end-to-end medical image registration with volume tweening network. IEEE J. Biomed. Health Inf. 24 (5), 1394–1404 (2019).
Article Google Scholar
Zhao, S., Dong, Y., Chang, E. I. & Xu, Y. Recursive cascaded networks for unsupervised medical image registration. Proceedings of the IEEE/CVF international conference on computer vision.10600-10 (2019).
Kang, M., Hu, X., Huang, W., Scott, M. R. & Reyes, M. Dual-stream pyramid registration network. Med. Image Anal. 78, 102379 (2022).
Article PubMed Google Scholar
Meng, M., Bi, L., Fulham, M., Feng, D. & Kim, J. Non-iterative coarse-to-fine transformer networks for joint affine and deformable image registration. International Conference on Medical Image Computing and Computer-Assisted Intervention.750 – 60Springer, (2023).
Ma, T., Dai, X., Zhang, S., Wen, Y. & PIViT Large deformation image registration with pyramid-iterative vision transformer. International Conference on Medical Image Computing and Computer-Assisted Intervention. 602 – 12Springer, (2023).
Wang, H., Ni, D., Wang, Y. & ModeT Learning deformable image registration via motion decomposition transformer. International Conference on Medical Image Computing and Computer-Assisted Intervention.740-9Springer, (2023).
Wang, H., Wang, Z., Ni, D. & Wang, Y. ModeTv2: GPU-accelerated motion decomposition transformer for pairwise optimization in medical image registration. ArXiv Preprint arXiv :240316526 (2024).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 10012-22 (2021).
Wang, Z., Wang, H. & Wang, Y. Pyramid attention network for medical image registration. ArXiv Preprint arXiv :240209016 (2024).
Wang, H., Ni, D. & Wang, Y. Recursive deformable pyramid network for unsupervised medical image registration. IEEE Trans. Med. Imaging (2024).
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).
Lowe, D. G. IEEE,. Object recognition from local scale-invariant features. Proceedings of the seventh IEEE international conference on computer vision. 1150-7 (1999).
Bay, H., Ess, A., Tuytelaars, T. & Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110 (3), 346–359 (2008).
Article Google Scholar
Ashburner, J. & Friston, K. J. Voxel-based morphometry—the methods. Neuroimage. 11(6), 805 – 21 (2000).
Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J. & Dalca, A. V. Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging. 38 (8), 1788–1800 (2019).
Article Google Scholar
Cao, C., Cao, L., Li, G., Zhang, T. & Gao, X. BIRGU net: deformable brain magnetic resonance image registration using gyral-net map and 3D Res-Unet. Med. Biol. Eng. Comput. 61 (2), 579–592 (2023).
Article PubMed Google Scholar
Chen, J. et al. Vit-v-net: vision transformer for unsupervised volumetric medical image registration. ArXiv Preprint arXiv :210406468 (2021).
Chen, J. et al. Transmorph: transformer for unsupervised medical image registration. Med. Image Anal. 82, 102615 (2022).
Article PubMed PubMed Central Google Scholar
Liu, L., Huang, Z., Liò, P., Schönlieb, C. B. & Aviles-Rivero, A. I. PC-SwinMorph: patch representation for unsupervised medical image registration and segmentation. ArXiv Preprint arXiv :220305684 (2022).
Zhu, Y., Lu, S. & Swin-voxelmorph A symmetric unsupervised learning model for deformable medical image registration using swin transformer. International Conference on Medical Image Computing and Computer-Assisted Intervention. 78–87Springer, (2022).
Shi, J. et al. Springer,. Xmorpher: Full transformer for deformable medical image registration via cross attention. International Conference on Medical Image Computing and Computer-Assisted Intervention. 217 – 26 (2022).
Chen, Z., Zheng, Y., Gee, J. C. & Transmatch A transformer-based multilevel dual-stream feature matching network for unsupervised deformable image registration. IEEE Trans. Med. Imaging. 43 (1), 15–27 (2024).
Article PubMed Google Scholar
Chen, J., Liu, Y., He, Y. & Du, Y. Deformable cross-attention transformer for medical image registration. International Workshop on Machine Learning in Medical Imaging.115 – 25Springer, (2023).
Hu, B., Zhou, S., Xiong, Z. & Wu, F. Recursive decomposition network for deformable image registration. IEEE J. Biomed. Health Informat. 26 (10), 5130–5141 (2022).
Article Google Scholar
Mok, T. C. & Chung, A. C. Large deformation diffeomorphic image registration with laplacian pyramid networks. Medical Image Computing and Computer Assisted Intervention–MICCAI : 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. 211 – 21 (Springer, 2020). (2020).
Wells, I. I. I., Viola, W. M., Atsumi, P., Nakajima, H., Kikinis, R. & S., & Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1 (1), 35–51 (1996).
Article PubMed Google Scholar
Heinrich, M. P. et al. Modality independent neighbourhood descriptor for multi-modal deformable registration. Med. Image Anal. 16 (7), 1423–1435 (2012).
Article PubMed Google Scholar
Shattuck, D. W. et al. Construction of a 3D probabilistic atlas of human cortical structures. Neuroimage 39 (3), 1064–1080 (2008).
Article PubMed Google Scholar
Klein, A. & Tourville, J. 101 Labeled brain images and a consistent human cortical labeling protocol. Front. Neurosci. 6, 171 (2012).
Article PubMed PubMed Central Google Scholar
Marcus, D. S. et al. Open access series of imaging studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J. Cogn. Neurosci. 19 (9), 1498–1507 (2007).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors thank the School of Medical Information Engineering at Gansu University of Traditional Chinese Medicine and Quanzhou Orthopedic Hospital for their support of this study.

Author information

Xinxin Cui and Jianlan Yang: these authors contributed equally to this work.

Authors and Affiliations

School of Medical Information Engineering, Gansu University of Traditional Chinese Medicine, Lanzhou, Gansu, 730000, China
Xinxin Cui, Yuee Zhou, Guodong Suo, Fengqing Jin & Jianlan Yang
Quanzhou Orthopedic Traumatological Hospital of Fujian University of Traditional Chinese Medicine, Quanzhou, 362000, Fujian, China
Caihong Wei & Jianlan Yang

Authors

Xinxin Cui
View author publications
Search author on:PubMed Google Scholar
Yuee Zhou
View author publications
Search author on:PubMed Google Scholar
Caihong Wei
View author publications
Search author on:PubMed Google Scholar
Guodong Suo
View author publications
Search author on:PubMed Google Scholar
Fengqing Jin
View author publications
Search author on:PubMed Google Scholar
Jianlan Yang
View author publications
Search author on:PubMed Google Scholar

Contributions

X.C.: Conceptualization, Methodology, Writing - original draft. Y.Z.: Visualization & Data curation. C.W.: Validation. G.S.: Formal anlaysis. F.J.: Investigation. J.Y.: Writing - Review & editing, Software, Supervision. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Jianlan Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Cui, X., Zhou, Y., Wei, C. et al. Hybrid transformer and convolution iteratively optimized pyramid network for brain large deformation image registration. Sci Rep 15, 15707 (2025). https://doi.org/10.1038/s41598-025-00403-w

Download citation

Received: 08 February 2025
Accepted: 28 April 2025
Published: 05 May 2025
DOI: https://doi.org/10.1038/s41598-025-00403-w

Subjects

Abstract

Similar content being viewed by others

Exploring the performance of implicit neural representations for brain image registration

Lightweight hybrid transformers-based dyslexia detection using cross-modality data

Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application

Introduction

Related work

Traditional iterative registration method

Single-scale registration method

Multi-scale registration method

Methods

Network overview

Enhanced pyramid feature encoder

Multi-scale pyramid iterative decoder

Decoder submodule

Loss function

Experimental setting

Dataset and preprocessing

Evaluation metrics

Implementation details

Comparison methods

Result and discussion

Quantitative comparison of the results of different registration methods

Visualization analysis of registration results

Further analysis of model registration performance in extreme cases

Analysis of continuous deformation of decoding layer

Ablation studies

Comparative analysis of four variant encoders

Research on the number of convolution iterations

Comparative analysis of multiple loss functions

Hyperparameter analysis

Study on the generalization of the model

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links