Fig. 2
From: Prior-guided attention fusion transformer for multi-lesion segmentation of diabetic retinopathy

The overall architecture of the proposed network. (a) is the overall architecture consisting of three components: a dual-branch encoder, a decoder and a classification head, where \(C_{1}=32\), \(C_{2}=64\), \(C_{3}=128\), \(C_{4}=256\) and \(C_{5}=512\) denote the number of feature channels at five scales, respectively, \(H=512\), \(W=512\) denote the scale of pre-processed images. (b) and (c) are the structures of Conv1 and Conv2, where h and w denote the height and width of features at the current scale. In our experiment, \(c'=2c\) when downsample is True.