Table 3 The results of temporal transformers using sequences of 8 images (8 F). The models with varying blocks of Non-Shuffled Transformer Attention Block (NTAB) or Channel-Shuffled Transformer Attention Block (CSTB) stacked after the output of CNN backbone. In addition, a model using Gated Recurrent Unit (GRU) has also been used for comparison. The performance is expressed in mean average precision (mAP) and top-k (e.g. top-1, top-5, etc.) accuracy in All-Search (all cameras used) and Indoor-Only (only indoor cameras used).

From: Channel-shuffled transformers for cross-modality person re-identification in video

Models

All-Search

Indoor-Only

mAP (%)

top-1 (%)

top-5 (%)

top-10 (%)

top-20 (%)

mAP (%)

top-1 (%)

top-5 (%)

top-10 (%)

top-20 (%)

0x NTAB (8 F)

79.57

80.71

95.67

97.66

98.83

88.41

85.48

96.72

98.17

99.92

1x GRU (8 F)

64.61

65.70

90.18

95.93

98.87

80.19

75.53

94.37

97.67

99.36

1x NTAB (8 F)

53.05

49.69

82.58

91.28

96.57

67.52

58.49

88.39

95.48

99.53

1x CGTB (8 F)

66.77

66.28

87.84

93.44

97.08

79.66

73.49

92.69

96.25

98.58

1x CSTB (8 F)

80.60

82.39

96.73

98.03

98.77

89.19

85.96

97.10

98.41

99.99

2x CSTB (8 F)

79.90

81.42

95.29

97.45

98.50

88.59

86.43

96.08

97.67

99.84

3x CSTB (8 F)

80.60

82.42

95.98

98.09

98.65

89.00

86.48

96.47

98.49

100.00