Table 3 The results of temporal transformers using sequences of 8 images (8 F). The models with varying blocks of Non-Shuffled Transformer Attention Block (NTAB) or Channel-Shuffled Transformer Attention Block (CSTB) stacked after the output of CNN backbone. In addition, a model using Gated Recurrent Unit (GRU) has also been used for comparison. The performance is expressed in mean average precision (mAP) and top-k (e.g. top-1, top-5, etc.) accuracy in All-Search (all cameras used) and Indoor-Only (only indoor cameras used).
From: Channel-shuffled transformers for cross-modality person re-identification in video
Models | All-Search | Indoor-Only | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
mAP (%) | top-1 (%) | top-5 (%) | top-10 (%) | top-20 (%) | mAP (%) | top-1 (%) | top-5 (%) | top-10 (%) | top-20 (%) | |
0x NTAB (8 F) | 79.57 | 80.71 | 95.67 | 97.66 | 98.83 | 88.41 | 85.48 | 96.72 | 98.17 | 99.92 |
1x GRU (8 F) | 64.61 | 65.70 | 90.18 | 95.93 | 98.87 | 80.19 | 75.53 | 94.37 | 97.67 | 99.36 |
1x NTAB (8 F) | 53.05 | 49.69 | 82.58 | 91.28 | 96.57 | 67.52 | 58.49 | 88.39 | 95.48 | 99.53 |
1x CGTB (8 F) | 66.77 | 66.28 | 87.84 | 93.44 | 97.08 | 79.66 | 73.49 | 92.69 | 96.25 | 98.58 |
1x CSTB (8 F) | 80.60 | 82.39 | 96.73 | 98.03 | 98.77 | 89.19 | 85.96 | 97.10 | 98.41 | 99.99 |
2x CSTB (8 F) | 79.90 | 81.42 | 95.29 | 97.45 | 98.50 | 88.59 | 86.43 | 96.08 | 97.67 | 99.84 |
3x CSTB (8 F) | 80.60 | 82.42 | 95.98 | 98.09 | 98.65 | 89.00 | 86.48 | 96.47 | 98.49 | 100.00 |