Scientific Reports

Table 3 The results of temporal transformers using sequences of 8 images (8 F). The models with varying blocks of Non-Shuffled Transformer Attention Block (NTAB) or Channel-Shuffled Transformer Attention Block (CSTB) stacked after the output of CNN backbone. In addition, a model using Gated Recurrent Unit (GRU) has also been used for comparison. The performance is expressed in mean average precision (mAP) and top-k (e.g. top-1, top-5, etc.) accuracy in All-Search (all cameras used) and Indoor-Only (only indoor cameras used).

From: Channel-shuffled transformers for cross-modality person re-identification in video

Models	All-Search					Indoor-Only
Models	mAP (%)	top-1 (%)	top-5 (%)	top-10 (%)	top-20 (%)	mAP (%)	top-1 (%)	top-5 (%)	top-10 (%)	top-20 (%)
0x NTAB (8 F)	79.57	80.71	95.67	97.66	98.83	88.41	85.48	96.72	98.17	99.92
1x GRU (8 F)	64.61	65.70	90.18	95.93	98.87	80.19	75.53	94.37	97.67	99.36
1x NTAB (8 F)	53.05	49.69	82.58	91.28	96.57	67.52	58.49	88.39	95.48	99.53
1x CGTB (8 F)	66.77	66.28	87.84	93.44	97.08	79.66	73.49	92.69	96.25	98.58
1x CSTB (8 F)	80.60	82.39	96.73	98.03	98.77	89.19	85.96	97.10	98.41	99.99
2x CSTB (8 F)	79.90	81.42	95.29	97.45	98.50	88.59	86.43	96.08	97.67	99.84
3x CSTB (8 F)	80.60	82.42	95.98	98.09	98.65	89.00	86.48	96.47	98.49	100.00

Back to article page

Search

Advanced search

Quick links