Extended Data Fig. 3: The graphical illustrations of the key components in BiomedGPT.
From: A generalist vision–language foundation model for diverse biomedical tasks

(a) Head-scale multi-head attention module in BiomedGPT. The trainable parameters γh is applied prior to the output projection for each head. (b) Instead of adding the absolute positional embedding Pi to the input embedding Ii (left), we compute the positional correlation and input correlation separately with different projection matrices and add them together in the self-attention module (right). (c) Graphical illustration of relative position bias. Such an inductive bias Bj-i is learnable parameter and can be viewed as the embedding of the relative position j−i, which is injected into the Query-Key product: \(\frac{1}{\sqrt{d}}({I}_{i}{W}^{\,Q})({P}_{i}{W}^{\,K})+{B}_{j-i}\), and shared in all layers. (d) An example of trie-based beam search: along the path across ‘Lipid’ and ‘breakdown’, BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘pathway’) to −∞ when computing log-probabilities for the target token ‘in’. It is worth noting that trie-based search is also applied during the validation phase of the fine-tuning stage for acceleration (approximately 16× increase in speed in our experiments).