Fig. 1: gLM training and inference schematics. | Nature Communications

Fig. 1: gLM training and inference schematics.

From: Genomic language model predicts protein co-regulation and function

Fig. 1

A For training, contigs (contiguous genomic sequences) containing up to 30 genes are first translated into proteins, which are subsequently embedded using a protein language model (pLM) encoder (ESM2). Masked inputs are generated by random masking at 15% probability and genomic language model (gLM; a transformer encoder) is trained to make four predictions for each masked protein, with associated likelihoods. Training loss is calculated on both the prediction and likelihoods. B At inference time, inputs are generated from a contig using ESM2 output. Contextualized protein embeddings (hidden layers of gLM) and attention patterns are used for various downstream tasks. See Supplementary Fig. 1 for detailed schematics. Source data are provided as a Source Data file.

Back to article page