Fig. 3: Contextualization of gene function. | Nature Communications

Fig. 3: Contextualization of gene function.

From: Genomic language model predicts protein co-regulation and function

Fig. 3

A Linear probe enzyme commission (EC) number classification accuracy for pLM (ESM2) representations and gLM (1st hidden layer) representations. Data are presented as mean values +/- standard deviation over five technical replicates. B F1-score comparisons of statistically significant (t-test, two-sided, Benjamini/Hochberg corrected p value < 0.05, technical replicates = 5) differences in performance of pLM- and gLM-based EC number linear probes. EC classes are ordered with the largest gain with contextualization on the left to the largest loss with contextualization on the right. Data are presented as mean values +/- standard deviation. Adjusted p-value (with two significant figures) for each class is specified above the bars. C Precision-Recall curves of pLM- and gLM-based EC number linear probes. D Histogram of variance (# bins = 100) calculated using contextualized embeddings (gLM; orange) and contig-averaged pLM (blue) embeddings of MGYPs that occur at least 100 times in the database. Histograms for unannotated and annotated fraction of the MGYPs are plotted separately and bars are not stacked. Annotated examples in the long right tail include phage proteins and transposases, reflecting their ability to self-mobilize (see annotations of top tens most variant genes in Supplementary Table 4). Source data are provided as a Source Data file.

Back to article page