Fig. 5: Potential for transfer learning. | Nature Communications

Fig. 5: Potential for transfer learning.

From: Genomic language model predicts protein co-regulation and function

Fig. 5

A ModA and ModC interaction (protein data bank structure 2ONK)47 B UMAP projection of predictions (orange) and labels (blues) of paralogs (ModAC shown in A), where correct predictions are colored in green. C Predicted embeddings are colored based on the predicted confidence. Out of distribution predictions and predictions closer to the mean are generally of lower confidence, while correct predictions are of higher confidence. D, E Random 30-gene contigs from representative bacterial (“bac”) and archaeal (“arch”) genomes and reference viral (“vir”) genomes were embedded by mean-pooling ESM2 protein embeddings (context-free contig embeddings, D) and by mean-pooling the last hidden layer of gLM (contextualized contig embeddings, E). F Micro-averaged precision-recall curves and average precisions for logistic regression classifiers trained using context-free contig embeddings (grey lines) and contextualized contig embeddings (colored lines) for class-level taxonomy classification task. Each line represents a fold in stratified k-fold cross-validation (k = 5). Class-level taxonomy for each contig is shown in Supplementary Fig. 9A, B and the confusion matrices for logistic regression classifiers are shown in Supplementary Fig. 9C, D. Source data are provided as a Source Data file.

Back to article page