Fig. 6: ESM1b effect predictions generalize to any coding variant.
From: Genome-wide prediction of disease variant effects with a deep protein language model

a, Top: functional effect scores are assigned to in-frame indels by invoking ESM1b on both the WT and mutated protein sequence and calculating the PLLR between them. Bottom: the distribution of ESM1b effect scores over 1,679 benign and 1,791 pathogenic in-frame indels from ClinVar. b, Comparison between three versions of ESM1b-based effect scores, CADD (a supervised VEP method) and three baseline models as classifiers of pathogenic versus benign in-frame indels (over the same set of variants as in a). One-tailed P values are shown for the differences between the performance of CADD and the ESM1b-based effect scores (Methods). Right: partitioning of the 3,470 in-frame indels into deletions, insertions and deletion–insertion combinations (delins). c, Functional effect scores are also assigned to stop-gain variants, defined as the LLR score assigned to the missense variant predicted to be the most deleterious among all possible missense variants in the lost region of the protein. Illustrated example: substitution of a glutamine into a stop codon at position 25. d, Assessment of ESM1b and three baseline models as classifiers of pathogenic versus benign stop-gain variants, over variants expected to either (1) not undergo NMD (3,672 pathogenic and 147 benign variants), (2) undergo NMD (32,362 pathogenic and 198 benign variants) or (3) all stop-gain variants (36,034 pathogenic and 345 benign variants). Error bars correspond to s.d. of the ROC-AUC scores centered around the mean (estimated by bootstrapping).