Fig. 2: ESM1b is suitable for genome-wide disease prediction of coding variants. | Nature Genetics

Fig. 2: ESM1b is suitable for genome-wide disease prediction of coding variants.

From: Genome-wide prediction of disease variant effects with a deep protein language model

Fig. 2

a, Top: the distribution of ESM1b effect scores across two sets of variants that are assumed to be mostly pathogenic (‘ClinVar: pathogenic’ and ‘HGMD: disease causing’) and two sets of variants assumed to be mostly benign (‘ClinVar: benign’ and ‘gnomAD: MAF > 0.01’). Bottom: Venn diagram of the variants extracted from HGMD, ClinVar and gnomAD. b, Comparison between ESM1b and EVE in their capacity to distinguish between pathogenic and benign variants (measured by global ROC-AUC scores), as labeled by ClinVar (36,537 variants in 2,765 unique genes) or HGMD/gnomAD (30,497 variants in 1,991 unique genes). c, The distribution of ESM1b effect scores across ClinVar missense VUS, decomposed as a mixture of two Gaussian distributions capturing variants predicted as more likely pathogenic (orange) or more likely benign (blue). d, The distribution of ESM1b effect scores across all common ClinVar labels, including the two Gaussian components from c. Boxes mark Q1–Q3 of the distributions, with midpoints marking the medians (Q2) and whiskers stretching 1.5× IQR. Altogether there are ~300,000 missense variants labeled in ClinVar. e,f, Evaluation of 19 VEP methods against the same two benchmarks: ClinVar (e) and HGMD/gnomAD (f). Performance was measured by two metrics for binary classification as follows: ROC-AUC (light red) and a balanced version of PRC-AUC (light blue; Methods). Performance was evaluated on the sets of variants available for all 19 methods. g,h, Head-to-head comparison between ESM1b and each of the 18 other VEP methods over the same two dataset benchmarks (in terms of ROC-AUC). Because ESM1b provides scores for all missense mutations, the comparison against each other method is performed on the set of variants with effect predictions for that method. The percentage of variants considered for each method is shown at the bottom of each bar. IQR, interquartile range.

Back to article page