Extended Data Fig. 4: Evaluation and comparison of different ESM models.
From: Genome-wide prediction of disease variant effects with a deep protein language model

Tested ESM models: ESM1b, ESM1, the five ESM1v models, and an assembly of the five ESM1v models into a single model averaging the LLR scores obtained by the 5 models (ESM1v-avg). (a) Performance of the different ESM models on the clinical benchmarks (ClinVar and HGMD/gnomAD). Each model was evaluated as a binary classifier of pathogenic vs. benign missense variants over the two benchmarks using the global ROC-AUC metric. Only proteins smaller than 1,022aa were considered in this evaluation (thereby avoiding the sliding window approach). (b) Performance of the ESM models on the DMS benchmark.