Extended Data Fig. 6: Evaluation of different sliding window approaches and window sizes. | Nature Genetics

Extended Data Fig. 6: Evaluation of different sliding window approaches and window sizes.

From: Genome-wide prediction of disease variant effects with a deep protein language model

Extended Data Fig. 6

(a) Evaluation as binary classifiers of variant pathogenicity over the ClinVar dataset (global ROC-AUC metric). (b) Evaluation over short proteins (640 to 900aa), by comparing the scores obtained from processing the entire sequences through a single window vs. multiple windows. Three metrics are considered for comparing the scores: Spearman’s correlation (left), mean square error (center) or 95th percentile of absolute difference (right). Comparison was performed over 500 randomly chosen proteins of length 640 to 900aa. To accommodate different window sizes with the weighted-average approach, we rescaled the range of the sigmoid function (described in Extended Data Fig. 5) in proportion to the window size. Points along the curves correspond to the mean metric values across the 500 proteins; error bars correspond to 95% confidence intervals for the means.

Back to article page