Extended Data Fig. 5: The sliding-window approach to tile long protein sequences with ESM1b.
From: Genome-wide prediction of disease variant effects with a deep protein language model

(a) The variant weights over each window’s coordinates (1 ≤ i ≤ 1022), defined by the function: w(i) = 1 / (1 + exp(-(i-128)/16) for 1 ≤ i < 256, w(i) = 1 for 256 ≤ i < 1022-256, and w(i) = 1/(1 + exp((i-1022 + 128)/16) for 1022-256 ≤ i ≤ 1022. (b) An example tiling of a protein sequence of length 1,479aa. Left: raw window weights (as in (a)). Right: normalized weights (summing up to 1 at each protein position). (c) Example of how a specific protein isoform (UniProt ID Q7Z460-5) is tiled. Top panel: ESM1b effect scores over the left window (1 ≤ i ≤ 1022; orange), the right window (458 ≤ i ≤ 1479; green), and the final weighted average throughout the entire protein’s length (blue). Middle: ESM1b effect scores over the left window. Bottom: ESM1b effect scores over the right window. (d) An example tiling of a larger protein sequence of length 3,703aa, as in (b). Top: the locations of the 7 windows used to tile the sequence. Middle: raw window weights. Bottom: normalized weights. (e) Example of how a specific protein (UniProt ID Q15911) is tiled, as in (c). As shown in the two examples, the effect scores tend to be consistent across different windows (with edge effects sometimes being more pronounced).