Supplementary Figure 2: Deep learning model outputs from the hold out test set (n = 13,530 variants) are well scaled across all predicted classes (ambiguous, fail, and somatic).

The correlation between the model output and the manual review call was assessed for all three different classes of calls (ambiguous, fail, and somatic). For each class, model outputs were binned into ten groups ranging from 0.00–1.00. For each bin, the total number of manual review calls that agree and disagree with the individual class were plotted. The ratio of agreement to disagreement was plotted for each bin and compared to the identity line (x = y) using the Pearson’s correlation coefficient (r).