Extended Data Fig. 1: Detecting clustered mutations and simulating processes that generate clustered mutations.
From: DNA mismatch repair promotes APOBEC3-mediated diffuse hypermutation in human cancers

a, Method to determine significant mutation clustering using HyperClust. A baseline distribution is generated by shuffling mutations within 1 Mbp windows multiple times (R1, R2, …, Rn) to loci with matching trinucleotide contexts. For every mutation, the observed intermutational distance to its nearest neighbour (nIMD) is compared with distributions of expected IMDs (from randomized data) to determine a local FDR (lfdr). Thresholding by lfdr yields clustered mutation calls (blue). b, Overview of study. c, Precision-recall curves for models in Fig. 1a, derived from simulated data with spiked-in mutation clusters: kataegis (top; with five mutations per cluster at an average 600 bp pairwise distance) or omikli_M (bottom; two mutations at 101 bp). Two examples of high mutation burden tumors (TCGA-AP-A0LD, TCGA-AP-A0LE) were used to generate the background mutation distributions. d, e, Testing accuracy of mutation cluster calling methods using simulated data. Points represent randomized tumor samples into which spiked-in mutation clusters were introduced. Samples are ordered according to total mutation burden (panel d). Columns show different performance metrics: F1 score, precision, and recall, all at lfdr=20%. Rows represent different types of spiked-in mutation clusters (IMD distributions plotted in panel e, where kataegis have five mutations and omikli_K/M/O two mutations. Boxplots compare cluster calling methods, including implementations of some previous methodologies (details in Methods). The “strand-clonality-lfdr” (blue) is the HyperClust method used throughout our work. f, g, Poisson mixture modelling (related with Fig. 1d) of the number of mutations per cluster, showing relative likelihood (panel f) of models with increasing number of components and the density functions (panel g) of a model with two Poisson components. solid line represents mean and dashed lines the 95% C.I. h, Number of mutation events per tumor sample (x axis, n) per local hypermutation type (rows), either the A3 context TCW>K mutations, or the remaining mutations (columns).