Extended Data Fig. 9: Bioinformatics pipeline for identification of disease alleles.
From: Precise therapeutic gene correction by a simple nuclease-induced double-stranded break

Schematic shows the bioinformatics pipeline used to identify all microduplications amendable to efficient MMEJ-mediated collapse from the ‘coding’ regions (exome_calling_regions.v1; mainly exons plus 50 flanking bases) in the gnomAD genome and exome databases (version 2.0.2). Insertion variants observed in both databases were used for analysis (variants occurring in both databases were counted once). Insertions that do not add a repeat unit to an existing tandem repeat and are not themselves a perfect repeat were filtered to constrain only duplications that spanned 2–40 bp in length and are amendable to CRISPR–Cas9 targeting. This dataset was then cross-referenced against the ClinVar database (clinvar_20180225.vcf) to apply further filters for variants reported as pathogenic, which ultimately yielded 143 likely disease-causing microduplications.