Main

Rare diseases collectively affect 3.5% to 5.9% of people worldwide5. Despite advances in genomic sequencing, molecular diagnosis continues to elude 50% to 80% of patients presenting to genetic clinics1. Furthermore, fewer than half of the 10,000 rare Mendelian diseases in the Online Mendelian Inheritance in Man (OMIM) database6 have an established genetic basis. Diagnostic failure may arise because of a lack of routine screening for non-coding7 or structural variants1. However, it is likely that a substantial proportion of the pathogenic variants responsible for patients undiagnosed with rare disease (cases) reside in those yet to be discovered genes associated with (possibly very rare) disorders. The scale of rare disease sequencing studies, such as the Undiagnosed Disease Network8, Centers for Mendelian Genomics9, Deciphering Developmental Disorders10 and the 100,000 Genomes Project (100KGP)2, offers expanded opportunities to provide insight into pathogenic mechanisms of inherited disease, including the possibility of establishing disease–gene associations through case–control analyses, akin to methods used previously to identify common genetic variants influencing the risk of complex disorders. Such an approach provides much-needed power to identify genes harbouring rare pathogenic variants.

To identify disease-associated genes, we recently developed a framework that analyses rare protein-coding variants identified by the Exomiser variant prioritization tool11, in a preliminary version of the 100KGP data1, to conduct gene-based burden testing of single probands (first people in a family identified as affected by a rare genetic disease) and family members relative to control families. In silico triage in this previous study highlighted 22 new disease–gene associations, three of which have also been reported in independent studies12,13,14.

In this study, we have enhanced our gene burden analytical framework with a refined rare variant filtering and change of statistical modelling that are more tailored to Mendelian diseases and unbalanced case–control studies with rare events, extended it for generic application to any large-scale, rare disease sequencing cohorts, complemented it with visualization scripts and released it as an open-source R analytical framework called geneBurdenRD. In addition, we report on the application of the approach to a larger cohort from the final 100KGP data, including 34,851 families, 226 rare diseases and a starting pool of 4,643,230 rare candidate variants with improved in silico and added clinical expert triage of 69 probable new disease–gene associations.

Gene burden analytical framework

We have developed an open-source R framework (https://github.com/whri-phenogenomics/geneBurdenRD) allowing gene burden testing of variants in user-defined cases (sequenced probands affected by tested disease along with sequenced family members where available) versus controls from rare disease sequencing cohorts. The minimal input to the framework is (1) a file of rare, putative disease-causing variants obtained from merging and processing Exomiser output files for each of the cohort samples, (2) a file containing a label for each case–control association analysis to perform within the cohort and (3) a (set of) corresponding file(s) with user-defined identifiers and case–control assignment per sample. Cases and controls in a cohort could be defined in many ways, for example, by recruited disease category as we have done for the application to the 100KGP below, by specific phenotypic annotations or by phenotypic clustering. The 100KGP is perhaps atypical, compared with other projects, in having a virtual panel-based approach for all types of rare disease that required detailed inclusion and exclusion criteria for recruitment to specific disease categories, allowing accurate case–control definitions. Phenotype-based selection strategies would be required when this is not the case and can overcome false recruitment but will require comprehensive and accurate phenotyping for maximum effectiveness. As part of the variant quality control, the initial input of rare, putative disease-causing variants is further filtered down to remove possible false positive variant calls and/or relatively common variants in the project itself. Finally, for each case–control analysis, variants observed in cases are discarded if seen in at least one control to mimic a Mendelian, fully penetrant-like disease model. The framework then assesses false discovery rate (FDR)-adjusted disease–gene associations through a cohort allelic sums test (CAST) statistic used as covariate in a Firth’s logistic regression model (Methods). Genes are tested for enrichment in cases versus controls of rare, protein-coding, segregating variants that are (1) predicted loss-of-function (LoF), (2) highly predicted pathogenic (Exomiser variant score at least 0.8), (3) highly predicted pathogenic and present in a constrained coding region (CCR)15 or (4) de novo (restricted to only trios or larger families where de novo calling is possible and provided by the user) (Methods). As well as various output files annotating these case–control association tests, volcano plots are generated summarizing the FDR-adjusted P values of all the gene-based tests for each case–control association analysis, along with lollipop plots of the relevant variants in cases and controls and plots of the hierarchical distribution of the Human Phenotype Ontology (HPO) case annotations for individual disease–gene associations.

Application to 100KGP

A rare variant gene burden analysis was performed on a cohort of 34,851 single probands and larger families (72,690 genomes), filtered down from an initial total of 35,548 (Methods), available from the 100KGP rare disease pilot and main programme (Data Release v.11) (Fig. 1 and Supplementary Table 1). The distribution of sex and genetically inferred ancestry in the overall cohort (Supplementary Table 1) is largely as expected from the reported ethnicity of the UK population (86% people of European descent, 8% people of Asian descent, 3% people of African/Caribbean descent, 2% mixed and 1% other in the 2011 census of England and Wales). A starting pool of 4,643,230 rare, protein-coding, segregating and most predicted pathogenic (per gene) variants for the analysis was derived by running Exomiser for each single proband and family and applying the initial variant quality control. Our pipeline was then used to detect statistically significant gene-based enrichment in relevant variant categories (predicted LoF, highly predicted pathogenic, highly predicted pathogenic in CCR regions, de novo) for ‘cases’ in each of 226 ‘specific diseases’ used for patient recruitment by the 100KGP1 versus ‘controls’ who were defined as probands from any other 20 broad ‘disease groups’ in the project (for example, intellectual disability cases were compared with all non-neurological probands as controls). A pairwise phenotypic comparison (Supplementary Fig. 1) highlights that the highest levels of similarity between samples are within these ‘specific disease’ categories, with extensive similarity often seen to other diseases in the same broad group (excluded in our control strategy), and no or more modest levels to the control disease categories (note that when combining all the control diseases, the median phenotypic similarity score between cases and controls was zero for all tested diseases). Applied cutoffs required at least five case probands per ‘specific disease’ and at least four probands (of which, at least two cases) with a relevant rare variant per disease–gene burden test. Overall, we performed 161 case–control gene burden analyses. Only gene-based enrichments that were more frequent in cases than controls (putative ‘disease-causing’ as opposed to ‘protective’) were considered in the statistical correction for multiple testing (51,899 association tests) (Fig. 1 and Supplementary Table 2; Methods).

Fig. 1: Rare variant gene burden analysis of 100KGP data.
figure 1

Flowchart of the rare variant gene-based analytical framework, including triaging of results. aAfter sample quality control. bCAST statistic within Firth’s logistic regression model adjusted for sex, age, family size and inferred genetic ancestry. AD, autosomal dominant; oe_lof, observed/expected LoF variants.

We identified 165 previously known and 141 new potential disease–gene associations (Supplementary Table 3 and Extended Data Fig. 1), imposing a 0.5% FDR. At this threshold, we observed a reasonable recall (3.2%) and precision (47.4%) of currently known associations (using green genes in PanelApp16 only), a balanced number of known and new signals and no enrichment at all for synonymous variants (empirical negative control burden test). Not previously known (new) signals were initially defined, at the first round of analysis in March 2021, as having no documented evidence for an association within OMIM and absence from the ‘specific disease’ curated panel of high confidence (green) genes in PanelApp16. Enrichment of predicted LoF, highly predicted pathogenic, also in CCR and de novo variants was observed in 57%, 17%, 10% and 16% of known and 47%, 50%, 1% and 2% of new associations respectively, revealing discovery was driven mostly by predicted pathogenic, missense variants.

Of our 141 new signals, 5 have had interim independent supporting evidence (relevant OMIM entry and/or green gene in PanelApp) emerge since our initial assessment was performed in 2021: mitochondrial disorders with MORC2 (ref. 17); craniosynostosis with SOX6 (ref. 18); osteogenesis imperfecta with COPB2 (ref. 19); cystic kidney disease with IFT140 (ref. 20) and epidermolysis bullosa with TUFT1 (ref. 21).

The remaining potential associations were further filtered and prioritized to 69 of 136 (57%) by removing (1) those in which the gene was a non-protein-coding RNA (12/136); (2) those for signals driven by dominant, predicted LoF variants where the Genome Aggregation Database (gnomAD v.2.1.1)22 indicates there is no evidence for haplo-insufficiency (gnomAD observed/expected LoF at least 0.5) (19/136); or (3) those in which any of the cases driving the signals had already received an alternative genetic diagnosis (45/136) (Fig. 1 and Supplementary Table 4). It is possible but unlikely that criteria (3) will have removed genuine associations because of incorrect/partial diagnoses as the Genomics England diagnostic pipeline is fairly conservative, generally requiring predicted LoF, de novo or known pathogenic variants in well-established genes for a diagnosis. In comparison with the new association signals, 163 of 165 (99%) of signals from known disease genes passed criteria from (1) and (2). Variants responsible for the 69 associations were classified automatically according to ACMG using Exomiser, including whether predicted LoF variants fulfilled the latest recommendations for PVS1 (ref. 23) (Supplementary Table 5).

An extensive review of the literature as well as the phenotype evidence from Exomiser, in collaboration with members of the Genomics England Clinical Interpretation Partnerships (GeCIPs), was performed to identify supporting evidence from the biological function of each of the 69 genes, known disease associations or the phenotypes of the gene-deficient mouse and/or other animal models. In silico analyses were also undertaken to identify high quality StringDB24 protein–protein associations between the gene signal and any other genes known to be associated with the disease, or with highly specific expression in the most relevant tissue for the disease. This combined curation highlighted 30 associations supported by experimental evidence: 21 from literature curation of a gene function fitting the likely disease mechanism with further lines of evidence for many, 5 on the basis of mouse models and other evidence for some, and 4 on protein–protein evidence only (Fig. 1 and Supplementary Table 4 (highlighted in bold in the summary column)).

ClinGen25 has developed a robust set of criteria to assess the evidence for disease–gene associations and we applied these to our 69 associations. Evidence of causality was moderate for 27 associations and limited for the remainder (Fig. 1 and Supplementary Table 4). Of the 69 new associations, we chose candidates with previous functional data for the gene fitting the likely disease mechanism (Supplementary Table 4) and a ClinGen classification score of at least 8 to restrict to a manageable size of five candidates for further highlighting here.

Monogenic diabetes with UNC13A

We identified a dominant association (ClinGen moderate score of 9) between variants in UNC13A and specific disease ‘diabetes with additional phenotypes suggestive of a monogenic aetiology’. The association is driven by rare predicted LoF variants in two singleton cases with the only recorded phenotypes being diabetes mellitus in both and one further phenotype in one: p.Ala53Serfs*50 and p.Gly44* (adjusted P value, 0.0005; odds ratio, 329.8; 95% confidence interval, 58.2–1334.0; Fig. 2a and Supplementary Table 4). Both variants are absent from gnomAD v.4.1.0, classified as ‘likely pathogenic’ and predicted to undergo nonsense-mediated decay (NMD) (PVS1). The gene is depleted for rare LoF variants in gnomAD (observed/expected LoF = 0.09 (0.05–0.16) and probability of being LoF intolerant (pLI) = 1). UNC13A is a diacylglycerol and phorbol ester receptor gene with evidence for a role in the regulation of β cells3. Neonatal pancreatic β cells extracted from UNC13A-knockout mice and knock-in mice lacking the DAG binding ___domain show impaired second phase of insulin secretion in response to glucose stimulation4 and the heterozygous mouse knockout model shows impaired glucose tolerance26. In addition, co-expression network analysis of UNC13A and known monogenic diabetes genes from PanelApp shows the most significant enrichment in the pancreas Genotype Tissue Expression (GTEx) v.6 module (FDR-adjusted P value of 0.01), with UNC13A co-expressed with 6 out of 43 of the known genes. However, predicted LoF variants (three splice site, five stop gain or frameshift) were also seen in controls with no apparent history of diabetes indicating incomplete penetrance, later onset (year of birth of the two was 1973 and 1974 compared with 1956–2007 (mean 1980) for controls), or that the variants in controls are not genuinely LoF.

Fig. 2: Evidence for disease–gene associations.
figure 2

a, Location of rare, predicted pathogenic UNC13A variants associated with monogenic diabetes observed in two singleton cases. b, Location of rare, predicted pathogenic RBFOX3 variants associated with generalized epilepsies observed in two affected sisters and one singleton case; GTEx data showing specific brain expression. Het, heterozygous; hom ref, homozygous reference; TPM, transcripts per million.

Epilepsy with RBFOX3

We identified a dominant association (ClinGen moderate score of 11) between variants in RBFOX3 and specific disease ‘familial genetic generalized epilepsies’ (adjusted P value, 0.0023; odds ratio, 197.8; 95% confidence interval, 28.6–1,383.9; Fig. 2b and Supplementary Table 4). The association is driven by rare predicted pathogenic variants in two cases with seizure phenotypes: p.Asn105Asp in two affected sisters and p.Gln71* in a proband with further learning disability phenotypes; both variants absent from gnomAD v.4.1.0 and ClinVar and classified as variant of uncertain significance (VUS). A mouse model shows increased susceptibility to seizures27 and RBFOX3 is expressed specifically in the brain, particularly the cerebellum. A potential association between variants in RBFOX3 and epilepsy was published back in 201328 but no link is curated in OMIM or PanelApp and Gene Curation Coalition (GenCC) records limited evidence for the association. Our findings add strength to this association and a recent study has shown that RBFOX3 plays a critical role in the regulation of epilepsy and establishes it as a possible treatment path29.

Charcot–Marie–Tooth disease with ARPC3

We identified an association (ClinGen moderate score of 8) between variants in Actin Related Protein 2/3 Complex Subunit 3 (ARPC3) and specific disease ‘Charcot–Marie–Tooth (CMT) disease’. The association is driven by rare heterozygous variants in four cases: a p.Leu21Gln VUS observed in a duo and unrelated singleton case (allele frequency, 0.00001984 in gnomAD v.4.1.0 but in people of South Asian descent only), a p.Lys84dup disruptive inframe insertion VUS (absent from gnomAD v.4.1.0) in a singleton and a c.6 G>C VUS splice region variant (absent from gnomAD v.4.1.0) in a singleton (adjusted P value, 0.0015; odds ratio, 38.5; 95% confidence interval 9.6–144.6; Fig. 3a and Supplementary Table 4). The four cases show strong phenotypic similarity to each other (mean PhenoDigm score from pairwise, reciprocal, non-self hits was 0.84). Only 10 out of 1,000 randomly sampled CMT sets of the same size achieved the same mean score or higher, indicating the ARPC3 families with common features of distal upper and lower limb muscle weakness and peripheral axonal neuropathy are phenotypically distinct from the other CMT cases. Protein–protein associations are observed with known CMT genes: DNM2 and SYT2. Actin filaments play a key role in the neuronal cytoskeleton, the dysregulation of which is associated with various neurological conditions including CMT30, and ARPC3 may regulate dendritic spine morphology downstream of miR-29a/b31. An ArpC3 conditional knockout mice fails to ensheath axons32.

Fig. 3: Evidence for disease–gene associations.
figure 3

a, Location of rare, predicted pathogenic ARPC3 variants associated with CMT disease observed in three singleton cases and one proband not seen in the unaffected mother. b, Location of rare, predicted pathogenic POMK variants associated with corneal abnormalities observed in one singleton case and two trios with co-segregation from the affected mothers in the female probands; evidence of highly specific expression in the corneal epithelium from publicly available RNA-seq datasets (Methods). c, Location of rare, predicted pathogenic GPR17 variants associated with schizophrenia observed in two singleton cases. Comp het, compound heterozygous.

Corneal abnormalities with POMK

We identified a dominant association (ClinGen moderate score of 9) between variants in POMK and specific disease ‘corneal abnormalities’. The association is driven by rare, predicted pathogenic variants (two predicted LoF, one missense) in three cases with recorded phenotypes collectively suggestive of anterior segment dysgenesis (ASD) (adjusted P value, 0.0001; odds ratio, 151.5; 95% confidence interval, 37.6–473.3; Fig. 3b and Supplementary Table 4). ASD is a spectrum of developmental disorders affecting the anterior segment of the eye, often with incomplete penetrance and/or variable expressivity33. Co-segregation was apparent in two trios where a heterozygous, splice acceptor variant c.-21-1G>A (gnomAD v.4.1 allele frequency, 0.000011) and a heterozygous, frameshift stop gain variant p.Arg339* (gnomAD v.4.1 allele frequency, 0.000001) are inherited from the affected mothers in the female probands (Fig. 3b). Both predicted LoF variants were classified as VUS, with no NMD (PVS1) assignment. A further heterozygous missense variant p.Thr79Arg (gnomAD v.4.1 allele frequency, 0.000011) was observed in a singleton case and classified as ‘likely benign’. In the independent cohort described in Supplementary Table 4, one rare (gnomAD v.4.1 allele frequency, 0.000038), heterozygous missense variant p.Arg86His was identified in a mother and a son from a family from the Czech Republic diagnosed with ASD. POMK is involved in the presentation of the laminin-binding O-linked carbohydrate chain of alpha-dystroglycan, which forms transmembrane linkages between the extracellular matrix and the exoskeleton. Given the absence of corneal-specific expression data in the GTEx Project, we interrogated publicly available bulk RNA sequencing (RNA-seq) datasets34,35, which showed expression across all corneal cell types analysed, with highest amounts detected in the corneal epithelium (Fig. 3b). Bi-allelic (predicted LoF) variants in POMK are associated with autosomal recessive muscular dystrophy-dystroglycanopathy (congenital with brain and eye anomalies), type A, 12 (OMIM:615249)—a disease that also includes several ocular abnormalities (microphthalmia, buphthalmos, coloboma, retinal degeneration and cataract)—indicating that POMK plays a crucial role in ocular development36. Morpholino knockdown of the pomk gene in zebrafish has been reported to show several defects, including developmental ocular abnormalities36. Whether the identified rare variants here could induce ASD by POMK haplo-insufficiency, or by exerting dominant gain-of-function effects, deserves future investigation. With POMK apparently LoF-tolerant (pLI = 0) and ASD not having been reported in carriers of muscular dystrophy-dystroglycanopathy type A, 12-associated variants, the latter seems more likely.

Schizophrenia with GPR17

We identified an association (ClinGen moderate score of 9) between variants in GPR17 and specific disease ‘schizophrenia plus additional features’. The association is driven by rare predicted LoF variants in two singleton cases with schizophrenia and other psychiatric phenotypes: a p.Trp6* VUS (gnomAD v.4.1 allele frequency, 0.00018) (in compound-heterozygosity with p.Arg248Gln; allele frequency, 0.00032) and a heterozygous p.Glu129* VUS variant in the second case (allele frequency, 0.00001 in gnomAD v.4.1.0), both of which are predicted not to undergo NMD (adjusted P value, 0.0022; odds ratio, 189.7; 95% confidence interval, 29.1–882.0; Fig. 3c and Supplementary Table 4). GnomAD evidence (observed/expected LoF = 0.79 (0.45–1.46) and pLI = 0) does not support a haplo-insufficient mechanism for this gene. This may be explained by the variants acting in a recessive manner, or the high observed/expected LoF may represent the later onset, incomplete penetrance aspects of the disease and/or a polygenic mechanism of disease—the latter being probably given the phenotype. Direct protein–protein associations are recorded in StringDB for numerous other neurodevelopmental disorder genes: GABBR2, GNAI1, GNB1, GNB5, GPSM2, GRM1, PIK3CA, PIK3R2, PLCB1, PSAP, SPR and TRIO. Highly specific expression is observed in the brain, especially the cerebral cortex. Finally, GPR17 has been shown to regulate the oligodendrocyte differentiation and myelination that plays a role in several neurological diseases, including schizophrenia37.

Discussion

In this study, we have described a gene burden analysis of a large cohort of rare disease cases and identified 69 new disease–gene associations after triaging of the statistically significant signals. We highlight five with strong genetic and experimental evidence: monogenic diabetes associated with UNC13A, epilepsy with RBFOX3, CMT disease with ARPC3, anterior segment ocular abnormalities with POMK and schizophrenia with GPR17. However, further evidence is necessary before many of the new associations described here can be used clinically for diagnostics, counselling and management as a ClinGen classification of at least moderate evidence is required for inclusion in diagnostic genetic test panels38. For example, the addition of further strong functional study evidence would increase the score of the 42 limited ClinGen classification evidence candidates shown in Supplementary Table 4, such that they would be re-classified as moderate. We have submitted our evidence on the 69 associations to the GenCC database39. We are also pursuing the identification of variants in further independent cases through GeneMatcher (and linked Matchmaker Exchange nodes) for all our highlighted candidates, so far without success. Collection of more affected family members for cases in collaboration with the original recruiting clinicians could also raise the ClinGen category, although this is likely to be more difficult (for example, for a dominant association at least two large families with five affected members are needed to add one point of evidence).

Rare variant gene burden approaches have been used originally in the context of complex genetics where a steadily increased number of (typically) unrelated individuals, now reaching hundreds of thousands, are tested for genetic associations often across several quantitative and/or binary traits (genome-wide/phenome-wide association studies). In this context, variants that confer susceptibility to a certain disease are of interest and, also depending on the applied allele frequency threshold for variant filtering, people contributing several rare variants per gene are (typically) allowed in burden testing, this way increasing the chance to test variants with opposite effect (both protective and deleterious) in the same gene. Indeed, one of the main motivations behind the development of many of the existing gene burden tests beyond some basic implementations, for example, the CAST statistic used in our geneBurdenRD approach, is the need to tackle the presence of variants with different effect directions and/or effect sizes in the same gene40. Selection of the input rare variants is crucial in any gene burden approach and can affect the statistical power, and becomes paramount in the context of Mendelian diseases, where we are not after several susceptibility variants (in any effect direction), but rather a single putative disease-causative variant (or pair of variants in a compound heterozygous genotype). Therefore, a scenario where someone (whether case or control) contributes several rare genotypes to the same gene burden should not be contemplated. Our analytical framework (geneBurdenRD) is based on a convenient use of the established Exomiser variant prioritization tool for rare Mendelian diseases to perform the non-trivial step of variant annotation, scoring and segregation (using variant call format (VCF) files from single probands or family-based VCF files from nuclear families) and selection of the most predicted pathogenic, rare, segregating (putative disease-causing) variant(s) per proband/family and each gene (and each compatible mode of inheritance). geneBurdenRD also provides case–control variant count filtering that mimics a Mendelian fully penetrant-like disease model with variants seen in controls not included in the analysis.

Among state-of-the-art software for whole-genome regression modelling of genome-wide/phenome-wide association studies is REGENIE41, which has been shown to be substantially faster than other similar approaches such as FastGWA42, BOLT-LMM43 and SAIGE44. In addition to several statistical tests that are relevant mostly to complex genetics, such as single-variant, gene–gene and gene–environment interaction tests as well as conditional analyses, REGENIE and some of the other related software can perform various gene burden tests, for example, tailored tests to binary traits in both unbalanced and balanced case–control studies with rare variants, such as the saddle point approximation (SPA) approach45 and Firth’s logistic regression model46 (our statistical model of choice). Notably, both the SPA and the Firth’s correction have been shown to provide good control of Type 1 error, but the SPA approach implemented in SAIGE resulted in inflated effect-size estimates—a feature that was not observed with the Firth’s logistic regression in REGENIE41. Despite the ability to handle most commonly used input genetic and phenotypic files such as, for example, BGEN or PLINK/PLINK2, and tackle population structure and relatedness, REGENIE and similar approaches are not natively tailored to the analysis of rare, putative disease-causing, segregating variants within families for rare Mendelian diseases. The crucial and often cumbersome step of annotation, scoring and selection of the pool of relevant rare variants to test for gene burden is left to the user as a preliminary step with the suggestion to use external tools such as snpEFF or VEP. Our analytical framework geneBurdenRD overcomes this limitation and covers an important need for more tailored gene burden analytical tools for the analysis of rare Mendelian diseases, with approaches such as REGENIE remaining to be preferred for complex genetics analyses, across several quantitative and/or binary traits in (typically) unrelated people.

Although our approach applied to a large, rare disease cohort has successfully highlighted numerous known associations and indicated many previously unreported associations, it is not without limitations, pointing towards opportunities for future developments and re-analysis. For example, although we considered only single-nucleotide variants and small insertions or deletions in protein-coding genes in the current study, the inclusion of non-coding rare variation as well as structural variants may help unveil further molecular diagnoses and disease mechanisms. Despite the large-scale size of the 100KGP, 29 of the 226 diseases observed in this analysis have fewer than five cases and a further 36 did not pass our testing criterium of at least two same disease cases with relevant variants in a certain gene. This highlights the current limited power in discovering new (ultra-)rare monogenic disease associations and the need for even larger rare disease sequencing efforts. As the input to the statistical testing of each gene in our analytical framework is simply a matrix recording the presence or absence of a proband’s genotype passing each criterion for each case and control, one possibility to increase statistical power is federated analysis across rare disease sequencing projects as such (preferably, Exomiser-based) processed data should be sharable under most ethical and legal frameworks. Finally, whereas our focus was rare Mendelian, monogenic fully penetrant associations, further analytical developments and research are needed to uncover those missing genetic etiologies because of incomplete penetrance and/or variable expressivity as well as digenic and/or polygenic effects in rare diseases.

An alternative to our frequentist approach is offered by a Bayesian method for rare diseases called BeviMed47 (Bayesian evaluation of variant involvement in Mendelian disease), in which disease risk depends on the genotypes at rare variants in a locus, a latent mode of inheritance and a latent partition of variants into pathogenic and non-pathogenic subsets. BeviMed was used in a recent analysis of approximately the same cohort of rare disease patients in this study, where 241 known but only 19 new associations were found48. Indeed, 113 out of 165 of our known, and only 4 out of 141 of the potentially new (before triage), signals were also detected by BeviMed: TUFT1 associated with epidermolysis bullosa; SRP9 associated with ductal plate malformations, MORC2 associated with mitochondrial disorders and ARPC3 associated with CMT disease. Although a direct comparison is not straightforward because of the differences in the case–control selection strategies (the BeviMed study analysed 269 case sets as opposed to 161 that passed our testing criteria), let alone in the statistical approach, the relatively small overlap in signals highlights a possible complementarity of the two methods to discover new disease–gene associations from the same cohort.

There are 553 cases with no molecular diagnosis but with variant(s) contributing to one of the known disease–gene association signals that had not already been considered and classified as VUS or benign in the diagnostic report, giving an upper bound on the increase in diagnostic yield from review of these variants of 1.6% (553 of 34,851 cases analysed). Furthermore, 155 molecularly unsolved cases had a variant contributing to one of the 69 new associations giving an upper bound on the potential increase in diagnostic yield of 0.4% (155 of 34,851 cases analysed), if all genes were confirmed and the variants considered penetrant enough to be deemed pathogenic rather than just predictive. By making our analytical framework openly available for wider application to similar cohort data globally, we hope to substantially aid disease–gene discovery and new molecular diagnoses in rare Mendelian diseases in numerous other cohorts.

Methods

Rare disease genomes from the 100KGP

The Health Research Authority Research Ethics Committee East of England—Cambridge South (Ref. 14/EE/111) gave ethical approval for the 100KGP. Patients with rare diseases and affected and unaffected family members were enrolled to the 100KGP through 1 of the 13 NHS Genomic Medicine Centres across England, Northern Ireland, Scotland and Wales2. Consent was obtained from all participants to the 100KGP. The recruiting clinicians assigned each proband to a specific disease (according to a hierarchical disease classification available in the project described below) and provided patient’s phenotypic data according to the HPO49. An initial cohort of 74,061 genomes (35,548 single probands and larger families) from the rare disease pilot and main programme of the 100KGP (data release v.11) was available for analysis (March 2021). Genomes were sequenced using the TruSeq DNA polymerase-chain-reaction-free sample preparation kit (Illumina) on a HiSeq 2500 sequencer, which generates a mean depth of 32× (range, 27–54) and a depth greater than 15× for at least 95% of the reference human genome. Whole-genome sequencing reads were aligned to either the Genome Reference Consortium human genome build 37 (GRCh37) for the minority of earlier samples, or build 38 (GRCh38), with the use of Isaac Genome Alignment Software. Family-based variant calling of single-nucleotide variants and insertions or deletions for chromosomes 1 to 22, the X chromosome, and the mitochondrial genome (mean coverage, 2,814×; range, 142–16,581) was performed with the use of the Platypus variant caller50. Quality control performed by Genomics England highlighted that 81 of the probands had been recruited and sequenced twice and these duplicates were removed from our cohort. In addition, the required data for our Exomiser-based gene burden analysis, for example, recruited disease category and phenotypic terms, was not available for 16 families and these were also excluded from our cohort. The demographics of the cohort, presented in Supplementary Table 1, were obtained using Labkey in the Genomics England Research Environment. Genetic ancestry inference was performed by Genomics England by principal component analysis. A random forest model was subsequently trained to predict ancestry across five super-populations (European, African, Admixed American, South Asian, East Asian), with people assigned to ancestries on the basis of a probability threshold of greater than 0.8 (https://re-docs.genomicsengland.co.uk/ancestry_inference/).

Pool of rare, putative disease-causing variants for gene burden testing

The variant prioritization tool Exomiser11 (v.12.1.0 with default settings and latest 2007* (July 2020) databases) was then run on all available 35,451 single proband and family-based VCF files to obtain a pool of rare, protein-coding, segregating and most predicted pathogenic (per gene) variants to use in an rare variant gene-based burden testing analysis for the discovery of new rare Mendelian disease–gene associations as described below. For each proband/family and each gene, Exomiser selected a single configuration of contributing variants, that is, the most predicted (REVEL and MVP) pathogenic, rare (less than 0.1% autosomal/X-linked dominant or homozygous recessive, less than 2% autosomal/X-linked compound-heterozygote recessive; using publicly available sequencing datasets including gnomAD) protein-coding homozygous/heterozygous variant or compound-heterozygote variants that segregated with disease for each possible mode of inheritance. Coding variants (including canonical splice acceptor and donor and splice region (bases one to three of exon to three to eight of intron)) were selected by Exomiser by removing all those classified as FIVE_PRIME_UTR_EXON_VARIANT, FIVE_PRIME_UTR_INTRON_VARIANT, THREE_PRIME_UTR_EXON_VARIANT, THREE_PRIME_UTR_INTRON_VARIANT, NON_CODING_TRANSCRIPT_EXON_VARIANT, UPSTREAM_GENE_VARIANT, INTERGENIC_VARIANT, REGULATORY_REGION_VARIANT, CODING_TRANSCRIPT_INTRON_VARIANT, NON_CODING_TRANSCRIPT_INTRON_VARIANT and DOWNSTREAM_GENE_VARIANT. The Exomiser analysis did not return any candidate variants for 29 families, generally for larger families with several affected people where no rare, putative disease-causing variants remained after filtering, leading to an interim dataset size of 35,422 single probands and larger families. To control for false positive variant calls and/or relatively common variants in the project itself, we further discarded variants on the basis of how often they were observed in the Exomiser master dataset itself (frequency greater than 2% for variants in a compound-heterozygote genotype, greater than 0.2% for mitochondrial DNA genome variants, greater than 0.1% for heterozygote/homozygote variants). This led us to discard data from a further 41 families. Finally, potentially digenic probands with more than one recruited disease category were discarded from the analysis, leaving a total of 35,008 probands. As part of the sample quality control, kinship coefficients were used to control for cryptic relatedness. Genomics England provided kinship coefficients only for 29,180 of the 35,008 probands. Therefore, we calculated a genetic variant overlap measure (number of variants in common in the Exomiser results for two probands per total number of variants in the Exomiser results for the two probands) for all pairwise combinations of the 35,008 probands, and demonstrated that this was correlated strongly with the available kinship coefficients (Supplementary Fig. 2). Within each recruited disease category, we then identified genetically related probands (kinship coefficient greater than 0.088 corresponding to second-degree relatives or above, or an equivalent variant overlap score (0.1) threshold where kinship coefficient was not available). The proband in the pair with the most Exomiser results (less refined list of rare, putative disease-causing variants, usually from a smaller family size) was then dropped from all further analyses. This removed 157 probands/families, leading to a final input analysis dataset of 34,851 single probands and larger families (40,402 probands and affected family members and 32,288 unaffected family members) and 4,643,230 Exomiser-based candidate heterozygote/homozygote variants and compound-heterozygote genotypes (Supplementary Table 1 and Fig. 1). Furthermore, for each case–control analysis (see case–control definition below), a further variant quality control was applied with variants seen in at least one case (as heterozygote, homozygote or in a compound-heterozygote genotype) being discarded if seen in at least one control (as heterozygote/homozygote/compound-heterozygote, homozygote/compound-heterozygote, homozygote/compound-heterozygote, respectively) to mimic a Mendelian, fully penetrant-like disease model.

Exomiser-based rare variant gene burden testing for Mendelian diseases

A rare variant gene-based burden case–control analytical framework that exploits rare, putative disease-causing variants as annotated, filtered and scored by the variant prioritization tool Exomiser was used to identify new rare Mendelian disease–gene associations. The annotation of variants to genes comes from this Exomiser analysis using its default settings to identify the most damaging consequence to the set of Gencode-basic tagged Ensembl transcripts. The framework has been described previously1 and extended in this study (https://github.com/whri-phenogenomics/geneBurdenRD). Briefly, as to the application of the analytical framework to the rare disease component of the 100KGP, cases and controls were defined exploiting the hierarchical disease classification in the project itself where the recruiting clinicians assigned each proband to any of 228 ‘specific diseases’ (level 4); the ‘specific diseases’ are in turn grouped into less specific 91 ‘disease sub-groups’ (level 3), each of which corresponds to 1 of 20 broad ‘disease groups’ (level 2) (Supplementary Table 6). Two specific diseases (pontine tegmental cap dysplasia and childhood onset leukodystrophy) were never used for recruitment in the end, leaving a final set of 226 level 4 disease categories. A case set was then defined as all probands recruited under each of the 226 level 4 disease categories and its corresponding control set as all recruited probands except those under the level 2 category containing the specific level 4 disease, for example, hypertrophic cardiomyopathy cases were compared with all non-cardiovascular disorders probands as controls. As with the gene burden testing, the gene-based enrichment of variants in cases versus controls was quantified using the cohort allelic sums test (CAST)51 statistic under four proband genotype scenarios (irrespective of the mode of inheritance): (1) presence of at least one rare, predicted LoF variant; (2) presence of at least one rare, highly predicted pathogenic variant (Exomiser variant score of at least 0.8 (either predicted LoF or missense variants predicted to be pathogenic)); (3) presence of at least one rare, highly predicted pathogenic variant in a CCR and (4) presence of a rare, de novo variant (restricted to only trios or larger families where de novo calling is possible and provided by the user). These CCR regions were defined previously by looking for the absence of variation in gnomAD15 at various levels of certainty and in the application to the rare disease component of the 100KGP we used the 95% percentile download. Given that Exomiser selects by default a single configuration of ‘contributing’ variants for each proband/family, each gene and each possible mode of inheritance as compatible with available family-based data and we calculated the CAST statistic as the best observed irrespective of the mode of inheritance, the CAST statistic corresponds to a sum test statistic40. The gene burden association is then assessed either using a binary case–control status versus the CAST statistic in a right-tailed Fisher’s exact test as in the original implementation of the analytical framework1, or using the CAST statistic as a covariate in a Firth’s logistic regression model46 that is tailored to testing unbalanced case–control datasets with rare events. In the application to the rare disease component of the 100KGP, Firth’s logistic regression models were adjusted for age, sex, family size (single proband/duos/trios and larger families) and inferred genetic ancestry (Supplementary Table 1). To maintain statistical validity and power, the analysis was limited to those disease–gene associations where an arbitrary set of at least five cases exist for the specific disease tested and, for each of the four gene-based proband genotype scenarios above, where relevant variants in the gene were seen in at least four probands, of which at least two were cases (we would not follow up associations signals driven by single cases/families in the first instance). Only gene-based enrichments that were more frequent in cases than controls (putative disease-causing as opposed to protective) were considered in the statistical correction for multiple testing. The Benjamini and Hochberg method52 was used to correct for multiple testing; an overall FDR-adjusted P value threshold of 0.5% was used for claiming statistically significant disease–gene associations for further triaging.

Triaging

First in silico triage

The statistically significant associations were further filtered for those where (1) the gene was protein-coding as the Exomiser coding variant filtering settings also identified variants disrupting non-protein-coding RNA genes (gene type definitions from the human gene nomenclature committee (HGNC) website); (2) for dominant, LoF signals there was gnomAD evidence for haplo-insufficiency (gnomAD v.2.1.1 observed/expected LoF less than 0.5) and (3) none of the cases driving the signal were already assigned a molecular diagnosis in other genes as part of the 100KGP routine diagnostic pipeline.

Application of ClinGen framework for gene–disease validity

Classification of the disease–gene associations according to ClinGen criteria (https://clinicalgenome.org/docs/gene-disease-validity-standard-operating-procedures-version-10/) was applied using in silico approaches where possible. The case-level variant score was calculated from scoring and summing all case variants that support a particular mode of inheritance for a disease–gene association. LoF variants (stop gain, frameshift or splice acceptor/donor) scored 1.5 points or 2 if de novo, whereas others scored 0.1 points or 0.5 if de novo. A case–control study score of 5 points for an odds ratio greater than 5, 4 points for odds ratio greater than 3 or 3 points for odds ratio less than 3 was assigned. The larger of the case-level variant score or case–control study score was used as the genetic evidence score, capped at a maximum of 12 for those associations that had many supporting case variants. Experimental evidence categories were calculated using a variety of sources. Existing evidence for a gene function fitting the likely disease mechanism was assessed using PubMed searches using the disease and gene name and the background knowledge of the experts in the various disease-specific GeCIPs. Scores of 0, 1 or 2 were awarded depending on whether there was no, some tenuous or lots of evidence. Gene expression was assessed using GTEx Project data through the web portal of the Human Protein Atlas (https://www.proteinatlas.org/ (ref. 53); and/or publicly available relevant RNA-seq datasets34,35 processed with STAR v.2.7.6a and Salmon v.1.4.0, and a score of 0, 1 or 2 assigned for no, widespread or solely specific expression in the relevant disease tissue. Defaults of one point for protein–protein association evidence (high quality, direct experimental interactions scoring greater than 0.7 in StringDB with genes on the disease panel from PanelApp16) and 2 points for mouse/zebrafish evidence (phenotypic similarity as calculated by Exomiser between the patient’s phenotypes and the mouse/zebrafish phenotypes where the orthologous gene was disrupted) were used. The rounded sum of genetic and experimental evidence points was used to assign the final ClinGen classification of the evidence for the association as being limited (0.1–6 points), moderate (7–11 points) or strong (12–18 points). Definitive evidence for an association is considered to be a score of 12–18 as well as convincing replication of the result in more than two publications over more than 3 years. Therefore, none of our associations will be classified as definitive at this early stage.

Visual representation of variant ___location in lollipop plots

Visual representations of the variant locations in the protein were generated by extending the Mutplot software54. The x axis represents the amino acid chain and its annotated protein ___domain from UniProt. Each lolly indicates a variant by its protein change annotated on the MANE Select transcript or MANE Plus Clinical if a stronger impact is predicted (the transcript used is specified in the plot) and the frequency is shown on the y axis. Its shape indicates the genotype found in the proband. The colour indicates the type of variant and the variant’s functional annotation. If the variant has both a p. change annotation and a number in parenthesis it means that the original p. change was annotated on a different transcript and the amino acid position in parenthesis indicates the re-annotation on the selected transcript. If the only annotation available indicates a number in parenthesis it means that the variant was in the non-coding region for that transcript; therefore, the lolly was placed on the closest amino acid.

PhenoDigm patient similarity comparisons

During the assessment of some disease–gene associations, the phenotypic similarity between the probands driving the signal was calculated using their HPO term annotations and the Exomiser API to give a PhenoDigm55 score between 0 and 1. The mean of the pairwise, reciprocal, non-self hits was calculated and compared with those obtained from 1,000 iterations when the same number of probands was selected at random from the set of cases with that disease.

Co-expression network analysis

Co-expression network analysis of our candidate genes and known genes linked to the potentially associated disease (green genes in PanelApp v.1.120) was performed using GTEx v.6 tissue-specific modules and the CoExp tool accessible at https://rytenlab.com/coexp (ref. 56).

Peripheral blood mononuclear cell expression analysis

RNA-seq data from peripheral blood mononuclear cells collected from three volunteer donors was analysed (poly A-selected libraries, mean of two replicates untreated and two replicates treated with cycloheximide for 1 h to inhibit protein translation and mimic integrated stress response)57. In-house, R was used for DeSeq2 normalizations per library and calculation of the mean values for each transcript for the two replicate libraries per donor per condition. For global evaluations, across all three donors, the mean base value, log2 fold change post cycloheximide and Benjamini–Hochberg adjusted P value were then calculated.

Gene and variant look-up in independent rare disease cohorts

In a cohort of patients from the Irish Kidney Gene Project58 (278 cystic kidney disease and 141 chronic kidney disease cases), rare (gnomAD minor allele frequency less than 0.1%) LoF, missense, splicing or intronic variants were extracted for our new renal disease-associated genes. A further cohort of more than 3,000 Dutch renal patients was queried for likely pathogenic/pathogenic variants in those genes using the Alissa bioinformatics pipeline. Similarly, a sequencing cohort of 212 participants with inherited corneal diseases, recruited in the United Kingdom and Czech Republic and pre-screened for known genetic causes, was interrogated for any rare variants in the candidate gene POMK.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.