Main

OCD is a chronic psychiatric disorder that affects 1–3% of the population1 and is characterized by obsessions and compulsions that vary in type and severity and over time. OCD is responsible for profound personal and societal costs2, including increased risk of suicide3 and overall mortality4. OCD is moderately heritable; twin-based heritability estimates range between 27% and 47% in adults and between 45% and 65% in children5,6,7,8, with SNP-based heritability estimates between 28% and 37%9,10,11.

Two earlier OCD GWAS meta-analyses, both containing a subset of the data included in this analysis12,13, showed SNP-based heritabilities of 8.5% (assuming a 3% population prevalence) and 16% (assuming a 2% population prevalence). The first GWAS (ncases = 14,140, ncontrols = 562,117)12 found one genome-wide significant locus associated with OCD, while the second (ncases = 37,015, ncontrols = 948,616)13 identified 15 independent genome-wide significant loci. As with other complex traits, increased sample sizes are needed for a more comprehensive understanding of the underlying genetic etiology of OCD and its genetic relationships with related disorders.

The current study combines data from the two unpublished OCD GWASs described above and includes additional cohorts (~9,000 cases). This results in one of the largest and most well-powered GWAS of OCD so far, with a ~20-fold increase of OCD cases compared to the previously published OCD GWASs10. Based on the results from the meta-analysis, we conducted secondary analyses, including positional and functional fine-mapping of SNPs and genes, structural equation modeling to examine possible genetic differences in sample ascertainment across cohorts, protein and transcriptome-wide association analyses, single-cell enrichment and genetic correlations with other traits (Supplementary Fig. 1). Our results provide more detailed insight into the genetic underpinnings and biology of OCD.

Results

GWAS meta-analysis identifies 30 genome-wide significant loci

We conducted a GWAS meta-analysis of 28 OCD case–control cohorts of European ancestry, comprising 53,660 cases and 2,044,417 controls (effective sample size, ~210,000 individuals). Ascertainment of cases varied across cohorts: OCD diagnosis was determined (1) by a healthcare professional in a clinical setting (18 cohorts, n = 9,089 cases), (2) from health records or biobanks (seven cohorts, n = 9,138 cases), (3) in a clinical setting or from health records with the additional characteristic that all OCD cases were primarily collected for another psychiatric disorder (three cohorts, n = 5,266 cases) or (4) by self-reported diagnosis in a consumer-based setting (23andMe, Inc., n = 30,167 cases). Cohort details, including phenotypic assessment, quality control and individual cohort GWAS analyses, are described in Supplementary Note 2 and Supplementary Table 1. We identified 30 independent (defined in Supplementary Note 3) loci among the 1,672 SNPs that exceeded the genome-wide threshold for significance (\(P < 5{\times 10}^{-8}\); Manhattan plot in Fig. 1, regional association plots and forest plots in Supplementary Figs. 231 and a list of all independent genome-wide significant SNPs in Table 1 with additional details in Supplementary Tables 2 and 3). The independence of the 30 lead SNPs was subsequently validated using conditional and joint analysis (GCTA-COJO)14 (Supplementary Table 4). Analysis of the X chromosome, conducted in a subset of the data for which this information was available (23andMe), yielded no significant associations (Supplementary Note 4 and Supplementary Fig. 37e). Of the 15 genome-wide significant loci previously reported in preprints12,13, 13 were genome-wide significant in the current GWAS, with the remaining two showing suggestive significance (\(P=5.23{\times 10}^{-8}\,{\rm{and}}\,{P}=2.2{\times 10}^{-7}\); Supplementary Table 5). Using MiXeR15, we estimated that approximately 11,500 (standard error of the effect estimate (s.e.) = 607) causal variants account for 90% of the OCD SNP-based heritability.

Fig. 1: Manhattan plot of OCD GWAS meta-analysis.
figure 1

The y axis represents −log10 (P values) (two sided, not adjusted for multiple testing) for the association of variants with OCD using an inverse-variance-weighted fixed-effects model (ncases = 53,660 and ncontrols = 2,044,417). The x axis shows chromosomes 1–22. The horizontal red line represents the threshold for genome-wide significance (\(P=5\times {10}^{-8})\). Index variants of genome-wide significant loci are highlighted as green diamonds.

Table 1 Genome-wide-significant loci associated with OCD

No statistically significant heterogeneity was observed across individual cohorts for the 30 genome-wide significant loci, as assessed with Cochran’s Q-test (Supplementary Fig. 32), the I2 statistic and the genomic structural equation modeling (GenomicSEM) \({Q}_{{\rm{SNP}}}\) statistic16 (Supplementary Table 2). Genome-wide analyses of samples grouped by clinical, comorbid, biobank and 23andMe information (Supplementary Table 3 and Supplementary Figs. 3337) showed evidence that sample ascertainment impacted results at a genome-wide scale, although not beyond what is observed with closely related psychiatric disorders17,18. We observed moderate to high genetic correlations across the subgroups (between 0.63, s.e. = 0.11 for biobanks and comorbid information and 0.92, s.e. = 0.07 for 23andMe and comorbid information; Supplementary Table 7) and a satisfactory fit for a one-factor GenomicSEM model (Supplementary Table 8 and Supplementary Fig. 39). A common factor GWAS based on the one-factor GenomicSEM model resulted in 20 significant loci, all of which were also significant in the primary GWAS (Supplementary Table 8 and Supplementary Fig. 40; analysis details in Supplementary Note 5). SNP heritability (assuming a 1% population prevalence) was 6.7% (s.e. = 0.3%), with slightly higher estimates for the clinical (\({h}_{{\rm{SNP}}}^{2}\) = 16.4%, s.e. = 1.5%) and comorbid (\({h}_{{\rm{SNP}}}^{2}\) = 13.3%, s.e. = 1.7%) subgroups (Supplementary Table 1).

Gene-based findings

We prioritized putative risk genes for OCD using six positional and functional QTL gene-based mapping approaches. Positional mapping was performed with mBAT-combo19. Functional expression quantitative trait locus (eQTL) mapping was performed with transcriptome-wide association study (TWAS)20, using PsychENCODE gene expression weights21, and summary-based Mendelian randomization (SMR)22 using the whole-blood eQTLGen23 and MetaBrain24 datasets. Functional protein QTL mapping was done using a protein-wide association study (PWAS) of human brain protein expression panels25. Finally, we used the psychiatric omnilocus prioritization score (PsyOPS)26, which combines positional mapping with biological annotations, to further prioritize risk genes within genome-wide significant loci. We identified 207 significant genes (Bonferroni correction, P < 2.67 × 106) with mBAT-combo and 24 genes using TWAS (P < 4.76 × 10−6), 14 of which were conditionally independent. The SMR–eQTLGen analysis identified 39 significant risk genes (P < 4.28 × 106), and the SMR–MetaBrain analysis identified 14 risk genes (P < 9.23 × 10−6). The PWAS identified three significant genes (P < 3.39 × 10−5), while PsyOPS prioritized 29 genes. In total, 251 genes were significantly associated with OCD through at least one gene-based approach, and 48 were implicated by at least two methods (Methods, Supplementary Note 7 and Supplementary Tables 914).

From the 48 genes implicated by at least two approaches, we prioritized likely causal genes for OCD using colocalization (TWAS-COLOC)27,28 and SMR–heterogeneity in dependent instruments (SMR-HEIDI)22 tests. Colocalization was used to identify significant TWAS associations for which the underlying GWAS and eQTL summary statistics are likely to share a single causal variant. Similarly, HEIDI was used to select SMR associations for which the same causal variant affects gene expression and trait variation. Of the 48 genes implicated by at least two gene-based tests, 25 were also significant in either the TWAS-COLOC or the SMR-HEIDI tests, suggesting causality (Fig. 2a). Only 2 of these 25 genes were prioritized by both TWAS-COLOC and SMR-HEIDI: WDR6 (WD repeat ___domain 6) and DALRD3 (DALR anticodon binding ___domain-containing 3). Another gene of interest, CTNND1 (catenin δ1), was implicated by three of our five approaches (multivariate set-based association test (mBAT-combo), TWAS, PWAS) and showed evidence for colocalization. Only three genes were implicated in the PWAS; of these, CTNND1 was the only gene also implicated in the TWAS. In the PWAS, downregulation of CTNND1 protein expression in the human dorsolateral prefrontal cortex (dlPFC) was significantly associated with OCD risk (\(Z=-4.49,P=7.11\times {10}^{-6}\); Supplementary Table 13), consistent with the downregulation of CTNND1 gene expression in the prefrontal cortex seen in the TWAS (\(Z=-6.86,P=6.90\times {10}^{-12};\) Supplementary Table 10). For a discussion of the overlap between the gene findings with rare coding variants in OCD, see Supplementary Table 6 and Supplementary Note 7.

Fig. 2: Gene-based, tissue and cell type enrichment analyses.
figure 2

a, List of 25 genes that were implicated in at least two of the five different gene-based tests (significance indicated by gray dots) and passed the TWAS colocalization and/or SMR-HEIDI filters (significance indicated by orange dots). Conditionally independent (cond. ind.) genes within each locus are indicated by blue dots. b, Enrichment of OCD GWAS signal in human brain-related tissues from GTEx (version 8). No significant enrichment was observed in the peripheral tissues (not included in the figure). The horizontal bar size represents the significance of the enrichment measured using the MAGMA gene set enrichment test or partitioned LDSC. c, Top 20 groups of brain cell types (n = 35 total tested) enriched with OCD GWAS signal using MAGMA. Dots represent −log10(P values) from MAGMA gene set enrichment tests of individual neuronal cell types from Zeisel et al.30. Vertical crosses represent the mean −log10(P value) observed for each brain cell type group. Blue crosses represent a significant enrichment of OCD GWAS signals (FDR across 35 groups, FDR < 0.05), while pink crosses indicate nonsignificant enrichment. Gray points represent the association (−log10(P value)) for each single cell cluster (‘level 5’ analysis defined by Zeisel et al.30) in a given cell type (for example, excitatory neurons, cerebral cortex). CCK, cholecystokinin-expressing; R-LM, stratum radiatum-stratum lacunosum-moleculare.

Tissue and cell type enrichment analysis

After mapping significantly associated SNPs from the GWAS meta-analysis to likely causal genes, we explored which tissues or cell types showed enriched gene expression of OCD-associated genetic signals using a previously described approach29 on published human gene expression datasets from bulk tissue RNA-seq data from the Genotype–Tissue Expression (GTEx) project and single-cell RNA-sequencing data from the adult mouse central and peripheral nervous systems30. We found enrichment of OCD GWAS signals in six of 13 human brain tissue types in GTEx but no enrichment in human peripheral tissues (Fig. 2b and Supplementary Table 15). In the adult mouse central and peripheral nervous systems, we found enrichment of OCD GWAS signals in 41 of 166 tested specific single cell types using the MAGMA gene set enrichment test (Supplementary Table 16). When summarizing results of individual single cell types into groups of cell types defined by the same region or tissue and cell type, nine of 35 were enriched for OCD GWAS signals (top 20 shown in Fig. 2c). Strong enrichment of OCD GWAS signal was especially observed in excitatory neurons of the hippocampus and the cerebral cortex as well as in D1 and D2 medium spiny neurons (MSNs).

Genetic relationship of OCD with other phenotypes

Using phenome-wide association analysis, we examined whether the 30 independent OCD-associated loci identified by our GWAS meta-analysis have previously been associated with other phenotypes (see Supplementary Tables 17a–d for lookups in four, partially overlapping GWAS databanks and Table 1 for highlighted associations). We found that 22 of the 30 loci were associated with other phenotypes, including schizophrenia (seven loci), depression and major depressive disorder (two loci), bipolar disorder (one locus), neuroticism (seven loci), educational attainment (seven loci) and body fat mass or body mass index (eight loci).

We further used bivariate linkage disequilibrium score regression (LDSC)31 to investigate the extent of genetic correlations between OCD and 112 previously published GWASs encompassing psychiatric, substance use and neurological phenotypes, among others (Fig. 3). We found that 65 phenotypes were significantly correlated with OCD after correcting for multiple testing using the Benjamini–Hochberg32 procedure to control the false discovery rate (FDR) at a threshold of 0.05. OCD was significantly positively correlated with all tested psychiatric phenotypes; the highest correlations were with anxiety (\({r}_{\rm{G}}=0.70\)), depression (\({r}_{\rm{G}}=0.60\)), anorexia nervosa (\({r}_{\rm{G}}=0.52\)), Tourette syndrome (\({r}_{\rm{G}}=0.47\)) and post-traumatic stress disorder (PTSD; \({r}_{\rm{G}}=0.48\)). Significant positive genetic correlations were also obtained for neuroticism (\({r}_{\rm{G}}=0.53\)), in particular for the worry subcluster (\({r}_{\rm{G}}=0.64\)), and all individual items in the worry subcluster, with slightly lower estimates for the depressive subcluster (\({r}_{\rm{G}}=0.35\)). Suicide attempt (\({r}_{\rm{G}}=0.40\)), history of childhood maltreatment (\({r}_{\rm{G}}=0.37\)) and tiredness (\({r}_{\rm{G}}=0.36\)) were also notable for strong positive associations with OCD. Of the assessed neurological disorders, OCD was only significantly correlated with migraine (\({r}_{\rm{G}}=0.15\)). Some autoimmune disorders, such as Crohn’s disease (\({r}_{\rm{G}}=-0.13\)), ulcerative colitis (\({r}_{\rm{G}}=-0.14\)) and inflammatory bowel disease (\({r}_{\rm{G}}=-0.14\)), showed negative correlations with OCD (see Fig. 3 and Supplementary Table 18 for all genetic correlation estimates, 95% confidence intervals and P values, Supplementary Note 6 for a more in-depth discussion of all significant genetic correlations and Supplementary Table 19 and Supplementary Figs. 41 and 42 for subgroup-specific genetic correlation estimates).

Fig. 3: Genetic correlations (rG) between OCD and 112 phenotypes.
figure 3

This includes psychiatric, substance use, cognition–socioeconomic status (SES), personality, psychological, neurological, autoimmune, cardiovascular (cardiovasc.), anthropomorphic–diet, fertility and other phenotypes. References and sample sizes of the corresponding summary statistics of the GWAS studies can be found in Supplementary Table 18. The OCD summary statistics are of the main meta-analysis (ncases = 53,660 and ncontrols = 2,044,417). Error bars represent the 95% confidence intervals for the genetic correlation estimates (rG). Red circles indicate significant associations with a P value adjusted for multiple testing with the Benjamini–Hochberg procedure to control the FDR (<0.05). Black circles indicate associations that are not significant. a., after; ADHD, attention-deficit hyperactivity disorder; ALS, amyotrophic lateral sclerosis; BMI, body mass index; embarras., embarrassment; freq, frequency; fr., from; HDL, high-density lipoprotein; IQ, intelligence quotient; LDL, low-density lipoprotein; neurot., neuroticism; nr., number; PTSD, post-traumatic stress disorder; sat., satisfaction; VN, verbal-numerical.

Discussion

The OCD GWAS reported here, comprising over 53,000 cases, identified 30 independent genome-wide significant loci. Common SNPs explained 6.7% of the variation in OCD risk in our meta-analysis (LDSC with an assumed population prevalence of 1%), a significant reduction from the 28% reported previously10. However, differences in the assumed population prevalence (where a lower assumed prevalence for LDSC heritability calculation results in a lower heritability estimate) and an increase in sample heterogeneity likely contributed to this discrepancy. The reduction in SNP heritability is in line with previous observations for closely related psychiatric disorders such as attention deficit hyperactivity disorder (ADHD)33,34 or depression17,35,36,37, where expanding the phenotype definition increased genetic heterogeneity, potentially accounting for the observed decrease in SNP heritability. This aligns with the fact that heritability estimates for more homogeneous OCD subgroups were higher: 16.4% for the clinically ascertained subgroup and 13.3% for the comorbid subgroup (Supplementary Note 10). The current estimates are comparable to those of other psychiatric and substance use disorders, with SNP heritability estimates ranging between 9% and 28%38.

The most significant SNP (rs78587207 (\(P=5.28\times 1{0}^{-12}\))) identified in the GWAS is located on chr11q12.1 and has been previously associated with several traits, including neuropsychiatric phenotypes39 such as depressive symptoms40 and neuroticism40. Gene-based analyses identified four putative causal genes within this locus. The closest gene to rs78587207 is CTNND1, which encodes the cell adhesion molecule p120 catenin. This gene was associated with OCD using three gene-based tests (mBAT-combo, TWAS and PWAS), and we found strong evidence for colocalization of the TWAS signal for CTNND1 in the dlPFC. The dlPFC has been consistently implicated in the neural circuitry of OCD as well as in compulsivity more broadly as part of the cortico–striatal–thalamo–cortical circuitry41,42. The protein product of CTNND1 is a regulator of cell–cell adhesion43 and has a crucial role in gene transcription, Rho GTPase activity and cytoskeletal organization44,45,46. Other credible causal genes in the locus include CLP1 (cleavage factor polyribonucleotide kinase subunit 1), TMX2 (thioredoxin-related transmembrane protein 2) and ZDHHC5 (zinc finger DHHC type palmitoyltransferase 5). Rare genetic mutations in CLP1 are associated with pontocerebellar hypoplasia type 10, a very rare autosomal recessive neurodegenerative disease characterized by brain atrophy and delayed myelination resulting in intellectual disability47. TMX2 is associated with increased risk of neurodevelopmental disorders with microcephaly, cortical malformations, spasticity and congenital nervous system abnormalities48. ZDHHC5 is broadly expressed in the brain, including the frontal cortex. ZDHHC5 has not been implicated in brain development but has been linked to lung acinar adenocarcinoma and lung papillary adenocarcinoma in prior studies49.

Our finding that approximately 11,500 (s.e. = 607) causal variants account for 90% of the SNP-based heritability of OCD suggests that OCD is more polygenic than other complex traits such as height (ncausal = 4,000), schizophrenia (ncausal = 9,600) and ADHD (ncausal = 5,600) but less polygenic than major depression (ncausal = 14,500) and educational attainment (ncausal = 13,200)50.

We identified a total of 25 credible causal genes based on robust evidence using multiple positional and functionally informed gene-based approaches. Notably, DLGAP1, which has been previously implicated in OCD pathogenesis10,51, was not identified in either the GWAS or in the gene-based analyses. Of the 25 genes that were implicated, 15 were within 6.5 kb of a SNP that surpassed genome-wide significance in the meta-analysis. In addition to the four genes discussed above, several others are of particular interest, including WDR6 and DALRD3, which had the strongest evidence from the gene-based analyses. These genes lie in a gene-rich region on chr3p21.31, which, in addition to harboring multiple genome-wide significant SNPs, has been previously associated with a broad range of psychiatric disorders and related traits, including schizophrenia39, well-being52 and the worry subcluster of neuroticism53.

WDR6 is broadly expressed in the brain, particularly the hypothalamus. Its protein product is involved in cell growth arrest54, and recent studies have implicated it in anorexia nervosa55 and Parkinson’s disease56. DALRD3 is located on chromosome 3 in the same region as WDR6. DALRD3, when fully disrupted, is implicated in a form of epileptic encephalopathy with associated developmental delay57. Finally, a third gene in the 3p21 locus, CELSR3 (cadherin EGF LAG seven-pass G type receptor 3), encodes a protocadherin that is highly expressed in the developing basal ganglia58. Multiple loss-of-function mutations in CELSR3 have been associated with Tourette syndrome59,60, which co-occurs with OCD in 10–20% of patients.

Four other genes identified through these analyses are located in the MHC locus, a region on chromosome 6 that has a major role in the adaptive immune system and has been repeatedly linked to major psychiatric disorders61. The newly identified MHC association for OCD is noteworthy given evidence linking OCD with autoimmune disorders62,63,64. Genetic pleiotropy may underlie this connection, with variants predisposing individuals to both autoimmune conditions and OCD65. Furthermore, some OCD subtypes, such as pediatric acute-onset neuropsychiatric disorders associated with Streptococcus and pediatric acute-onset neuropsychiatric syndrome, may have autoimmune origins66,67. Nevertheless, we were surprised to discover several negative genetic correlations between OCD and autoimmune disorders such as Crohn’s disease, ulcerative colitis and inflammatory bowel disease in our analyses, suggesting that there is heterogeneity (and perhaps pleiotropy) in the genetic relationships between autoimmune disorders and OCD.

Tissue and cell type enrichment analysis revealed significant enrichment of OCD SNP heritability in several tissues and cell types, with the strongest enrichment in excitatory neurons of the hippocampus and the cerebral cortex and in dopamine D1 receptor (D1R)-positive and dopamine D2 receptor (D2R)-positive MSNs in the striatum. These findings are in line with traditional neural circuitry models of OCD, which focus on frontal cortical–striatal pathways68,69. These findings are consistent with and build on previous work linking various neuronal cell types to psychiatric and cognitive phenotypes70.

Interestingly, the frontal and anterior cingulate cortices, which were enriched in our tissue-based analyses, as well as the hippocampus and the striatum, which were implicated in our cell type-based analyses, are among the regions that are consistently implicated in neuroimaging studies of OCD41,71,72,73. Enrichment in MSNs in the striatum is consistent with their role in the observed aberrant circuitry in OCD, where the D1 MSNs project to the globus pallidus interna and the substantia nigra in the direct pathway and the D2 type MSNs project to the globus pallidus externa in the indirect pathway74. However, MSNs are also enriched in major depressive disorder75, schizophrenia76 and intelligence77, suggesting that the observed enrichment is not specific for OCD.

Our analyses of the shared genetic risk between OCD and other psychiatric disorders provides further insights into the etiology of OCD. In line with previous observations38,78, OCD was significantly genetically correlated with multiple psychiatric disorders and traits. The strongest genetic correlations were observed for anxiety disorders, depression and anorexia nervosa, all of which are highly comorbid with OCD79. This aligns with previous findings from cross-disorder analyses suggesting a shared genetic susceptibility among most psychiatric disorders38,80,81. A notable exception is our finding that risk variants for OCD are protective for alcohol dependence82, which is at odds with epidemiological evidence strongly linking OCD and alcohol-related disorders83 but in line with a recent paper79 reporting a lower-than-expected lifetime comorbidity of substance use disorders in OCD. The observed pattern of correlations with other phenotypes can be thought of as falling into two categories: compulsivity–impulsivity and rumination–worry–neuroticism. In both categories, the patterns of genetic correlations appear to follow a gradient across disorders and traits. For example, in the compulsivity–impulsivity category, strong positive correlations are seen with anorexia nervosa and Tourette syndrome, which are disorders with strong compulsive features, with less positive associations seen with ADHD and negative correlations with alcohol dependence and risk-taking behaviors, which are all phenotypes characterized by impulsivity. A similar gradient is observed for the rumination–worry–neuroticism-related phenotypes, with strong positive correlations with anxiety and other ruminative phenotypes such as worry, transitioning to less strong correlations with individual depression-related items.

This study marks the transition from the flat (sample-building) phase of SNP discovery described for GWAS84 (Supplementary Fig. 20), where few to no genome-wide significant loci are identified10,12,51,85, to the linear phase of SNP discovery, where even relatively small increases in sample size identify additional genome-wide significant loci18. The strengths of the current study therefore include the marked increase in the number of OCD cases and the rigorous analytic methods, including two multivariate approaches (multi-trait analysis of GWAS (MTAG) and GenomicSEM) to control for potential overlapping study participants and to examine potential heterogeneity between the multiple ascertainment approaches. Potential weaknesses include the inability to document comorbid psychiatric disorders in the majority of cases that were not ascertained from clinical collections or electronic registries, the lack of inclusion of non-European ancestries and the limited availability of sex chromosome data. Owing to the nature of our study, imputation references used in the different cohorts were heterogeneous and did not allow for confident analysis of rare variant associations. Future larger-scale sequencing studies that are currently underway will be needed to identify associations in this allele frequency spectrum. We also note that the genetic correlation analyses are impacted by residual heterogeneity in genetic signals owing to the employment of heterogeneous ascertainment strategies.

In summary, this work substantially advances the field of OCD genetics by identifying new OCD genetic risk loci and multiple credible candidate causal genes, including those expressed in brain regions and cell types previously implicated in OCD86. We have also shown that OCD is highly polygenic in nature, with many variants implicated not only in OCD but also in commonly comorbid disorders or traits, in particular, anxiety, neuroticism, anorexia nervosa and depression. The observation that common variants explain only a modest amount of the phenotypic variation in OCD suggests that other types of genetic variation may also contribute to the etiology of OCD. Notably, whole-exome-sequencing studies have suggested that a substantial proportion of OCD cases (22%) may be influenced by rare de novo coding variants87, especially in genes that are intolerant to loss of function88. Similarly, rare potentially damaging copy number variations represent part of the risk architecture for OCD9. These findings emphasize the need for a comprehensive exploration of the contribution of both common and rare genetic factors as well as their interplay to OCD risk. Finally, with the implication of the MHC complex, we provide additional evidence for potential shared genetic influences underlying both OCD and increased liability to autoimmune processes, although the directionality of those relationships remains to be definitively elucidated. In addition to continuing to increase sample sizes, future studies will require ancestrally diverse samples to further facilitate the discovery of additional OCD risk variants. Similarly, sex-specific analyses and additional clinical phenotyping will allow for the further elucidation of genetic and clinical relationships between OCD and co-occurring disorders. Finally, with the emergence of drug databases describing the relations between drugs and molecular phenotypes89, our results may be useful for drug repurposing (that is, identifying existing drugs targeting OCD risk genes), leading to new opportunities to find more effective treatments.

Methods

Ethics

All relevant ethics approvals have been obtained by the respective cohort’s institutions, and a list of all respective approvals can be found in Supplementary Note 2.

Study participants

We analyzed genomic data from 28 OCD case–control cohorts including 53,660 OCD cases and 2,044,417 controls of European ancestry. Supplementary Table 1 provides an overview of the individual cohorts. A subset of the cases and controls have been included in previous studies10,51,85 and preprints12,13, as described in Supplementary Note 2. Among all included individuals, 323 cases were part of a parent–proband trio; in these cases, parents were used as pseudocontrols. A total of 20,427 cases met DSM-5 (ref. 93) or ICD-10 (https://icd.who.int/) criteria for OCD as assessed by a healthcare professional or derived from (electronic) health records, while the remaining 32,233 cases were based on self-reported OCD diagnosis (23andMe, AGDS and parts of UKBB). Cohort-specific sample and analytic details can be found in Supplementary Note 2. Data collections were approved by the relevant institutional review boards at all participating sites, and all participants provided written informed consent.

Individual GWAS analyses and harmonizing of results

First, the data of each participating cohort were analyzed individually (see Supplementary Note 2 for details). Genetic data were imputed using either the Haplotype Reference Consortium (HRC)94 or 1000 Genomes Project Phase 3 reference panels95. The resulting GWAS summary statistics were then harmonized before a conjoint meta-analysis of all autosomes was conducted. Each summary statistic dataset was transformed to the ‘daner’ file format following RICOPILI96 specifications. All variants had to meet the following criteria for inclusion: minor allele frequency (MAF) > 1% in cases and controls, INFO score > 0.8 and <1.2. If the effect measure, P value or s.e. was missing or was out of bounds (infinite), the SNP was removed. Once cleaned summary statistics were produced, all datasets were aligned to the HRC reference panel. If variants were reported on different strands, they were flipped to the orientation in the HRC reference. Furthermore, strand-ambiguous A/T and C/G SNPs were removed if their MAF was >0.4. In the case that A/T and C/G SNPs showed a MAF < 0.4, allele frequencies were compared to frequencies in the HRC reference. If an allele frequency match was found, that is, minor alleles were the same in the summary statistics and the HRC reference, the same strand orientation was assumed. If an allele mismatch was found, that is, the allele had a frequency > 0.5 in the HRC reference, it was assumed that alleles were reported on different strands, and alleles were flipped subsequently. Marker names were uniformly switched to those present in the HRC reference. If a variant did not overlap with the variants in the HRC reference, it was removed.

GWAS meta-analysis

Inverse-variance-weighted meta-analysis was conducted on 28 European cohorts using METAL97. Weighting was based on standard error primarily to account for the large case–control imbalances in cohorts that used linear mixed model approaches in their primary GWAS. Heterogeneity was assessed with Cochran’s Q statistic and the I2 statistic98,99 (see Supplementary Note 5 for details). The genomic control factor lambda (λ) was calculated for each individual GWAS and for the overall meta-analysis to identify residual population stratification or systematic technical artifacts. GWAS summary statistics were subjected to LDSC analyses on high-quality common SNPs (INFO score > 0.9) to examine the LDSC intercept to distinguish polygenicity from other types of inflation and to estimate the genetic heritability from the meta-analysis and genetic correlations between cohorts. The genomic inflation factor λ was estimated at 1.330 with a λ1000 of 1.033, while the LDSC intercept was 1.0155 (s.e. = 0.0085), indicating that the inflation was mostly due to polygenic signal and unlikely to be substantially confounded by population structure. The genome-wide significance threshold for the GWAS was set at a P value of \(5.0\times 1{0}^{-8}\). The 23andMe data included information on the X chromosome; as this information was not present for all other cohorts, analysis of the X chromosome was only conducted in this subcohort (see Supplementary Note 4 for details).

We further conducted GWAS meta-analyses on the following four subgroups, defined by differences in their sample ascertainment: (1) clinical OCD cases diagnosed by a healthcare professional in a clinical setting (ncases = 9,089, ncontrols = 21,077; including IOCDF, IOCDF_trio, EPOC, NORDiC-nor, NORDiC-swe, EGOS, OCGAS, OCGAS-ab, OCGAS-gh, OCGAS-nes, Psych_Broad, WWF, MVP, Michigan/Toronto IGS, YalePenn, Chop, CoGa), (2) comorbid individuals who were primarily ascertained for another comorbid psychiatric disorder (ncases = 5,266, ncontrols = 43,760; AGDS, iPSYCH), (3) biobank data from large-scale biobanks or registries with ICD or DSM codes (ncases = 9,138, ncontrols = 1,049,776; BioVU, EstBB, FinnGen, HUNT, MoBa, UKBB) or (4) 23andMe data (ncases = 30,167, ncontrols = 929,804). While these groups are not exclusive (for example, diagnoses in health records were originally given in a clinical setting or comorbid cases were also assessed in a clinical setting or derived from health records), we defined these groups by the cohort’s primary characteristic. We also conducted one meta-analysis including all clinical, comorbid and biobank subgroups, while excluding the 23andMe data, resulting in 23,493 cases and 1,114,613 controls. As 23andMe is the only consumer-based dataset, we intended to compare this dataset to all others.

Number of trait-specific causal variants (MiXeR analysis)

We applied MiXeR version 1.3 (ref. 15) to quantify the polygenicity of OCD (that is, estimate the total number of trait-influencing genetic variants). MiXeR fits a Gaussian mixture model assuming that common genetic effects on a trait are a mixture of causal variants and noncausal variants. Polygenicity is reported as the number of causal variants that explain 90% of SNP heritability of OCD (to avoid extrapolating model parameters into the area of infinitesimally small effects).

SNP-based fine-mapping (GCTA-COJO)

We performed a conditional and joint analysis (GCTA-COJO)14 to identify independent signals within significant OCD loci. This approach performs a conditional and joint analysis on the basis of conditional P values before calculating the joint effects of all selected SNPs. We used the stepwise model selection procedure to select independently associated SNPs. The linkage disequilibrium reference sample was created from 73,005 individuals from the QIMR Berghofer Medical Research Institute genetic epidemiology cohort. The distance assumed for complete linkage disequilibrium was 10 Mb, and we used the default P-value threshold of \(5\times {10}^{-8}\) to define a genome-wide significant hit.

Multi-trait analysis of ascertainment subgroups

We used MTAG100 to conduct multivariable GWAS analyses, reporting GWAS results for each of the ascertainment-specific subgroups. Through this approach, we aimed to address potential concerns about heterogeneity in genetic liability for individual subgroups following different ascertainment strategies. MTAG is a multi-trait analysis that is usually used to combine different but related traits into one meta-analysis by leveraging the shared heritability among the different traits and thereby gaining power. In this case, our aim was to generate ascertainment-specific estimates, while boosting power by leveraging the high shared heritability between the subgroups. The MTAG analysis resulted in four different GWAS summary statistics, one for each subgroup (clinical, comorbid, biobanks, 23andMe). We performed maxFDR analyses to approximate the upper bound on the FDR of MTAG results.

GenomicSEM

Similarly, we used GenomicSEM16 to model the joint genetic architecture of the four subgroups. First, we ran a common factor model without individual SNP effects, following the tutorial ‘Models without individual SNP effects’ on the GenomicSEM GitHub website (Code availability). Second, we ran a multivariate GWAS of the common factor (see Supplementary Note 5 for details). We specified the model using unit variance identification, for which the latent factor variance is fixed to 1 and the loadings of the traits are estimated freely. This ensures that we capture how much of each subgroup contributes to the latent factor. GenomicSEM also generates \({Q}_{{\rm{SNP}}}\) values, which indicate possible heterogeneous effects across the subgroups. The \({Q}_{{\rm{SNP}}}\) statistic is mathematically similar to the Q statistic from standard meta-analysis and is a \({ X}^{2}\)-distributed test statistic, with larger values indexing a violation of the null hypothesis that the SNP acts entirely through the common factor.

SNP heritability estimation

The proportion of the phenotypic variance that could be explained by the aggregated effect of all included SNPs (SNP-based heritability, \({h}_{{\rm{SNP}}}^{2}\)) was estimated using LDSC31. The analysis was performed using precomputed linkage disequilibrium scores from samples restricted to European ancestry in the 1000 Genomes Project95, filtered for SNPs included in the HapMap 3 reference panel101. SNP heritability was estimated based on the slope of the LDSC, with heritability on the liability scale calculated assuming a 1% population prevalence of OCD1. To omit a downward bias in our estimates of liability-scale heritability, following Grotzinger et al.102, we accounted for varying levels of ascertainment across cohorts in our meta-analysis by summing the effective sample sizes across the contributing cohorts and using that as the input sample size for LDSC. For conversion to the liability scale (1%), the sample prevalence was then specified as 0.5. The SNP heritability was calculated for the whole OCD sample as well as for ascertainment-specific subgroups.

Genetic correlations

We used cross-trait LDSC31, a method that computes genetic correlations between GWASs without bias from ancestry differences or sample overlap to calculate genetic correlations between the primary OCD meta-analysis and other phenotypes of interest. The selection of traits was based on phenotypic relevance and/or prior report of a genetic relationship with OCD. The genetic correlation between traits was based on the estimated slope from the regression of the product of Z scores from two GWASs on the linkage disequilibrium score and represents the genetic covariation between two traits based on all polygenic effects captured by the included SNPs. The genome-wide linkage disequilibrium information used by these methods was based on European populations from the HapMap 3 reference panel101, and GWAS summary statistics were filtered to only include SNPs that were part of the 1,290,028 HapMap 3 SNPs.

To ensure the internal consistency of the datasets included in our meta-analysis, we calculated genetic correlations between all cohorts we considered to have a sample size large enough for LDSC (effective sample size of ≥1,000) and between the four ascertainment-specific subgroups.

We further calculated genetic correlations between OCD and 112 other disorders and traits. The source studies of the GWAS summary statistics can be found in Supplementary Table 18. As a follow-up, we also calculated genetic correlations between the 112 phenotypes and each ascertainment-specific subcohort and compared the genetic correlation patterns between the four groups. For all cross-phenotype genetic correlation analyses, we adjusted P values for multiple testing using the Benjamini–Hochberg procedure to control for the FDR (<0.05).

Gene-based analyses

To match the significant SNPs to the genes for which they likely influence function, we conducted a series of positional and functional gene-mapping analyses. The positional mapping employed MBAT-combo19, while the functional mapping tested whether genetic variants associated with OCD were also associated with differential expression of nearby genes (within a 1-Mb window) using (1) TWAS20 using PsychENCODE data and included colocalization with COLOC27,28, and (2) SMR22 using whole-blood eQTL information and brain tissues from MetaBrain, alongside the HEIDI test, which tests for heterogeneity in GWAS signal and eQTL association. Furthermore, a PWAS was conducted. As a final step, genes within each locus were prioritized using PsyOPS26, which integrates both positional and functional information. The details of each method are described below.

Positional gene mapping (MBAT-combo)

A gene-based analysis was conducted using mBAT-combo19 within GCTA version 1.94.1 (ref. 14). The European subsample (n = 503 individuals) from phase 3 of the 1000 Genomes Project95 was used as the linkage disequilibrium reference panel with the fastBAT default linkage disequilibrium cutoff of 0.9 applied. After filtering SNPs with MAF > 0.01, there were 6,629,124 SNPs for analysis in our sample. A gene list consisting of 19,899 protein-coding genes was used to map the base pair position of genes using genome build hg19 (see Supplementary Note 7 for details).

Functional gene mapping

Transcriptome-wide association study

We used TWAS FUSION20 to perform a TWAS of OCD. We used brain gene expression weights from the PsychENCODE103 and linkage disequilibrium information from the 1000 Genomes Project Phase 3 (ref. 95). TWAS FUSION uses reference linkage disequilibrium and reference gene expression panels with GWAS summary statistics to estimate the association between gene expression and OCD risk. These data were processed with the test statistics from the OCD GWAS to estimate the expression–GWAS association statistic. We corrected for multiple testing using Bonferroni correction.

We performed colocalization analyses using the COLOC R function27,28 implemented in TWAS FUSION. Colocalization is a Bayesian method used to calculate the posterior probabilities (PP) that individual lead SNPs within a significant TWAS locus are (1) independent (for example, two causal SNPs in linkage disequilibrium, one affecting transcription and one affecting OCD; PP3) or (2) share the same associated variant (for example, a single causal SNP affects both transcription and OCD (PP4)). We also performed a conditional analysis to determine whether identified associations represented independent associations. This was performed using the FUSION software, which jointly estimates the effect of all significant features within each locus by using residual SNP associations with OCD after accounting for the predicted expression of other features.

Summary-based Mendelian randomization

SMR22 was performed using default settings and eQTL meta-analysis summary statistics from European populations for whole blood from eQTLGen23 and all five nervous system tissues from MetaBrain (basal ganglia, cerebellum, cortex, hippocampus and spinal cord)104. The HEIDI test was performed alongside SMR to test for effect size heterogeneity between the GWAS and eQTL summary statistics. Both SMR and TWAS have a number of important assumptions and limitations, which we discuss in Supplementary Note 9.

Psychiatric omnilocus prioritization score

We used the gene prioritization method PsyOPS26 to rank genes within genome-wide significant loci. This supervised approach integrates biological annotations on mutational intolerance, brain-specific expression and involvement in neurodevelopmental disorder for genes within significant loci. Genes with the top PsyOPS score within each locus were used for further gene prioritization (Gene prioritization). In the instance where two genes in the same locus had the same PsyOPS score, the gene nearest the index SNP was prioritized.

Protein-wide association study

We performed a PWAS using protein expression data from human brain samples. Human brain proteome reference weight data were obtained using the Religious Orders Study and Rush Memory and Aging Project (ROS/MAP) and the Banner Sun Health Research Institute (Banner) study. The ROS/MAP proteomes were generated from the dlPFC of 376 participants of European ancestry and included 1,476 proteins with significant SNP-based heritability (P < 0.01). The Banner PWAS weights were generated from 152 individuals of European ancestry and included 1,147 proteins with significant SNP-based heritability. The PWAS was performed using the TWAS FUSION software20 with linkage disequilibrium reference information from the 1000 Genomes Project Phase 3 (ref. 95). We corrected for multiple testing using Bonferroni correction.

Gene prioritization

We created a list of prioritized genes using both gene-based tests and colocalization–HEIDI filters. Results from each gene-based test were first restricted to protein-coding genes with unique gene identifiers based on the release from GENCODE (version 40) for hg19. The following criteria were then used to prioritize genes: (1) a significant (Bonferroni-corrected) association from at least two gene-based tests (mBAT-combo, TWAS FUSION, SMR or PsyOPS) and (2) evidence of colocalization (COLOC PP4 > 0.8) and/or significant SMR association with HEIDI P > 0.05. Joint–conditional tests of association and significant PWAS associations were used as ancillary approaches to further annotate the prioritized gene list.

Tissue and cell type enrichment analysis

An analysis of tissue and cell type enrichment of OCD GWAS association signals was conducted using MAGMA (version 1.08)105 and partitioned LDSC106. We used the previously described approach29 to determine gene expression specificity in bulk tissue RNA-seq data from 37 tissues in GTEx (version 8) and single-cell RNA-sequencing data from 19 regions in the mouse central and peripheral nervous systems30. The analysis was limited to protein-coding genes with 1:1 orthologs between mice and humans. Gene expression in each tissue or cell type was calculated relative to total expression across all tissues or cell types. Enrichment analysis was performed on genes with the top 10% specificity values in each tissue or cell type, as previously defined29.

To evaluate the enrichment of tissue- and cell type-specific genes in OCD genetic association signals, we applied MAGMA and partitioned LDSC. We restricted the analysis to summary statistics for SNPs with a high INFO score (>0.6) and frequency in the entire cohort (MAF > 0.01). Using MAGMA (version 1.08), we tested whether genes with the top 10% specificity in a tissue or cell type showed enrichment in gene-level genetic associations for OCD, with the 1000 Genomes Phase 3 European sample genotypes serving as the linkage disequilibrium reference panel. We used standard gene boundaries (35 kb upstream of the transcription start site to 10 kb downstream of the transcription stop site). Partitioned LDSC was used to examine whether SNPs within 100-kb regions of the top 10% specifically expressed genes were enriched for SNP-based heritability for OCD. All results were corrected for multiple testing with an FDR threshold of 0.05.

SNP and gene findings in the context of previous analyses

Previously reported associations for significant SNPs (PheWAS)

Multiple resources were used to identify previously reported associations of our 30 significant SNPs with other phenotypes. We used the IEU Open GWAS project92, PheWAS analysis of GWAS ATLAS52 and the NHGRI-EBI GWAS Catalog91 and identified credible SNPs through CAUSALdb90. CAUSALdb estimates causal probabilities of all genetic variants in GWAS significant loci using three state-of-the-art fine-mapping tools including PAINTOR, CAVIARBF and FINEMAP107,108,109,110. We used default settings for our CAUSALdb queries.

Lookup of previous OCD GWAS findings

We performed a lookup of SNPs identified to be significantly associated with OCD-related phenotypes in previous GWASs. Note that this is not an independent replication, as previous studies partially overlap with the cohorts included in this GWAS.

Overlap of previous rare coding variants in OCD and GWAS gene findings

We performed a bidirectional lookup, assessing (1) whether gene findings from our GWAS showed evidence for rare variant involvement and (2) vice versa, whether findings from rare variant testing showed evidence of common variant association in our GWAS.

First, we comprehensively assessed the overlap between 251 genes that we highlighted in our study as carrying common risk variation for OCD (Supplementary Table 14) and current gene-based summary statistics from OCD exome-sequencing data. We used data from Halvorsen et al.88 because it is the largest published exome-sequencing study of OCD presently. The supplementary materials from that paper include de novo variant calls from 771 case trios and 1,911 controls (supplementary table 14 in ref. 88). We compared the burden of de novo variants, partitioned by variant annotation (synonymous, missense, loss of function) in trio cases versus trio controls within these 251 GWAS-prioritized genes. As described previously88, we only included de novo variants that were in loci well covered in both case and control data (In_Jointly_Covered_Loci==TRUE). We also excluded all calls from quartet samples in ref. 88 (Cohort!=“OCD_JHU_quartets”). For each of the four variant annotation classes, we compared the proportion of cases that had at least one qualifying de novo variant to the proportion of controls using a two-sided Fisher’s exact test.

Second, as Halvorsen et al.88 describe an overall excess of loss-of-function variants in OCD cases relative to controls specifically within loss-of-function intolerant genes (supplementary table 13 in ref. 88), we analyzed the overlap between those genes and our GWAS-derived genes. We looked up 200 genes with a probability of loss-of-function intolerance > 0.995 (derived from ref. 111) and effect size estimate > 1. We further tested for a difference in the proportion of these pLI > 0.995 genes with effect size estimate > 1 versus ≤1 within the set of genes highlighted in the OCD GWAS (n = 251) versus outside this set using a two-sided Fisher’s exact test.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.