Main

Genome-wide association studies (GWAS) have identified tens of thousands of loci associated with complex traits1. Most GWAS loci lie in noncoding genomic regions and their functional mechanisms are predominantly elusive. Genetic studies on molecular phenotypes such as Genotype-Tissue Expression (GTEx) project2 and the eQTLGen consortium3 have identified expression quantitative trait loci (eQTLs) for nearly every gene, yet the proportion of GWAS signals attributable to eQTLs has been modest4,5. To identify the genetic effects underlying complex diseases, it is important to investigate disease-relevant cell types during relevant differentiation stages under pertinent perturbations6. Single-cell RNA sequencing (scRNA-seq) enables unbiased examination of cell types7,8 and massively parallel perturbation measurements of transcriptional regulation9.

Splicing QTLs (sQTLs) are a mediator of genetic effects on complex traits. Growing evidence shows that sQTLs and eQTLs exert largely orthogonal effects on the genetic risk of complex diseases10,11. Cell-type-specific sQTLs are critical for dissecting complex diseases, yet they are heavily underexplored compared to eQTLs. FACS has enabled cell-type-specific dissection of sQTLs12,13,14, but they are biased toward known cell types defined by established surface markers. Genetics-coupled single-cell RNA sequencing (scRNA-seq) studies7,8 enabled unbiased detection of cell types and cellularly resolved eQTLs but did not investigate alternative splicing (AS) due to 3′ bias of their scRNA-seq libraries15,16. Full-length single-cell technologies, on the other hand, currently suffer from insufficient throughput and elevated costs for population-scale profiling17,18.

In this study, we leveraged 5′ library preparation (10x Genomics) and stochastic mRNA cleavage and recapping—an endogenous cellular phenomenon that creates multiple 5′ ends per isoform19,20—to increase the exon coverage of scRNA-seq. Using ~1 million single-cell transcriptomic profiles from 474 donors of Eastern, Southeastern and South Asian ancestries, we identified 48 sex-biased differential splicing events (DSEs) across 32 genes. Specifically, sex-biased splicing of FLNA was putatively driven by female-biased expression of the isoform ENST00000498491. We also identified 1,031 ancestry-biased DSEs affecting a total of 509 genes. In particular, ancestry-biased SPSB2 splicing was probably driven by cross-population allele frequency differences in rs11064437 that disrupted the 3′ splice site of SPSB2 exon 2. We identified 10,874 and 703 cis-sQTLs for protein-coding genes and long noncoding RNAs, respectively, many of which were cell-type-specific or sex-biased. We also identified 865 dynamic intron use events and 107 dynamic sQTLs along B cell differentiation. Our analysis revealed 607 trans-sQTL genes (sGenes) and cell-type-specific genetic coregulation between a trans-sQTL for PTPRC, a protein tyrosine kinase critical for T cell development, and a cis-eQTL for HNRNPLL, a master regulator of T cell activation-induced AS. Finally, we observed strong enrichment of cis-sQTL effects in autoimmune and inflammatory disease heritability. Colocalization analysis identified 563 putative effector genes. In particular, we functionally validated an Asian-specific sQTL that putatively modulated Graves’ disease (GD) risk by disrupting the 5′ splice site of TCHP exon 4 in East Asian populations.

Results

Quantification of AS across cell types

The Asian Immune Diversity Atlas (AIDA) Data Freeze v.1 contained 474 Eastern, Southeastern and South Asian donors from Japan, Korea and Singapore (Fig. 1a, Extended Data Fig. 1a and Supplementary Note). Each donor was sampled to an average of 1,959 single cells (Extended Data Fig. 1b). Each cell was sampled to an average of 53,846 reads, of which 25,228 (46.9%) overlapped at least one splice junction (Extended Data Fig. 1d,e). Although read 1 was biased toward the 5′ end, read 2 was spread more evenly because of normally distributed insert lengths21 (Fig. 1b and Supplementary Fig. 1a). Furthermore, incomplete reverse transcription, along with stochastic mRNA cleavage and recapping—or ‘exon painting’—created multiple 5′ ends per isoform19,20 (Supplementary Fig. 1b) to extend coverage throughout the gene body. The covered bases for each gene increased with its read count, with a median of 85.3% (interquartile range (IQR) = 51.8–97.3%; Fig. 1c and Extended Data Fig. 1f).

Fig. 1: Population-scale 5′ scRNA-seq identified 21 cell types and thousands of alternatively spliced genes per cell type.
figure 1

a, The AIDA cohort and study design. b, Profile plot and heatmap showing that read 1 of 5′ scRNA-seq was biased toward the transcription start site and read 2 was spread more evenly across the gene body. c, The base coverage rate per gene increased with the read count (fraction of base coverage = covered bases/all bases). Left, box plot showing the fraction of base coverage across different read count bins (n = 4,034, 4,114, 4,803, 4,883 and 491, from left to right). Outliers are not shown. Right, box plot showing that a median of 85.3% of exonic bases (red line) are covered across all expressed genes. d, Replication of LeafCutter intron discoveries in GENCODE, PacBio MAS-seq and Snaptron. Top, 59.3% of LeafCutter discoveries were annotated in GENCODE and 85.9% replicated in PacBio long-read sequencing from four individuals. Bottom, close to 93% of detected splice junctions appeared in more than 1,000 samples, 98.8% in more than 100 and 99.5% in more than ten. e, We examined 21 distinct PBMC subtypes with sufficient cell counts. Cell types are colored according to their hematopoietic lineage. The numbers below the cell type labels indicate the sample size for differential splicing analysis and sQTL calling. f, Number of alternatively spliced genes detected per cell across 21 cell types at the single-cell level (see Supplementary Table 1 for the number of cells used (n)). The red diamonds indicate the average number of detected genes (NODGs) per cell. The dashed blue line indicates the number of AS genes detected using the OneK1K dataset. g, NODGs positively correlated with the number of AS genes. Linear regression lines (black) are shown for AIDA and OneK1K, respectively. h, Number of detected AS genes per pseudobulk cell type (see Supplementary Table 1 for the number of cells used (n)). i, Number of detected AS genes scaled with the number of cells in a pseudobulk, plateauing at ~11,500 genes. A sigmoid curve was fitted to the data and plotted. cDC, conventional dendritic cell; GZMB, granzyme B; GZMK, granzyme K; IGHM, immunoglobulin heavy constant Mu; pDC, plasmacytoid dendritic cell; RPKM, reads per kilobase of transcript per million mapped reads; TES, transcription end site; TSS, transcription start site.

We classified scRNA-seq data into 34 cell types based on their gene expression profiles22 and examined AS in 21 peripheral blood mononuclear cell (PBMC) subtypes at two levels (Methods, Fig. 1e and Extended Data Fig. 1c). At the single-cell level, SpliZ23 detected a median of 1,146 AS genes per cell (range of medians = 1,013–2,081; Fig. 1f and Supplementary Table 1). Our estimate was 4.3-fold of OneK1K7 using identical pipelines (1,146 versus 267). The number of AS and expressed genes per cell were highly correlated (Pearson’s r = 0.95). An average of 66% (ordinary least squares (OLS) 95% confidence interval (CI) = 65.8–66.3%) of expressed genes had detectable AS events in AIDA, compared to 12.1% in OneK1K (OLS 95% CI = 12.0–12.2%; Fig. 1g), suggesting that the difference was not due to sequencing depth but rather to exon painting and other factors. At the pseudobulk level, LeafCutter24 detected a median of 7,721 AS genes per pseudobulk (range of medians = 5,341–9,683; Fig. 1h and Supplementary Table 1). AS gene counts scaled sigmoidally with cell counts per pseudobulk, saturating at ~11,500 genes (coefficient of determination R2 = 0.92; Fig. 1i).

As a quality control (QC), we compared pseudobulk introns to GENCODE (v.32): 59.3% were canonical introns, while 38.9% and 1.8% contained one and two new junctions, respectively, which is consistent with findings from the Transcriptomic Resource of Immune Cells using Long-read Sequencing25. New introns had a higher Gini index, indicating cell-type-specific expression, than canonical introns (P < 2.2 × 10−308, t = −55.874, d.f. = 40,909; Extended Data Fig. 2a). Furthermore, we sequenced complementary DNA (cDNA) from four donors with PacBio multiplexed arrays sequencing (MAS-seq); 85.9% of short-read junctions were replicated in long-read data, while 13.9% and 0.17% contained one and two new junctions, respectively (Fig. 1d and Extended Data Fig. 2b). Lastly, compared with Snaptron26, ~93%, 98.8% and 99.5% of pseudobulk junctions appeared in more than 1,000, more than 100 and more than ten Snaptron samples, respectively (Fig. 1d and Extended Data Fig. 2c). For both canonical and new junctions defined by GENCODE, 5′ splice sites were highly enriched for the canonical motif 5′-AG|GU; 3′ splice sites were highly enriched for AG|G-3′ (Extended Data Fig. 2d).

Cell-type-dependent and context-dependent AS

AS is ubiquitous across immune cells27, so we asked whether AS-based clustering recapitulated our current understanding of the hematopoietic lineage. Hierarchical clustering using pseudobulk quantifications separated myeloid from lymphoid cells, which further clustered into B, cytotoxic T and natural killer (NK), and noncytotoxic T cells (Fig. 2a and Extended Data Fig. 3a). To verify the robustness, we repeated hierarchical clustering using single-cell quantifications and observed congruent clustering patterns (Fig. 2a).

Fig. 2: Cell-type-dependent and context-dependent AS.
figure 2

a, Hierarchical clustering of single-cell and pseudobulk quantification of AS recapitulated well-known hematopoietic lineages. The heatmap shows the Spearman’s rank correlation coefficient. Within the T and NK cluster, two subclusters demarcated cytotoxic and noncytotoxic cell types. The cytotoxic cellular cluster contained CD4+ T cytotoxic, mucosal-associated invariant (MAIT), γδ T, NK and CD8+ T (GZMKhi and GZMBhi) cells, whereas CD4+ T cell (naive, TCM and TEM), regulatory T (Treg) cells and CD8+ T naive cells fell within the noncytotoxic cluster. b,c, Alternative intron use of PTPRC and CD44 reflected isoform-specific roles in T cell development. In b, the mRNA encoding the CD45RO isoform (red) was the lowest in naive T cells and was more abundant in activated and memory T cells. This trend was reversed for the mRNA encoding the CD45RA+ isoforms. log-transformed splicing ratio = log2(CD45RX/CD45RO), where RX indicates any isoforms other than RO. For CD45RO, log-transformed splicing ratio = log2(CD45RO/ΣCD45RX). In c, the standard CD44 (CD44s) isoform (red) was highest in naive T cells and was less abundant in activated and memory T cells. d, Discovery and sharing of sex-biased differentially spliced genes (DSGs) (FDR < 0.05). e, The sex-biased isoform expression of FLNA was cell-type-specific. The ENST00000498491 isoform (red boxes) exhibited strong female bias in T cells but not in B cells. f, Ancestry-biased DSGs discovered through pairwise comparisons across Eastern, Southeastern and South Asian individuals. Left, relative contributions of the three pairwise comparisons to the total number of DSGs in each cell type. Right, total number of DSGs across all cell types. g, Allele frequency difference in rs11064437 led to ancestry-biased isoform use of SPSB2 in CD8+ T GZMBhi. rs11064437 disrupted the canonical splice site, thereby promoting use of the new splice site. Black, annotated canonical intron; red, new intron missing from GENCODE. Inset, MAF of rs11064437 decreased from Eastern to Southeastern to South Asian individuals.

We asked whether well-known isoforms were captured using 10x 5′ scRNA-seq assays. PTPRC encodes the CD45 transmembrane protein tyrosine phosphatase, which is critical in T cell differentiation28. The mRNA encoding the CD45RA+ isoform, which included one or more of exons 4, 5 and 6, was preferentially expressed in naive T cells to facilitate T cell activation. In contrast, the mRNA encoding the CD45RO isoform, which skipped exons 4–6, was preferentially expressed in activated and memory T cells29. We observed that use of junctions corresponding to the CD45RA+ isoform was highest in naive T cells and decreased in activated and memory T cells, whereas the CD45RO isoform exhibited an opposite trend (Fig. 2b). Another example is CD44, which encodes a key surface glycoprotein during T cell activation and homing. The mRNA encoding the standard CD44 isoform (CD44s) was most abundant in naive T cells, whereas isoforms containing exons 2–10 were more likely to be expressed after activation30. We observed elevated use of junctions corresponding to the CD44s isoform in naive T cells compared to other T cell subtypes (Fig. 2c).

Sexual dimorphism influences a broad range of autoimmune and inflammatory diseases. Using LeafCutter, we identified 48 sex-biased DSEs in 32 genes across all cell types (false discovery rate (FDR) < 0.05; Fig. 2d and Supplementary Table 2). An example of shared sex-biased splicing involved filamin A (FLNA), an auto-antigen targeted by T and B cells in more than 50% of patients with rheumatoid arthritis (RA)31. Female donors had higher expression of a short isoform, ENST00000498491, consisting of exons 44–49 in mature T and NK cells (Fig. 2e and Extended Data Fig. 3b).

We next determined the influence of ancestry on AS and identified 1,031 DSEs affecting 509 genes (Fig. 2f and Supplementary Table 3). We observed the smallest difference between Singaporean Chinese and Singaporean Malay individuals, potentially because they are more genetically similar to each other than they are to Singaporean Indian individuals32. We identified an SNP, rs11064437, whose minor allele frequency (MAF) decreased from Eastern to Southeastern to South Asian individuals in AIDA (EAS = 30.7%; SEA = 13.9%; SAS = 3.4%; Fig. 2g) and in the 1000 Genomes Project (Extended Data Fig. 3c). Functionally, rs11064437 disrupted the 3′ splice site of SPSB2 exon 2 (5′-AG|C-3′ to 5′-AA|C-3′), prompting the use of an alternative 3′ splice site 12-bp downstream. Notably, the EAS-biased alternative intron was missing from GENCODE, highlighting the lack of ancestral diversity in a widely used gene annotation database.

cis-sQTLs are cell-type-specific and context-biased

We followed an established pipeline33,34 to identify cis-sQTL within a ± 1-Mb window across 19 cell types (Supplementary Table 1). We detected 10,874 and 703 sQTLs for protein-coding and long noncoding RNA genes, respectively (FDR < 0.05), representing 39.9% of all tested genes (Fig. 3a and Supplementary Table 4). Lead sQTL variants (sVariants) were enriched near splice sites, introns and splice regions (Fig. 3b,c). Sample size was positively correlated with sQTL discovery (Pearson’s r = 0.95; Fig. 3d), but were inversely correlated with sQTL effect size (Pearson’s r = −0.95; Extended Data Fig. 4a), indicating that large sample sizes were better-powered to detect small-effect variants33. sQTLs discovery was also positively correlated with pseudobulk library size (Pearson’s r = 0.96; Extended Data Fig. 4b). Sample size and library size independently contributed to sQTL discovery power (likelihood ratio test; sample size: P = 3.68 × 10−10; library size: P = 1.14 × 10−10). Across all cell types, we observed a maximum of 16.1% of sQTL genes (sGenes) with two or more regulatory variants (Fig. 3a). We observed that allelic heterogeneity scaled with the number of significant sGenes to detect sQTLs (Fig. 3e).

Fig. 3: Single-cell sQTL analysis revealed cell-type-specific and sex-biased regulation of splicing.
figure 3

a, Numbers of sGenes (red dots) and proportions of sGenes (stacked bars) with various numbers of independent sQTLs across 19 cell types (adjusted beta-approximated P < 0.05). b, cis-sVariants preferentially located near splice junctions and in the affected introns. c, A Bayesian hierarchical model revealed that sVariants were enriched in the splice region and as missense and synonymous variants. The dot plot shows the mean ± s.e.m. of functional annotations (n of sVariants = 11,577). d, Number of sGenes scaled with the number of donors and junction read count across 19 cell types. The shaded area on either side of the linear regression line represents the 95% CI. e, The proportion of sGenes with more than one independent sVariant increased with the power of sGene discovery across 19 cell types. The shaded area on either side of the linear regression line represents the 95% CI. f, AIDA cis-sQTLs were well replicated in BLUEPRINT, DICE, GTEx LCL, GTEx whole-blood and ImmuNexUT. Each dot represents one cell type from AIDA, colored as in a. g, Fractions of lead cis-sQTLs shared according to sign and magnitude in one or more cell types. Sharing according to sign was defined as a cis-sQTL sharing the same sign with the top cis-sQTL across 19 cell types. Sharing according to magnitude was defined as the effect size of a cis-sQTL being within a factor of two of the top cis-sQTLs across 19 cell types. h, Pairwise sQTL sharing according to magnitude across 19 cell types. A total of 2,488 sQTLs that were significant (linear feedback shift register (LFSR) < 0.05) in at least one cell type were considered to avoid random noise in association testing. i, Number of sex-biased sQTLs discovered in 19 cell types (FDR < 0.05). Cell type coloring as in a. j, CLEC2D sQTLs in CD4+ TEM cells colocalized with the GWAS of lymphocyte count. This colocalization was primarily driven by a female-biased sQTL. The sQTL lead variant rs3764022 was an exonic variant located in the splice region of CLEC2D exon 2. The unadjusted two-sided P value was calculated using QTLtools.

Source data

To replicate cis-sQTLs with external datasets, we used four publicly available PBMC-related bulk RNA-seq datasets: (1) the BLUEPRINT dataset13 (n = 197); (2) the DICE dataset12 (n = 91); (3) the GTEx whole-blood (n = 670) and lymphoblastoid cell lines (LCLs) (n = 147); and (4) ImmuNexUT35 (n = 416) (Extended Data Fig. 5). We used the π1 statistics to assess replication between AIDA and matching cell types from the aforementioned datasets. π1 estimates ranged from 0.91 to 0.93 for BLUEPRINT, 0.83 to 0.91 for DICE, 0.70 to 1 for GTEx whole-blood (except for cDC2; π1 = 0.48), 0.71 to 0.86 for GTEx LCL and 0.70 to 1 for ImmuNexUT (except for CD56+ NK; π1 = 0.42) (Fig. 3f and Supplementary Fig. 2), suggesting that AIDA cis-sQTLs replicated well. Despite high replication rates, we identified an average of 14.31% (range = 4.76–19.32%) new sQTLs compared with ImmuNexUT, the largest existing PBMC dataset (Extended Data Fig. 5).

To identify cell-type-specific sQTLs, we quantified sQTL sharing in terms of sign and magnitude using mash36. sQTLs tended to have the same direction of effects but different effect size magnitudes (Fig. 3g). Neighboring cell-type pairs on the hematopoietic lineage shared sQTLs more than other pairs (Fig. 3h). For example, myeloid and B cells formed two distinct clusters. We identified many cell-type-specific sQTLs in known autoimmune risk genes37. One example was CCL4, which encodes a cytokine upregulated in systemic lupus erythematosus (SLE)-associated autoimmune hemolytic anemia38. Its sQTL effects preferentially appeared in T cell subtypes (Extended Data Fig. 6).

Sex disparities in disease risk might also be attributed to sex-biased sQTLs (sQTL)39. For each independent lead cis-sQTL, we performed sex-biased sQTL analysis with a linear model to test for genotype-by-sex (G × S) interaction. We identified a total of 27 sex-biased sQTLs across 20 genes (FDR < 0.05; Fig. 3i and Supplementary Table 5). Among these, two sex-biased sQTLs had significant effects in only one sex. For example, the lead variant rs930090 modulated TECR intron use (chromosome 19: 14529711–14562525) in CD16+ NK cells for females (P = 6.42 × 10−19; \(\hat{\beta }\) = 0.359) but not males (P = 0.043; \(\hat{\beta }\) = 0.125; Extended Data Fig. 4d). Another 25 sex-biased sQTLs appeared in the two sexes with different effect sizes. The lead sVariant rs17713729 had a significant sQTL effect on the SH3YL1 intron (chromosome 2: 253115–264782) in CD4+ central memory T (TCM) cells for both males (P = 6.41 × 10−30) and females (P = 2.49 × 10−14), but its effect size was larger in males (\(\hat{\beta }\) = 0.523) than in females (\(\hat{\beta }\) = 0.319; Extended Data Fig. 4e). Using 20 GWAS traits conducted in East Asian populations40,41,42,43,44 (Supplementary Table 6), we identified a sex-biased colocalization between lymphocyte count and CLEC2D sQTL in CD4+ effector memory T (TEM) cells. The sQTL had a stronger effect on CLEC2D splicing in females (\(\hat{\beta }\) = 0.817, P = 5.5 × 10−21) than males (\(\hat{\beta }\) = 0.499, P = 3.3 × 10−6) and stronger colocalization with lymphocyte count in females (H4 = 0.935) than males (H4 = 0.676; Fig. 3j). CLEC2D encodes a transmembrane C-type lectin receptor that causes inflammation and tissue injury45. Further experimental evidence is needed to understand its function affecting lymphocyte count.

We performed ancestry-biased sQTL detection by stratifying the AIDA cohort into East Asian, Southeast Asian and South Asian. We observed eight Malay-biased cis-sQTLs and 19 Indian-biased cis-sQTLs (Supplementary Table 7). One example of Malay-biased sQTL is the lead variant rs492083, which modulates ATP5MPL intron use (chromosome 14: 103914633–103915066) in CD16+ monocytes (P = 2.10 × 10−8; \(\hat{\beta }\) = -0.995; Extended Data Fig. 4f). One example of Indian-biased sQTL is the lead variant rs6576010, which modulates POLB intron use (chromosome 8: 42338685–42344953) in naive CD4+ T cells (P = 1.31 × 10−7; \(\hat{\beta }\) = −0.946; Extended Data Fig. 4g).

Dynamic intron use and sQTL across B cell development

B cell differentiation is crucial for the immune system46. We focused on naive (n = 26,617), IGHMlo memory (n = 16,280) and IGHMhi memory B cells (n = 10,067) to explore how splicing regulation correlates with the different stages of B cell development (Fig. 4a). We assigned pseudotime to each cell along a differentiation trajectory (Fig. 4b) and split the population into six quantiles based on pseudotime values (Fig. 4c). In this trajectory, memory B cells underwent class switch recombination from IgM (IGHMhi) to other isotypes (IGHMlo), in agreement with previous studies (Fig. 4d)47. We identified 865 introns whose use correlated with pseudotime (FDR < 0.05; Supplementary Table 8). Dynamic introns showed three distinct modes of pseudotime-dependent use (Fig. 4e). The first mode was a stepwise change between naive and memory B cells. PAX5 is a B cell-specific transcription factor that has a critical role in B cell development (Fig. 4f and Extended Data Fig. 7)48,49. Its two isoforms were expressed throughout B cell development, but there was no consensus regarding their isoform-specific roles in the current literature49,50,51. We observed PAX5A downregulation and PAX5B upregulation as B cells matured. The second mode showed a steady linear change in intron junction use, like the decrease in exon 4 use for PTPRC during naive to memory B cell differentiation. The final quadratic mode involved transient intron use, such as transient upregulation of exon 1 in DOCK8 during B cell development (Fig. 4f and Extended Data Fig. 7).

Fig. 4: Dynamic intron use and sQTLs identified through B cell development.
figure 4

a, Principal component (PC) projections of single-cell gene expression for naive, IGHMhi memory and IGHMlo memory B cells. b, Pseudotime projection of 52,964 B cells. The direction of the curve and the intensity of the green color indicate the dynamic process of B cell maturation from naive to IGHMhi memory and to IGHMlo memory B cells. c, B cells were partitioned into six quantiles according to pseudotime values. d, Dynamic expression of IGHM during cellular development agreed with B cell class switch recombination from producing IgM to other isotypes. IGHM ratio: IGHM expression level/(IGHM + IGHG1 + IGHG2 + IGHG3 + IGHG4 + IGHA1 + IGHA2 + IGHD + IGHE) expression level. e, Three distinct patterns were identified for pseudotime-dependent intron use: stepwise, linear and quadratic. f, Dynamic intron use across six quantiles of B cell development. Three example genes with different dynamic intron use patterns (top, stepwise change in PAX5; middle, linear change in PTPRC; bottom, quadratic change in DOCK8). The dot color corresponds to the six quantiles in c and the dot size reflects the mean intron usage in that quantile. g, Left, heatmap of scaled mean intron use across pseudotime, with the color bar corresponding to the three dynamic intron use patterns in e. sVariant–intron pairs with significant interaction effects with B cell pseudotime are shown. Both linear (genotype × time) and quadratic (genotype × time2) models were used to assess the interaction between genetic and pseudotime quantiles. Middle, scaled effect size estimates of sVariant–intron pairs. Right, three example genes (CLEC2D, CCND3, ORMDL3) with dynamic effect sizes across pseudotime. The samples sizes for each quantile are: Q1 (n = 419), Q2 (n = 425), Q3 (n = 427), Q4 (n = 450), Q5 (n = 448) and Q6 (n = 449).

Next, we sought to identify dynamic genetic effects on splicing across the B cell differentiation trajectory. We found 107 lead sQTLs with significant pseudotime interactions (FDR < 0.05, Fig. 4g and Supplementary Table 9). Of these dynamic sQTLs, the effect sizes of 62 (57.94%) grew stronger, 37 (35.58%) grew weaker and 11 (10.28%) showed transient effects along the differentiation trajectory. Several dynamic sGenes were known to be immune-related, including CLEC2D45, CCND3 (ref. 52) and ORMDL3 (ref. 53). Notably, we found several dynamic sQTLs that colocalized with autoimmune disease GWAS loci. For example, the dynamic sQTL for CD83, whose effect of lead variant rs6936285 weakened during B cell maturation, strongly colocalized with RA (COLOC H4 = 0.980) (Extended Data Fig. 8a). Additionally, the dynamic sQTL for BCL2A1, whose effect of lead variant rs16971619 increased during B cell maturation, colocalized with lymphocyte count (COLOC H4 = 0.848). These findings suggest a functional link between dynamic sQTLs and autoimmune diseases (Extended Data Fig. 8b).

trans-sQTLs are highly cell-type-specific

We performed trans-sQTL mapping to identify distal genetic effects on splicing (Methods) and identified 607 trans-sGenes (FDR < 0.01; Fig. 5a). The number of detected trans-sGenes was positively correlated with sample size (Spearman’s ρ = 0.74) and with the number of cis-sGenes (Spearman’s ρ = 0.62; Fig. 5b).

Fig. 5: trans-sQTL analysis revealed a regulatory relationship between HNRNPLL and PTPRC.
figure 5

a, Upset plot showing discovery and sharing of trans-sQTLs across cell types. Right, the bar plot shows the number of trans-sQTLs per cell type. Top, the bar plot shows the number of trans-sQTLs in each category. The x axis is truncated at a minimum of five sQTLs. b, The number of trans-sGenes scaled with the number of donors. The two-sided P value was calculated using Spearman’s rank correlation. c, Box plot of the π1 statistics for cis-sQTLs and trans-sQTLs. The P value was calculated using a two-sided paired t-test (n = 251 for trans-sQTLs; n = 251 for cis-sQTLs). d, Circos plot revealing the cis-regulatory effects (cis-eQTLs) underlying trans-sQTLs (links colored according to cell type as in a). A link is black if a colocalization event occurred in multiple cell types. e, Bar plot and heatmaps showing the colocalization probability (COLOC PP: H4) between HNRNPLL cis-eQTL and PTPRC trans-sQTL and QTL P values. In e,f,j, Unadjusted P values were obtained using Matrix eQTL (cis-eQTL) and QTLtools (trans-sQTL). f, LocusCompare plots showing the colocalization between HNRNPLL cis-eQTL and PTPRC trans-sQTL in CD4+ T (naive, TCM and TEM) cells. g, Higher SpliZ scores (representing more isoforms with longer intron length) were observed in single cells with greater HNRNPLL expression. The dot plot shows the mean and 95% CI. The P value was calculated using a two-sided t-test (n = 214,504 for ‘not expressed’; n = 53,064 for ‘expressed’). h, Violin and box plots showing that rs6751481 was associated with the ratio between naive and memory CD4+ T cells across AIDA donors. The P value and β were determined using linear regression (red line; n = 96 for TT; n = 217 for TC; n = 114 for CC). i, SMR revealed strong pleiotropy between HNRNPLL cis-eQTLs and GWAS on activated T cell proportion. The P value was obtained using SMR (n = 3579 for all the input variants). The SMR effect plot shows the mean ± s.e.m. of the variant effects. j, LocusZoom plot showing that naive and CD4+ TEM cells harbored two independent lead SNPs for HNRNPLL cis-eQTLs (square: lead SNPs for naive and CD4+ TEM cells; triangle: remaining SNPs for naive CD4+ T cells; circle: remaining SNPs for CD4+ TEM cells). Bottom, SuSiE posterior inclusion probability (PIP). The LD between rs6751481 and rs74258942 was modest (r2 = 0.28). k, Schematic showing the proposed regulatory relationship between HNRNPLL cis-eQTLs and PTPRC trans-sQTLs.

Source data

Most detected trans-sGenes were cell-type-specific. Out of all trans-sGenes, 393 (64.7%) appeared in only one cell type, which was similar to the OneK1K estimate (63.6%). Trans-sQTLs π1 replication between cell types was significantly lower than cis-sQTLs (two-sided t-test P = 2.91 × 10−79, t = −28.1, d.f. = 250; Fig. 5c). Additionally, we noticed that the effect size differences between discovery and replication cell types were larger for trans-sQTLs than cis-sQTLs (two-sided t-test P = 2.81 × 10−91, t = −31.4, d.f. = 250; Supplementary Fig. 3). This suggests that trans-sQTLs are more cell-type-specific than cis-sQTLs.

We asked whether the cis-effects on RNA-binding proteins could underlie trans-sQTLs. We identified 17 colocalization events between AIDA cell-type-specific cis-eQTLs and trans-sQTLs (COLOC H4 > 0.75; Fig. 5d and Supplementary Table 10). Notably, PTPRC trans-sQTLs exhibited T cell-biased colocalization with HNRNPLL cis-eQTLs (Fig. 5e,f and Extended Data Fig. 9a,b). Furthermore, single-cell quantifications also suggested that higher HNRNPLL expression led to shorter PTPRC isoforms (two-sided t-test P = 3.07 × 10−197, t = −30.0, d.f. = 81578.31; Fig. 5g).

Notably, we observed a lead switch across CD4+ T cell subtypes. In naive CD4+ T cells, rs6751481 was the lead SNP for HNRNPLL cis-eQTLs and PTPRC trans-sQTLs (H4 = 1.00). However, the lead SNP rs74258942 in CD4+ TEM cells (H4 = 0.998) was 36,158 bp away and in modest linkage disequilibrium (LD) with rs6751481 (r2 = 0.28). Fine-mapping with SuSiE54 confirmed that both lead SNPs were the most likely causal SNPs for their respective cell types (Fig. 5j). We observed that both lead SNPs overlapped functional elements on HaploReg: rs6751481 overlapped a blood-specific promoter region, while rs74258942 overlapped a blood-specific enhancer region. In addition, we also replicated rs6751481 in OneK1K. It was the lead SNP for HNRNPLL in CD4+ naive T cells (FDR < 0.15) but not in CD4+ TEM cells (FDR > 0.9). The variant rs74258942 was Asian-biased (OneK1K MAF = 0.07; AIDA MAF = 0.32; 1KG_EUR = 0.07; 1KG_EAS = 0.29) and was not significant in OneK1K (FDR > 0.99).

Isoform-specific expression of PTPRC (CD45) is associated with naive CD4+ T cell differentiation. Because rs6751481 was associated with the CD45RA+ and RO isoform ratios in CD4+ T naive cells, we tested whether it would be associated with CD4+ T cell proportions and observed a modest but significant association (\(\hat{\beta }\) = 0.043; P = 0.027; Fig. 5h). As an orthogonal validation, we leveraged a previous GWAS study on T cell proportions55 and conducted summary-based Mendelian randomization (SMR)56. We observed strong pleiotropy between HNRNPLL expression and the proportion of memory CD4+ T cells (P = 9.21 × 10−7; Fig. 5i) without significant linkage effect (HEIDI P = 0.103). These results showed that germline variations influencing HNRNPLL expression and PTPRC splicing also influenced T cell proportions (Fig. 5k).

Aberrant splicing mediates complex diseases

To identify the AS mechanisms that mediated polygenic disease risk, we compiled GWAS summary statistics for 20 traits focused on Asian populations (Supplementary Table 6). We conducted colocalization between cis-sQTLs from 19 cell types and GWAS for 20 complex traits with COLOC57 (Fig. 6a and Extended Data Fig. 10a). The proportion of colocalized GWAS loci varied across different GWAS traits, with immune-related diseases having the highest proportion, while nonimmunological traits had the lowest proportion of colocalization. We applied stratified LD score regression (S-LDSC) to estimate heritability enrichment from GWAS summary statistics58. The heritability of SLE, atopic dermatitis (AD), GD and RA showed higher enrichment in PBMC sQTLs than other traits (Fig. 6b and Extended Data Fig. 10e,f).

Fig. 6: Aberrant splicing mediates complex diseases.
figure 6

a, Cell-type-specific colocalization between cis-sQTLs from 19 cell types and 20 complex traits. b, Heritability enrichment (proportion h2/proportion variant) for 20 traits mediated by cis-sQTLs from 19 cell types. Autoimmune and inflammatory diseases are highlighted in bold. c, Colocalization for 28 example sGenes across 19 cell types in the five disease traits. The color of each circle indicates the associated diseases. The inset shows the total number of colocalized loci across the five diseases. d, Gene expression, eQTLs, junction reads, sQTLs and H4 posterior probability (sQTL-GWAS colocalization) for TCHP across 19 cell types. High junction use between exons 4 and 5 led to sQTL and sQTL-GWAS colocalization. e, Cell-type-specific colocalization of GD GWAS and TCHP sQTLs in seven cell types. rs74416240 was the lead GWAS risk variant. The unadjusted, two-sided P value was calculated using QTLtools. f, MAF of rs74416240 in five AIDA populations and five major populations in the 1000 Genomes Project showed an East Asian bias of the rs74416240 minor allele. g, Gene model of TCHP with three isoforms. rs74416240 was located in the 5′ splice site of the intron junction between exons 4 and 5. h, Minigene experiment to validate the effect of rs74416240 on TCHP exon 4 splicing in K562 cells. The universal minigene vector (UMV) backbone alone corresponded to the band with the smallest molecular weight on the gel image. The test region, containing the 57-nt long exon 4 plus the 200-bp flanking sequences, was cloned into the UMV. Two identical minigene constructs with one nucleotide difference at rs74416240 (reference = G; alternative = A) were transfected into K562 cells. The reference allele (G) predominantly led to the normal isoform; the alternative allele (A) led to intron retention. BAS, basophil count; BMI, body mass index; EOS, eosinophil; Hb, hemoglobin; Ht, hematocrit; MCH, mean corpuscular Hb; LYM, lymphocyte; MCHC, MCH concentration; MCV, mean corpuscular volume; MON, monocyte count; NEU, neutrophil; PLT, platelet count; RBC, red blood cell; WBC, white blood cell count.

Source data

We identified 53 colocalized loci among the five autoimmune and inflammatory diseases (COLOC H4 > 0.75; Fig. 6c and Supplementary Tables 11 and 12). Our results identified cell-type-specific colocalization between the SLE GWAS and an sQTL that regulates the splicing of IRF5 exon 1 (Extended Data Fig. 10b,c). IRF5 is a well-known risk gene for SLE59. The putative causal SNP rs2004640 disrupted the 5′ splice site of exon 1B, leading to nonsense-mediated decay and downregulation of IRF5 expression (Extended Data Fig. 10d).

We turned our attention to GD, an autoimmune hyperthyroidism with a significantly higher incidence rate in Asian than European populations60. Colocalization analysis captured TCHP as a cell-type-specific risk gene for GD. Cell types with high use of the intron junction between TCHP exons 4 and 5 harbored cis-sQTLs, which led to cell-type-specific sQTL-GWAS colocalization (Fig. 6d). Furthermore, cis-sQTL effects led to intron retention between exons 4 and 5 and nonsense-mediated decay, which manifested as cis-eQTLs (Fig. 6d). A recent study in the Japanese population identified a GWAS locus associated with GD (P = 8.6 × 10−14)42. The GWAS lead variant rs74416240 had a significant genetic effect on TCHP exon 4 use in monocytes, NK, CD4+ and CD8+ T cells (FDR < 0.05) and strongly colocalized with TCHP cis-sQTLs (Fig. 6d,e). Notably, the MAF of rs74416240 was high in the East Asian population (Japanese = 0.18; Chinese = 0.15, Korean = 0.13), modest in the Southeast Asian (Malay = 0.04) population and absent in the South Asian population in AIDA. We observed the same pattern of allele frequencies in 1000 Genomes Project populations (Fig. 6f).

The lead variant rs74416240 resided in the last nucleotide of TCHP exon 4 and was predicted to disrupt the 5′ splice junction (Fig. 6g). To validate this effect, we conducted a minigene experiment to test the effect of rs74416240 on TCHP exon 4 splicing in K562 cells. The test region contained a 57-nt-long exon 4 plus the 200-bp flanking sequence (Fig. 6h). We transfected two identical minigene constructs with one nucleotide difference at rs74416240. A low level of intron retention was observed for the reference allele (G) construct. The alternative allele (A) construct, where the 5′ splice site was disrupted, revealed a nearly complete intron retention isoform (Fig. 6h). These results suggest that rs74416240 is a causal variant for the TCHP sQTL and possibly contributes to GD risk via this differential splicing.

Discussion

AS is a key mediator of genetic effects on complex diseases. Our study leveraged droplet-based 5′ scRNA-seq to investigate cell-type-specific AS and sQTLs. Unlike FACS-based bulk RNA-seq studies12,13,35, our study conducted unbiased sampling to capture low-abundance PBMC cell types. Compared to previous population-scale scRNA-seq PBMC studies7,8, our study demonstrated a 4.3-fold increase in exon coverage by leveraging 5′ scRNA-seq, enabling better power for sQTL detection. In addition, our data filled a critical gap in expanding the catalog of under-represented Asian populations, especially Southeast and South Asians.

The ancestral diversity of our cohort enabled the identification of ancestry-biased splicing and sQTLs. Like previous studies61, we found that ancestry had a stronger influence than sex on AS. Such influence can be partially attributed to differences in allele frequency: ancestry-biased SPSB2 splice site use was caused by an MAF gradient from Eastern to Southeastern to South Asian populations. In addition, most Malay-biased and Indian-biased sQTLs had higher MAFs in the respective discovery populations, while the rest were explained by differences in effect sizes. Notably, we captured an Asian-specific causal variant that disrupted the 5′ splice site of TCHP exon 4 and led to intron retention. This variant colocalized with an Asian-specific GD risk locus identified in the Japanese population42.

We identified 607 trans-sGenes across 19 cell types, which was significantly higher than previously identified by tissue-level bulk analysis33. Most trans-sGenes are specific to one cell type, which probably explains the scarcity of trans-sGenes when studying homogenized tissue samples. Notably, trans-sQTLs for PTPRC colocalized with cis-eQTLs for HNRNPLL in a cell-type-specific manner. Lead SNPs in naive and CD4+ TEM cells were more than 36-kbp apart and in modest LD; fine-mapping analysis suggested distinct causal variants across the T cell subtypes.

Our study highlights mechanistic findings and disease implications from single-cell sQTLs and underscores the importance of ancestral diversity. However, our findings must be interpreted within the limitations of our study. One limitation is that dynamic B cell analysis missed transitional states such as activated B cells and plasmablasts, which are typically observed during the adaptive immune response and not in healthy donors. Individuals undergoing active immune challenges are better suited to study B cell development. A second limitation of this study is that rare cell types are underpowered for sQTL discovery. Thus, we probably missed true positive sQTLs in these rare cell types. A third limitation was the uneven sequencing depth across the length of the gene. This problem could be addressed using full-length scRNA-seq library preparation methods, such as SmartSeq, or using long-read sequencing, such as Oxford Nanopore and PacBio. Because of the higher cost per sample and lower throughput, these methods have not been widely applied to large-scale sQTL studies. With the rapid advancement in genomic technologies, high-throughput and low-cost full-length scRNA-seq will enable unbiased and comprehensive coverage of the transcriptomic landscape.

Methods

Inclusion and ethics statement

Local researchers from the AIDA consortium member countries were involved in study design, study implementation, data ownership, intellectual property and authorship of publications throughout the research process. All participants were approved by local ethics review committees before study enrollment.

All study protocols were approved by the institutional review boards (IRBs) of the institutions that the laboratories are affiliated with (Genome Institute of Singapore: IRB-2020-012 and IRB-2022-051; Nanyang Technological University: IRB-2016-11-030-01, IRB-2016-11-030 and IRB-18IC4698; RIKEN: IRB-H30-9; Samsung Genome Institute: IRB-2019-09-121; Faculty of Medicine Siriraj Hospital, Mahidol University: IRB-725/2563(IRB3); National Institute of Biomedical Genomics: IRB-NIBMG/2022/1/0022) before dataset generation. All donors provided written informed consent for sample and metadata collection and subsequent analyses.

The AIDA cohort

The AIDA Data Freeze v.1 performed 10x 5′ scRNA-seq on 503 donors of Eastern (Singaporean Chinese, Japanese, Korean), Southeastern (Singaporean Malay) and South Asian (Singaporean Indian) descent. After removing donors whose samples failed genotyping or scRNA-seq QC, and related donors (up to third-degree cousins), 484 nonrelated individuals remained. Taking the intersection between genotype and scRNA-seq resulted in 474 individuals for sQTL analysis. The donor characteristics are provided in Supplementary Table 1.

Genotype data processing and QC

Genotyping was performed using the Illumina Infinium Global Screening Array v.3 according to the manufacturer’s protocol. We exported genotype data into PLINK data format using the GenomeStudio PLINK Input Report plug-in v.2.1.4 to perform QC and imputation. A total of 477,889 post-QC variants were used for genome-wide imputation with prephasing using the Michigan imputation server62. All populations in 1000 Genomes Project high-coverage (hg38) were selected as reference panel. After imputation, only imputed variants with an imputation quality of R2 > 0.8 were retained. Variants with missingness greater than 0.05, MAF < 0.05 or Hardy–Weinberg equation P < 1 × 10−6 were excluded from further analysis. Monoallelic and multiallelic variants in each population were removed, leaving 5,065,361 genetic variants for downstream analysis. We extracted autosomal nonpalindromic variants and merged the AIDA genotypes with the 1000 Genomes (n = 2,504) for PC analysis (PCA). We merged the AIDA samples with the 1000 Genomes samples using bcftools63 v.1.9 and performed PCA using PLINK64 v.1.9.

scRNA-seq data processing and QC

The single-cell experiments used the 10x Genomics 5′ v.2 RNA-seq method. Each batch included a pooling of 15 Asian donors and one European control sample. The initial data preprocessing and QC steps, such as doublet identification and demultiplexing, were conducted using DRAGEN v.3.8.4, Cell Ranger v.7.0.1 and Freemuxlet (https://github.com/statgen/popscle;v0.1-beta). We used the GENCODE release 32 (GRCh38, Ensembl 98, 5 September 2019) as our gene annotation reference. Cells with fewer than 300 GENCODE release 32 genes detected (NODG < 300) were filtered out. Subsequent analyses involved Seurat v.4.1.1R65 and RCA v.2.0 (ref. 65) for the initial cell type annotation for the doublet identification workflow. Further QC, both at the library-specific and cell-type-specific levels, was performed to remove low-quality cells. AIDA Data Freeze v.1 included 1,058,909 PBMCs from 503 Asian donors and five European controls profiled in Japan, Singapore and South Korea. Cells expressing heightened platelet marker genes in B cells, plasmacytoid dendritic cells, myeloid cells, innate lymphoid cells, NK cells and T cells were removed during cell-population-specific QC. Data integration was achieved using the Seurat anchor integration reciprocal PCA algorithm. Subsequent subclustering annotation was based on marker genes curated from the literature and an examination of gene expression across clusters in our dataset. In the final annotation, 34 cell types were identified. In these 34 subtypes, we focused on 21 PBMC subtypes, each including more than 3,500 cells. Red blood cells, platelets and clusters with ambiguous identities (for example, expression of marker genes of other cell types) were excluded from this analysis (the details are described in ref. 22).

Quantification of mRNA AS

RNA-seq data were aligned to the human reference genome GRCh38 primary assembly and GENCODE v.32 using STARsolo66 v.2.7.10a with the options --soloCBmatchWLtype 1MM --soloUMIdedup 1MM_Directional_UMItools. The cell barcode whitelist is available inside the Cell Ranger installation (cellranger-x.y.z/lib/python/cellranger/barcodes/737K-august-2016.txt). A two-pass mode was used to enable new splice junction discovery. The --waspOutputMode option was used to reduce allelic mapping bias. For each sample, the corresponding post-QC VCF file was used for WASP filtering. Deduplication of reads was performed based on cell barcodes and unique molecular identifier tags from the BAM files using the MarkDuplicates function from Picard. Only uniquely mapped reads in proper pairs that passed the WASP filter were retained for downstream analysis. To calculate the percentage of exonic bases covered by sequencing reads, BEDTools67 v.2.27.1 merge was used to merge overlapping exon regions. To vizualize read distribution, deepTools68 v.3.5.1 was used.

Reads from the same cell type of each donor were extracted using custom scripts to make pseudobulk BAM files; reads from the same individuals were pooled using SAMtools69 v.1.16.1 merge. We used RegTools70 v.0.0.1 to extract intron junctions and LeafCutter24 v.0.2.9 to quantify intron use levels. The prepare_phenotype_table.py script from LeafCutter was used to generate phenotype files in sQTL mapping. Introns with zero read counts in more than 40% of the samples or with insufficient variation (s.d. < 0.005) were removed. Replication of intron junctions obtained using RegTools is detailed in the Supplementary Methods. SpliZ was used to quantify the gene-level splicing of each cell. This pipeline generated a scalar score for each gene–cell pair. A larger negative score indicates that the introns for the gene in a given cell are shorter than average, while a larger positive score indicates the opposite. The GENCODE v.32 annotation file was used as the reference annotation file. The --lower_bound parameter was set to 1 to include all junction reads in the calculation of SpliZ.

Differential splicing analysis across sex and ancestry

Sex-biased differential splicing was performed with LeafCutter v.0.2.7 using default parameters, with age, sequencing center and five genotype PCs as covariates. We calculated FDRs with the Benjamin–Hochberg method and used an FDR < 0.05 as a significance cutoff. The R package gggenes v.0.5.1 was used to plot the gene model and visualize the read coverage on FLNA.

Ancestry-biased differential splicing was performed with LeafCutter using default parameters, with sex and age as covariates. To minimize the batch effects caused by different sequencing centers, we focused on the Singapore cohort (n = 75 for Singaporean Chinese; n = 54 for Singaporean Malay; n = 60 for Singaporean Indian) for this analysis. We determined the influence of ancestry on AS by performing one-versus-one differential splicing analysis between the Singaporean Chinese, Malay and Indian populations. We calculated FDRs with the Benjamin–Hochberg method and used an FDR < 0.05 as a significance cutoff. Sashimi plots used to show the junction use difference between ancestries were generated using ggsashimi34 v.1.1.5.

cis-sQTL mapping

cis-sQTL mapping was performed using QTLtools71 v.1.2, using intron excision ratios and a cis-window of 1 Mb on both sides of the junction. Each cell type needed a minimum of ten cells per individual. Eight PCs derived from splicing ratios, five genotype PCs, sex and age information were used as covariates in the linear model. The number of phenotypes and genotype PCs were chosen to maximize sQTL discovery. Grouped permutations (--grp option) were used to jointly compute an empirical P value over all intron clusters of a gene. QTLtools was run using the permutation mode (1,000 permutations); beta-approximated permutation P values were adjusted for multiple testing correlation using the qvalue package72 v.2.30.0. The significance threshold was set at FDR < 0.05. Conditional sQTL analysis was done by forward stepwise regression followed by a backward selection step. The gene-level significance threshold was set to the maximum beta-adjusted P value over all sGenes in a given cell type. Scanning for cis-sQTLs using QTLtools was performed to correct for all previously discovered variants and all covariates. If the beta-adjusted P value for the lead variant was insignificant at the gene-level threshold, the forward stage was complete and the procedure moved on to the backward stage. If this P value was significant, the lead variant was added to the list of discovered cis-QTLs as an independent signal and the forward step moved on to the next iteration. The backward stage consisted of testing each variant separately, controlling for all other discovered variants. Enrichment of lead sQTL variants in the annotations was calculated using qtlBHM v.1.0 (https://github.com/rajanil/qtlBHM).

To evaluate the replication of AIDA cis-sQTLs in the external datasets, we used five data sources: (1) BLUEPRINT13, which had CD14+ monocytes and CD4+ T cells; (2) DICE12, which had CD14+ monocytes, CD16+ monocytes, CD16+ NK cells, CD4+ naive T cells, CD8+ naive T cells and naive B cells; (3) GTEx whole-blood; (4) GTEx LCL; and (5) ImmuNexUT35. During replication, we selected cis-sQTLs from the corresponding cell types and queried their P value in the aforementioned datasets. The replication was quantified using the π1 statistic as implemented in the qvalue package.

We used mashr36 v.0.2.79 to estimate sQTL sharing across cell types. We used as input the nominal P values from sQTLs (sIntron-sVariant) for each cell type. We defined the cis-sQTL with the most significant P value in each gene as the top cis-sQTL for this gene and combined all the top cis-sQTLs as the strong tests. Random tests were chosen by randomly selecting 20,000 cis-sQTLs from sQTL nominal P values. Strong tests were used to learn data-driven covariance matrices; random tests were used to learn correlation structures. Finally, we used the aforementioned fitted model to compute the posterior means and LFSR for the strong tests.

Sex-biased cis-sQTL

Sex-biased sQTLs were defined as those cis-sQTLs with a significant genotype-by-sex (G × S) interaction effect. For each significant intron–variant pair identified in the independent sQTL analysis, a linear regression model was fitted for each cell type to test for genotype-by-sex (G × S) interaction while adjusting for known and unknown confounders. Male-specific or female-specific means that the allelic effect of the sQTL was significant in the discovery sex (P < 5 × 10−4) and nonsignificant in the other sex (P > 0.05). Differential effect means that the allelic effects of the sQTL were significant in both sexes, but their effect sizes differed.

Dynamic intron use and dynamic sQTL mapping in B cell development

We inferred the pseudotime trajectory from naive to memory B cells using slingshot73 v.2.10.0 and divided the trajectory into six discrete quantiles. To identify dynamic splicing junctions during B cell development, we used analysis of variance to identify intron junctions with a significant change (FDR < 0.05) through six quantiles. To identify dynamic sQTLs, we used linear mixed models with the R package lme4 (ref. 74) v.1.1-35.3. The model assessed the interaction between genotype and pseudotime quantiles, incorporating random effects to account for individual identity and fixed effects for genotype, pseudotime, sex, age and ancestry PCs, and phenotype PCs from each quantile (Supplementary Methods). We tested both linear (genotype × pseudotime) and quadratic interactions (genotype × pseudotime2). An FDR of 0.05 was used as a threshold to call dynamic sQTLs.

trans-sQTL mapping

We followed a previously established permutation-based pipeline33 to map regulatory variants at least 5 Mb away from their trans-sGenes and applied stringent mappability filters to minimize false positive findings74. Only variants with a 50-mer mappability greater than 0.9 were retained as test variants. To exclude genes susceptible to mapping artifacts, we excluded genes with a mappability of less than 0.8 and any variant–gene pair where the trans-sGene cross-mapped with any gene within the cis-window of the variant. Mappability was calculated with a k-mer length of 75 for exons and 36 for untranslated regions75. We performed trans-sQTL mapping using QTLtools34 to test for associations between the test genes and all variants beyond 5 Mb of the same chromosome and interchromosomal associations. For gene-level FDR control, we obtained beta-approximated empirical P values based on 50,000 permutations. To correct for multiple intron phenotypes per gene, we used the most significant empirical P values of variant–phenotype pairs to compute the distribution for the P value across k phenotypes using 1 − (1 − F(x))k, where F(x) is the empirical cumulative distribution function. Benjamin–Hochberg correction was used across all genes and an FDR of 0.01 was used as a threshold to define trans-sQTLs. The circos plot was generated using circlize76 v.0.4.15.

Complex trait associations

We obtained publicly available GWAS summary statistics of 20 phenotypes covering a broad range of categories including blood traits (hemoglobin, platelet count, white blood cell count, monocyte count, basophil count, hematocrit, mean corpuscular volume, lymphocyte, neutrophil, mean corpuscular hemoglobin concentration, mean corpuscular hemoglobin, red blood cell, eosinophil), anthropometric traits (height, BMI) and immune-related traits (AD, asthma, GD, RA, SLE). We estimated sQTL enrichment in the 20 traits using S-LDSC v.1.0.1 with default parameters from the partitioned heritability pipeline (https://github.com/bulik/ldsc/wiki/Partitioned-Heritability) in a window of ±100 kb around each sGene with at least one independent cis-sQTL. To assess colocalization between GWAS loci and cis-sQTLs, we tested colocalization between GWAS and sQTL signals for genes with at least ten variants using the coloc.abf function of the COLOC R package57 v.5.2.3 (with default prior, using beta coefficients from the GWAS and sQTL analysis). We used the MAFs of our sQTL study to estimate the standard deviation of the quantitative trait. H4 > 0.75 was set as the threshold for colocalization. The colocalization results were visualized using LocusCompare77 (http://locuscompare.com/) v.1.0.0.

Experimental validation

To validate the effect of rs74416240 on TCHP exon 4 splicing, we built two minigene constructs that were identical except at rs74416240, with a G allele in the reference construct and an A allele in the alternative construct (the complete sequences can be found in the Supplementary Information). The minigenes consisted of three exon and two intron fragments cloned into an expression plasmid, whereby the middle exon may be subject to splicing changes by the variant. We inserted the sequences into the MCAD minigene we built previously78, which had two outer constitutive exons with their intronic portions and an intronic multiple cloning site to clone the fragments. We cloned both reference and alternative alleles using PCR amplification of genomic DNA fragments. We transfected minigene constructs into the K562 cell line. Forty-eight hours after transfection, we examined the splicing patterns using RNA extraction, PCR with reverse transcription and agarose gels. All oligonucleotides were synthesized by Integrated DNA Technologies (Supplementary Data 1).

Statistics and reproducibility

We included all donors with both quality-controlled genotype and scRNA-seq data. For each cell type, donors with fewer than ten cells were excluded from the analysis. This filtering process resulted in the final sample sizes for each cell type, as detailed in Supplementary Table 1. Sample sizes ranged from 114 to 459 (>400 for most cell types). These numbers are comparable to those reported in previous studies. Donors whose samples failed genotyping or scRNA-seq QC, as well as related donors (up to third-degree cousins), were removed. Genetic variants with a missingness >0.05, MAF < 0.05 or Hardy–Weinberg equation P < 0.000001 were excluded from the analysis. The experiments were not randomized. The investigators were not blinded to allocation during the experiments and outcome assessment. All statistical tests described in the article were two-sided. Box plots generated using the R ggplot2 function show the median and IQR; the whiskers are 1.5 times the IQR. Any data points outside the whiskers were considered outliers and were plotted as individual points.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.