Main

Alternative splicing (AS) has the potential to expand the number of functional peptides encoded in messenger RNA. Large-scale transcriptomics studies have confirmed that nearly all protein-coding genes generate multiple—sometimes dozens—of distinct mRNA isoforms. This finding is often interpreted as supporting the role of AS in diversifying the proteome; yet, most alternatively spliced isoforms are lowly expressed and lack cross-species conservation1,2,3,4,5,6. To explain these observations, multiple studies have suggested that the vast majority of isoforms are nonfunctional transcripts resulting from mis-splicing rather than regulated AS4,5,6,7,8,9.

Mis-splicing from aberrant activation of unconserved ‘cryptic’ splice sites often introduces frameshifts in the mRNA coding sequence, resulting in premature termination codons (PTCs). Consequently, downstream exon junction complexes, which would normally be displaced by translating ribosomes, recruit nonsense-mediated decay (NMD) machinery to the mRNA for degradation. Thus, most transcripts with one or more aberrant splicing events are considered to be ‘unproductive’, as they are expected to undergo rapid NMD.

Unproductive transcripts can also result from regulated AS10,11,12,13,14,15,16,17,18,19,20,21. For example, some splicing factors control AS of their own pre-mRNA, relying on the coupling between AS and NMD (AS–NMD) to autoregulate their expression levels12,13,14,17,21. However, regulated AS–NMD has only been documented in a handful of genes. While most genes have annotated unproductive isoform structures22, the extent to which these isoforms influence gene expression levels, impact phenotypes and/or are important for organismal fitness is unknown. Assessing these questions is complicated partly because the rapid decay of unproductive isoforms obscures quantitative measurements of their splicing7,23,24.

An early study of unproductive splicing estimated that up to a third of transcript isoforms inferred from expressed sequence tags supported unproductive rather than productive splicing and hypothesized that AS–NMD may be a widespread regulatory mechanism7. Since then, multiple studies have attempted to test this hypothesis using improved methods, for example, by measuring gene expression levels before and after knocking down core NMD factors such as UPF1 or UPF2. These studies revealed a modest impact on gene expression levels, noting that only a small fraction (<10%) of genes show appreciable change in gene expression levels upon knockdown25,26, providing evidence against a widespread role of AS–NMD27. By contrast, recent studies23,24,28,29,30,31,32 support partial redundancy between core NMD factors, which can obscure knockdowns of single NMD factors and underestimate the impact of AS–NMD on mRNA expression levels. Thus, the impact of AS on gene expression levels remains unclear10,15,33,34,35,36.

Results

High-throughput measurements of AS before mRNA decay

To assess the impact of AS on steady-state gene expression levels, we must jointly consider multiple stages of gene regulation that reflect mRNA before and after the influence of cytoplasmic decay processes. To do this, we leveraged a large collection of molecular assays in lymphoblastoid cell lines (LCLs) derived primarily from 40–86 Yoruba individuals (Fig. 1a and Supplementary Fig. 1). These datasets37,38,39,40 have been used to study the impact of genetic variants on molecular phenotypes and consists of measurements tracking major steps of mRNA biogenesis including chromatin activity at enhancers (H3K4me1 and H3K27ac combining chromatin immunoprecipitation sequencing (ChIP-seq)) and promoters (H3K27ac and H3K4me3 ChIP-seq), newly transcribed polyA RNAs (4sU pulse-labeled for 30 or 60 min) and steady-state mRNA levels (RNA sequencing, RNA-seq). However, these data fail to capture spliced mRNA before potential cytoplasmic degradation, preventing us from capturing rapidly degraded mRNA transcripts.

Fig. 1: Genomic data captured before, during, and after transcription reveal an abundance of NMD isoforms.
figure 1

a, Subset of the population-scale datasets we analyzed, covering stages of mRNA biogenesis, from activation of enhancers/promoters (that is, H3K27ac and H3K4me3 ChIP-seq) to steady-state RNA (polyA RNA-seq). The gene expression correlation matrix (using promoter peak coverage for H3K27ac and H3K4me3, gene body coverage for H3K36me3 and exonic read coverage for RNA-seq) relative to steady-state RNA samples is shown as a heat map. b, Left: fraction of splice junction reads in each RNA-seq sample (columns, grouped by dataset) that are in Gencode-annotated productive transcript structures (blue) versus unannotated or annotated unproductive transcript structures (gray). The dashed lines indicate the median for each dataset. Right: for the 2.4% of splice junctions in naRNA-seq data that are not in annotated productive transcript structures, we checked for their unique presence in annotated unproductive transcript structures (for example, transcripts tagged by Gencode as ‘retained_intron’), or if unannotated, we attempted to translate sequences surrounding the splice sites (Supplementary Methods). The stacked bars indicate fraction of naRNA-seq splice junctions in each subcategory. c, Similar to b, comparing steady-state RNA from shRNA scramble control (n = 6) and shRNA dKD (n = 3) of SMG6 and SMG7 in HeLa cells24. d, Fraction of cassette exons that are symmetric (that is, length divisible by three) as function of their usage, estimated as percent spliced in PSI. The error bars represent standard error of values for LCL lines treated as replicates (circular markers, same dataset as b) and standard error across replicate shRNA knockdown experiments (triangular markers, same dataset as c). e, Cumulative distribution of log fold differences in steady-state gene expression versus gene transcription (measured by H3K4me3 promoter activity, naRNA, H3K36me3 or 30 m 4sU-labeled RNA), a proxy for degradation rate. Genes are grouped by quintiles based on percent of unproductive junction reads. The quintile of genes with the most unproductive splicing (darkest) show the strongest signature of mRNA degradation. The correlation between unproductive splicing and 30 min labeled 4sU RNA/steady-state RNA is weaker than in comparisons using the other degradation rate proxies, consistent with rapid decay of unproductive transcripts.

To measure AS of pre-mRNA splicing before mRNA decay, we used nascent RNA-seq (naRNA-seq). We obtained a total of 22.4 billion naRNA-seq reads across 86 LCLs. To further increase the temporal resolution of our dataset, we also collected CUT&Tag data from 95 LCLs to profile H3K36me3, a mark associated with active transcription elongation (Supplementary Fig. 2).

The correlation between gene expression measurements in steady-state RNA and gene expression measurement at previous stages of RNA processing reveals a clear temporal pattern, matching our expectation that our naRNA-seq data capture mRNA at an earlier stage of maturation than 4sU sequencing labeled for 30 min (Fig. 1a). We further verified that naRNA-seq captures RNA associated with chromatin and nascent pre-mRNA in several ways (Extended Data Fig. 1), showing that our naRNA dataset allows high resolution analysis of splicing outcomes before cytoplasmic degradation.

Unproductive mRNA splicing is pervasive

We used naRNA-seq data to estimate the prevalence of ‘unproductive splicing’, that is, splicing outcomes that are expected to induce NMD of the host transcript. In contrast, ‘productive splicing’ is expected to preserve the proper reading frame of the host transcript. Thus, exon–exon junction reads from unproductive transcripts are expected to be depleted in steady-state RNA-seq, while the junction reads identified in naRNA would reflect unbiased rates of unproductive splicing. We find 0.44% of all junction reads overlapping protein-coding genes from naRNA-seq map to splice junctions uniquely annotated in NMD-targeted transcripts (Gencode annotations22) compared with 0.15% in steady-state RNA (Extended Data Fig. 2a). Moreover, we find that 2.4% of junction reads in naRNA are not attributable to annotated junctions of stable protein-coding transcripts, compared with 1.1% in 30 min labeled 4sU RNA and 0.60% in steady-state RNA (Fig. 1b). As only a small fraction of these 2.4% of junction reads can be attributed to annotated NMD transcripts, we sought to understand whether the remaining represent unproductive isoforms. We find that about half of these remaining junctions are annotated in transcripts that are expected to be NMD substrates, albeit, not explicitly defined as so by Gencode (for example, transcripts associated with a retained intron) (Fig. 1b, Extended Data Fig. 2b,c and Methods). Most of these splice junctions are greatly depleted in steady-state RNA (Supplementary Fig. 3a), consistent with their rapid decay. To categorize the unannotated splice junctions, we developed a method to predict their effect on transcript coding potential (Supplementary Methods). We find that these unannotated junctions are overwhelmingly expected to either frameshift or introduce a PTC in the coding sequence, thus, resulting in a transcript targeted by NMD. To confirm the quality of our productive/unproductive categorizations of splice junctions, we assessed the change in abundance of splice junctions in short hairpin RNA (shRNA)-induced knockdowns of core NMD components, by reprocessing RNA-seq data from single knockdowns of UPF1, SMG6 and SMG7 and a double knockdown (dKD) of SMG6 and SMG7 (ref. 24). As expected, annotated and unannotated splice junctions that we classify as productive are relatively unchanged upon NMD knockdown, while unproductive junctions increase in abundance (Supplementary Fig. 3b). Single knockdown of core NMD factors displayed much smaller abundances of unproductive splice junctions compared with double SMG6 and SMG7 knockdowns (Extended Data Fig. 3), suggesting functional redundancy between these NMD factors. Overall, we find a similar enrichment of unproductive splice junctions in the dKD experiments (Fig. 1c) as in our naRNA (Fig. 1b), suggesting NMD is the primary mechanism explaining the abundance of unannotated junctions in naRNA. Thus, we estimate that ~2.3% of splicing events target transcripts for NMD, as measured in naRNA, compared with ~0.55% in steady-state RNA (Extended Data Fig. 2c).

We further describe the abundance of unproductive splicing revealed by naRNA in a qualitative manner by analyzing exons whose length is divisible by three (‘symmetric exons’) and, thus, are frame preserving whether the exon is skipped or included. Previous studies using steady-state RNA-seq data found that highly included and conserved alternatively spliced exons (‘cassette exons’) are biased toward being symmetric1,2,3,4,41, suggesting that there is selective pressure to maintain coding frame for highly included cassette exons (Extended Data Fig. 4a). Indeed, ~55% of highly included cassette exons are symmetric compared with only ~35% of rarely used cassette exons (percent spliced in (PSI) <1%). However, we find that in naRNA the fraction of symmetric exons is low, under 40%, even for the most highly included cassette exons (Fig. 1d and Extended Data Fig. 4b). We find similar results in steady-state RNA after knockdown of NMD factors (Fig. 1d and Extended Data Fig. 4b). These observations show that in the absence of NMD, there is no bias for alternatively spliced exons to be symmetric. These findings question the widely held belief that the main function of AS is protein diversification.

Having observed such abundant unproductive splicing, we wondered to what extent AS–NMD influences gene expression levels genome wide. We find that the genewise fraction of unproductive splicing correlates with multiple measures of degradation rate (Fig. 1e), and we conservatively estimate that AS–NMD explains at least 9% of post-transcriptional gene expression variance across genes (Supplementary Note 1).

Despite the abundant effects of AS–NMD on gene expression levels, it is not clear to what extent AS–NMD is functionally regulated and optimized through selection versus a result of splicing mistakes, which nonetheless impact gene expression. To better understand this, we explored which classes of genes and introns are most prone to AS–NMD. We found that highly expressed (Extended Data Fig. 5a,b) and evolutionarily constrained genes (Extended Data Fig. 5c) have among the lowest rates of unproductive splicing. Because evolutionarily constrained genes are expected to be under strong selective pressure to be tightly and accurately regulated42, the fact that we see the least unproductive splicing in them suggests that unproductive splicing generally represents molecular noise, rather than a form of regulation. However, we find exceptions—for example, the highly conserved SR-family proteins have among the highest rates of unproductive splicing even when controlling for expression (Extended Data Fig. 5b and Supplementary Note 2), consistent with previous reports of conserved AS–NMD-based autoregulatory loops in these genes12,13,14,17,19,21. Still, splicing regulators represent only a small fraction of genes with the highest rates of unproductive splicing (Supplementary Note 2). Further, we observe that introns with weak splice sites (Supplementary Fig. 4) and long introns (Fig. 2a) have higher unproductive splicing rates, consistent with splicing errors caused by appreciable competition between bona fide functional splice sites and cryptic splice sites. We also reasoned that if unproductive splicing is regulated and optimized, most unproductive splicing would be derived from a single alternative event that could be efficiently regulated by cis-regulatory elements, as is the case for SR protein autoregulatory loops. Yet, for most genes, we were unable to attribute a majority of unproductive splice events to a single splice junction (Fig. 2b and Supplementary Fig. 5).

Fig. 2: Unproductive mis-splicing accumulates across transcripts.
figure 2

a, Correlation between percent of unproductive junction reads in LeafCutter clusters versus length of the most used productive intron in the cluster (Spearman’s rho coefficient of 0.19; two-sided correlation test P value = 1.3 × 10−243). Correlation presented as cumulative distribution of percent unproductive splice junctions, for increasing groups of intron length. b, Percent unproductive splice junctions for each gene that are attributed to the most common (rank 1) unproductive junction, the second most common unproductive junction and so on. c, Model of unproductive splicing compounding across multi-intronic transcripts. A low unproductive rate at many independently spliced introns produces a high rate of unproductive molecules at the transcript level. d, Nanopore long-read sequencing quantifies the percent of full-length reads that are targeted by NMD, defined as containing at least one unproductive junction, as a function of the number of splice junctions in the read. The vertical dashed line marks eight splice junctions, corresponding to a typical full-length human transcript. The total RNA were isolated from shRNA-mediated dKD of NMD factors SMG6 and SMG7 or shRNA scramble control23 in HeLa cells. The naRNA data were from K562 cells43. Multiple points of the same color indicate replicate experiments. The blue and orange shaded area represents the binomial expectation when assuming 1.5–2.5% and 0.2–0.7% of unproductive junction reads at each independent junction.

Based on these observations, we propose a model in which every protein-coding intron of a gene can generate unproductive junctions at a specific ‘error’ rate, which depends on intron length, splice site strength and other genomic features. This model predicts that unproductive junctions will compound along the entire length of pre-mRNA, making long genes—or genes with many introns—more likely to produce unproductive transcripts than those with fewer introns (Fig. 2c). Under the simplest binomial model, assuming that 2.3% of splicing events are unproductive, we predicted that ~17% of mRNA transcripts would be unproductive for a typical human gene with eight introns. To test this prediction, we analyzed long-read RNA-seq (LRS) data from datasets collected using Oxford Nanopore technology23,43. We investigated the number of transcripts containing one or more unproductive junctions, binned LRS reads by their number of junctions and calculated the fraction of reads with an unproductive junction for each bin to estimate the abundance of unproductive transcripts. We first analyzed LRS naRNA from chromatin cell fractions from K562 cells43. We found that ~15% of naRNA reads spanning a typical human gene contained one or more unproductive junctions (Fig. 2d). Importantly, we also investigated LRS of steady-state mRNA following shRNA knockdown of NMD machinery23 and found that ~15% of reads spanning a typical human gene had one or more unproductive junctions (Fig. 2d). We further corroborated these findings by analyzing the small subset of LRS reads that probably represent full-length mRNA transcripts, which we directly evaluated for adherence to known molecular rules governing NMD44 (Extended Data Fig. 6). Taken together, we estimate that for a typical multi-intronic gene, ~15% of all transcript molecules generated are NMD targets, though rates of unproductive splicing vary greatly across genes. For example, at lowly expressed genes, we estimate the fraction of NMD targets for transcripts with ten exons is 25% of mRNA molecules, and for transcripts with 15 or more exons, this number increases to over 50% (Supplementary Fig. 6).

Artificially induced cryptic splicing widely triggers NMD

Our findings collectively suggest that a large fraction of AS events result in unproductive transcripts that are targeted by NMD for rapid degradation. To validate this finding experimentally, we measured the effects of a splice-switching drug on gene expression levels in LCLs (Fig. 3a). Risdiplam was originally identified as a small molecule splicing modulator that upregulates SMN2 by activating a productive GA|GU 5′ splice site (5′ss) at the expense of an unproductive isoform45,46. Recent studies47,48 find that at higher doses, risdiplam’s effects are less selective, extending to GA|GU 5′ss in many other genes. We predicted increases in these cryptic GA|GU splice sites would generally lead to an abundance of unproductive, NMD-targeted transcripts and, consequently, decreased expression of the affected genes. To test this, we treated LCLs with risdiplam and sequenced steady-state RNA at eight increasing doses of risdiplam and naRNA at two doses (Fig. 3a). We observed widespread, genome-wide induction of GA|GU 5′ ss in a dose-dependent manner but no effect on canonical AG|GU 5′ss or other noncanonical splice sites (Fig. 3b and Supplementary Fig. 7). For example, MYB, a transcription factor protooncogene targeted by a number of antitumor therapies, contains a risdiplam-induced exon (Fig. 3c). This exon is uniquely annotated in an unproductive transcript isoform, and MYB is downregulated in a dose-dependent manner.

Fig. 3: Risdiplam-induced splicing alterations mediate expression changes at hundreds of genes.
figure 3

a, Overview of risdiplam-based approach to assess pervasiveness of NMD after splicing perturbations. LCLs were treated with eight doses of risdiplam. Splicing changes at cryptic exons assessed for NMD potential, and gene expression changes were estimated. b, Genome-wide mean splicing dose-dependent changes at various classes of 5′ss in naRNA and steady-state RNA. Bootstrapped 95% confidence intervals shaded around the mean activation level across n introns in each group. c, Left: dose-dependent splicing response at a risdiplam-induced exon in MYB. Right: dose-dependent expression response of MYB. d, Predicted translation result of 316 risdiplam-induced exons. Unproductive exons (expected to induce NMD) versus those that maintain transcript stability are red and blue, respectively. Annotated and unannotated exons are dark and light colors, respectively. e, Empirically measured effect of host gene expression (log2 fold-change) as measured in steady-state RNA and naRNA in the presence of risdiplam at 3.16 µM. Each point is a gene hosting a risdiplam-induced exon colored by the same exon classifications used in d. P values for two-sided Mann–Whitney U test, comparing productive (n = 85 genes hosting annotated or unannotated risdiplam-induced exons) and NMD sets (n = 219 genes) are shown. The box represents median and innerquartiles. The whiskers extend from hinge to most extreme value no greater than 1.5× IQR from hinge. f, Conventional small molecule drugs usually operate at the level of protein binding and are disproportionately skewed for particular classes of ‘druggable’ genes (for example, G-protein-coupled receptors, GPCRs) and against other classes (for example, transcription factors). Risdiplam-induced NMD targets are more representative of all genes. Disease genes with therapeutic potential by downregulation (Online Mendelian Inheritance in Man dominant negative genes) are similarly distributed across categories of previously ‘druggable’ genes. g, Cumulative distribution of gene length for genes with predicted NMD-induced exons, a similarly sized set of expression-matched genes, all genes or a set of gene targets for Food and Drug Administration (FDA)-approved small molecules.

In total, we identified 316 risdiplam-induced cassette exons, 281 (92%) of which are unannotated. We found that 257/281 (91%) of the unannotated risdiplam-induced exons are within the coding region of the transcript (not in UTRs). Of these, 214/257 (61%) result in a frameshift, matching the expectation that two-thirds of cryptic splicing changes should produce a frameshift. An additional 56/281 (20%) of the unannotated exons contain an in-frame PTC and are also expected to induce NMD (Fig. 3d). The effects of these risdiplam-induced exons on expression matched our expectation that inclusion of predicted frame-shifting- and in-frame-PTC-containing exons generally results in downregulation of steady-state steady RNA but not naRNA (Fig. 3e and Supplementary Fig. 8). This set of 230 unproductive risdiplam-induced exons in 219 genes included some with robust expression effects in potentially disease-modifying genes (Supplementary Fig. 9). Notably, these 219 risdiplam-regulated genes were not limited to gene families traditionally thought of as ‘druggable’ by protein-targeting small molecule drugs (for example, kinases) (Fig. 3f and Supplementary Fig. 10a). Consistent with our observation that unproductive junctions accumulate across the length of the transcript and that long introns and long genes are most susceptible to AS–NMD, we found that introns and genes hosting these risdiplam-induced exons tend to be longer than matched controls (Fig. 3g and Supplementary Fig. 10). Finally, we identified an asymmetry in risdiplam-induced post-transcriptional effects (Extended Data Fig. 7) favoring downregulation. Thus, random splicing changes, such as those introduced by similar nonspecific splice-switching drugs or mutations, usually result in NMD-targeted transcripts and subsequent downregulation, as opposed to frame preserving or frame correcting (for example, SMN2) transcript changes.

Global effects of sQTLs on gene expression levels

The prevalence of unproductive splicing, along with the observation that unproductive splicing anti-correlates with gene expression levels genomewide, predicts that genetic effects on RNA splicing would often impact RNA expression levels. To test this prediction, we used quantitative trait loci (QTL) mapping (Fig. 4a) to identify genetic variants associated with expression (eQTLs) and splicing (splice junction abundance, sQTLs) in naRNA-seq, 4sU sequencing and steady-state RNA-seq data. To better distinguish splicing-mediated expression effects from transcriptional effects, we mapped histone QTLs (hQTLs), reflective of variants impacting promoter and enhancer activity (H3K27ac, H3K4me1 and H3K4me3) and transcription across gene bodies (H3K36me3). In total, we identified 57,981 QTLs for 620,020 tested molecular traits (Supplementary Fig. 11). Consistent with previous work, we find a large fraction of eQTLs are explained by transcriptional regulation, as indicated by the high degree of sharing between eQTL and hQTL signals. For example, we estimate that 67% of eQTL-containing genes (eGenes, steady-state RNA) contain hQTL effects at the gene’s promoter (Storey’s π1 = 0.67; Supplementary Fig. 12a), consistent with transcriptional regulation. As expected, we observe a strong concordance in the direction of hQTL and eQTL effects, wherein hQTL alleles that increase H3K27ac signal at the promoter overwhelmingly have corresponding upregulating signals at the level of mRNA (Fig. 4b). The remaining steady-state RNA eQTLs that do not have hQTL signal probably function during or shortly after transcription, as the effects of an additional 24% of steady-state eQTLs are also detected in 30 min 4sU-labeled RNA (π1 = 0.91; Supplementary Fig. 12a).

Fig. 4: Genetic variants that alter expression post-transcriptionally are enriched in splice-altering variants that are predicted to induce NMD.
figure 4

a, Approach for identifying mechanisms of gene regulation, using data from CCHCR1 locus as illustrative example. Genetic variants that associate with chromatin peak height, RNA expression, splicing and so on (molQTL) are identified. The boxplots grouped by genotype at lead SNP depict innerquartiles of phenotype values with whiskers extending to most extreme value no greater than 1.5× IQR from hinge. Multitrait colocalization (Methods) compares alignment of molQTL signals (scatter plots54) to identify molQTLs that probably share causal SNP. b, Effect size (β) of H3K27ac hQTLs at promoter versus eQTL effect sizes estimated in each RNA-seq dataset. Correlation summarized with Spearman’s rho coefficient and two-sided significance test. c, Tally of steady-state eGenes by their colocalizing molQTLs. A total of 831 eGenes colocalize (coloc) with an hQTL (*, H3K27ac, H3K4me1, H3K4me3 or H3K36me3), and 518 ‘post-transcriptional’ (post-txn) eGenes (**) do not colocalize with any hQTL but do colocalize with other molQTL, some of which may be sQTL or apaQTL. d, Example eQTL functioning through AS–NMD from poison exon (red). H3K27ac ChIP-seq and RNA-seq coverage grouped by genotype of lead eQTL SNP. The pink inset region (chr22:46,289,343-46,294,094) shows effect on splicing of poison exon, depicted as sashimi plots with relative usage (intronic PSI) of splice junction arcs. e, QQ plot shows inflation of eQTL signal among groups of SNPs (lead SNPs for p-sQTLs and u-sQTLs within the host gene, H3K27ac QTLs within 100 kb of test gene or random test SNPs). f, Similar to b. Top: effect size of p-sQTLs (only significantly affecting productive splice junctions) versus host gene eQTL β. Bottom: effect size of u-sQTLs (significantly affecting at least one unproductive splice junction) versus effect on host gene expression. g, eQTLs consistent with transcriptional regulation (eQTL/hQTLs, purple) in discovery dataset (LCLs) compared with eQTLs consistent with AS–NMD (eQTL/u-sQTLs, red). Left: each eQTL (SNP:gene pair, rows) assessed for eQTL effects in GTEx tissues (columns). Right: cumulative distribution of number tissues with significant effects. The P value for a two-sided Mann–Whitney test is shown.

To better interpret molecular mechanisms of eQTLs while accounting for linkage disequilibrium, we performed a multitrait colocalization analysis to identify molecular QTLs that probably share causal variants (Fig. 4a, Supplementary Fig. 13a and Methods). For example, at the CCHRC1 locus, we identify an eQTL, an hQTL and an sQTL (Fig. 4a). While both the sQTL and hQTL are nominally significant for an eQTL association, colocalization analysis reveals that only the hQTL and eQTL are probably caused by the same genetic variant (posterior probability of full colocalization, PPFC >0.5). This suggests that transcriptional regulation explains the eQTL effect, while the sQTL constitutes a distinct genetic effect that does not explain the primary eQTL signal. Across all 3,970 steady-state RNA eQTLs, we find 831 (~20%) colocalized with an hQTL (PPFC >0.5) (Fig. 4c and Supplementary Fig. 13a). We do not interpret this to suggest that only 20% of eQTLs are caused by transcriptional regulation but rather that 20% represents a lower limit of the degree of transcriptional regulation. Nevertheless, we reasoned that the abundance of eQTL colocalizations with hQTLs relative to their colocalizations with other molecular QTLs may be informative. When considering eQTLs that colocalized with any molecular QTL (molQTL) we tested, 62% colocalized to an hQTL, consistent with our previous finding that approximately two-thirds of eQTLs are driven by chromatin effects. Among the remaining 38%, which we hereafter refer to as ‘post-transcriptional eQTLs’, nearly half colocalized with an sQTL (Fig. 4c and Supplementary Fig. 13b), suggesting that AS may be a major contributor to interindividual variation in gene expression levels. By comparison, alternative polyadenylation QTLs (apaQTLs) colocalized with only ~5% of the non-hQTL colocalizations, suggesting that alternative polyadenylation plays a comparatively minor role.

The degree of sQTL–eQTL colocalization suggests that a large number of genetic variants may impact gene expression levels through AS. However, we and others have noted that alternative transcription initiation or polyadenylation can alter splicing quantifications and vice versa40,49,50,51. Furthermore, preferential decay of specific mRNA isoforms may manifest as sQTL–eQTL colocalizations52, without being mediated by AS. To better assess whether splicing changes causally drive eQTL–sQTL colocalizations, we asked which genomic annotations are most enriched among post-transcriptional eQTLs, compared with transcriptional eQTLs as controls. As expected, post-transcriptional eQTLs are strongly depleted in enhancers and promoters. In terms of enriched genomic regions, we find significant enrichment of post-transcriptional eQTLs near polyadenylation sites and in splice donor, branch site and splice acceptor regions (Supplementary Fig. 13c). While polyadenylation sites and splice sites are similarly enriched among post-transcriptional eQTLs over transcriptional eQTLs, splice sites account for a far greater fraction of post-transcriptional eQTLs (Supplementary Fig. 13d). These observations indicate that changes in RNA splicing causally drive many of these post-transcriptional eQTLs.

As an illustrative example of a splicing-mediated eQTL, we highlight the TTC38 gene (Fig. 4d) for which the lead eQTL variant has no detectable effect on promoter activity. Rather, we observed a clear effect of the lead eQTL variant on splicing. The allele associated with decreased expression level is also associated with an increase in splicing of unproductive introns. These unproductive splice junctions are substantially more abundant in naRNA than steady-state RNA, consistent with their rapid degradation, though the sQTL effect size is similar in naRNA and steady-state RNA (Fig. 4d). More generally, when we stratify sQTLs by whether they affect an unproductive splice junction (henceforth ‘u-sQTLs’), versus merely switching between alternate productive protein-coding isoforms (henceforth ‘p-sQTLs’), we find that u-sQTLs are particularly enriched in eQTL signal for the host gene (Fig. 4e and Supplementary Fig. 14a), with even stronger eQTL signal than H3K27ac QTLs within 100 kb of the gene (Fig. 4e). Furthermore, though we identify similar numbers of p-sQTLs versus u-sQTLs, u-sQTLs explain 77% of sQTL colocalizations with post-transcriptional eQTLs, compared with 23% for p-sQTLs (P = 5.9 × 10−12, hypergeometric test; Supplementary Fig. 14b). These results again suggest that AS–NMD is a major contributor to the genetic regulation of expression levels.

To further validate that u-sQTLs affect gene expression levels through post-transcriptional decay, we considered the concordance of normalized effect sizes between u-sQTLs and eQTLs (Methods). As expected, p-sQTL variants are largely inert with respect to expression levels (Fig. 4f). In contrast, u-sQTL effects are strongly anticorrelated with expression, such that alleles increasing the unproductive junction are associated with decreased expression of the host gene (Fig. 4f, steady-state facet). These effects on expression are most apparent in steady-state RNA, weaker in our 4sU RNA-seq data and largely absent in naRNA (Fig. 4f and Supplementary Fig. 15), which is expected, as NMD occurs post-transcriptionally and in the cytoplasm.

While we observed an overwhelming enrichment of eQTLs among u-sQTL SNPs (Fig. 4e), previous studies37,50,53 have found that the leading eQTL and sQTL signals for a given gene tend to be independent. To reconcile these observations, we provide two mutually compatible explanations: (1) an eGene’s top sQTL signal is often a p-QTL, not an expression-modifying u-sQTL (Supplementary Fig. 16a,b) and (2) many genes have multiple eQTLs53,54, and u-sQTLs may contribute to expression even when they are not the lead eQTL.

Given reports that lead eQTLs can sometimes change across tissues53,55 and our previous observation that sQTL effects are generally more stable across tissues than eQTLs56, we hypothesized that the effects of splicing-mediated eQTLs would be more consistent across tissues than the effects of transcription-mediated eQTLs. Indeed, we find that eQTLs mediated by splicing mechanisms are eQTLs in a larger number of GTEx tissues than eQTLs mediated by transcription regulation (Fig. 4g and Extended Data Fig. 8), indicating that the regulatory impact of variants that function through AS–NMD are more tissue-pervasive than those that function through transcription. Such variants could be relevant for interpreting genome-wide association study (GWAS) signals that have yet to be explained by eQTLs in known cell-types.

AS–NMD impacts GWAS traits

Given the pervasive effects of AS–NMD on gene expression, we wondered to what extent AS that affects complex phenotypes are mediated by NMD versus protein-diversifying stable isoforms. To answer this question, we compiled summary statistics from 45 GWAS for blood and immune-related traits and evaluated the enrichment of GWAS signal among various classes of molecular QTLs: eQTLs, H3K27ac QTLs, p-sQTLs and u-sQTLs. For example, using multiple sclerosis GWAS summary statistics, we find enrichment of u-sQTLs on par with that of p-sQTLs, hQTLs and eQTLs (Fig. 5a). Similar results are found in most complex traits we examined (Supplementary Fig. 17).

Fig. 5: Splicing-mediated NMD contributes to complex trait biology.
figure 5

a, QQ plot of multiple sclerosis GWAS signal, grouped by categories of SNPs. p-sQTLs that impact the balance of protein-coding isoforms and u-sQTLs that impact usage of unproductive splice junctions are similarly inflated for GWAS signals. b, Fraction of GWAS loci that colocalize with various sets of molQTLs in each of 45 blood or immune-related traits. Number of loci for which colocalization was attempted is indicated at the top of each column. ‘Other combinations’ includes loci that colocalize with alternative polyadenylation QTLs, hQTLs and sQTLs or other combinations that may include sQTLs and other molQTLs and are difficult to interpret mechanistically. c, Histogram of usage of unique sQTL junctions that colocalize with a GWAS signal, grouped by sQTL type. Intronic PSI (junction read count divided by most abundant junction in LeafCutter cluster) for each junction was summarized as the median from steady-state RNA samples which are homozygous for the PSI-increasing allele. Many GWAS sQTLs, especially u-sQTLs, have low PSI, even in samples with genotypes that favor higher usage. d, Effect size (β) of sQTLs and eQTLs for distinct u-sQTLs that colocalize with a GWAS signal. Correlation was summarized with Spearman’s rho coefficient and two-sided significance test.

To better resolve the mechanisms at these GWAS loci, we used multitrait colocalization to identify GWAS signals that colocalize with hQTLs, eQTLs, sQTLs or various combinations of QTLs. Across all complex traits, approximately 70% of GWAS loci could not be colocalized with any molecular QTL (Fig. 5b), in line with previous studies with similar sample sizes57,58. Approximately 18% of GWAS loci colocalize with either an hQTL, eQTL or some combination of molecular QTLs (Fig. 5b). The remaining 12% of loci colocalize with an sQTL but not an hQTL, consistent with splicing-mediated impacts on traits. We next sought to assess whether these GWAS/sQTL loci possess characteristics consistent with an AS–NMD mechanism versus protein diversification.

Notably, we find that these sQTLs largely affect low-usage splice junctions, such that 57% of these splice junctions are spliced in at PSI <5% (Fig. 5c). This observation is consistent with that of a recent study57 and naturally poses questions as to how such low-usage isoforms might impact traits. We found that most of these low-usage sQTLs are u-sQTLs that alter the balance of unproductive and productive isoforms, consistent with AS–NMD mediating these loci. We also found that the allelic effects of u-sQTLs that colocalize with GWAS loci, similar to that of u-sQTLs in general (Fig. 4f), were anticorrelated with their effects on gene expression levels, again consistent with NMD (Fig. 5d). These u-sQTLs are also enriched among the GWAS/sQTL loci that also colocalize with eQTL signal (Extended Data Fig. 9a). For example, a reticulocyte-count-associated GWAS signal colocalizes with a u-sQTL in NUDT14 gene, as well as NUDT14 eQTL signal (Extended Data Fig. 10a,b). As expected, the allele that increases usage of the unproductive splice junction is associated with NUDT14 downregulation (Fig. 5c). While the effect on splicing is similar in both steady-state RNA-seq and naRNA-seq (Extended Data Fig. 10c), the effect on NUDT14 expression is only apparent in steady-state RNA (Extended Data Fig. 10d), again consistent with AS–NMD.

More generally, u-sQTLs that colocalize with GWAS and eQTL signals tend to also display lower usage in steady-state RNA than naRNA (Extended Data Fig. 9b), with sQTL and eQTL effects that are also consistent with AS–NMD (Extended Data Fig. 9c). In contrast, the sQTLs that colocalize with GWAS but not eQTL do not share these characteristics and probably function by tuning the expression levels of alternative protein-coding isoforms. Given that there are a similar number of these GWAS loci that colocalize with both sQTL and eQTL (and not hQTL), as compared with just sQTLs (Fig. 5b), we conclude that AS–NMD carries similar importance as splicing-mediated protein diversification for complex organism-level traits.

Discussion

The molecular impact of AS has been challenging to verify experimentally, as it has been difficult to study the function of individual isoforms at the protein level, and the rapid decay of unproductive isoforms obscure their quantification at the mRNA level. Through detailed analysis of molecular measurements that capture the major steps of RNA maturation, we found that aberrant splicing produces remarkably high levels of unproductive transcripts bearing a PTC. Unproductive mRNAs account for around 15% of all mRNA transcripts from the average human gene, even exceeding 50% for many long genes expressed at low levels. These estimates may be surprising, given earlier studies that utilized single knockdown of UPF1 or UPF2 to identify a relatively small subset of AS–NMD-regulated genes25,26,27,59, suggesting that unproductive isoforms are produced at such low rates that expression levels of most genes are unaffected by AS–NMD27. However, we find that early estimates are consistent with incomplete inhibition of NMD due to partial redundancy between core NMD factors. Indeed, the levels of unproductive splicing in steady-state mRNA stabilized from double SMG6 and SMG7 knockdowns—but not that of single UPF1 knockdowns—were nearly identical with that estimated in our nascent mRNA dataset. Thus, previous studies using knockdowns of single NMD factors appear to have substantially underestimated the impact of AS–NMD.

Notably, we also show that an important previous observation—that highly used AS exons are enriched for being frame-preserving—is largely due to NMD surveillance rather than selection at the level of splicing regulation. Thus, our study suggests that the molecular impact of AS is largely shouldered by NMD, which regulates protein output by targeting unproductive transcripts for degradation. Supporting this view, we identified nearly as many genetic variants that impact production of these unproductive transcripts as compared with those that tune the balance of stable mRNA isoforms. Importantly, this observation also holds for sQTLs that colocalize with GWAS signals to influence organism-level traits. These unproductive sQTLs can have variable effects on expression, wherein small changes in splicing are sometimes associated with large changes in expression or vice versa. While the biological basis for this variability is unknown, we speculate that multiple mechanisms, including transcriptional adaptation60 and differential efficiency of NMD44, may be at play. Nonetheless, the general tissue-pervasive nature of these AS–NMD-mediated eQTLs compared with transcription-mediated eQTLs may have particular relevance both in enhancing the phenotypic impact of AS–NMD-based regulation and in mapping the regulatory mechanisms underlying genetic associations for complex traits.

What fraction of mRNA isoforms encode functionally diverse peptides has long been under debate7,27,35,36,61. Novel isoforms continue to be discovered as RNA-seq experiments increase in depth, making this question particularly timely. Our observation that introns are often mis-spliced into substrates of NMD indicates that splicing is inherently noisy and that novel isoforms uncovered by RNA-seq generally do not encode functional proteins. This view is consistent with previous findings that most AS events are lowly used and do not show cross-species conservation.

Still, we posit that future research will reveal a preponderance of cases where regulated AS functions by tuning protein expression levels rather than by creating protein diversity, as the sheer abundance of AS–NMD events presents opportunities for evolution to co-opt AS–NMD as a functional regulatory mechanism. Regulated AS–NMD has been identified in the past, but it has largely been found in genes encoding splicing regulators. While we confirmed that splicing regulators are enriched among genes with high levels of unproductive transcripts, splicing regulators represent only a small fraction of all genes with very high levels of unproductive transcripts. We predict that future work using long-read sequencing of RNA across multiple species, stages of maturation and biological systems will provide us with a much more complete understanding of the mechanisms by which AS functionally impacts cellular function, organismal phenotypes and evolution.

Methods

We confirm that this research complies with all relevant ethical regulations. Ethical approval was not required because the Yoruba in Ibadan, Nigeria (population code YRI) panel of cell lines used here (derived from consenting individuals with publicly available genotypes62), from which molecular sequencing data was generated, are not considered human subjects research by the institutional review board at the University of Chicago.

Statistics and reproducibility

No statistical method was used to predetermine sample size for novel data (naRNA-seq, H3K36ME3 CUT&Tag, RNA-seq and naRNA-seq at various risdiplam doses). We note that our sample sizes are similar to previous QTL studies using this panel of cell lines37,38,40 and risdiplam transcriptomics studies45,48. For eQTL and sQTL mapping using naRNA, we excluded line NA18855. This was not a predetermined decision. Rather, principal component analysis revealed this sample as an outlier, and differential expression analysis against other lines identified differentially expressed genes clustered along large sections of chromosomes, suggestive of chromosomal abnormalities. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment, as cell lines (for QTL studies) and treatment were required knowledge for carrying out the experiments. When data could not be assumed to follow assumptions for parametric tests, nonparametric tests were used. Tests for significance of spearman correlation were obtained using ‘cor.test(…, method = ’s’)’ in R.

Molecular trait quantification

We aligned the Illumina short-read RNA-seq datasets from LCL and HeLa cells to the human genome version GRCh38 with Gencode v34 annotation using STAR63 version 2.7.7a. For the LCL lines, we used STAR’s WASP mode64 to account for genetic variants. We mapped ChIP-seq and CUT&Tag reads to the human genome using HISAT65 version 2.2.1, and we used Hornet (https://github.com/TheFraserLab/Hornet, a reimplementation of WASP pipeline64) to find reads that overlap with SNPs for remapping. For the ChIP-seq datasets, we used MACS2 (ref. 66) version 2.2.7.1 to call peaks.

We quantified gene expression and histone modification coverage of H3K27ac, H3K4me1 and H3K4me3 using featureCounts67 version 2.0.3. We used ‘bedtools multicov’ to count the number of H3K36me3 CUT&Tag reads overlapping each protein-coding gene. For each phenotype, we normalized the raw counts into counts per million (CPM) and selected the top 14,000 autosomal genes with the highest median CPM in standard RNA-seq data for downstream analysis.

To quantify splicing efficiency of introns, we used SPLICE-q68 version 1.0.0 to calculate the reverse intron expression ratio in protein-coding genes. To quantify splicing in the RNA-seq datasets, we extracted junction reads by running regtools version 0.5.2 (ref. 69) on the filtered Binary Alignment Map (BAM) files. We obtained intron clusters using Leafcutter56

We developed a method to predict the effect of splice junctions on transcript coding potential. Our method attempts to reconcile junctions identified from short-read RNA-seq with introns of annotated transcripts and predicts whether the junction is compatible with the open reading frame of the annotated protein-coding transcript. See the ‘Classification of unannotated splice junctions’ section in Supplementary Methods for a detailed description.

See Supplementary Methods for a detailed description of our mapping and quantification pipeline and for a description of our analysis of Oxford Nanopore Technologies long-read RNA-seq data.

Molecular cis-QTL mapping

To prepare the LCL RNA-seq data for eQTL mapping, we Z-score normalized the top 14,000 expressed protein-coding genes samples. Then, we enforced normality of the data by applying the rank-based inverse normal transform to each sample across all genes. We utilized the same procedure to normalize the gene coverage quantification for H3K36me3. For hQTLs from ChIP-seq data, we normalized the read counts quantified by featureCounts at the peaks called by MACS2.

For sQTL mapping, we used the prepare_phenotype_table.py script from Leafcutter. This script calculates the PSI of each junction, it applies Z-score normalization to each junction across all samples, and enforces normality by rank-normalizing each sample across all junctions.

To build a covariate matrix for QTL mapping for each dataset, we applied principal component analysis on the standard and rank-normalized feature-by-sample matrix. We selected the number of principal components that explain more variance than in a version of the data matrix obtained by randomly permuting each feature across the samples.

For molecular cis-QTL calling, we used as input for each dataset the standard and rank-normalized data matrices, the principal components (PCs) obtained as previously described and the corresponding VCF files from the 1000 Genomes Project.

We ran QTLTools70 version 1.3.1 for molecular cis-QTL mapping, using both the permutation and nominal pass versions. For the permutation pass, we used 1,000 permutations using the permute 1000 flag. For the nominal pass, we used the nominal 1 flag to obtain the QTL statistics of all SNPs irrespective of their P value. For eQTLs and hQTLs, we used a cis-window of 100,000 bp. For sQTLs and splicing efficiency QTLs, we used a cis-window of 10,000 bp. For the permutation pass of sQTLs, we used the grp-best flag to get only the statistics of the best hit per intron cluster. For the permutation pass, we applied the Benjamini–Hochberg correction to the adjusted β distribution P values to account for the false discovery rate.

For analyses that require unstandardized effect size estimates, we considered the top QTL SNP from the QTLtools pass using standardized and rank-normalized data matrices and reran QTLtools using the same PC covariates but providing unstandardized expression matrix (that is, log2RPKM for expression or gene level PSI for splicing), such that the β estimate is in more directly interpretable units.

See ‘eQTL calling on GTEx gene expression data’ section in Supplementary Methods for details on the QTLTools runs on GTEx data.

naRNA-seq experimental methods

Cell growth

LCLs were grown in Roswell Park Memorial Institute media (RPMI) + glutamine + penicillin–streptomycin + 20% fetal bovine serum. Cells were grown in four batches, with approximately 20–35 cell lines per batch. The day before collection, cell cultures were counted and normalized across cell lines to approximately 35 million live cells, supplemented with media to 50 ml, such that cells were in log-phase growth with approximately 50 million cells per culture at the time of collection.

Isolation of naRNA

We collected naRNA from LCLs by first isolating nuclei through a sucrose cushion followed by high-salt washes to dissociate nucleoplasm and weakly bound RNAs and proteins from chromatin with slight modifications from previous protocols for cellular fractionation71, as detailed below.

Cells were collected by centrifugation (300g for 3 min) in 50 ml conical tubes. Cells were washed twice in cold phosphate-buffered saline with 1 mM EDTA. We reasoned that the inclusion of EDTA (and exclusion of magnesium) in buffers used during cellular fractionation would inhibit splicing during sample processing. However, we have noted that chelation of magnesium yields fragile nuclei that are more prone to premature bursting, and care should be taken to pipet gently in subsequent steps. Washed cell pellets were resuspended in 400 µl BufferA (10 mM pH 7.5 HEPES, 10 mM KCl, 10% (v/v) glycerol, 11.6% (w/v) sucrose, 1 mM dithiothreitol (DTT), 1× ROCHE complete protease inhibitor) and transferred to 2 ml 96-well plates for convenience. An equal volume of BufferA supplemented with 0.2% TritonX was gently mixed to the resuspended cells, bringing the final TritonX concentration to 0.1% for a 12 min incubation on ice with periodic inversion. Nuclei were isolated by centrifugation (1200g, 5 min). Supernatant was discarded, and the nuclei pellet was washed with 500 µl BufferA, followed by centrifugation (1,200g, 5 min), removal of supernatant, and resuspension in 250 µl nuclear resuspension buffer (20 mM pH 7.5 HEPES, 50% (v/v) glycerol, 75 mM NaCl, 1 mM DTT, 0.5 mM EDTA, 1× protease inhibitor). An equal volume of NUN (high salt) buffer (50 mM pH 7.5 HEPES, 1 M urea, 300 mM NaCl, 1 mM DTT, 1× protease inhibitor) was added to nuclei pellets and gently mixed, following by 5 min incubation on ice and periodic inversion. Chromatin pellets were isolated by centrifugation (1,200g, 5 min). Nonchromatin-bound supernatant was removed. Chromatin pellets were washed with 500 µl BufferA supplemented with 0.2% NP-40, followed by centrifugation and removal of supernatant. The usually insoluble chromatin pellet was resuspended in 100 µl nuclear resuspension buffer, added to 1 ml Trizol, and stored in 1.5 ml centrifuge tubes at −20 °C for further processing.

Trizol samples were vigorously mixed with periodic heating at 50 °C until the pellet dissolved. After adding 200 µl chloroform, vigorously mixing and centrifugation (16,000g, 15 min), the aqueous phase was transferred to clean tubes or 96-well plates. An equal volume of ethanol was mixed, and samples were bound to Zymo spin I-96-XL plate (Catalog no. C2010) or individual Zymo Spin ii (Catalog no. C1008-50) columns by centrifugation (2,000g, 2 min). Columns were washed twice with 500 µl wash buffer (80% ethanol, 10 mM pH 8.0 Tris buffer). Samples were treated with DNAseI (2.5 µl RQ1 Promega DNAseI, 1× DNAse buffer in 25 µl total volume) while bound to columns and incubated at room temperature for 15 min. Columns were washed with RNA binding buffer (2 M guanidinium, 75% isopropanol), followed by two washes with 80 µl wash buffer and a dry spin. RNA was eluted by incubation with 25 µl water for 5 min followed by centrifugation. RNA yield was quantified by NanoDrop, typically yielding 5–20 µg naRNA per sample.

RNA-seq library preparation

rRNA was depleted using Lexogen ribocop v2 kit according to the manufacturer’s protocol. RNA was eluted in 8 µL water, all of which was used as input for NEB Ultra Directional II RNA-seq kits according to manufacturer’s protocol. Fragmentation time was adjusted from the recommended 15 min to 5 min to obtain larger insert sizes. Samples were pooled and sequenced on a single NovaSeq flow-cell (2 × 150 bp paired end reads) by UChicago sequencing core.

H3K36me3 CUT&Tag

Cells were grown as described above (naRNA-seq experimental methods) and 100,000 cells were frozen in 10% dimethylsulfoxide (DMSO). CUT&Tag was performed as previously described72 (detailed protocol at https://doi.org/10.17504/protocols.io.z6hf9b6) using polyclonal antibody (‘Reporting summary’ section), with the following modifications. Rather than 13 polymerase chain reaction (PCR) cycles, we used 14 cycles. We determined this on the basis of a test qPCR with ~10% of the pre-PCR library, estimating that 14 PCR cycles with the remaining 90% would yield fluorescent signal at about halfway to the plateau, ensuring we have enough DNA material to quantify and sequence.

Molecular QTL sharing

Sharing of molQTLs between datasets and colocalization

We used Storey’s π1 statistic73, an estimate of the fraction of non-null hypothesis from a distribution of P values, to get the fraction of molecular QTLs that are shared across multiple phenotypes (for example, ‘what fraction of eQTLs discovered in steady-state RNA are eQTLs in naRNA?’). See ‘π1 sharing of eQTLs between RNA-seq datasets’ and ‘π1 sharing of eQTLs and hQTLs at TSS’ sections in Supplementary Methods for details.

We simultaneously assessed colocalization of molQTLs around each gene using hyprcoloc74. Summary statistics for a 100 kb cis-window surrounding the gene were obtained for each molQTL from QTLtools nominal pass. Only molQTLs with a permutation pass P < 0.01 were considered for colocalization.

Effect size concordance of sQTLs and eQTLs

For each cluster with a significant sQTL intron, we classified the sQTL as a u-sQTL if it contains at least one sQTL intron (false discovery rate (FDR) <10%) in an unproductive intron. sQTLs with a nominally significant hQTL (P < 0.01, for any H3K4me3, H3K27ac or H3K36me3 trait) were filtered out. To avoid plotting nonindependent sQTLs, we selected only a single sQTL intron per cluster, retaining the intron with the largest absolute value of sQTL β. The top SNP of remaining sQTLs was used to look up the corresponding eQTL effect size in the host gene. Test for significance of correlation was performed with ‘cor.test(method=’spearman’)’ in R.

Effects of AS–NMD-mediated eQTLs and transcription-mediated eQTLs across tissues

We first identified a set of 359 eQTLs, which we were most confident are mediated by AS–NMD. Specifically, the steady-state eQTL must colocalize with an u-sQTL within the gene (‘Colocalization of molQTLs’ section in Supplementary Methods) and not with any hQTL. Furthermore, the lead SNP for the colocalizing u-sQTL junction must not be nominally significant (P > 0.01) for any hQTL test feature (considering all H3K34me3, H3K27ac peaks within 100 kb of the gene and H3K36me3 for the gene body). The top colocalized eQTL/u-sQTL SNP was then assessed for effect size across GTEx tissues (‘eQTL calling on GTEx expression data’ section in Supplementary Methods). We similarly identified a set 313 eQTLs which we were most confident are mediated by expression effects: the steady-state eQTL must colocalize with an hQTL at the eQTL gene’s promoter (see ‘π1 sharing of eQTLs and hQTLs at TSS’ section in Supplementary Methods for linking hQTLs to promoters; H3K4me3, H3K27ac and H3K4me1 peaks were considered). The top colocalized eQTL/hQTL SNP was then assessed for effect size across GTEx tissues (‘eQTL calling on GTEx expression data’ section in Supplementary Methods). To assess effect sizes across tissues, we used the (not quantile-normalized) log2CPM matrix for eQTL mapping with QTLtools, enabling interpretable comparisons of effect sizes (β) across tissues in units of log2 fold-change per alt allele.

Colocalization of molQTLs with GWAS loci

Summary statistics for GWAS75,76,77,78,79,80,81 were downloaded from the GWAS Catalog or other source datasets. One mega-base windows centered at lead GWAS SNPs were determined as previously described58, using a lead SNP threshold of 5 × 10−8 to consider locus for colocalization. Summary statistics for all molQTL features within each GWAS locus window were obtained using QTLtools nominal pass, after selecting only those with a QTLtools permutation pass P < 0.01. Hyprcoloc was used with default settings to colocalize molQTLs and GWAS signals. Loci were categorized as ‘Only hQTL’, ‘Only eQTL’ or ‘Only sQTL’, if the only molQTLs to colocalize with the GWAS signal were either hQTLs (H3K36me3, H3K27ac, H3K4me3 or H3K4me1), eQTLs (in naRNA, 4sU or steady-state RNA-seq) or sQTLs (in naRNA, 4sU or steady-state RNA-seq), respectively. If all of the molQTLs could be classified as eQTLs or hQTLs, or eQTLs and sQTLs, the loci was classified as ‘hQTL+eQTL’ or ‘sQTL+eQTL’, respectively. All other loci with a molQTL colocalization were classified as ‘other combinations’. We chose to classify these loci as ‘other combinations’ because sQTLs that also colocalize with both eQTL and hQTL or with apaQTL or other combinations are hard to interpret, and we wanted to refrain from suggesting these loci may be mediated by AS–NMD or alternative protein isoforms caused by AS.

Risdiplam dosage series experiment

Cell growth, library preparation and RNA-seq

LCL growth, naRNA isolation and conversion to sequencing libraries were performed as described above (naRNA-seq experimental methods). Risdiplam was added 24 h before cell collection. Total (‘steady-state’) RNA was converted to sequencing libraries using NEB polyA capture kit (product no.) followed by NEB Ultra directional II RNA-seq library kits. Steady-state RNA-seq libraries were sequenced by (2 × 150 bp paired end). naRNA-seq libraries were sequenced by UChicago sequencing core (2 × 150 bp paired end).

Identification and quantification of risdiplam-induced exons

All splice junctions containing GA|GU in reference genome sequence at the 5′ss were assessed for a significant positive dose:response correlation using leafcutter’s intron excision ratio as the response. Significance was assessed using R’s ‘cor.test(…, method=’spearman’)’ function, and P values were adjusted for multiple testing with Storey’s q-value method. All significant (q < 0.1) GA|GU introns with a leafcutter-clustered splice junction that has a 3′ splice site (3′ss) 500 bp upstream of the GA|GU splice donor were considered as risdiplam-induced cassette exons. Splicing at these cassette exons was requantified using the cassette exon PSI metric (see ‘Analysis of alternative splicing and symmetry of cassette exons’ in Supplementary Methods).

Prediction of transcripts effects

We utilized Gencode transcript structures and their predictions for coding potential (‘basic’-tagged transcripts being deemed as productive) to annotate cassette exons that use an annotated GA|GU downstream splice junction. Cassette exons with an unannotated downstream GA|GU splice junction were translated in-frame using the most-expressed ‘basic’-tagged transcript (transcript quantifications derived from Salmon82) as a reference, using custom scripts. For simplicity, we classified unannotated exons as unproductive if the exon-included translation is shorter than the exon–excluded translation. While splicing in UTRs may trigger NMD by 50 nt rule, all risdiplam-induced exons in UTRs are, by definition, already in an intron-containing UTR and, thus, we did not classify those as unproductive.

Quantification of expression and splicing

Gene expression was quantified using featureCounts as described above. naRNA-seq samples were analyzed using edgeR83, with two contrasts: (1) 100 nM risdiplam versus DMSO and (2) 3,160 nM risdiplam versus DMSO contrast. The unconventional experimental design of the titration series experiment (a single replicate at eight doses) precluded use of many standard differential splicing or differential gene expression analysis approaches. For those samples, we fit splicing quantifications (in units of cassette exon PSI) and gene expression quantifications (in units of log2CPM) to a four-parameter log-logistic curve using the drc package84 in R: ‘drc::drm(formula = cpm ~ dose, fact=LL.4(), …)’. Effect size estimates and standard errors at 100 nM and 3,160 nM were extracted from the model fits using ‘predict(…, se.fit=T)’. FDR was estimated with Storey’s q-value. Genes were classified as transcriptionally regulated if they had an absolute log2FC between naRNA and steady-state RNA less than 1.5, with FDR <0.1 in both naRNA and steady-state RNA. If genes had FDR <0.1 in steady-state RNA but with an effect size difference between naRNA and steady-state greater than log2(1.5), they were classified as post-transcriptionally downregulated. Similarly, genes had FDR <0.1 in steady-state RNA but with an effect size difference between naRNA and steady state less than log2(1.5), they were classified as post-transcriptionally upregulated.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.