Aberrant splicing prediction across human tissues

Wagner, Nils; Çelik, Muhammed H.; Hölzlwimmer, Florian R.; Mertes, Christian; Prokisch, Holger; Yépez, Vicente A.; Gagneur, Julien

doi:10.1038/s41588-023-01373-3

Technical Report
Published: 04 May 2023

Aberrant splicing prediction across human tissues

Nature Genetics volume 55, pages 861–870 (2023)Cite this article

27k Accesses
53 Citations
167 Altmetric
Metrics details

Subjects

Abstract

Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Study design and main findings.**

**Fig. 2: Tissue-specific splice site map improves prediction performance.**

**Fig. 3: Quantitative splicing levels further improve prediction performance.**

**Fig. 4: Application of AbSplice-DNA on independent data.**

**Fig. 5: Integrating RNA-seq data of CATs to predict aberrant splicing in nonaccessible tissues.**

Aberrant gene expression prediction across human tissues

Article Open access 29 March 2025

Beyond genomics: using RNA-seq from dried blood spots to unlock the clinical relevance of splicing variation in a diagnostic setting

Article Open access 28 January 2025

Comparison of in silico strategies to prioritize rare genomic variants impacting RNA splicing for the diagnosis of genomic disorders

Article Open access 18 October 2021

Data availability

No primary data were generated for this study. Rare variants from gnomAD v.3.1.2 are publicly available at https://gnomad.broadinstitute.org. The GTEx v8 dataset is available at (under dbGaP protection) https://gtexportal.org/home. The ALS dataset is available at http://dataportal.answerals.org after a registration and approval process. The mitochondrial dataset is described by Yépez et al.²⁷. Precomputed SpliceAI scores are publicly available at Illumina Basespace, https://basespace.illumina.com/s/otSPW8hnhaZR, after registration. SpliceMaps for all 49 GTEx tissues and iPSC-derived spinal motor neurons from ALS (hg38) are available at Zenodo, https://doi.org/10.5281/zenodo.6387937. Precomputed AbSplice-DNA scores (hg38) in all 49 GTEx tissues are available at Zenodo, https://doi.org/10.5281/zenodo.6408331. Due to potential donor re-identification when revealing rare variants, the benchmark dataset cannot be shared without restrictions. Users with access to the GTEx data can reproduce the benchmark using the code repository below.

Code availability

SpliceMaps can be generated using the custom-written python package ‘splicemap’ (publicly available at: https://github.com/gagneurlab/splicemap ref. ⁶¹). AbSplice predictions using the enhanced SpliceMap annotation can be performed with the custom-written python package ‘absplice’ (publicly available at: https://github.com/gagneurlab/absplice ref. ⁶²). We also provide a fast implementation of computing SpliceAI predictions using a wrapper based on fast lookup from a database of precomputed scores for existing variants and running SpliceAI for not precomputed variants at https://github.com/gagneurlab/spliceai_rocksdb ref. ⁶³. Fast lookup of all gnomAD variants can be performed with https://github.com/gagneurlab/gnomad_rocksdb ref. ⁶⁴. The analyses are available under https://github.com/gagneurlab/AbSplice_analysis ref. ⁶⁵.

References

Zappala, Z. & Montgomery, S. B. Non-coding loss-of-function variation in human genomes. Hum. Hered. 81, 78–87 (2016).
CAS PubMed Google Scholar
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
CAS PubMed Google Scholar
Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).
PubMed PubMed Central Google Scholar
Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).
CAS PubMed Google Scholar
Rosenberg, A. B., Patwardhan, R. P., Shendure, J. & Seelig, G. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163, 698–711 (2015).
CAS PubMed Google Scholar
Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
PubMed Google Scholar
Rentzsch, P., Schubach, M., Shendure, J. & Kircher, M. CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 13, 31 (2021).
CAS PubMed PubMed Central Google Scholar
Danis, D. et al. Interpretable prioritization of splice variants in diagnostic next-generation sequencing. Am. J. Hum. Genet. 108, 2205 (2021).
CAS PubMed PubMed Central Google Scholar
Cheng, J., Çelik, M. H., Kundaje, A. & Gagneur, J. MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol. 22, 94 (2021).
CAS PubMed PubMed Central Google Scholar
Mertes, C. et al. Detection of aberrant splicing events in RNA-seq data using FRASER. Nat. Commun. 12, 529 (2021).
CAS PubMed PubMed Central Google Scholar
Jenkinson, G. et al. LeafCutterMD: an algorithm for outlier splicing detection in rare diseases. Bioinformatics 36, 4609–4615 (2020).
CAS PubMed PubMed Central Google Scholar
Ferraro, N. M. et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science 369, eaaz5900 (2020).
PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
CAS PubMed PubMed Central Google Scholar
Wilks, C. et al. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 22, 323 (2021).
CAS PubMed PubMed Central Google Scholar
Ling, J. P. et al. ASCOT identifies key regulators of neuronal subtype-specific splicing. Nat. Commun. 11, 137 (2020).
CAS PubMed PubMed Central Google Scholar
Kremer, L. S. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824 (2017).
CAS PubMed PubMed Central Google Scholar
Dawes, R., Joshi, H. & Cooper, S. T. Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data. Nat. Commun. 13, 1655 (2022).
CAS PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
CAS PubMed Google Scholar
Elliott, D. J. & Grellscheid, S. N. Alternative RNA splicing regulation in the testis. Reprodroduction 132, 811–819 (2006).
CAS Google Scholar
de la Grange, P., Gratadou, L., Delord, M., Dutertre, M. & Auboeuf, D. Splicing factor and exon profiling across human tissues. Nucleic Acids Res. 38, 2825–2838 (2010).
PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
CAS PubMed Google Scholar
Cotto, K. C., Feng, Y. Y., Ramu, A. et al. Integrated analysis of genomic and transcriptomic data for the discovery of splice-associated variants in cancer. Nat Commun 14, 1589 (2023).
Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).
CAS PubMed Google Scholar
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
PubMed PubMed Central Google Scholar
Baeza-Centurion, P., Miñana, B., Schmiedel, J. M., Valcárcel, J. & Lehner, B. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell 176, 549–563.e23 (2019).
CAS PubMed Google Scholar
Cheng, J., Çelik, M. H., Nguyen, T. Y. D., Avsec, Ž. & Gagneur, J. CAGI 5 splicing challenge: improved exon skipping and intron retention predictions with MMSplice. Hum. Mutat. 40, 1243–1251 (2019).
CAS PubMed PubMed Central Google Scholar
Yépez, V. A. et al. Clinical implementation of RNA sequencing for Mendelian disease diagnostics. Genome Med. 14, 38 (2022).
PubMed PubMed Central Google Scholar
Abel, O., Powell, J. F., Andersen, P. M. & Al-Chalabi, A. ALSoD: a user-friendly online bioinformatics tool for amyotrophic lateral sclerosis genetics. Hum. Mutat. 33, 1345–1351 (2012).
CAS PubMed Google Scholar
Gregory, J. M., Fagegaltier, D., Phatnani, H. & Harms, M. B. Genetics of amyotrophic lateral sclerosis. Curr. Genet. Med. Rep. 8, 121–131 (2020).
Google Scholar
Pecoraro, V. et al. The NGS technology for the identification of genes associated with the ALS. A systematic review. Eur. J. Clin. Invest. 50, e13228 (2020).
PubMed PubMed Central Google Scholar
Hardiman, O. et al. Amyotrophic lateral sclerosis. Nat. Rev. Dis. Primers 3, 17071 (2017).
PubMed Google Scholar
McCann, E. P. et al. Evidence for polygenic and oligogenic basis of Australian sporadic amyotrophic lateral sclerosis. J. Med. Genet. https://doi.org/10.1136/jmedgenet-2020-106866 (2020).
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
PubMed PubMed Central Google Scholar
Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).
PubMed PubMed Central Google Scholar
Aicher, J. K., Jewell, P., Vaquero-Garcia, J., Barash, Y. & Bhoj, E. J. Mapping RNA splicing variations in clinically accessible and nonaccessible tissues to facilitate Mendelian disease diagnosis using RNA-seq. Genet. Med. 22, 1181–1190 (2020).
PubMed PubMed Central Google Scholar
Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).
CAS PubMed Google Scholar
Pertea, M., Lin, X. & Salzberg, S. L. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 29, 1185–1190 (2001).
CAS PubMed PubMed Central Google Scholar
Desmet, F.-O. et al. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acids Res. 37, e67 (2009).
PubMed PubMed Central Google Scholar
Ke, S. et al. Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res. 21, 1360–1374 (2011).
CAS PubMed PubMed Central Google Scholar
Jian, X., Boerwinkle, E. & Liu, X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res. 42, 13534–13544 (2014).
CAS PubMed PubMed Central Google Scholar
Xiong, H. Y. et al. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
PubMed Google Scholar
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J. & Rätsch, G. Accurate splice site prediction using support vector machines. BMC Bioinf. 8, S7 (2007).
Google Scholar
Barash, Y. et al. Deciphering the splicing code. Nature 465, 53–59 (2010).
CAS PubMed Google Scholar
Xiong, H. Y., Barash, Y. & Frey, B. J. Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics 27, 2554–2562 (2011).
CAS PubMed Google Scholar
Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 51, 592–599 (2019).
CAS PubMed PubMed Central Google Scholar
Gonorazky, H. D. et al. Expanding the boundaries of RNA sequencing as a diagnostic tool for rare Mendelian disease. Am. J. Hum. Genet. 104, 466–483 (2019).
CAS PubMed PubMed Central Google Scholar
Martorella M. Noninvasive, low-cost RNA-sequencing enhances discovery potential of transcriptome studies. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2022.09.06.506813v1 (2022).
Teran, N. A. et al. Nonsense-mediated decay is highly stable across individuals and tissues. Am. J. Hum. Genet. 108, 1401–1408 (2021).
CAS PubMed PubMed Central Google Scholar
Rogalska, M. E., Vivori, C. & Valcárcel, J. Regulation of pre-mRNA splicing: roles in physiology and disease, and therapeutic prospects. Nat. Rev. Genet. https://doi.org/10.1038/s41576-022-00556-8 (2022).
Article PubMed Google Scholar
Zhang, Y., Zhou, R. & Wang, Y. Sashimi.py: a flexible toolkit for combinatorial analysis of genomic data. Preprint at bioRxiv https://doi.org/10.1101/2022.11.02.514803 (2022).
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
PubMed PubMed Central Google Scholar
Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).
CAS PubMed PubMed Central Google Scholar
Yépez, V. A. et al. Detection of aberrant gene expression events in RNA sequencing data. Nat. Protoc. 16, 1276–1296 (2021).
PubMed Google Scholar
Pervouchine, D. D., Knowles, D. G. & Guigo, R. Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29, 273–274 (2013).
CAS PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Stovner, E. B. & Sætrom, P. PyRanges: efficient comparison of genomic intervals in Python. Bioinformatics 36, 918–919 (2020).
CAS PubMed Google Scholar
Dong, S., Kryczka, A., Jin, Y. & Stumm, M. RocksDB: evolution of development priorities in a key-value store serving large-scale applications. ACM Trans. Storage 17, 26:1–26:32 (2021).
Google Scholar
Nori, H., Jenkins, S., Koch, P. & Caruana, R. InterpretML: a unified framework for machine learning interpretability. Preprint at arXiv190909223 Cs Stat https://doi.org/10.48550/arXiv.1909.09223 (2019).
Mertes, C., Scheller, I. & Gagneur, J. FRASER code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7447804 (2022).
Kopajtich, R. et al. Integration of proteomics with genomics and transcriptomics increases the diagnostic rate of Mendelian disorders. Preprint at https://www.medrxiv.org/content/10.1101/2021.03.09.21253187v1 (2021).
Wagner, N. et al. SpliceMap code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7626022 (2022).
Wagner, N. et al. AbSplice code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7626035 (2022).
Wagner, N. et al. Code to generate SpliceAI rocksdb used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7626078 (2022).
Wagner, N. et al. Code to generate gnomAD rocksdb used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7625641 (2022).
Wagner, N. et al. Analysis code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7628868 (2022).

Download references

Acknowledgements

We thank I. Scheller for feedback and advice. M.H.C. thanks X. Xie and A. Mortazavi for institutional support. The German Bundesministerium für Bildung und Forschung (BMBF) supported the study through the Model Exchange for Regulatory Genomics project (MERGE; grant no. 031L0174A to F.R.H. and J.G.), the German Network for Mitochondrial Disorders (mitoNET; grant no. 01GM1906B to V.A.Y.), the ERA PerMed project PerMiM (grant no. 01KU2016A to H.P. and grant no. 01KU2016B to J.G.) and the EJP RD project GENOMIT (grant no. 01GM1920A to H.P.). N.W. is supported by the Helmholtz Association under the joint research school ‘Munich School for Data Science – MUDS’. This study was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), via the projects ‘Identification of host genetic variation predisposing to severe COVID-19 by genetics, transcriptomics and functional analyses’ (grant no. 466168909 to V.A.Y. and J.G.) and NFDI 1/1 ‘GHGA - German Human Genome-Phenome Archive’ (grant no. 441914366 to C.M. and J.G.). Figure 1 was created with BioRender.com. The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by the NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS. This study was supported by data provided by the Answer ALS Consortium, administered by the Robert Packard Center for ALS at Johns Hopkins. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Nils Wagner, Muhammed H. Çelik.

Authors and Affiliations

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
Nils Wagner, Muhammed H. Çelik, Florian R. Hölzlwimmer, Christian Mertes, Vicente A. Yépez & Julien Gagneur
Helmholtz Association – Munich School for Data Science (MUDS), Munich, Germany
Nils Wagner & Julien Gagneur
Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
Muhammed H. Çelik
Munich Data Science Institute, Technical University of Munich, Garching, Germany
Christian Mertes
Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
Holger Prokisch & Julien Gagneur
Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
Holger Prokisch & Julien Gagneur

Authors

Nils Wagner
View author publications
Search author on:PubMed Google Scholar
Muhammed H. Çelik
View author publications
Search author on:PubMed Google Scholar
Florian R. Hölzlwimmer
View author publications
Search author on:PubMed Google Scholar
Christian Mertes
View author publications
Search author on:PubMed Google Scholar
Holger Prokisch
View author publications
Search author on:PubMed Google Scholar
Vicente A. Yépez
View author publications
Search author on:PubMed Google Scholar
Julien Gagneur
View author publications
Search author on:PubMed Google Scholar

Contributions

J.G. conceptualized the project. N.W., M.H.C. and J.G. designed the methodology. N.W. and M.H.C. provided the software. N.W., M.H.C., F.R.H., H.P. and V.A.Y. performed validations. N.W., M.H.C., F.R.H., V.A.Y. and C.M. performed the formal analysis. N.W., M.H.C., F.R.H. and V.A.Y. curated the data. N.W., M.H.C., V.A.Y. and J.G. wrote the original draft of the manuscript. All authors reviewed and edited the manuscript. N.W., M.H.C., F.R.H., V.A.Y. and J.G. performed visualizations. J.G. supervised the project.

Corresponding author

Correspondence to Julien Gagneur.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Jamie Ellingford, Xin Gao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance comparison with different outlier detection methods and different differential splicing cutoffs.

a, Distribution of the area under the precision-recall curve across GTEx tissues (n = 49) of different prediction methods (SpliceAI, SpliceAI using SpliceMap annotation, SpliceAI using SpliceMap annotation along with quantitative reference levels of splicing, MMSplice using GENCODE annotation, MMSplice using SpliceMap annotation, MMSplice using SpliceMap annotation along with quantitative reference levels of splicing, and the integrative model AbSplice-DNA) taking as ground truth 3 different aberrant splicing callers: FRASER, LeafcutterMD and SPOT. A gene was considered aberrantly spliced if it contained at least one significant splicing outlier reported by the aberrant splicing caller without applying any additional replication or rare variant filter (Extended Data Fig. 4a for FRASER). Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. P values were computed using the paired one-sided Wilcoxon test. b, Precision-recall curves comparing the overall prediction performance on all GTEx tissues of the same models as in a, using FRASER as the outlier caller and the rare variant filter in Extended Data Fig. 4c with 250 bp together with different differential splicing cutoffs, namely |ΔΨ| = 0.1, 0.2, 0.3.

Extended Data Fig. 2 Splicing outliers with a rare variant in the vicinity are enriched for replicated events.

a, Enrichment of replicated splicing outliers across tissues with respect to the distance to the nearest rare variant. Note that there is an enrichment up to a distance of 250 bp. ‘Number of tissues’ denotes the minimum number of tissues from an individual with a shared splicing outlier such that the outlier is considered to be replicated. b, Replication rate of aberrant splicing events between tissues (n = 49) of a sample for all aberrant splicing events (red) compared with aberrant splicing events that contain a rare variant within a 250 bp window (blue). Filtering for aberrant splicing events with a rare variant reduces the amount of singletons probably by filtering out technical artifacts. Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. c, Percentage of singletons (aberrant splicing events that are observed only in one tissue) among all outliers (in red) and among outliers with a rare variant (in blue) for each tissue. There are nearly no replicated RNA-seq samples in the GTEx dataset. Therefore, among all singleton events, genuinely tissue-specific aberrant splicing events are hard to distinguish from non-reproducible technical artifacts.

Extended Data Fig. 3 Outlier filtering.

Visualization of different cases for the rare variant outlier filter (corresponds to Filter 3 in Extended Data Fig. 4). a, Exons 1, 3 and 4 were annotated in SpliceMap. Exon 2 is a novel exon detected on an individual whose splice sites are not in SpliceMap. If there exists a rare variant within 250 bp of any splice site (in SpliceMap or not) that shares a junction with either the donor or acceptor site of the outlier event, the outlier passes the ‘rare variant filter’. Cases 1 and 2: The individual has a rare variant within 250 bp of either the donor site of exon 1 or the acceptor site of exon 2, which are the splice sites of the outlier junction. Importantly, exon 2 was not quantified by SpliceMap, but the outlier filter solely depends on split reads. Case 3: The individual has a rare variant within 250 bp of the donor site of exon 2. However, this donor site is not part of the outlier event. Case 4: The individual has a rare variant within 250 bp of the acceptor site of exon 3, which forms a splicing junction with the donor site of exon 1. Case 5: The individual has two rare variants, one further than 250 bp of any splice site, the other within 250 bp of the acceptor site of exon 4. Notably, a variant can be far from the outlier junction and still be involved in the outlier event. b, Exon elongation detected as a splicing efficiency outlier. For splicing efficiency outliers, only the affected splice-site with altered splicing efficiency is considered for the variant filter. Case 1: The individual has a rare variant within 250 bp of the donor site of exon 1. Case 2: The individual has a rare variant that overlaps the acceptor site of the elongated exon 3, but is further than 250 bp from the acceptor site of exon 3. Case 3: The individual has a rare variant within 250 bp of the acceptor site of exon 3. Case 4: The individual has a rare variant within 250 bp of the donor site of exon 3, but the donor is not related to the exon elongation.

Extended Data Fig. 4 Performance with different filters.

Precision-recall curve comparing the overall prediction performance on all GTEx tissues of SpliceAI, SpliceAI using SpliceMap annotation, SpliceAI using SpliceMap annotation along with quantitative reference levels of splicing, MMSplice using GENCODE annotation, MMSplice using SpliceMap annotation, MMSplice using SpliceMap annotation along with quantitative reference levels of splicing, and the integrative model AbSplice-DNA, using different filters for aberrantly spliced genes. a, Filter 1: FRASER default cutoffs (|ΔΨ| > 0.3, FDR < 0.05, 126,308 aberrant events) b, Filter 2: same as a, but restricting to genes that are aberrantly spliced in at least two different tissues from the same individual (32,886 aberrant events). c, Filter 3: same as a, but restricting to genes that have a rare variant within 250 bp of the splice sites (22,766 aberrant events). While the results are best with Filter 3, the relative improvements in terms of precision at the same recall between the methods is the same as with Filter 2. In particular, having restricted to variants 250 bp away from any detected split read boundary (Filter 3) did not bias our analysis for the splice-site centric method MMSplice over SpliceAI. d, After applying Filter 3, outliers were stratified into ‘replicated’ (14,030 aberrant events), that is appearing in at least two different tissues of the same individual, and ‘not replicated’ (8,736 aberrant events). All models showed a significantly higher performance for aberrant splicing events replicated in two or more samples compared to those reported in a single sample only.

Extended Data Fig. 5 Variant scoring of SpliceAI, MMSplice, MMSplice + SpliceMap and AbSplice-DNA.

a, A gene model with 3 annotated exons in the standard annotation (1, 3 and 4) and 3 exons detected by SpliceMap (1, 2 and 4). SpliceAI scores for every bp in a 50 bp window of a variant (shown as red star) and reports the maximum score independent of the distance to a junction. MMSplice provides a score in a 100 bp window around a variant as long as there is a junction in that window. b, Case with a variant within 100 bp of an annotated junction in SpliceMap, but further than 100 bp from any exon in the standard annotation. MMSplice + SpliceMap is able to score the variant, while MMSplice is not. c, Case with a variant within 100 bp of an annotated exon in the standard annotation, but further than 100 bp from any exon in the SpliceMap. Therefore, MMSplice is able to score the variant, while MMSplice + SpliceMap is not. d, The variant is not within 100 bp of any annotated junction in the standard annotation or SpliceMap. Therefore neither MMSplice nor MMSplice + SpliceMap can score the variant. However, SpliceAI is always able to score a variant. Consequently, AbSplice is always able to score a variant.

Extended Data Fig. 6 Comparison of annotated splice-sites in SpliceMap and GENCODE.

Number of introns, acceptor sites, and donor sites annotated in GENCODE and the SpliceMap of each GTEx tissue (first row), GENCODE only (second row) and SpliceMap only (third row).

Extended Data Fig. 7 The variant effect depends on the reference isoform proportion.

a, Ψ against Δlogit(Ψ) showing the non-linear splicing scaling law. The mutation effect of a variant can lead to different changes in Ψ in natural scale, depending on the reference splicing level of the intron. For example, the same variant can lead to a large change in Ψ if Ψ_ref is initially at an intermediate level and almost no change if Ψ_ref is initially at an extreme value (here low). b, Distribution of Ψ_ref in SpliceMap. Most of the introns are not alternatively spliced, so the reference level of those introns is either 0 or 1. c, Cumulative distribution function of the maximum difference of Ψ_ref (defined as: max(Ψ_ref) - min(Ψ_ref)) across tissues per intron. d, Heatmap of the Ψ_ref of the most variable introns (defined as: max(Ψ_ref) - min(Ψ_ref) > 0.3) across tissues.

Extended Data Fig. 8 Calibration of AbSplice-DNA.

a, Histogram of AbSplice-DNA scores for gene, sample, tissue combinations that do not contain an aberrant splicing event. The dashed red line indicates the median. b, Histogram of AbSplice-DNA scores for gene, sample, tissue combinations that contain an aberrant splicing event. The peak at logit(AbSplice-DNA) ~-3.1 corresponds to AbSplice-DNA scores that are low due to small SpliceAI and MMSplice scores, but with an expressed splice site as annotated in SpliceMap. The peak at logit(AbSplice-DNA) ~-4.3 corresponds to small SpliceAI and MMSplice scores with an unused splice site as annotated in SpliceMap. c, Odds of aberrant splicing events as a function of logit transformed AbSplice-DNA scores (binned in bins of width 0.1). The line represents the diagonal. Note the linear relationship (especially in the high AbSplice-DNA score region) and the (extrapolated) intersection at AbSplice-DNA score of 0.5 (logit(AbSplice-DNA) = 0) corresponding to a log odds of 1, indicating a well calibrated model.

Extended Data Fig. 9 Performance analysis of additional state-of-the art models and AbSplice-DNA trained with different model methods.

a, Precision-recall performance of CADD-Splice, SQUIRLS, MTSplice, MMSplice and SpliceAI. b, Distribution of the area under the precision-recall curve (auPRC) across all GTEx tissues (n = 49) of the AbSplice-DNA models trained with varying feature sets using the models in a, that is ‘AbSplice-DNA (+ CADD-Splice)’ additionally used CADD-Splice scores during training. Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. Shown in red is the AbSplice-DNA model used in the manuscript. Models are sorted by auPRC. P-values were computed using the paired two-sided Wilcoxon test. c-d, AbSplice-DNA was trained using a generalized additive model (GAM), random forest and logistic regression. AbSplice-DNA with GAM is the one used in the manuscript. c, Precision-recall curve across all GTEx tissues. d, Distribution of the area under the precision-recall curve of the models in c across tissues (n = 49). Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles.

Extended Data Fig. 10 RNA-based predictions from CAT improve DNA-based scores.

a, Precision-recall curves comparing the overall prediction performance on non-accessible GTEx tissues using the gene-level FRASER p-values from the CAT, AbSplice-RNA trained on a single CAT and AbSplice-DNA. Each panel shows a different CAT and the number of matching samples in the non-accessible tissues. b, Same as a, but for samples having RNA-seq from both blood and fibroblasts. AbSplice-RNA (all CATs) was trained using RNA-seq data from blood, fibroblasts and lymphocytes. Note that AbSplice-RNA (fibroblasts) gave a similar performance as AbSplice-RNA (all CATs). We did not restrict the samples to the ones also having lymphocytes as this would result in a low number of samples (N = 2,258). c, Model performance for genes not expressed or expressed in the clinically accessible tissue fibroblasts. The cutoff for calling a gene expressed was TPM > 1 (transcript per million). AbSplice-RNA improves for genes expressed in fibroblasts and remains on par with AbSplice-DNA for genes not expressed in fibroblasts.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7 and Table 1.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wagner, N., Çelik, M.H., Hölzlwimmer, F.R. et al. Aberrant splicing prediction across human tissues. Nat Genet 55, 861–870 (2023). https://doi.org/10.1038/s41588-023-01373-3

Download citation

Received: 05 April 2022
Accepted: 14 March 2023
Published: 04 May 2023
Issue Date: May 2023
DOI: https://doi.org/10.1038/s41588-023-01373-3

This article is cited by

Aberrant gene expression prediction across human tissues
- Florian R. Hölzlwimmer
- Jonas Lindner
- Julien Gagneur
Nature Communications (2025)
Alternative splicing analysis in a Spanish ASD (Autism Spectrum Disorders) cohort: in silico prediction and characterization
- S. Dominguez-Alonso
- M. Tubío-Fungueiriño
- C. Rodriguez-Fontenla
Scientific Reports (2025)
Functional impact of splicing variants in the elaboration of complex traits in cattle
- Mathieu Charles
- Nicolas Gaiani
- Arnaud Boulling
Nature Communications (2025)
From computational models of the splicing code to regulatory mechanisms and therapeutic implications
- Charlotte Capitanchik
- Oscar G. Wilkins
- Jernej Ule
Nature Reviews Genetics (2025)
Prime editor-based high-throughput screening reveals functional synonymous mutations in human cells
- Xuran Niu
- Wei Tang
- Wensheng Wei
Nature Biotechnology (2025)