Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Aberrant splicing prediction across human tissues

Abstract

Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Study design and main findings.
Fig. 2: Tissue-specific splice site map improves prediction performance.
Fig. 3: Quantitative splicing levels further improve prediction performance.
Fig. 4: Application of AbSplice-DNA on independent data.
Fig. 5: Integrating RNA-seq data of CATs to predict aberrant splicing in nonaccessible tissues.

Similar content being viewed by others

Data availability

No primary data were generated for this study. Rare variants from gnomAD v.3.1.2 are publicly available at https://gnomad.broadinstitute.org. The GTEx v8 dataset is available at (under dbGaP protection) https://gtexportal.org/home. The ALS dataset is available at http://dataportal.answerals.org after a registration and approval process. The mitochondrial dataset is described by Yépez et al.27. Precomputed SpliceAI scores are publicly available at Illumina Basespace, https://basespace.illumina.com/s/otSPW8hnhaZR, after registration. SpliceMaps for all 49 GTEx tissues and iPSC-derived spinal motor neurons from ALS (hg38) are available at Zenodo, https://doi.org/10.5281/zenodo.6387937. Precomputed AbSplice-DNA scores (hg38) in all 49 GTEx tissues are available at Zenodo, https://doi.org/10.5281/zenodo.6408331. Due to potential donor re-identification when revealing rare variants, the benchmark dataset cannot be shared without restrictions. Users with access to the GTEx data can reproduce the benchmark using the code repository below.

Code availability

SpliceMaps can be generated using the custom-written python package ‘splicemap’ (publicly available at: https://github.com/gagneurlab/splicemap ref. 61). AbSplice predictions using the enhanced SpliceMap annotation can be performed with the custom-written python package ‘absplice’ (publicly available at: https://github.com/gagneurlab/absplice ref. 62). We also provide a fast implementation of computing SpliceAI predictions using a wrapper based on fast lookup from a database of precomputed scores for existing variants and running SpliceAI for not precomputed variants at https://github.com/gagneurlab/spliceai_rocksdb ref. 63. Fast lookup of all gnomAD variants can be performed with https://github.com/gagneurlab/gnomad_rocksdb ref. 64. The analyses are available under https://github.com/gagneurlab/AbSplice_analysis ref. 65.

References

  1. Zappala, Z. & Montgomery, S. B. Non-coding loss-of-function variation in human genomes. Hum. Hered. 81, 78–87 (2016).

    CAS  PubMed  Google Scholar 

  2. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).

    CAS  PubMed  Google Scholar 

  3. Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).

    PubMed  PubMed Central  Google Scholar 

  4. Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).

    CAS  PubMed  Google Scholar 

  5. Rosenberg, A. B., Patwardhan, R. P., Shendure, J. & Seelig, G. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163, 698–711 (2015).

    CAS  PubMed  Google Scholar 

  6. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).

    PubMed  Google Scholar 

  7. Rentzsch, P., Schubach, M., Shendure, J. & Kircher, M. CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 13, 31 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Danis, D. et al. Interpretable prioritization of splice variants in diagnostic next-generation sequencing. Am. J. Hum. Genet. 108, 2205 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Cheng, J., Çelik, M. H., Kundaje, A. & Gagneur, J. MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol. 22, 94 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Mertes, C. et al. Detection of aberrant splicing events in RNA-seq data using FRASER. Nat. Commun. 12, 529 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Jenkinson, G. et al. LeafCutterMD: an algorithm for outlier splicing detection in rare diseases. Bioinformatics 36, 4609–4615 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Ferraro, N. M. et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science 369, eaaz5900 (2020).

    PubMed  PubMed Central  Google Scholar 

  13. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Wilks, C. et al. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 22, 323 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Ling, J. P. et al. ASCOT identifies key regulators of neuronal subtype-specific splicing. Nat. Commun. 11, 137 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Kremer, L. S. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Dawes, R., Joshi, H. & Cooper, S. T. Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data. Nat. Commun. 13, 1655 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    CAS  PubMed  Google Scholar 

  19. Elliott, D. J. & Grellscheid, S. N. Alternative RNA splicing regulation in the testis. Reprodroduction 132, 811–819 (2006).

    CAS  Google Scholar 

  20. de la Grange, P., Gratadou, L., Delord, M., Dutertre, M. & Auboeuf, D. Splicing factor and exon profiling across human tissues. Nucleic Acids Res. 38, 2825–2838 (2010).

    PubMed  PubMed Central  Google Scholar 

  21. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  22. Cotto, K. C., Feng, Y. Y., Ramu, A. et al. Integrated analysis of genomic and transcriptomic data for the discovery of splice-associated variants in cancer. Nat Commun 14, 1589 (2023).

  23. Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).

    CAS  PubMed  Google Scholar 

  24. Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    PubMed  PubMed Central  Google Scholar 

  25. Baeza-Centurion, P., Miñana, B., Schmiedel, J. M., Valcárcel, J. & Lehner, B. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell 176, 549–563.e23 (2019).

    CAS  PubMed  Google Scholar 

  26. Cheng, J., Çelik, M. H., Nguyen, T. Y. D., Avsec, Ž. & Gagneur, J. CAGI 5 splicing challenge: improved exon skipping and intron retention predictions with MMSplice. Hum. Mutat. 40, 1243–1251 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Yépez, V. A. et al. Clinical implementation of RNA sequencing for Mendelian disease diagnostics. Genome Med. 14, 38 (2022).

    PubMed  PubMed Central  Google Scholar 

  28. Abel, O., Powell, J. F., Andersen, P. M. & Al-Chalabi, A. ALSoD: a user-friendly online bioinformatics tool for amyotrophic lateral sclerosis genetics. Hum. Mutat. 33, 1345–1351 (2012).

    CAS  PubMed  Google Scholar 

  29. Gregory, J. M., Fagegaltier, D., Phatnani, H. & Harms, M. B. Genetics of amyotrophic lateral sclerosis. Curr. Genet. Med. Rep. 8, 121–131 (2020).

    Google Scholar 

  30. Pecoraro, V. et al. The NGS technology for the identification of genes associated with the ALS. A systematic review. Eur. J. Clin. Invest. 50, e13228 (2020).

    PubMed  PubMed Central  Google Scholar 

  31. Hardiman, O. et al. Amyotrophic lateral sclerosis. Nat. Rev. Dis. Primers 3, 17071 (2017).

    PubMed  Google Scholar 

  32. McCann, E. P. et al. Evidence for polygenic and oligogenic basis of Australian sporadic amyotrophic lateral sclerosis. J. Med. Genet. https://doi.org/10.1136/jmedgenet-2020-106866 (2020).

  33. Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).

    PubMed  PubMed Central  Google Scholar 

  34. Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).

    PubMed  PubMed Central  Google Scholar 

  35. Aicher, J. K., Jewell, P., Vaquero-Garcia, J., Barash, Y. & Bhoj, E. J. Mapping RNA splicing variations in clinically accessible and nonaccessible tissues to facilitate Mendelian disease diagnosis using RNA-seq. Genet. Med. 22, 1181–1190 (2020).

    PubMed  PubMed Central  Google Scholar 

  36. Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).

    CAS  PubMed  Google Scholar 

  37. Pertea, M., Lin, X. & Salzberg, S. L. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 29, 1185–1190 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Desmet, F.-O. et al. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acids Res. 37, e67 (2009).

    PubMed  PubMed Central  Google Scholar 

  39. Ke, S. et al. Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res. 21, 1360–1374 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Jian, X., Boerwinkle, E. & Liu, X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res. 42, 13534–13544 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Xiong, H. Y. et al. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).

    PubMed  Google Scholar 

  42. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J. & Rätsch, G. Accurate splice site prediction using support vector machines. BMC Bioinf. 8, S7 (2007).

    Google Scholar 

  43. Barash, Y. et al. Deciphering the splicing code. Nature 465, 53–59 (2010).

    CAS  PubMed  Google Scholar 

  44. Xiong, H. Y., Barash, Y. & Frey, B. J. Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics 27, 2554–2562 (2011).

    CAS  PubMed  Google Scholar 

  45. Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 51, 592–599 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Gonorazky, H. D. et al. Expanding the boundaries of RNA sequencing as a diagnostic tool for rare Mendelian disease. Am. J. Hum. Genet. 104, 466–483 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Martorella M. Noninvasive, low-cost RNA-sequencing enhances discovery potential of transcriptome studies. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2022.09.06.506813v1 (2022).

  48. Teran, N. A. et al. Nonsense-mediated decay is highly stable across individuals and tissues. Am. J. Hum. Genet. 108, 1401–1408 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Rogalska, M. E., Vivori, C. & Valcárcel, J. Regulation of pre-mRNA splicing: roles in physiology and disease, and therapeutic prospects. Nat. Rev. Genet. https://doi.org/10.1038/s41576-022-00556-8 (2022).

    Article  PubMed  Google Scholar 

  50. Zhang, Y., Zhou, R. & Wang, Y. Sashimi.py: a flexible toolkit for combinatorial analysis of genomic data. Preprint at bioRxiv https://doi.org/10.1101/2022.11.02.514803 (2022).

  51. McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

    PubMed  PubMed Central  Google Scholar 

  52. Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Yépez, V. A. et al. Detection of aberrant gene expression events in RNA sequencing data. Nat. Protoc. 16, 1276–1296 (2021).

    PubMed  Google Scholar 

  54. Pervouchine, D. D., Knowles, D. G. & Guigo, R. Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29, 273–274 (2013).

    CAS  PubMed  Google Scholar 

  55. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  56. Stovner, E. B. & Sætrom, P. PyRanges: efficient comparison of genomic intervals in Python. Bioinformatics 36, 918–919 (2020).

    CAS  PubMed  Google Scholar 

  57. Dong, S., Kryczka, A., Jin, Y. & Stumm, M. RocksDB: evolution of development priorities in a key-value store serving large-scale applications. ACM Trans. Storage 17, 26:1–26:32 (2021).

    Google Scholar 

  58. Nori, H., Jenkins, S., Koch, P. & Caruana, R. InterpretML: a unified framework for machine learning interpretability. Preprint at arXiv190909223 Cs Stat https://doi.org/10.48550/arXiv.1909.09223 (2019).

  59. Mertes, C., Scheller, I. & Gagneur, J. FRASER code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7447804 (2022).

  60. Kopajtich, R. et al. Integration of proteomics with genomics and transcriptomics increases the diagnostic rate of Mendelian disorders. Preprint at https://www.medrxiv.org/content/10.1101/2021.03.09.21253187v1 (2021).

  61. Wagner, N. et al. SpliceMap code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7626022 (2022).

  62. Wagner, N. et al. AbSplice code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7626035 (2022).

  63. Wagner, N. et al. Code to generate SpliceAI rocksdb used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7626078 (2022).

  64. Wagner, N. et al. Code to generate gnomAD rocksdb used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7625641 (2022).

  65. Wagner, N. et al. Analysis code used in AbSplice publication. Zenodo https://doi.org/10.5281/zenodo.7628868 (2022).

Download references

Acknowledgements

We thank I. Scheller for feedback and advice. M.H.C. thanks X. Xie and A. Mortazavi for institutional support. The German Bundesministerium für Bildung und Forschung (BMBF) supported the study through the Model Exchange for Regulatory Genomics project (MERGE; grant no. 031L0174A to F.R.H. and J.G.), the German Network for Mitochondrial Disorders (mitoNET; grant no. 01GM1906B to V.A.Y.), the ERA PerMed project PerMiM (grant no. 01KU2016A to H.P. and grant no. 01KU2016B to J.G.) and the EJP RD project GENOMIT (grant no. 01GM1920A to H.P.). N.W. is supported by the Helmholtz Association under the joint research school ‘Munich School for Data Science – MUDS’. This study was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), via the projects ‘Identification of host genetic variation predisposing to severe COVID-19 by genetics, transcriptomics and functional analyses’ (grant no. 466168909 to V.A.Y. and J.G.) and NFDI 1/1 ‘GHGA - German Human Genome-Phenome Archive’ (grant no. 441914366 to C.M. and J.G.). Figure 1 was created with BioRender.com. The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by the NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS. This study was supported by data provided by the Answer ALS Consortium, administered by the Robert Packard Center for ALS at Johns Hopkins. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

J.G. conceptualized the project. N.W., M.H.C. and J.G. designed the methodology. N.W. and M.H.C. provided the software. N.W., M.H.C., F.R.H., H.P. and V.A.Y. performed validations. N.W., M.H.C., F.R.H., V.A.Y. and C.M. performed the formal analysis. N.W., M.H.C., F.R.H. and V.A.Y. curated the data. N.W., M.H.C., V.A.Y. and J.G. wrote the original draft of the manuscript. All authors reviewed and edited the manuscript. N.W., M.H.C., F.R.H., V.A.Y. and J.G. performed visualizations. J.G. supervised the project.

Corresponding author

Correspondence to Julien Gagneur.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Jamie Ellingford, Xin Gao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance comparison with different outlier detection methods and different differential splicing cutoffs.

a, Distribution of the area under the precision-recall curve across GTEx tissues (n = 49) of different prediction methods (SpliceAI, SpliceAI using SpliceMap annotation, SpliceAI using SpliceMap annotation along with quantitative reference levels of splicing, MMSplice using GENCODE annotation, MMSplice using SpliceMap annotation, MMSplice using SpliceMap annotation along with quantitative reference levels of splicing, and the integrative model AbSplice-DNA) taking as ground truth 3 different aberrant splicing callers: FRASER, LeafcutterMD and SPOT. A gene was considered aberrantly spliced if it contained at least one significant splicing outlier reported by the aberrant splicing caller without applying any additional replication or rare variant filter (Extended Data Fig. 4a for FRASER). Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. P values were computed using the paired one-sided Wilcoxon test. b, Precision-recall curves comparing the overall prediction performance on all GTEx tissues of the same models as in a, using FRASER as the outlier caller and the rare variant filter in Extended Data Fig. 4c with 250 bp together with different differential splicing cutoffs, namely |ΔΨ| = 0.1, 0.2, 0.3.

Extended Data Fig. 2 Splicing outliers with a rare variant in the vicinity are enriched for replicated events.

a, Enrichment of replicated splicing outliers across tissues with respect to the distance to the nearest rare variant. Note that there is an enrichment up to a distance of 250 bp. ‘Number of tissues’ denotes the minimum number of tissues from an individual with a shared splicing outlier such that the outlier is considered to be replicated. b, Replication rate of aberrant splicing events between tissues (n = 49) of a sample for all aberrant splicing events (red) compared with aberrant splicing events that contain a rare variant within a 250 bp window (blue). Filtering for aberrant splicing events with a rare variant reduces the amount of singletons probably by filtering out technical artifacts. Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. c, Percentage of singletons (aberrant splicing events that are observed only in one tissue) among all outliers (in red) and among outliers with a rare variant (in blue) for each tissue. There are nearly no replicated RNA-seq samples in the GTEx dataset. Therefore, among all singleton events, genuinely tissue-specific aberrant splicing events are hard to distinguish from non-reproducible technical artifacts.

Extended Data Fig. 3 Outlier filtering.

Visualization of different cases for the rare variant outlier filter (corresponds to Filter 3 in Extended Data Fig. 4). a, Exons 1, 3 and 4 were annotated in SpliceMap. Exon 2 is a novel exon detected on an individual whose splice sites are not in SpliceMap. If there exists a rare variant within 250 bp of any splice site (in SpliceMap or not) that shares a junction with either the donor or acceptor site of the outlier event, the outlier passes the ‘rare variant filter’. Cases 1 and 2: The individual has a rare variant within 250 bp of either the donor site of exon 1 or the acceptor site of exon 2, which are the splice sites of the outlier junction. Importantly, exon 2 was not quantified by SpliceMap, but the outlier filter solely depends on split reads. Case 3: The individual has a rare variant within 250 bp of the donor site of exon 2. However, this donor site is not part of the outlier event. Case 4: The individual has a rare variant within 250 bp of the acceptor site of exon 3, which forms a splicing junction with the donor site of exon 1. Case 5: The individual has two rare variants, one further than 250 bp of any splice site, the other within 250 bp of the acceptor site of exon 4. Notably, a variant can be far from the outlier junction and still be involved in the outlier event. b, Exon elongation detected as a splicing efficiency outlier. For splicing efficiency outliers, only the affected splice-site with altered splicing efficiency is considered for the variant filter. Case 1: The individual has a rare variant within 250 bp of the donor site of exon 1. Case 2: The individual has a rare variant that overlaps the acceptor site of the elongated exon 3, but is further than 250 bp from the acceptor site of exon 3. Case 3: The individual has a rare variant within 250 bp of the acceptor site of exon 3. Case 4: The individual has a rare variant within 250 bp of the donor site of exon 3, but the donor is not related to the exon elongation.

Extended Data Fig. 4 Performance with different filters.

Precision-recall curve comparing the overall prediction performance on all GTEx tissues of SpliceAI, SpliceAI using SpliceMap annotation, SpliceAI using SpliceMap annotation along with quantitative reference levels of splicing, MMSplice using GENCODE annotation, MMSplice using SpliceMap annotation, MMSplice using SpliceMap annotation along with quantitative reference levels of splicing, and the integrative model AbSplice-DNA, using different filters for aberrantly spliced genes. a, Filter 1: FRASER default cutoffs (|ΔΨ| > 0.3, FDR < 0.05, 126,308 aberrant events) b, Filter 2: same as a, but restricting to genes that are aberrantly spliced in at least two different tissues from the same individual (32,886 aberrant events). c, Filter 3: same as a, but restricting to genes that have a rare variant within 250 bp of the splice sites (22,766 aberrant events). While the results are best with Filter 3, the relative improvements in terms of precision at the same recall between the methods is the same as with Filter 2. In particular, having restricted to variants 250 bp away from any detected split read boundary (Filter 3) did not bias our analysis for the splice-site centric method MMSplice over SpliceAI. d, After applying Filter 3, outliers were stratified into ‘replicated’ (14,030 aberrant events), that is appearing in at least two different tissues of the same individual, and ‘not replicated’ (8,736 aberrant events). All models showed a significantly higher performance for aberrant splicing events replicated in two or more samples compared to those reported in a single sample only.

Extended Data Fig. 5 Variant scoring of SpliceAI, MMSplice, MMSplice + SpliceMap and AbSplice-DNA.

a, A gene model with 3 annotated exons in the standard annotation (1, 3 and 4) and 3 exons detected by SpliceMap (1, 2 and 4). SpliceAI scores for every bp in a 50 bp window of a variant (shown as red star) and reports the maximum score independent of the distance to a junction. MMSplice provides a score in a 100 bp window around a variant as long as there is a junction in that window. b, Case with a variant within 100 bp of an annotated junction in SpliceMap, but further than 100 bp from any exon in the standard annotation. MMSplice + SpliceMap is able to score the variant, while MMSplice is not. c, Case with a variant within 100 bp of an annotated exon in the standard annotation, but further than 100 bp from any exon in the SpliceMap. Therefore, MMSplice is able to score the variant, while MMSplice + SpliceMap is not. d, The variant is not within 100 bp of any annotated junction in the standard annotation or SpliceMap. Therefore neither MMSplice nor MMSplice + SpliceMap can score the variant. However, SpliceAI is always able to score a variant. Consequently, AbSplice is always able to score a variant.

Extended Data Fig. 6 Comparison of annotated splice-sites in SpliceMap and GENCODE.

Number of introns, acceptor sites, and donor sites annotated in GENCODE and the SpliceMap of each GTEx tissue (first row), GENCODE only (second row) and SpliceMap only (third row).

Extended Data Fig. 7 The variant effect depends on the reference isoform proportion.

a, Ψ against Δlogit(Ψ) showing the non-linear splicing scaling law. The mutation effect of a variant can lead to different changes in Ψ in natural scale, depending on the reference splicing level of the intron. For example, the same variant can lead to a large change in Ψ if Ψref is initially at an intermediate level and almost no change if Ψref is initially at an extreme value (here low). b, Distribution of Ψref in SpliceMap. Most of the introns are not alternatively spliced, so the reference level of those introns is either 0 or 1. c, Cumulative distribution function of the maximum difference of Ψref (defined as: max(Ψref) - min(Ψref)) across tissues per intron. d, Heatmap of the Ψref of the most variable introns (defined as: max(Ψref) - min(Ψref) > 0.3) across tissues.

Extended Data Fig. 8 Calibration of AbSplice-DNA.

a, Histogram of AbSplice-DNA scores for gene, sample, tissue combinations that do not contain an aberrant splicing event. The dashed red line indicates the median. b, Histogram of AbSplice-DNA scores for gene, sample, tissue combinations that contain an aberrant splicing event. The peak at logit(AbSplice-DNA) ~-3.1 corresponds to AbSplice-DNA scores that are low due to small SpliceAI and MMSplice scores, but with an expressed splice site as annotated in SpliceMap. The peak at logit(AbSplice-DNA) ~-4.3 corresponds to small SpliceAI and MMSplice scores with an unused splice site as annotated in SpliceMap. c, Odds of aberrant splicing events as a function of logit transformed AbSplice-DNA scores (binned in bins of width 0.1). The line represents the diagonal. Note the linear relationship (especially in the high AbSplice-DNA score region) and the (extrapolated) intersection at AbSplice-DNA score of 0.5 (logit(AbSplice-DNA) = 0) corresponding to a log odds of 1, indicating a well calibrated model.

Extended Data Fig. 9 Performance analysis of additional state-of-the art models and AbSplice-DNA trained with different model methods.

a, Precision-recall performance of CADD-Splice, SQUIRLS, MTSplice, MMSplice and SpliceAI. b, Distribution of the area under the precision-recall curve (auPRC) across all GTEx tissues (n = 49) of the AbSplice-DNA models trained with varying feature sets using the models in a, that is ‘AbSplice-DNA (+ CADD-Splice)’ additionally used CADD-Splice scores during training. Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. Shown in red is the AbSplice-DNA model used in the manuscript. Models are sorted by auPRC. P-values were computed using the paired two-sided Wilcoxon test. c-d, AbSplice-DNA was trained using a generalized additive model (GAM), random forest and logistic regression. AbSplice-DNA with GAM is the one used in the manuscript. c, Precision-recall curve across all GTEx tissues. d, Distribution of the area under the precision-recall curve of the models in c across tissues (n = 49). Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles.

Extended Data Fig. 10 RNA-based predictions from CAT improve DNA-based scores.

a, Precision-recall curves comparing the overall prediction performance on non-accessible GTEx tissues using the gene-level FRASER p-values from the CAT, AbSplice-RNA trained on a single CAT and AbSplice-DNA. Each panel shows a different CAT and the number of matching samples in the non-accessible tissues. b, Same as a, but for samples having RNA-seq from both blood and fibroblasts. AbSplice-RNA (all CATs) was trained using RNA-seq data from blood, fibroblasts and lymphocytes. Note that AbSplice-RNA (fibroblasts) gave a similar performance as AbSplice-RNA (all CATs). We did not restrict the samples to the ones also having lymphocytes as this would result in a low number of samples (N = 2,258). c, Model performance for genes not expressed or expressed in the clinically accessible tissue fibroblasts. The cutoff for calling a gene expressed was TPM > 1 (transcript per million). AbSplice-RNA improves for genes expressed in fibroblasts and remains on par with AbSplice-DNA for genes not expressed in fibroblasts.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wagner, N., Çelik, M.H., Hölzlwimmer, F.R. et al. Aberrant splicing prediction across human tissues. Nat Genet 55, 861–870 (2023). https://doi.org/10.1038/s41588-023-01373-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-023-01373-3

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research