Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Quantifying batch effects for individual genes in single-cell data

A preprint version of the article is available at Research Square.

Abstract

Batch effects substantially impede the comparison of multiple single-cell experiment batches. Existing methods for batch effect removal and quantification primarily emphasize cell alignment across batches, often overlooking gene-level batch effects. Here we introduce group technical effects (GTE)—a quantitative metric to assess batch effects on individual genes. Using GTE, we show that batch effects unevenly impact genes within the dataset. A portion of highly batch-sensitive genes (HBGs) differ between datasets and dominate the batch effects, whereas non-HBGs exhibit low batch effects. We demonstrate that as few as three HBGs are sufficient to introduce substantial batch effects. Our method also enables the assessment of cell-level batch effects, outperforming existing batch effect quantification methods. We also observe that biologically similar cell types undergo similar batch effects, informing the development of data integration strategies. The GTE method is versatile and applicable to various single-cell omics data types.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Quantification of batch effects for individual genes.
Fig. 2: Batch effect removal guided by GTE.

Similar content being viewed by others

Data availability

All datasets used in this paper are publicly available. Specifically, the mouse MOp dataset can be accessed via the CELLxGENE portal at https://cellxgene.cziscience.com/collections/ae1420fe-6630-46ed-8b3d-cc6056a66467. The mouse retina and human cell line datasets are available at https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data. The human cortical development and human PBMCs (CITE-seq) datasets can be accessed from the Gene Expression Omnibus (GEO) under accession codes GSE168408 and GSE156473, respectively. The human cortex dataset is available from https://cellxgene.cziscience.com/collections/35928d1c-36fc-4f93-9a8d-0b921ab41745. The human MTG dataset can be accessed via the Allen Brain Map portal at https://portal.brain-map.org/atlases-and-data/rnaseq. The human heart dataset is available from the Heart Cell Atlas project at https://www.heartcellatlas.org. The human PBMCs (scRNA-seq) dataset is accessible at https://github.com/satijalab/seurat-data (ref. 33). The TCGA READ bulk RNA-seq dataset can be accessed through Zenodo at https://doi.org/10.5281/zenodo.6392171 (ref. 34). The mouse brain scATAC-seq datasets (peak and gene activity versions) are available at https://doi.org/10.6084/m9.figshare.12420968.v8 (ref. 35). The mouse cell line proteomics dataset can be accessed via https://scproteomicsdb.com. Refer to Supplementary Table 2 for further details of the datasets. Processed datasets used for the analyses have been deposited to Zenodo at https://doi.org/10.5281/zenodo.13358933 (ref. 36). Source data are provided with this paper.

Code availability

The codes and Source data used to generate the results in this paper are available at GitHub (https://github.com/yzhou1999/GTEs; ref. 37) and at Zenodo (https://doi.org/10.5281/zenodo.15412860; ref. 38).

References

  1. Youden, W. J. Enduring values. Technometrics 14, 1–11 (1972).

    Article  Google Scholar 

  2. Lander, E. S. Array of hope. Nat. Genet. 21, 3–4 (1999).

    Article  Google Scholar 

  3. Akey, J. M. et al. On the design and analysis of gene expression studies in human populations. Nat. Genet. 39, 807–808 (2007).

    Article  Google Scholar 

  4. Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  Google Scholar 

  5. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  MATH  Google Scholar 

  6. Smyth, G. K. & Speed, T. Normalization of cDNA microarray data. Methods 31, 265–273 (2003).

    Article  Google Scholar 

  7. Haghverdi, L. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  Google Scholar 

  8. Butler, A. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  Google Scholar 

  9. Lopez, R. et al. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  Google Scholar 

  10. Xu, C. et al. Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).

    Article  Google Scholar 

  11. De Donno, C. et al. Population-level integration of single-cell datasets enables multi-scale analysis across samples. Nat. Methods 20, 1683–1692 (2023).

    Article  Google Scholar 

  12. Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature 598, 103–110 (2021).

    Article  Google Scholar 

  13. Büttner, M. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article  Google Scholar 

  14. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  Google Scholar 

  15. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).

    Article  Google Scholar 

  16. Luecken, M. D. & Theis, F. J. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).

    Article  Google Scholar 

  17. Subramanian, A. et al. Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics. Genome Biol. 23, 267 (2022).

    Article  Google Scholar 

  18. Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).

    Article  Google Scholar 

  19. CZI Cell Science Program et al. CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res. 53, D886–D900 (2024).

  20. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    Article  Google Scholar 

  21. Chazarra-Gil, R. et al. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic Acids Res. 49, e42 (2021).

    Article  Google Scholar 

  22. Lütge, A. et al. CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data. Life Sci. Alliance 4, e202001004 (2021).

    Article  Google Scholar 

  23. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  Google Scholar 

  24. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    Article  Google Scholar 

  25. Molania, R. et al. Removing unwanted variation from large-scale RNA sequencing data with PRPS. Nat. Biotechnol. 41, 82–95 (2023).

    Article  Google Scholar 

  26. Leduc, A. et al. Exploring functional protein covariation across single cells using nPOP. Genome Biol. 23, 261 (2022).

    Article  Google Scholar 

  27. Derks, J. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat. Biotechnol. 41, 50–59 (2023).

    Article  Google Scholar 

  28. Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).

    Article  Google Scholar 

  29. McCarthy, D. J. et al. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

    Article  Google Scholar 

  30. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).

    Article  Google Scholar 

  31. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).

    Article  Google Scholar 

  32. Yu, G. et al. ClusterProfiler: an R package for comparing biological themes among gene clusters. Omics 16, 284–287 (2012).

    Article  Google Scholar 

  33. Satija, R. et al. seurat-data. GitHub https://github.com/satijalab/seurat-data (2025).

  34. Molania, R. Vignettes: removing unwanted variation from TCGA RNA-seq data. Zenodo https://doi.org/10.5281/zenodo.6392171 (2025).

  35. Luecken, M. et al. Benchmarking atlas-level data integration in single-cell genomics—integration task datasets. figshare https://doi.org/10.6084/m9.figshare.12420968.v8 (2022).

  36. Zhou, Y., Sheng, Q., Wang, G., Xu, L. & Jin, S. Quantifying batch effects for individual genes in single-cell data. Zenodo https://doi.org/10.5281/zenodo.13358933 (2024).

  37. Zhou, Y. GTEs. GitHub https://github.com/yzhou1999/GTEs (2025).

  38. Zhou, Y. GTEs R package. Zenodo https://doi.org/10.5281/zenodo.15412860 (2025).

  39. Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).

  40. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (grant no. 62271173 to S.J., no. 62172122 to L.X. and no. 124B2027 to Y.Z.), the Key Research and Development Program of Heilongjiang (grant no. 2022ZX01A19 to S.J.), the Natural Science Foundation of Heilongjiang Province, China (grant no. JQ2023A003 to S.J.), and the Fundamental Research Funds for the Central Universities (grant no. HIT.DZJJ.2023133 to Q.S. and no. HIT.DZJJ.2024043 to Y.Z.).

Author information

Authors and Affiliations

Authors

Contributions

S.J. supervised the study. Y.Z. conceived and developed the method, and designed the analysis. Y.Z. and Q.S. performed the analysis. G.W. and L.X. checked the analysis results. Y.Z., Q.S., G.W., L.X. and S.J. wrote the paper. All authors read and approved the final paper.

Corresponding authors

Correspondence to Li Xu or Shuilin Jin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Lachlan Coin, Debashis Ghosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–3, Algorithm 1 and Figs. 1–22.

Reporting Summary

Peer Review File

Supplementary Table 1

Identified common HBGs and non-HBGs.

Supplementary Table 2

Details of datasets used in the manuscript.

Source data

Source Data Fig. 1

The numerical Source data for Fig. 1.

Source Data Fig. 2

The numerical Source data for Fig. 2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Sheng, Q., Wang, G. et al. Quantifying batch effects for individual genes in single-cell data. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00824-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s43588-025-00824-7

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing