Abstract
Batch effects substantially impede the comparison of multiple single-cell experiment batches. Existing methods for batch effect removal and quantification primarily emphasize cell alignment across batches, often overlooking gene-level batch effects. Here we introduce group technical effects (GTE)—a quantitative metric to assess batch effects on individual genes. Using GTE, we show that batch effects unevenly impact genes within the dataset. A portion of highly batch-sensitive genes (HBGs) differ between datasets and dominate the batch effects, whereas non-HBGs exhibit low batch effects. We demonstrate that as few as three HBGs are sufficient to introduce substantial batch effects. Our method also enables the assessment of cell-level batch effects, outperforming existing batch effect quantification methods. We also observe that biologically similar cell types undergo similar batch effects, informing the development of data integration strategies. The GTE method is versatile and applicable to various single-cell omics data types.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
27,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
99,00 € per year
only 8,25 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
All datasets used in this paper are publicly available. Specifically, the mouse MOp dataset can be accessed via the CELLxGENE portal at https://cellxgene.cziscience.com/collections/ae1420fe-6630-46ed-8b3d-cc6056a66467. The mouse retina and human cell line datasets are available at https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data. The human cortical development and human PBMCs (CITE-seq) datasets can be accessed from the Gene Expression Omnibus (GEO) under accession codes GSE168408 and GSE156473, respectively. The human cortex dataset is available from https://cellxgene.cziscience.com/collections/35928d1c-36fc-4f93-9a8d-0b921ab41745. The human MTG dataset can be accessed via the Allen Brain Map portal at https://portal.brain-map.org/atlases-and-data/rnaseq. The human heart dataset is available from the Heart Cell Atlas project at https://www.heartcellatlas.org. The human PBMCs (scRNA-seq) dataset is accessible at https://github.com/satijalab/seurat-data (ref. 33). The TCGA READ bulk RNA-seq dataset can be accessed through Zenodo at https://doi.org/10.5281/zenodo.6392171 (ref. 34). The mouse brain scATAC-seq datasets (peak and gene activity versions) are available at https://doi.org/10.6084/m9.figshare.12420968.v8 (ref. 35). The mouse cell line proteomics dataset can be accessed via https://scproteomicsdb.com. Refer to Supplementary Table 2 for further details of the datasets. Processed datasets used for the analyses have been deposited to Zenodo at https://doi.org/10.5281/zenodo.13358933 (ref. 36). Source data are provided with this paper.
Code availability
The codes and Source data used to generate the results in this paper are available at GitHub (https://github.com/yzhou1999/GTEs; ref. 37) and at Zenodo (https://doi.org/10.5281/zenodo.15412860; ref. 38).
References
Youden, W. J. Enduring values. Technometrics 14, 1–11 (1972).
Lander, E. S. Array of hope. Nat. Genet. 21, 3–4 (1999).
Akey, J. M. et al. On the design and analysis of gene expression studies in human populations. Nat. Genet. 39, 807–808 (2007).
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Smyth, G. K. & Speed, T. Normalization of cDNA microarray data. Methods 31, 265–273 (2003).
Haghverdi, L. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Butler, A. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Lopez, R. et al. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Xu, C. et al. Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
De Donno, C. et al. Population-level integration of single-cell datasets enables multi-scale analysis across samples. Nat. Methods 20, 1683–1692 (2023).
Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature 598, 103–110 (2021).
Büttner, M. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Luecken, M. D. & Theis, F. J. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Subramanian, A. et al. Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics. Genome Biol. 23, 267 (2022).
Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).
CZI Cell Science Program et al. CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res. 53, D886–D900 (2024).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Chazarra-Gil, R. et al. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic Acids Res. 49, e42 (2021).
Lütge, A. et al. CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data. Life Sci. Alliance 4, e202001004 (2021).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Molania, R. et al. Removing unwanted variation from large-scale RNA sequencing data with PRPS. Nat. Biotechnol. 41, 82–95 (2023).
Leduc, A. et al. Exploring functional protein covariation across single cells using nPOP. Genome Biol. 23, 261 (2022).
Derks, J. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat. Biotechnol. 41, 50–59 (2023).
Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).
McCarthy, D. J. et al. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Yu, G. et al. ClusterProfiler: an R package for comparing biological themes among gene clusters. Omics 16, 284–287 (2012).
Satija, R. et al. seurat-data. GitHub https://github.com/satijalab/seurat-data (2025).
Molania, R. Vignettes: removing unwanted variation from TCGA RNA-seq data. Zenodo https://doi.org/10.5281/zenodo.6392171 (2025).
Luecken, M. et al. Benchmarking atlas-level data integration in single-cell genomics—integration task datasets. figshare https://doi.org/10.6084/m9.figshare.12420968.v8 (2022).
Zhou, Y., Sheng, Q., Wang, G., Xu, L. & Jin, S. Quantifying batch effects for individual genes in single-cell data. Zenodo https://doi.org/10.5281/zenodo.13358933 (2024).
Zhou, Y. GTEs. GitHub https://github.com/yzhou1999/GTEs (2025).
Zhou, Y. GTEs R package. Zenodo https://doi.org/10.5281/zenodo.15412860 (2025).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).
Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (grant no. 62271173 to S.J., no. 62172122 to L.X. and no. 124B2027 to Y.Z.), the Key Research and Development Program of Heilongjiang (grant no. 2022ZX01A19 to S.J.), the Natural Science Foundation of Heilongjiang Province, China (grant no. JQ2023A003 to S.J.), and the Fundamental Research Funds for the Central Universities (grant no. HIT.DZJJ.2023133 to Q.S. and no. HIT.DZJJ.2024043 to Y.Z.).
Author information
Authors and Affiliations
Contributions
S.J. supervised the study. Y.Z. conceived and developed the method, and designed the analysis. Y.Z. and Q.S. performed the analysis. G.W. and L.X. checked the analysis results. Y.Z., Q.S., G.W., L.X. and S.J. wrote the paper. All authors read and approved the final paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Lachlan Coin, Debashis Ghosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–3, Algorithm 1 and Figs. 1–22.
Supplementary Table 1
Identified common HBGs and non-HBGs.
Supplementary Table 2
Details of datasets used in the manuscript.
Source data
Source Data Fig. 1
The numerical Source data for Fig. 1.
Source Data Fig. 2
The numerical Source data for Fig. 2.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Sheng, Q., Wang, G. et al. Quantifying batch effects for individual genes in single-cell data. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00824-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43588-025-00824-7