Abstract
Data integration to align cells across batches has become a cornerstone of single-cell data analysis, critically affecting downstream results. Currently, there are no guidelines for when the biological differences between samples are separable from batch effects. Here we show that current paradigms for single-cell data integration remove biologically meaningful variation and introduce distortion. We present a statistical model and computationally scalable algorithm, CellANOVA (cell state space analysis of variance), that harnesses experimental design to explicitly recover biological signals that are erased during single-cell data integration. CellANOVA uses a ‘pool-of-controls’ design concept, applicable across diverse settings, to separate unwanted variation from biological variation of interest and allow the recovery of subtle biological signals. We apply CellANOVA to diverse contexts and validate the recovered biological signals by orthogonal assays. In particular, we show that CellANOVA is effective in the challenging case of single-cell and single-nucleus data integration, where it recovers subtle biological signals that can be validated and replicated by external data.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
27,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
209,00 € per year
only 17,42 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The mouse radiation therapy dataset is available from the GEO under accession number GSE280883. The JAK inhibition and PD1 immunotherapy trial dataset was obtained from Mathew et al.30 and is availabe from the Gene Expression Omnibus (GEO) under accession number GSE266219. The T1D dataset was obtained from Fasolino et al.29 and is available from the GEO under accession number GSE148073. The human kidney multiomics atlas dataset was obtained from Abedini et al.31 and is available from the GEO under accession number GSE211785. The scRNA-seq data by Lake et al.36 can be downloaded from KPMP (https://www.kpmp.org.; DK133081, DK133091, DK133092, DK133093, DK133095, DK1330971, DK114866, DK114908, DK133090, DK133113, DK133766, DK133768, DK114907, DK114920, DK114923, DK114933 and DK114886). The NSCLC T cell dataset was obtained from Salcher et al.32 and downloaded from another study59.
Code availability
All code used in this study, including the CellANOVA software and the analysis code, can be found on GitHub (https://github.com/Janezjz/cellanova).
References
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Zhang, Y. et al. Sample-multiplexing approaches for single-cell sequencing. Cell. Mol. Life Sci. 79, 466 (2022).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
Xu, J. et al. Genotype-free demultiplexing of pooled single-cell RNA-seq. Genome Biol. 20, 290 (2019).
Gehring, J., Hwee Park, J., Chen, S., Thomson, M. & Pachter, L. Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tagging of cellular proteins. Nat. Biotechnol. 38, 35–38 (2020).
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
Heaton, H. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat. Methods 17, 615–620 (2020).
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16, 1289–1296 (2019).
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Lin, Y. et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA 116, 9775–9784 (2019).
Huang, M., Zhang, Z. & Zhang, N. R. Dimension reduction and denoising of single-cell rna sequencing data in the presence of observed confounding variables. Preprint at bioRxiv https://doi.org/10.1101/2020.08.03.234765 (2020).
Song, F., Chan, G. M. A. & Wei, Y. Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction. Nat. Commun. 11, 3274 (2020).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498–507 (2017).
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Molania, R. et al. Removing unwanted variation from large-scale RNA sequencing data with PRPS. Nat. Biotechnol. 41, 82–95 (2023).
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).
Sun, Y., Zhang, N. R. & Owen, A. B. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Stat. 6, 1664–1688 (2012).
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Fasolino, M. et al. Single-cell multi-omics analysis of human pancreatic islets reveals novel cellular states in type 1 diabetes. Nat. Metab. 4, 284–299 (2022).
Mathew, D. et al. Combined JAK inhibition and PD-1 immunotherapy for non-small cell lung cancer patients. Science 384, eadf1329 (2024).
Abedini, A. et al. Single-cell multi-omic and spatial profiling of human kidneys implicates the fibrotic microenvironment in kidney disease progression. Nat. Genet. 56, 1712–1724 (2024).
Salcher, S. et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell 40, 1503–1520 (2022).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Liu, J. et al. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat. Protoc. 15, 3632–3662 (2020).
Kang, J. B. et al. Efficient and precise single-cell reference atlas mapping with Symphony. Nat. Commun. 12, 5890 (2021).
Lake, B. B. et al. An atlas of healthy and injured cell states and niches in the human kidney. Nature 619, 585–594 (2023).
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Schröder, M. & Kaufman, R. J. The mammalian unfolded protein response. Annu. Rev. Biochem. 74, 739–789 (2005).
Wang, J., Yang, X. & Zhang, J. Bridges between mitochondrial oxidative stress, ER stress and mTOR signaling in pancreatic β cells. Cell. Signal. 28, 1099–1104 (2016).
Sarbassov, D. D., Ali, S. M. & Sabatini, D. M. Growing roles for the mTOR pathway. Curr. Opin. Cell Biol. 17, 596–603 (2005).
Wang, X. & Proud, C. G. The mTOR pathway in the control of protein synthesis. Physiology 21, 362–369 (2006).
Thompson, E. B. The many roles of c-Myc in apoptosis. Ann. Rev. Physiol. 60, 575–600 (1998).
Dang, C. V. c-Myc target genes involved in cell growth, apoptosis, and metabolism. Mol. Cell. Biol. 19, 1–11 (1999).
Yates, F. The analysis of multiple classifications with unequal numbers in the different classes. J. Am. Stat. Assoc. 29, 51–66 (1934).
Fisher, R. A. Statistical Methods for Research Workers (Oliver and Boyd, 1970).
Mootha, V. K. et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Cover, T. M. & Thomas, J. A. Elements of Information Theory 2nd edn (Wiley, 2006).
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50, 157–175 (1900).
Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640 (2019).
Megill, C. et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv https://doi.org/10.1101/2021.04.05.438318 (2021).
Acknowledgements
We would like to acknowledge support from the National Science Foundation (NSF; DMS-2345215 and DMS-2210104 to Z.M. and DMS-2245575 to Z.M., Z.Z. and N.R.Z.), National Institutes of Health (R01-HG006137, U2C-CA233285, and 5R01GM125301 to N.R.Z., 5R01DK087635 to K.S. and TL1TR001880 to T.L.L.), Mark Foundation Center for Radiobiology and Immunology (to N.R.Z. and A.M.) and Parker Institute for Cancer Immunotherapy (to D.M.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding organizations.
Author information
Authors and Affiliations
Contributions
Conceptualization, Z.Z., Z.M. and N.R.Z. Algorithm development, Z.Z., Z.M. and N.R.Z. Software implementation, Z.Z. Design of computational experiments, Z.Z., Z.M. and N.R.Z. Immunotherapy trial experiment, D.M. Mouse irradiation experiment, C.M.M. and T.L.L. Kidney disease experiment, K.S. Data analysis, Z.Z., with help from D.M., T.L.L., K.M., S.H., Z.M. and N.R.Z. Manuscript writing, Z.Z., Z.M. and N.R.Z., with feedback from T.L., D.M. and K.M. Supervision, A.J.M., E.J.W., K.S., Z.M. and N.R.Z.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Jinmiao Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Benchmarking of Harmony and CellANOVA integrations of single-cell samples measured with multiple technologies on the NSCLC atlas dataset.
Benchmarking of Harmony and CellANOVA integrations of single-cell samples measured with multiple technologies on the NSCLC atlas dataset32. Harmony is used in the first step of CellANOVA. (a) UMAP visualizations of uncorrected data, Harmony-integrated data, and CellANOVA-integrated data. Each sample was treated as a separate batch. UMAPs were colored by technology, sample origin, and cell type in the top, the middle, and the bottom row, respectively. (b) Comparisons of batch correction (left), gene-wise distortion (middle) and global distortion (right) between Harmony and CellANOVA (scRNA-seq data: 20000 cells, 26023 genes; snRNA-seq data: 20000 cells, 30899 genes.) Left: Boxplots of iLISI scores across cells of each hold-out sample based on uncorrected data, Harmony-integrated data, and CellANOVA-integrated data. In each box plot, the box covers the middle 50% of iLISI scores and the whiskers stop at the largest iLISI score value below the upper quartile plus 1.5 * IQR and the smallest iLISI score value above the lower quartile minus 1.5 * IQR. Middle: Comparisons of p-values obtained from DE analysis pre- and post-Harmony integration and pre- and post-CellANOVA integration. Red dashed line corresponds to equal p-values pre- and post-integration. P-values for DEG analysis are calculated using a two-sided Wilcoxon signed-rank test. Right: Violin plots of Pearson correlations between pre- and post-integration gene expressions per cell with Harmony integration (top) and with CellANOVA integration (bottom). (c) Signal preservation evaluation of batch-corrected embeddings by out-of-batch nearest neighbor proportion distributions. We focused on the combined group of ‘T cell CD8’, ‘T cell CD4’, and ‘T regulatory’ cells, considering sample origin as the primary biological condition of interest. For cells with each origin, we computed out-of-batch nearest neighbor proportions across all sample origins. With successful signal preservation, cells with the same biological condition should be enriched in out-of-batch nearest neighbors.
Extended Data Fig. 2 Top ten upregulated pathways the disease condition for each specific cell type after Harmony integration without CellANOVA recovery.
Top ten upregulated pathways identified within the disease condition for each specific cell type in the Abedini et al. data and the KPMP data after Harmony integration without CellANOVA recovery. P-values for DEG analysis are calculated using a two-sided Wilcoxon signed-rank test.
Extended Data Fig. 3 Additional evaluation of CellANVOA in multimodal data integration.
Left: Scatter plots of epithelial-mesenchymal transition (EMT) pathway activity versus fibroblast ratio, with p-values obtained from a two-sided t-test of slope in linear regression. Middle: Epithelial-mesenchymal transition pathway activity. Right: Abundance of fibroblast cells from deconvolution.
Extended Data Fig. 4 Comparison of pathway enrichment analysis based on scRNA-seq versus flow cytometry of corresponding markers in NSCLC immunotherapy trial data.
(a) G2-M checkpoint pathway enrichment (scRNA-seq) versus Ki-67 frequency (flow cytometry). A positive normalized enrichment score (NES) from GSEA indicates higher pathway enrichment in the later time points. Both Ki-67 and G2-M checkpoint pathway activity measure cell proliferation. For the aPD1 group, comparing Ki-67 frequency (flow cytometry), the Chi-squared test yielded a p-value of 0.03824 between cycle 1 and cycle 2, and a p-value of 0.00043 between cycle 2 and cycle 4. (b) Interferon alpha/gamma response pathway enrichment (scRNA-seq) versus ISG15 mean fluorescence intensity (flow cytometry). For the aPD1 group, comparing between cycle 1 and cycle 2, the Chi-squared test yielded a p-value of 0.07008. (MFI: mean fluorescence intensity). (c) Cell-subtype-specific gene set enrichment analysis within each response group between cycle 2 and cycle 4 after Harmony-based CellANOVA integration. Top 5 up-regulated and down-regulated pathways in cycle 4 compared to cycle 2 are shown. P-value for DEG analysis are calculated using a two-sided Wilcoxon signed-rank test.
Extended Data Fig. 5 Benchmarking analysis on type 1 diabetes dataset.
(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.
Extended Data Fig. 6 Benchmarking analysis on immunotherapy trial dataset.
(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.
Extended Data Fig. 7 Benchmarking analysis on mouse radiation dataset.
(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.
Extended Data Fig. 8 Benchmarking analysis on human kidney multi-omics dataset.
(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.
Extended Data Fig. 9 Additional benchmarking of CellANOVA and other batch correction methods on real and simulated datasets.
(a) UMAP visualizations of the ‘hold-one-celltype-out’ experiment using immunotherapy data before and after batch correction, colored by cell type, condition, and batch. (b) Box plots showing the LISI score of cells from the control pool (left) and cells belonging to treatment-specific cell type Terminal Effector (right) in the experiment in (a) (30153 cells, 18126 genes). In each box plot, the box covers the middle 50% of iLISI scores and the whiskers stop at the largest iLISI score value below the upper quartile plus 1.5 * IQR and the smallest iLISI score value above the lower quartile minus 1.5 * IQR. (c) UMAP visualizations of simulated data before and after batch correction, colored by cell type, condition, and batch. The simulation setting is described in the Section ‘Additional benchmarks and cell type hold-out experiments’ of the main article. (d) Box plots showing the LISI score of cells from the control pool (left) and cells belonging to treatment-specific cell type CT6 (right) in the simulation in (b). Box plots constructed in the same way as in (b). (simulation data: 21000 cells, 5000 genes). (e) ROC curves obtained from differential expression analysis between control and treatment groups, using the batch-corrected expressions of CT6 cells from different integration methods.
Extended Data Fig. 10 Descriptive analysis of CellANOVA-estimated batch effects with initial integration by Seurat.
(a) UMAP visualizations of batch effects estimated by Seurat-based CellANOVA on three datasets, colored by batch and cell type. (b) Top ten batch-affected pathways of each study based on batch-susceptibility score (BSS) with Seurat-based CellANOVA. P-values are calculated using a two-sided Wilcoxon signed-rank test.
Supplementary information
Supplementary Information
Supplementary Figs. 1–22 and Discussion.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Z., Mathew, D., Lim, T.L. et al. Recovery of biological signals lost in single-cell batch integration with CellANOVA. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02463-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-024-02463-1