Recovery of biological signals lost in single-cell batch integration with CellANOVA

Zhang, Zhaojun; Mathew, Divij; Lim, Tristan L.; Mason, Kaishu; Martinez, Clara Morral; Huang, Sijia; Wherry, E. John; Susztak, Katalin; Minn, Andy J.; Ma, Zongming; Zhang, Nancy R.

doi:10.1038/s41587-024-02463-1

Article
Published: 26 November 2024

Recovery of biological signals lost in single-cell batch integration with CellANOVA

Nature Biotechnology (2024)Cite this article

9014 Accesses
9 Citations
56 Altmetric
Metrics details

Subjects

Abstract

Data integration to align cells across batches has become a cornerstone of single-cell data analysis, critically affecting downstream results. Currently, there are no guidelines for when the biological differences between samples are separable from batch effects. Here we show that current paradigms for single-cell data integration remove biologically meaningful variation and introduce distortion. We present a statistical model and computationally scalable algorithm, CellANOVA (cell state space analysis of variance), that harnesses experimental design to explicitly recover biological signals that are erased during single-cell data integration. CellANOVA uses a ‘pool-of-controls’ design concept, applicable across diverse settings, to separate unwanted variation from biological variation of interest and allow the recovery of subtle biological signals. We apply CellANOVA to diverse contexts and validate the recovered biological signals by orthogonal assays. In particular, we show that CellANOVA is effective in the challenging case of single-cell and single-nucleus data integration, where it recovers subtle biological signals that can be validated and replicated by external data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Examples of control pool construction and integration results.**

**Fig. 3: Benchmarking of CellANOVA and other integration methods on batch effect correction and data distortion.**

**Fig. 4: Benchmarking of CellANOVA and other integration methods on biological signal preservation.**

**Fig. 5: Evaluation of CellANOVA in multimodal data integration.**

**Fig. 6: Descriptive analysis of CellANOVA-estimated batch effects.**

Comparison of bias and resolvability in single-cell and single-transcript methods

Article Open access 02 June 2021

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

Article Open access 17 October 2022

Data-driven comparison of multiple high-dimensional single-cell expression profiles

Article Open access 01 November 2021

Data availability

The mouse radiation therapy dataset is available from the GEO under accession number GSE280883. The JAK inhibition and PD1 immunotherapy trial dataset was obtained from Mathew et al.³⁰ and is availabe from the Gene Expression Omnibus (GEO) under accession number GSE266219. The T1D dataset was obtained from Fasolino et al.²⁹ and is available from the GEO under accession number GSE148073. The human kidney multiomics atlas dataset was obtained from Abedini et al.³¹ and is available from the GEO under accession number GSE211785. The scRNA-seq data by Lake et al.³⁶ can be downloaded from KPMP (https://www.kpmp.org.; DK133081, DK133091, DK133092, DK133093, DK133095, DK1330971, DK114866, DK114908, DK133090, DK133113, DK133766, DK133768, DK114907, DK114920, DK114923, DK114933 and DK114886). The NSCLC T cell dataset was obtained from Salcher et al.³² and downloaded from another study⁵⁹.

Code availability

All code used in this study, including the CellANOVA software and the analysis code, can be found on GitHub (https://github.com/Janezjz/cellanova).

References

Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Article PubMed Google Scholar
Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Article CAS PubMed PubMed Central Google Scholar
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Article PubMed PubMed Central Google Scholar
Zhang, Y. et al. Sample-multiplexing approaches for single-cell sequencing. Cell. Mol. Life Sci. 79, 466 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Article CAS PubMed Google Scholar
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Xu, J. et al. Genotype-free demultiplexing of pooled single-cell RNA-seq. Genome Biol. 20, 290 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gehring, J., Hwee Park, J., Chen, S., Thomson, M. & Pachter, L. Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tagging of cellular proteins. Nat. Biotechnol. 38, 35–38 (2020).
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
Article CAS PubMed PubMed Central Google Scholar
Heaton, H. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat. Methods 17, 615–620 (2020).
Article CAS PubMed Google Scholar
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lin, Y. et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA 116, 9775–9784 (2019).
Article CAS PubMed PubMed Central Google Scholar
Huang, M., Zhang, Z. & Zhang, N. R. Dimension reduction and denoising of single-cell rna sequencing data in the presence of observed confounding variables. Preprint at bioRxiv https://doi.org/10.1101/2020.08.03.234765 (2020).
Song, F., Chan, G. M. A. & Wei, Y. Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction. Nat. Commun. 11, 3274 (2020).
Article CAS PubMed PubMed Central Google Scholar
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Article CAS PubMed Google Scholar
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
Article CAS PubMed PubMed Central Google Scholar
Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498–507 (2017).
Article CAS PubMed Google Scholar
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Article CAS PubMed Google Scholar
Molania, R. et al. Removing unwanted variation from large-scale RNA sequencing data with PRPS. Nat. Biotechnol. 41, 82–95 (2023).
Article CAS PubMed Google Scholar
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).
Article CAS PubMed PubMed Central Google Scholar
Sun, Y., Zhang, N. R. & Owen, A. B. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Stat. 6, 1664–1688 (2012).
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Article CAS PubMed PubMed Central Google Scholar
Fasolino, M. et al. Single-cell multi-omics analysis of human pancreatic islets reveals novel cellular states in type 1 diabetes. Nat. Metab. 4, 284–299 (2022).
Article CAS PubMed PubMed Central Google Scholar
Mathew, D. et al. Combined JAK inhibition and PD-1 immunotherapy for non-small cell lung cancer patients. Science 384, eadf1329 (2024).
Abedini, A. et al. Single-cell multi-omic and spatial profiling of human kidneys implicates the fibrotic microenvironment in kidney disease progression. Nat. Genet. 56, 1712–1724 (2024).
Salcher, S. et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell 40, 1503–1520 (2022).
Article Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS PubMed PubMed Central Google Scholar
Liu, J. et al. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat. Protoc. 15, 3632–3662 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kang, J. B. et al. Efficient and precise single-cell reference atlas mapping with Symphony. Nat. Commun. 12, 5890 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lake, B. B. et al. An atlas of healthy and injured cell states and niches in the human kidney. Nature 619, 585–594 (2023).
Article CAS PubMed PubMed Central Google Scholar
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Article CAS PubMed PubMed Central Google Scholar
Schröder, M. & Kaufman, R. J. The mammalian unfolded protein response. Annu. Rev. Biochem. 74, 739–789 (2005).
Article PubMed Google Scholar
Wang, J., Yang, X. & Zhang, J. Bridges between mitochondrial oxidative stress, ER stress and mTOR signaling in pancreatic β cells. Cell. Signal. 28, 1099–1104 (2016).
Article CAS PubMed Google Scholar
Sarbassov, D. D., Ali, S. M. & Sabatini, D. M. Growing roles for the mTOR pathway. Curr. Opin. Cell Biol. 17, 596–603 (2005).
Article CAS PubMed Google Scholar
Wang, X. & Proud, C. G. The mTOR pathway in the control of protein synthesis. Physiology 21, 362–369 (2006).
Article CAS PubMed Google Scholar
Thompson, E. B. The many roles of c-Myc in apoptosis. Ann. Rev. Physiol. 60, 575–600 (1998).
Article CAS Google Scholar
Dang, C. V. c-Myc target genes involved in cell growth, apoptosis, and metabolism. Mol. Cell. Biol. 19, 1–11 (1999).
Article CAS PubMed PubMed Central Google Scholar
Yates, F. The analysis of multiple classifications with unequal numbers in the different classes. J. Am. Stat. Assoc. 29, 51–66 (1934).
Article Google Scholar
Fisher, R. A. Statistical Methods for Research Workers (Oliver and Boyd, 1970).
Mootha, V. K. et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003).
Article CAS PubMed Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Article CAS PubMed Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Article CAS PubMed PubMed Central Google Scholar
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Article PubMed PubMed Central Google Scholar
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Article CAS PubMed Google Scholar
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Article Google Scholar
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Article PubMed Google Scholar
Cover, T. M. & Thomas, J. A. Elements of Information Theory 2nd edn (Wiley, 2006).
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Article Google Scholar
Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50, 157–175 (1900).
Article Google Scholar
Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640 (2019).
Article CAS PubMed PubMed Central Google Scholar
Megill, C. et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv https://doi.org/10.1101/2021.04.05.438318 (2021).

Download references

Acknowledgements

We would like to acknowledge support from the National Science Foundation (NSF; DMS-2345215 and DMS-2210104 to Z.M. and DMS-2245575 to Z.M., Z.Z. and N.R.Z.), National Institutes of Health (R01-HG006137, U2C-CA233285, and 5R01GM125301 to N.R.Z., 5R01DK087635 to K.S. and TL1TR001880 to T.L.L.), Mark Foundation Center for Radiobiology and Immunology (to N.R.Z. and A.M.) and Parker Institute for Cancer Immunotherapy (to D.M.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding organizations.

Author information

These authors contributed equally: Zongming Ma, Nancy R. Zhang.

Authors and Affiliations

Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, PA, USA
Zhaojun Zhang, Kaishu Mason & Nancy R. Zhang
Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Divij Mathew & E. John Wherry
Institute for Immunology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Divij Mathew, E. John Wherry & Andy J. Minn
Parker Institute for Cancer Immunotherapy, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Divij Mathew, E. John Wherry & Andy J. Minn
Department of Radiation Oncology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Tristan L. Lim, Clara Morral Martinez & Andy J. Minn
Mark Foundation Center for Immunotherapy, Immune Signaling and Radiation, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Clara Morral Martinez, E. John Wherry & Andy J. Minn
Penn Institute of Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
Sijia Huang
Renal, Electrolyte and Hypertension Division, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Katalin Susztak
Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Philadelphia, PA, USA
Katalin Susztak
Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
Katalin Susztak
Department of Statistics and Data Science, Yale University, New Haven, CT, USA
Zongming Ma

Authors

Zhaojun Zhang
View author publications
Search author on:PubMed Google Scholar
Divij Mathew
View author publications
Search author on:PubMed Google Scholar
Tristan L. Lim
View author publications
Search author on:PubMed Google Scholar
Kaishu Mason
View author publications
Search author on:PubMed Google Scholar
Clara Morral Martinez
View author publications
Search author on:PubMed Google Scholar
Sijia Huang
View author publications
Search author on:PubMed Google Scholar
E. John Wherry
View author publications
Search author on:PubMed Google Scholar
Katalin Susztak
View author publications
Search author on:PubMed Google Scholar
Andy J. Minn
View author publications
Search author on:PubMed Google Scholar
Zongming Ma
View author publications
Search author on:PubMed Google Scholar
Nancy R. Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, Z.Z., Z.M. and N.R.Z. Algorithm development, Z.Z., Z.M. and N.R.Z. Software implementation, Z.Z. Design of computational experiments, Z.Z., Z.M. and N.R.Z. Immunotherapy trial experiment, D.M. Mouse irradiation experiment, C.M.M. and T.L.L. Kidney disease experiment, K.S. Data analysis, Z.Z., with help from D.M., T.L.L., K.M., S.H., Z.M. and N.R.Z. Manuscript writing, Z.Z., Z.M. and N.R.Z., with feedback from T.L., D.M. and K.M. Supervision, A.J.M., E.J.W., K.S., Z.M. and N.R.Z.

Corresponding authors

Correspondence to Zongming Ma or Nancy R. Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Jinmiao Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Benchmarking of Harmony and CellANOVA integrations of single-cell samples measured with multiple technologies on the NSCLC atlas dataset.

Benchmarking of Harmony and CellANOVA integrations of single-cell samples measured with multiple technologies on the NSCLC atlas dataset³². Harmony is used in the first step of CellANOVA. (a) UMAP visualizations of uncorrected data, Harmony-integrated data, and CellANOVA-integrated data. Each sample was treated as a separate batch. UMAPs were colored by technology, sample origin, and cell type in the top, the middle, and the bottom row, respectively. (b) Comparisons of batch correction (left), gene-wise distortion (middle) and global distortion (right) between Harmony and CellANOVA (scRNA-seq data: 20000 cells, 26023 genes; snRNA-seq data: 20000 cells, 30899 genes.) Left: Boxplots of iLISI scores across cells of each hold-out sample based on uncorrected data, Harmony-integrated data, and CellANOVA-integrated data. In each box plot, the box covers the middle 50% of iLISI scores and the whiskers stop at the largest iLISI score value below the upper quartile plus 1.5 * IQR and the smallest iLISI score value above the lower quartile minus 1.5 * IQR. Middle: Comparisons of p-values obtained from DE analysis pre- and post-Harmony integration and pre- and post-CellANOVA integration. Red dashed line corresponds to equal p-values pre- and post-integration. P-values for DEG analysis are calculated using a two-sided Wilcoxon signed-rank test. Right: Violin plots of Pearson correlations between pre- and post-integration gene expressions per cell with Harmony integration (top) and with CellANOVA integration (bottom). (c) Signal preservation evaluation of batch-corrected embeddings by out-of-batch nearest neighbor proportion distributions. We focused on the combined group of ‘T cell CD8’, ‘T cell CD4’, and ‘T regulatory’ cells, considering sample origin as the primary biological condition of interest. For cells with each origin, we computed out-of-batch nearest neighbor proportions across all sample origins. With successful signal preservation, cells with the same biological condition should be enriched in out-of-batch nearest neighbors.

Extended Data Fig. 2 Top ten upregulated pathways the disease condition for each specific cell type after Harmony integration without CellANOVA recovery.

Top ten upregulated pathways identified within the disease condition for each specific cell type in the Abedini et al. data and the KPMP data after Harmony integration without CellANOVA recovery. P-values for DEG analysis are calculated using a two-sided Wilcoxon signed-rank test.

Extended Data Fig. 3 Additional evaluation of CellANVOA in multimodal data integration.

Left: Scatter plots of epithelial-mesenchymal transition (EMT) pathway activity versus fibroblast ratio, with p-values obtained from a two-sided t-test of slope in linear regression. Middle: Epithelial-mesenchymal transition pathway activity. Right: Abundance of fibroblast cells from deconvolution.

Extended Data Fig. 4 Comparison of pathway enrichment analysis based on scRNA-seq versus flow cytometry of corresponding markers in NSCLC immunotherapy trial data.

(a) G2-M checkpoint pathway enrichment (scRNA-seq) versus Ki-67 frequency (flow cytometry). A positive normalized enrichment score (NES) from GSEA indicates higher pathway enrichment in the later time points. Both Ki-67 and G2-M checkpoint pathway activity measure cell proliferation. For the aPD1 group, comparing Ki-67 frequency (flow cytometry), the Chi-squared test yielded a p-value of 0.03824 between cycle 1 and cycle 2, and a p-value of 0.00043 between cycle 2 and cycle 4. (b) Interferon alpha/gamma response pathway enrichment (scRNA-seq) versus ISG15 mean fluorescence intensity (flow cytometry). For the aPD1 group, comparing between cycle 1 and cycle 2, the Chi-squared test yielded a p-value of 0.07008. (MFI: mean fluorescence intensity). (c) Cell-subtype-specific gene set enrichment analysis within each response group between cycle 2 and cycle 4 after Harmony-based CellANOVA integration. Top 5 up-regulated and down-regulated pathways in cycle 4 compared to cycle 2 are shown. P-value for DEG analysis are calculated using a two-sided Wilcoxon signed-rank test.

Extended Data Fig. 5 Benchmarking analysis on type 1 diabetes dataset.

(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.

Extended Data Fig. 6 Benchmarking analysis on immunotherapy trial dataset.

(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.

Extended Data Fig. 7 Benchmarking analysis on mouse radiation dataset.

(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.

Extended Data Fig. 8 Benchmarking analysis on human kidney multi-omics dataset.

(a) Comparison of methods from three complementary perspectives: batch correction, cell type prediction, and preservation of biological signals beyond cell type. (b) Performance of each method measured by different metrics from each perspective and overall performance percentiles of those methods. Performance scores for individual metrics are rank-based and have been normalized within a 0 to 1 range. For each metric, a higher score indicates better performance.

Extended Data Fig. 9 Additional benchmarking of CellANOVA and other batch correction methods on real and simulated datasets.

(a) UMAP visualizations of the ‘hold-one-celltype-out’ experiment using immunotherapy data before and after batch correction, colored by cell type, condition, and batch. (b) Box plots showing the LISI score of cells from the control pool (left) and cells belonging to treatment-specific cell type Terminal Effector (right) in the experiment in (a) (30153 cells, 18126 genes). In each box plot, the box covers the middle 50% of iLISI scores and the whiskers stop at the largest iLISI score value below the upper quartile plus 1.5 * IQR and the smallest iLISI score value above the lower quartile minus 1.5 * IQR. (c) UMAP visualizations of simulated data before and after batch correction, colored by cell type, condition, and batch. The simulation setting is described in the Section ‘Additional benchmarks and cell type hold-out experiments’ of the main article. (d) Box plots showing the LISI score of cells from the control pool (left) and cells belonging to treatment-specific cell type CT6 (right) in the simulation in (b). Box plots constructed in the same way as in (b). (simulation data: 21000 cells, 5000 genes). (e) ROC curves obtained from differential expression analysis between control and treatment groups, using the batch-corrected expressions of CT6 cells from different integration methods.

Extended Data Fig. 10 Descriptive analysis of CellANOVA-estimated batch effects with initial integration by Seurat.

(a) UMAP visualizations of batch effects estimated by Seurat-based CellANOVA on three datasets, colored by batch and cell type. (b) Top ten batch-affected pathways of each study based on batch-susceptibility score (BSS) with Seurat-based CellANOVA. P-values are calculated using a two-sided Wilcoxon signed-rank test.

Supplementary information

Supplementary Information

Supplementary Figs. 1–22 and Discussion.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Z., Mathew, D., Lim, T.L. et al. Recovery of biological signals lost in single-cell batch integration with CellANOVA. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02463-1

Download citation

Received: 02 October 2023
Accepted: 02 October 2024
Published: 26 November 2024
DOI: https://doi.org/10.1038/s41587-024-02463-1