Fig. 2: Binarizing scATAC-seq data is unnecessary and hides quantitative information.
From: Modeling fragment counts improves single-cell ATAC-seq analysis

a, Comparison of the Poisson VAE, Binary VAE and PeakVI models on reconstructing the binarized cell-peak matrix of the NeurIPS, the Satpathy, the Fly and the sci-ATAC-seq3 datasets for ten cross-validation (CV) runs. Poisson VAE and Binary VAE use the observed total fragment count. The horizontal line denotes the median. P values were computed using a two-sided paired Wilcoxon test and BenjaminiāHochberg corrected. **Pā=ā0.0019, *Pā=ā0.0195, NS, not significant, Pā=ā0.0695. b, Uniform Manifold Approximation and Projection (UMAP) of the integrated latent space of all NeurIPS batches, colored by cell type for the Poisson VAE model. The isolated label ID2-hi myeloid progenitors and the erythrocyte lineage are annotated. UMAPs for all other methods and datasets are in Extended Data Figs. 5ā8. c, Enrichment (odds ratio, one-sided Fisher exact test) of distal regulatory elements, super-enhancers in bone marrow, promoters of highly expressed genes and promoters of highly variable genes in the scATAC-seq peaks of the NeurIPS dataset. Peaks are sorted by the fraction of counts above the binarization threshold and grouped according to different quantiles. *Pā<ā0.0001. d, Correlation of expression of the SLC4A1 gene and fragment counts in its promoter. The two-sided Spearman correlation analysis was computed on cells with at least one fragment count in the promoter (nā=ā775). The P values were adjusted for multiple testing using the BenjaminiāHochberg correction. We restricted the plot to cells of similar total fragment count (0.25ā0.75 quantile) to not capture effects driven by total fragment count. eāg, log-normalized gene expression over normalized accessibility of the SLC4A1 gene for the Poisson VAE (e), Binary VAE model (f) and cisTopic model (g). Cell type separation is measured with the silhouette width and area under the receiver operating characteristic (ROC) curve and is better with the Poisson VAE model. In all boxplots, the central line denotes the median, boxes represent the interquartile range (IQR) and whiskers show the distribution except for outliers. Outliers are all points outside 1.5āĆāIQR. AUC, area under the curve. B, B cell; T, T cell; Mono, Monocyte; prog, progenitor; HSC, Hematopoietic stem cell; ILC, Innate lymphoid cell; Lymph, Lymphoid; MK/E, Megakaryocyte and Erythrocyte; G/M, Granulocyte and Myeloid; NK, Natural Killer cell; cDC2, Classical dendritic celltype 2; pDCs, Plasmacytoid dencritic cells.