Introduction

Metagenomic next-generation sequencing (mNGS) encompasses various high-throughput DNA profiling techniques that enable environment-agnostic taxonomic profiling of microorganisms, including bacteria, archaea, fungi, and viruses1. mNGS has shown increasing adoption in clinical contexts for diagnosing infectious diseases, surveilling microbial pathogens, and predicting antibiotic efficacy2,3 across fecal, skin, and tissue samples. Utilizing mNGS in these settings is attractive due to its untargeted and high-throughput characteristics; however, its untargeted nature can result in substantial and confounding amounts of non-microbial DNA (e.g., human DNA) when processing low-microbial biomass samples, especially at higher sequencing depths.

To mitigate the influence of non-microbial DNA on metagenomic studies, diverse host depletion techniques have been developed, ranging from experimental modification of DNA extraction steps (e.g., differential lysis)4 to real-time sorting of reads during sequencing5. Computational host filtration, or simply host filtration, refers to computational approaches for removing host DNA from sequenced samples, regardless of whether prior host depletion steps were performed. Separating host genetic information from microbial counterparts is a crucial step in mNGS workflows, especially in the analysis of low microbial biomass samples, such as those derived from skin, saliva, or tumors6,7. Additionally, these methodologies are imperative to increase the rigor with which debated microbial communities, like putative blood-borne microbiota, can be assayed.

When human DNA reads from mNGS are not correctly identified during the host filtration step, they may be incorrectly classified as microbial reads, creating potentially misidentified taxonomic classifications and biased effect sizes (inflation or deflation of mismapped taxa)7. These misclassifications can manifest as false positive taxonomic classifications, and in this work we further demonstrate these false positives can alter biological conclusions. Further, failure to remove human DNA from metagenomic sequencing samples can leak private genetic information about the host into putatively microbial data, enabling re-identification of study participants8.

Although host filtration is generally a common preprocessing step9, the algorithmic choice for host filtration and employed human reference database(s) can result in substantially different biological results7. Tools differ between pipelines, but most host filtration approaches map reads to a host reference genome followed by sequence-based computational subtraction of host reads to obtain human-filtered data10.

Most host filtration tools10 and recommended host filtration workflows11,12 exclusively use a single human reference, which fails to capture the diversity of human genomes and cannot remove population-specific variation. Portions of the human genome that are incomplete in these references, such as the Y chromosome in GRCh38 or earlier versions of T2T-CHM13 (v1.0), can permit flow-through of human reads from those regions to microbial mapping steps, leading to the mismapping of taxa during classification and artifactual data distributions (e.g., false sex differences in the low biomass microbial profiles). Moreover, regions of population-specific genome variation or haplotypes not well covered in singular reference genomes can allow leakage of patient-identifying information in microbial reads8. To date, previous work13 has either failed to incorporate pangenome references14 or to provide computationally efficient methods capable of host filtering across dozens of human genomes7, and both are needed to protect patient privacy and improve output quality. Therefore, we were motivated to explore more efficient methods for host filtration using the most comprehensive human references available to protect the privacy of disseminated metagenomics datasets and mitigate artifactual biases associated with missing genome regions.

In this work, we identify and resolve an artifactual technical effect caused by insufficient host filtration in quantifying the microbial profiles associated with tumor tissue from mNGS. We then implement and benchmark three improved host filtration approaches that leverage two complementary algorithmic approaches and a wide variety of human reference genomes to maximize host read removal. We apply these novel methodologies towards multiple sample types in both low and high biomass conditions. Finally, we show that improved host filtration prevents host re-identification from mNGS datasets with matched genotyping information. These efforts support utilization of comprehensive host filtration preprocessing for current and future mNGS studies to increase data robustness and protect patient privacy.

Results

Artifactual sex splitting across metastatic cancers

We initially and incidentally discovered detrimental effects from improper host filtration when exploring sequencing data from a cohort of metastatic human tumor tissues (Hartwig Medical Foundation15 hereafter “HMF”). These data were originally processed and published before the release of T2T-CHM13v2.016, which added a complete human Y chromosome, and have since been independently analyzed for microbiomes17. After isolating non-human reads from deep, whole-genome sequenced samples of 4902 metastatic tumors (Supplementary Data 1), we applied quality and length filtering, followed by re-alignment against GRCh38.p7. Surprisingly, our initial analysis of putative metastatic tumor low biomass microbial profiles revealed significant differences between male and female-labeled samples (Fig. 1a; p = 0.00025, RPCA18-PERMANOVA). Subsequent re-analysis of the same data using T2T-CHM13v2.016, which included the first complete Y chromosome, abolished the male-female sex separation in our data (Fig. 1b; p = 0.29, RPCA18-PERMANOVA). These results suggested that missing regions of human reference chromosomes can directly cause related artifactual biases in downstream microbiome data.

Fig. 1: Sex biases identified in inadequately host-filtered human tumor tissue data.
figure 1

a RPCA of microbial relative abundance quantification from tumor samples in the Hartwig Medical Foundation Database, which was originally subject to GRCh38.p7 filtration exclusively. Statistically significant differences were found between male and female groups (PERMANOVA; pseudo-F = 65.4, p = 0.00025). b Identical dataset and pre-processing steps done in a but with the addition of the T2T-CHM13v2.0 reference genome in host filtration. Differences were not statistically significant between male and female groups (PERMANOVA; pseudo-F = 1.23, p = 0.29).

To validate that this result was not unique to RPCA, we re-calculated results with Weighted and Unweighted UniFrac19, Bray-Curtis20 dissimilarity, and Jaccard21 similarity index. We also tested another microbial database, Web of Life release 222, all with and without rarefaction. Notably, neither the choice of rarefaction level nor microbial database affected the identification of artifactual sex differences. The results were less affected by filtration against T2T-CHM13v2.0 when qualitative distance metrics were used (i.e., Unweighted UniFrac and Jaccard similarity index). However, quantitative metrics (i.e., Weighted UniFrac and Bray-Curtis) reproduced our original findings that additional filtration against T2T-CHM13v2.0 abolished sex differences (Supplementary Data 2).

To investigate why quantitative metrics were affected but qualitative metrics were not, we examined a subset of 100 samples from HMF. We randomly selected metastatic tissue samples across various body sites, all of which had metastasized from a primary colorectal tumor. We isolated reads from these samples that were removed when filtered with T2T-CHM13v2.0 but retained when filtered with GRCh38.p7. We refer to these reads as “T2T-filtered”. Using Woltka and RefSeq release 200, we mapped the T2T-filtered reads to their corresponding operational genomic units (OGUs)23 (Supplementary Data 3). Notably, 99.895% of the reads (5,590,189/5,596,038) matched only four taxa: Toxoplasma gondii (G000006565), Bifidobacterium tibiigranuli (G009371885), Alcanivorax hongdengensis (G000300995), and Tetrasphaera japonica (G001046855). When reads we observed as being mistakenly mapped to microbial taxa are removed with T2T-CHM13v2.0 filtering, the quantitative diversity metrics more accurately reflect the per-sample microbial diversity (Fig. 1b). However, qualitative metrics relying on presence-absence (and not abundance) are less affected, and thereby more robust to, falsely inflated abundances of T.gondii, B. tibiigranuli, A. hongdengensis, and T. japonica. These results demonstrate the importance of validating conclusions with qualitative metagenomic methods when using older or incomplete genome references—although the human references are now essentially complete, the same is not true for other species, so this category of validation will continue to be important into the future.

To test whether these four taxa shared regions of sequence similarity with the human genome, we took the same T2T-filtered reads from the subset of 100 HMF samples described above, and mapped those sequences against RS210-clean, a version of RefSeq release 210 (2022-01-01) in which regions of microbial genomes shared with human genomes based on Exhaustive1 and Conterminator24 were masked. We found that using a human-scrubbed microbial database eliminated some of the T2T-filtered reads from mapping to microbes (Supplementary Data 4). 5,596,038 T2T-filtered reads mapped to microbes using RefSeq release 200, but only 53 mapped to microbes using RS210-clean. This result suggested that a cleaned microbial database may alone abolish the false male-female sex difference. However, a strong human-filtration pipeline that uses T2T-CHM13v2.0 would have removed all these T2T-filtered reads, so that none of them would map to microbes regardless of which database was used.

To confirm that a cleaned microbial database would abolish the sex difference without any additional filtration, we then applied RS210-clean on a larger subset of 477 metastatic tissue samples across various body sites, all of which had metastasized from the colon using GRCh38.p7 filtration alone. Importantly, we found that the sex differences were eliminated (p = 0.14, RPCA18-PERMANOVA). These results indeed demonstrate that host filtration of either the reads (prior to mapping) or the microbial database is sufficient to prevent sex biases.

To verify that the T2T-filtered reads from the subset of 100 HMF samples described above was in fact derived from the human Y chromosome, we aligned the T2T-filtered reads using minimap225 (v. 2.26) and an index based only on the Y chromosome portion of T2T-CHM13v2.0, and observed an overall alignment of 88.99% to the Y chromosome. We speculate that the remaining 11.01% are likely due to additions of regions in other chromosomes with the T2T-CHM13v2.0 release. We spot-checked alignments of a subset of Y chromosome mapped reads using the BLAST26 web portal. For example, a read that aligns with 100% identity to the Y chromosome is also identified as T. gondii at 98.67% identity when using Nucleotide BLAST26. To confirm the T2T-filtered reads were likely mismapped to microbial genomes, we calculated the depth and breadth for the top ten multi-mapped organisms (Supplementary Fig. 1a). We observe a large coverage peak within each genome with low mean coverage depth, suggesting an artifactual signal. We extracted the genomic regions corresponding to the coverage peak for each organism and confirmed they correspond to low complexity regions of each respective microbial reference genome (Supplementary Data 5). Finally we include coverage depth and breadth assessments for these same reads against the Y chromosome from the T2T-CHM13v2.0 and note a more uniform distribution, suggesting the true origin of the reads corresponds to the more complete Y chromosome in T2T-CHM13v2.0 rather than any microbial genome (Supplementary Fig. 1b). Overall, these data suggest that the sex differences identified in the HMF dataset are attributable to human Y chromosome sequences leaking through the GRCh38.p7 filter, which were subsequently mapped to microbial taxa containing genomic regions common to the human genome.

Inspired by this resolution to the problem of artifactual sex-specific differences, we sought to create and evaluate pipelines for thorough host filtration in a computationally efficient manner (described below). These pipelines can be conservatively combined with microbial database cleaning/masking approaches, as we and others have described elsewhere7,24. However, we caution that microbial database masking alone may not adequately address patient re-identification concerns, because human reads remain mixed with microbial reads, as addressed later in this work.

Improved host filtration approach and validation

We thus proposed and benchmarked three methods for improved host filtration that utilizes traditional sequence alignment25 and a novel indexing-based approach called Movi27. We evaluate multiple human references, including the most updated versions of GRCh38.p14, T2T-CHM13v2.016, and HPRC-2023 release14, to maximize captured human genomic diversity. Our methods are as follows: 1) Alignment with minimap2 to GRCh38.p14 and T2T-CHM13v2.0, and indexing with Movi to GRCh38.p14, T2T-CHM13v2.0, and HPRC, 2) Alignment with minimap2 to GRCh38.p14, T2T-CHM13v2.0, and HPRC, and indexing with Movi to GRCh38.p14, T2T-CHM13v2.0, and HPRC, 3) Indexing with Movi to GRCh38.p14, T2T-CHM13v2.0, and HPRC (Fig. 2a; see Methods for details). Additionally, we compared our methods to the only other publication using HPRC for host filtration, which used all three human genome reference sets with minimap225 in both paired-end and single-end mode. We also benchmarked our methods against the strict host filtration method described by Sepich-Poore et al.7 (Supplementary Fig. 2a).

Fig. 2: Host filtration pipeline and runtime evaluation.
figure 2

a Pipeline of host filtration methods. b Using simulated data with a 50/50 mix of human data from HPRC and microbial data from FDA-ARGOS, we applied the 3 host filtration methods with 3 different sample sizes. Runtimes were averaged across 10 runs per sample size. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.

To compare the run time of host filtration methods, we simulated data of 50% human and 50% microbial reads using ten sampled genomes from HPRC and over 800 complete bacterial assemblies from the FDA-ARGOS database28. Using these ten simulated datasets, we subsampled them at ten thousand, 1 million, and 10 million reads, followed by applying all filtration methods on each to assess their scalability (Fig. 2b). Methods 1 and 3 had comparable runtimes: 11.13 min and 11.15 min at 1 million reads, respectively. In comparison, Method 2’s use of HPRC alignment with minimap2 created exponentially increasing run times (46 min, 55 min, and 2.5 h at ten thousand, 1 million, and 10 million reads, respectively) as the dataset size increased. The strict host filtering method described by Sepich-Poore et al.7 also took the longest to complete, or 1.59 h for 1 million reads (Supplementary Fig. 2b).

We next applied the three host filtration methods to assess sensitivity on the aforementioned ten samples of 1 million reads each, excluding the ten pangenomes we used to simulate the human data during filtration. An ideal host filtration method would result in zero remaining human reads (Fig. 3a, Supplementary Data 6) and a minimal number of lost microbial reads (Fig. 3b, Supplementary Data 6). For the remaining human reads (Fig. 3a), we found significant differences between Method 1 and Method 3, as well as Method 2 and Method 3 (Wilcoxon signed-rank test, p = 0.0020), indicating that the combination of alignment and indexing-based approaches for host filtration outperforms indexing based approaches alone. For microbial reads lost (Fig. 3b), we found significant differences across all three methods (Wilcoxon signed-rank test, p = 0.0020 for all comparisons). We find that the indexing-based host filtration approach alone (Method 3) retains the greatest number of microbial reads, while alignment-based steps, as in the initial steps of Method 1 and Method 2, inadvertently discard an increasing number of microbial reads proportional to the number of human references used for alignment. Although Method 2 was most effective at removing human reads, it also removed 242.5 and 288.5 more microbial reads on average compared with Method 1 and 3, respectively. In contrast, Method 3 maximized the number of microbial reads kept, losing only 43.5 microbial reads on average, but also allowed an average of 4.5 human reads through. Method 1 struck a balance, losing 89.5 microbial reads on average and eliminating all the human reads in 8 out of the 10 cases. We found that the prior Sepich-Poore et al. 7 method performed identically to Method 2 regarding the number of human reads removed (Supplementary Fig. 3a) and unnecessarily removed an additional ten microbial reads (Supplementary Fig. 3b). Because host filtration is used in a wide range of applications, it is crucial to allow users to choose between methods and determine if, for a given application and regulatory environment, it is acceptable to lose more microbial reads while ensuring maximum human read removal; conversely, one may want to maximize the number of microbial reads retained while still removing the majority of host reads. We note that microbial reads may be lost inadvertently due to sequence similarity between microbial input reads and human reference databases when using both alignment and indexing-based approaches (see Fig. 3 and Supplementary Fig. 4).

Fig. 3: Host filtration pipeline simulated data validation.
figure 3

Using the 10 simulated datasets of 1 million reads as described in Fig. 2b, we a calculated the number of human reads remaining, and b number of microbial reads remaining, for host filtration Methods 1–3 (HPRC host filtration performed excluding the 10 genomes used for data simulation). HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.

Application of host filtration methods to low and high microbial biomass samples

To determine the robustness of these three methods across a range of microbial biomasses, we evaluated each method on human exome data as well as tissue, skin, and fecal metagenomic samples. First, we obtained 30 International Genome Sample Resource (IGSR) phase 3 human exome sequencing samples29, which are putatively human. After sampling 1 million reads each, we examined the number of human reads remaining, with an ideal host filtering method having zero reads left. We found Method 2 left the smallest amount of human exome reads followed by Method 1, then Method 3 (average reads remaining; Method 1: 32.66, Method 2: 24, Method 3: 351.53). There were significant differences between Method 1 and Method 2 (Wilcoxon signed-rank test, p = 2.6e−05), between Method 2 and Method 3 (Wilcoxon signed-rank test, p = 8.2e−06), and between Method 1 and Method 3 (Wilcoxon signed-rank test, p = 3.8e−05) (Fig. 4a, Supplementary Data 6). Mirroring the distributions seen in human simulated data benchmarks (Fig. 3a), Method 2 removed the largest number of human sequences, followed by Method 1, then Method 3. Interestingly, we found nearly ten times as many human exome reads remained compared to the simulated human data (Fig. 3a). However, without access to the samples, it is not possible to determine whether the increased number of reads in the human exome data compared to the simulated human data is due to real microbial presence (contamination or biological) in the exome sample, imperfect amplification or selection chemistry, and/or reduced performance of the host filtration procedure.

Fig. 4: Comparing human exome and tumor tissue samples across host filtration methods.
figure 4

a The number of reads remaining after host-filtering 30 human exomes subset to 1 million reads across methods. b 100 metastatic colorectal cancer tissue samples were selected from HMF and read counts were calculated following application of improved host filtration methods. HG38 GRCH38.p14, T2T T2T-CHM13v2.0, HPRC Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.

Using these three host filtration methods, we re-analyzed the aforementioned 100 colorectal tissue tumor samples from HMF, finding additional human reads removed compared to T2T-CHM13v2.0 alone (Fig. 4b, Supplementary Data 6). For HMF total read count following host filtration, we found significant differences between Method 1 and Method 2 (Wilcoxon signed-rank test, p = 3.9e−18), between Method 2 and Method 3 (Wilcoxon signed-rank test, p = 1.2e−17), and between Method 1 and Method 3 (Wilcoxon signed-rank test, p = 3.9e−18). Again, Method 2 has the least reads followed by Method 1 and then Method 3 (average reads remaining; Method 1: 84,663.12, Method 2: 84,009.03, Method 3: 84,692.71). Although we cannot verify if the remaining reads are all microbial, we can conclude, based on the simulations, that Method 2 likely has lower read counts due to removal of true microbial reads.

Next, we applied our host filtration methods to mNGS data from skin samples, where microbial and human DNA would be expected in varying proportions. Specifically, we analyzed 77 skin swab samples from pediatric healthy controls and subjects with atopic dermatitis (Fig. 5a, Supplementary Fig. 5, Supplementary Data 6). The percentage of non-human reads remaining across skin samples varied, consistent with distinct levels of host background within each sample, with Method 2 providing the lowest percentage of reads remaining, followed by Method 1, then Method 3. For the total percentage of reads remaining of these skin samples following host filtration, we found significant differences across all three methods (Wilcoxon signed-rank p = 2.5e−14 for all comparisons).

Fig. 5: Comparing human skin and fecal samples across host filtration methods.
figure 5

a 87 human skin samples were host-filtered with the improved methods, we then calculated the percentage of reads remaining. b We calculated the percentage of reads remaining on a per-sample basis for each of the 50 human fecal samples examined. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), and whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.

Lastly, we evaluated a high microbial biomass dataset of 50 fecal samples from older adults consisting of healthy controls and subjects with Alzheimer’s disease30 (Fig. 5b, Supplementary Data 6). As expected, we observed nominal reductions in the percentage of total reads, although still greater than 1% of reads. For the total percentage of reads remaining in these fecal samples following host filtration, we found significant differences between all three methods (Wilcoxon signed-rank test, p = 7.6e−10 for all comparisons), and the same pattern of Method 2 having the lowest percentage of reads remaining followed by Method 1, then Method 3.

Proper host filtration protects against private genomic data leakage

Improper host filtration of metagenomic samples can leak sensitive genomic information. In a recent study, Tomofuji et al.8 re-identified patients from human reads that leaked through fecal metagenomic data, matching them to blood-derived genotype data from the same individuals (Supplementary Data 1). Their study initially used host filtration steps derived from traditional filtration methods8. To test the effectiveness of our approaches to disrupt a host re-identification signal, we applied the above methods to the 343 fecal samples from Tomofuji et al.8 re-filtering host data with steps outlined in Methods 1 and 2. Using the 343 paired genotype samples to test whether re-identification (from the fecal samples) was still possible (see Methods for details), we found that filtration with any combination of two human references (GRCh38.p14, T2T-CHM13v2.0, HPRC) was sufficient to prevent patient re-identification, haplotype reconstruction, and phenotype prediction (Fig. 6). These data demonstrate the importance of thorough host filtration prior to public upload of mNGS data while providing computationally efficient tools to do so.

Fig. 6: Re-identification from a set of genotype data based on the human reads in fecal samples prevented with improved host filtration.
figure 6

The 343 fecal samples from Tomofuji et al. Nature Microbiology 2023, with paired genotype data, were re-analyzed with various combinations of updated host filtration methods (GRCh38.p14, T2T-CHM13v2.0, Human Pangenome Reference Consortium 2024 release) resolving host data leakage. The x-axis of the plots indicates the number of bases used for the calculation of the likelihood scores. The y-axis of the plot indicates the two-sided P values calculated using a standard normal distribution based on the standardized likelihood scores. The red and blue dashed lines indicate p = 4.3 × 10−7 (0.05/117,649 tests) and p = 1.5 × 10−4 (0.05/343 tests), respectively. The results of the 117,649 tests (343 genotype data × 343 metagenome data) are indicated as the colors of the points. Some samples could not be used for the re-identification analysis because too few reads remained after filtering, hence the fewer dots shown across host filtration methods. Full description on the calculation of P values can be found in the Methods.

Discussion

While host filtration and host depletion are important steps of careful mNGS analyses, only host filtration has relevant use cases for thousands of already-generated human datasets, especially those initially generated without the original intent of microbial analysis. Through processing whole genome sequencing (WGS) data from metastatic tumor samples, we incidentally identified the impact of insufficient host filtration through artifactual sex biases that human DNA introduced in downstream analyses. These biases had larger impacts on abundance-based metrics and can be mitigated by using qualitative approaches. Nonetheless, such biases introduced by insufficient host filtration likely persist for other metagenomic sequencing datasets, particularly those generated prior to the release of the T2T-CHM13v2.0 reference genome containing the full Y chromosome. Thus, it remains prudent to continue developing and refining techniques to easily scrub human DNA from pre-existing and future mNGS datasets.

Beyond biasing microbial data, an additional consequence of insufficient host filtration in human studies is the retention of personally-identifying human DNA sequences in mNGS datasets. Many mNGS data generation and usage agreements dictate that metagenomic applications of whole genome sequencing will be used to generate and analyze microbial DNA, not human DNA: this is particularly relevant in clinical environments where patients may consent to microbial analysis via mNGS of samples derived from a human host but may not consent to their host genomic content analysis31. Moreover, patients are often unaware that untargeted sequencing approaches intended for microbial study will also sequence some amount of host DNA, and failure to comprehensively remove host DNA from metagenomic sequencing data may violate data usage and patient consent agreements by inadvertently enabling deanonymization and reidentification. This is particularly important when depositing microbiome datasets in public repositories, which typically do not require restricted/controlled access8. Even when efforts are devoted to preventing re-identification of individuals (for example, by only sharing summary-level data), others have shown that re-identification of participants from specific GWAS is still possible, or even in DNA mixtures where an individual contributes less than 0.1% of the total genomic material32. Therefore, as sequencing methods continue to provide higher mNGS throughput, it is imperative that computationally efficient techniques countering their unintended privacy consequences are made available.

To address this need, we have proposed an efficient, customizable, and effective host filtering pipeline that accurately separates human and microbial reads from mNGS datasets, with demonstrated applications across real samples of varying biomass. We recommend using Method 1, which is time efficient (Fig. 2b) while leaving a majority of the microbial reads (Fig. 3b), but removes enough human reads to disrupt subject reidentification. In cases where the maximum amount of human reads must be removed, even at the cost of losing microbial reads, then Method 2 would be more appropriate. Finally, if a user wants to maximize the remaining microbial reads, then Method 3 would be best. Given the scalability of Movi27, we anticipate that additional human references can be easily incorporated as they are released without major impacts on runtime, making the approach future proof.

Tomofuji et al. describe the process of performing patient reidentification using relatively few human DNA sequences retained in fecal sequencing data by combining paired genotype data8. Using their data, we demonstrated how our methods can prevent patient reidentification, thereby protecting patient privacy. As interest in metagenomic studies of human biological processes grows, increased emphasis should be placed on applying end-to-end privacy-protecting methodologies, inclusive of computational workflows. Sequencing human DNA is a byproduct of mNGS, even in fecal samples, and may persist despite molecular host depletion protocols. Thus, applying computational host filtration techniques remains imperative when performing (or uploading) human-associated microbiome studies.

All host filtration methods remain imperfect due to the under-explored genomic diversity of the human population and the concomitant lack of complete and individualized human reference genomes. Our proposed workflows incorporate more genetic diversity than any computational host filtration approach to date while remaining computationally efficient. Since existing reference databases do not account for the complete set of human genetic variation or non-germline sequence variants frequently found in cancer and other diseases, read-based host filtration approaches may leave a small number of human reads in the data while performing a negative selection for human DNA.

Although removing human reads is an important part of downstream microbial analysis data, this is just one part of the puzzle in properly detecting the true microbial profiles of low biomass human samples. Many other factors, including various ways the sample may become contaminated from the collection process through sequencing, require other tools and strategies beyond our host filtration pipeline to be accounted for.

If researchers desire extra protections to ensure no human reads are inadvertently mapped, a positive selection for microbial reads can be performed using reference databases confirmed to be fully microbial. We conducted this type of analysis using a microbial database scrubbed of human reads derived from Sepich-Poore et al.7 and similarly noted a resolution of the artifactual sex-difference effect with the cleaned database alone. Additionally, researchers may choose to use a broader range of microbial reference databases beyond RefSeq, which may have more low-complexity regions masked, potentially eliminating some of the mismapping issues leading to sex differences. Additionally, we provide a list to the community of ‘false positive taxa’ we identified with the addition of T2T-CHM-13v2.0 filtering across common microbial databases (RefSeq release 200 and 21033, Web of Life version 223, Genome Taxonomy Database release 22034) as a resource to the community (Supplementary Data 7). However, we caution that cleaning microbial reference databases or using alternative microbial databases in principle cannot address the retention of human reads in metagenomics datasets due to bias from incomplete representation of variation in the human genome.

Nevertheless, this work highlights the importance of and provides appropriate tools for thorough host filtration to mitigate false alignments and erroneous conclusions. The methods here provide an important starting point for conducting host filtration using state-of-the-art methods while being readily expandable for future improvements and reference databases.

Methods

All participants from which fecal and skin samples were derived provided informed consent, and protocols were approved by the Institutional Review Boards at the University of Wisconsin-Madison (#2015-0030) and the University of California San Diego (#200844) respectively.

Human references

Human reference genomes GRCh38.p7, GRCh38.p14, and T2T-CHM-13v2.0 were retrieved from NCBI (Supplementary Data 1). All 94 currently published reference assemblies from the Human Pangenome Reference Consortium website. More information on downloading each reference can be found in Supplementary Data 1.

HMF data processing

The Hartwig Medical Foundation15 (HMF) performed DNA sequencing of tumor tissue (n = 9973 samples) and mapped reads to reference genome GRCh37.p13 using BWA-MEM35 (v. 0.7.x) to create BAM files. Pre-aligned BAM files were downloaded from HMF in October 2021, and unmapped reads were extracted from the BAM files. The files were then filtered through fastp36 (v. 0.20.1) with a length cutoff of 45 bp minimum and default adapter removal, then using minimap2 (v. 2.17) mapped to either GRCh38.p737 or GRCh38.p7 + T2T-CHM13v2.016 human databases. Finally, samtools38 (v. 1.11) was used to extract reads which did not align to the human reference. The full command used is: fastp -l 45 -i $R1 -I $R2 -w 16 --stdout | minimap2 -ax sr -t 16 $human_database - -a | samtools fastq -@ 16 -f 12 -F 256 -1 $R1_out -2 $R2_out. Following host filtration, reads were aligned to the RefSeq release 200 database using the SHOGUN39 protocol with Woltka23 in the Qiita40 platform. RefSeq33 release 200 includes the NCBI representative and reference microbial genomes corresponding to release date 2020-05-14. Dimensionality reduction of the corresponding BIOM table41 was then performed through Gemelli’s RPCA function (v. 0.0.6)18 to create a distance matrix on which PERMANOVA42 differences across sex and a Robust Aitchison PCA plot were created, both using QIIME243 (v. 2022.2.0).

Parameters used in alignment and indexing host filtration

For all three methods, raw FASTQ files are quality filtered using fastp36 (v. 0.23.4) with a length cutoff of 45 bp minimum and subject to adapter removal using the full list of adapters hardcoded in fastp (as opposed to relying on fastp’s automated detection which is limited to a subset of sequences).

For host filtration methods that utilize sequence alignment, we generated individual minimap225 (v. 2.26) indexes using default parameters for each individual human reference genomes. Sequence alignment was then performed with parameters: -2 -ax sr. To host filter, sequences were aligned sequentially to each genome discarding reads which mapped. We used samtools38 (v. 1.19) to reverse unmapped sequences from SAM back into FASTQ format (with arguments -f 12 -F 256 -N for paired-end data and -f 4 -F 256 for single-end data) after each consecutive alignment. Finally, for paired-end data we used fastq-pair44 (v. 0.4) to sort and match filtered read pairs into individual files.

For host filtration methods that utilize indexing, we generated a full index over all 94 published pangenome references from HPRC release 2023, along with GRCh38.p14 and T2T-CHM-13v2.0 using Movi27 (unversioned; git commit hash 76d5a6da1ec0aeb0121b5ac7c59b295936e23cc1). Movi generates pseudo-matching lengths (PMLs) which are approximations of sequence similarity between the query and the index. We used movi-default to generate PML distributions for each queried read and explored several different mathematical transformations of the resulting PML distributions into singular per-read scores. The PML distributions produced by Movi roughly approximate matching statistics previously validated for sequence classification tasks27. Thus, we reasoned that reads with larger PML distribution values had higher similarity to human genomic regions within the index than those with lower values. We tested several transformations of PML distributions into summative scores and devised approach-specific threshold values based on a theoretical human read of length 150 with a singular contiguous matching run of length 31. First, we utilized the maximum PML score within the distribution as the test value and thus computed the threshold for the maximum approach as 31. Next, we calculated the average PML score within the distribution as the test value and thus computed the threshold for the average approach as 3.306. To maximally distinguish PML distributions that feature discontinuous runs of matching nucleotides, we devised a custom metric that magnifies the summative score for read distributions with long stretches of matches above a minimum run-length threshold (denoted \(w\)). In concordance with the previous maximum and average metrics, we computed the threshold for the custom approach as 0.175 with a minimum run-length value of 5. Through experimental validation on mixed human/microbe datasets, we found that our custom metric had the best discriminative performance. To verify this approach, we ran a grid search over various thresholds for the custom metric (thresholds: 0.145 to 0.200 by increments of 0.005; minimum run-length: 2 to 12 by increments of 1) on our simulated data for which we had labeled ground truth (see “Simulated data”). The results showed strong recall for our use cases at the default threshold, but we acknowledge that some users may prefer higher stringency on human DNA removal even at the cost of inadvertent microbial DNA removal. Thus, we implemented both the metric (“maximum”, “average”, or “custom”) and the numerical threshold as configurable options in our host filtration pipeline to meet the needs of all users. The equation for the custom metric is denoted below:

$$ \frac{1}{2L}\left(\max \left(\right.{PML\; distribution}\right) \\ +\left(\left({\sum}_{r\epsilon R\,}r\right)\cdot \log (\left|R\right|+1\left)\right.)\right){where}\,r\epsilon \,R\,{if\; len}\left(r\right)\, > \,w$$
(1)

Host filtration benchmarking data processing

We benchmarked three methods to compare different combinations of the aforementioned alignment-based and indexing-based host filtration approaches. We use minimap2 for alignment based on its support for this application in prior work11.

Method 1

Step i (filled circle): aligned reads to human reference GRCh38.p14 with minimap2 (v. 2.26), then used samtools (v. 1.19) to extract reads that did not align to the human reference. Step ii (filled square): aligned remaining reads from step i to human reference T2T-CHM13v2.0 with minimap2 (v. 2.26), then used samtools (v. 1.19) to extract reads that did not align to the human reference. Step iii (star): matched remaining reads from step ii to an aggregated human reference set consisting of GRCh38.p14, T2T-CHM13v2.0, and the 94 HPRC pangenomes using indexing-based filtration with Movi, as described above.

Method 2

Step i (filled star): aligned reads sequentially to GRCh38.p14, T2T-CHM13v2.0, and the 94 HPRC pangenomes with minimap2 (v. 2.26), then used samtools (v. 1.19) to extract reads that did not align with each iteration. Step ii (star): matched remaining reads from step i to an aggregated human reference set consisting of GRCh38.p14, T2T-CHM13v2.0, and the 94 HPRC pangenomes using indexing-based filtration with Movi, as described above.

Method 3

Step i (star): matched reads to an aggregated human reference set consisting of GRCh38.p14, T2T-CHM13v2.0, and the 94 HPRC pangenomes using indexing-based filtration with Movi, as described above.

Although all Methods use GRCh38.p14, T2T-CHM13v2.0, and HPRC as part of the Movi indexing approach, we want to emphasize that only Method 2 also uses them for minimap2-based alignment. Although minimap2 and Movi are both useful tools for linking short reads to reference genomes, their underlying algorithms are distinct and thus the two tools produce differing results (Fig. 3) and runtimes (Fig. 2b). Minimap2 uses a traditional seed-chain-align approach to identify exact matches to a reference, while Movi uses the Move structure introduced by Nishimoto and Tabei in 2021 to compute pseudo-matching lengths to a reference45. Combining minimap2 and Movi utilizes the respective strengths of each of their implementations, and appears to result in the highest number of reads removed.

We calculated the number of microbial, bacterial, eukaryota, viral, and archaea genome frequencies listed in Supplementary Data 6 using the SHOGUN39 protocol via the Qiita40 platform with the RefSeq release 210 (2022-01-01) database. We listed forward and reverse read counts separately in Supplementary Data 6 to directly report the number of reads mapped to microbial taxa. Read counts listed throughout all other portions of the manuscript count forward and reverse reads as a single count. Additionally, in the case of the Tissue Samples from Various Metastatic Cancer, the ‘Before host filtration number’ is following GRCh38.p7 host filtration and not the purely raw reads as in the other cases.

As a resource to the community, we calculate microbial alignments from reads retained following GRCh38.p14 filtration, but removed with the addition of T2T-CHM-13v2.0 filtration. We do this by tabulating the additional reads removed from the aforementioned 100 colorectal tissue tumor samples from HMF when incorporating T2T-CHM-13v2.0 for host filtration (Method 1 step ii) referred to as “T2T-filtered” reads above. We aligned these removed reads to a diverse set of microbial reference databases – RefSeq release 20033, RefSeq release 21033, Web of Life release 223, and GTDB release 22022 – and reported the resulting spurious microbial alignments (Supplementary Data 7).

Statistics and reproducibility

We used two-sided Wilcoxon signed-rank tests using SciPy46 (v. 1.8.0) to assess differences in medians between reads retained or removed across differing methods. Because this test involves ranking the absolute differences between pairs, and since reads retained or reads removed tend to either decrease or increase respectively across pairs, the resulting test statistic reported is 0 for all comparisons. For all boxplots, throughout the figures, the box represents the interquartile range (IQR), with the centerline being the median and the top and bottom of the box representing Q1 and Q3. Boxplots were generated using matplotlib47 (v. 3.8.0), and the whiskers of the plot were left at matplotlib defaults, making them +/−1.5 from the IQR. Outliers were removed from the boxplots since a scatter plot with all dots was overlaid. To facilitate computationally efficient benchmarking of the host filtration methods, we analyzed a subset of n = 100 samples from the HMF dataset, and n = 50 samples from the Alzheimer’s disease fecal dataset while using complete datasets for all other sample types. All p values are rounded two decimal places, and all test statistics are reported with three significant figures unless otherwise specified.

Simulated data

Data was simulated using ART Illumina46,48 (v. 2.5.8). Human reads were simulated using HPRC genomes, and microbial reads were simulated from FDA-ARGOS28. Supplementary Data 1 further describes simulated dataset accession.

Re-identification analysis with updated host filtration methods

For the re-identification analysis, we utilized human reads extracted from mNGS data and imputed SNP array data in the previous study8. The main steps in the human read extraction were as follows: (i) trimming of low-quality bases, (ii) identification of candidate human reads, (iii) removal of duplicated reads, and (iv) removal of the potential bacterial reads. We trimmed the raw reads to clip Illumina adapters and cut off low-quality bases using the Trimmomatic49 (v. 0.39; parameters: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:true TRAILING:20 MINLEN:60). We discarded reads less than 60 bp in length after trimming. Then, we mapped the trimmed reads to the human reference genome (GRCh37, human_g1k_v37_decoy) using bowtie250 (v. 2.3.5.1) with the ‘—no-discordant’ option and retained only the properly mapped reads. Next, we performed duplicate removal by Picard MarkDuplicates (v. 2.22.8) with ‘VALIDATION_STRINGENCY = LENIENT’ option. Finally, we mapped the duplicate removed reads to the bacterial reference genome set constructed in Kishikawa et al.51. This reference was composed of the 7881 genomes including those derived from Nishijima et al.52 and those identified in the cultivated human gut bacteria projects53,54,55. We kept only reads of which both paired ends failed to align. The resulting reads were defined as human reads and used in the subsequent analyses. Then, extracted human reads were subjected to the host filtration methods, namely the first steps two steps of Method 1: GRCh38.p14 alignment, and GRCh38.p14/T2T-CHM13v2.0 alignment, along with the first step of Method 2: GRCh38.p14/T2T-CHM13v2.0/HPRC alignment.

For the re-identification analysis, we utilized likelihood score-based method introduced in the previous study8. We calculated the likelihood that each sample in the genotype dataset produced the observed human reads in the fecal samples from two input data; (i) human reads in the gut-derived mNGS data which were mapped to the human reference genome and (ii) genotype dataset of the multiple samples. We extracted the SNP sites which were covered by at least a read and included in the reference panel by ‘bcftools mpileup’56 with the ‘-T’ option. To get independent SNP sites, we applied clumping to the list of the SNPs which were covered by at least a read. We used ‘–indep-pairwise 100 30 0.1’ option in PLINK for clumping at Rsq = 0.1. Then, we calculated the likelihood according to the model proposed in Li et al.57. Suppose an SNP site \(i\) was covered by \({n}_{i}\) reads in the gut-derived mNGS data, \({k}_{i}\) reads were from the reference allele, and \({n}_{i}-{k}_{i}\) reads were from the alternative allele. bcftools56 (v. 1.10.2) was used to calculate the read coverage with ‘-q 40 -Q 20’ options. The error probability of the read bases was \(\varepsilon\) and error independency was assumed. In this study, \(\varepsilon\) was set at 1 × 10−6 following the assumption in Li et al.57. At the SNP site \(i\), the number of the alternative allele of an individual \(j\) (\({g}_{i,j}\)) could be 0 (Ref / Ref), 1 (Ref / Alt), or 2 (Alt / Alt). Then, the likelihood that the sample with a \({g}_{i}\) alternative alleles at SNP site \(i\) produced the observed human reads in the gut-derived mNGS data was expressed as

$${L}_{i,j}({g}_{i,j},{n}_{i},{k}_{i})=\frac{1}{{2}^{{n}_{i}}}[(2-{g}_{i,j})\varepsilon+{g}_{i,j}({1-\varepsilon })]^{{n}_{i}-{k}_{i}}{[\left(\right.{g}_{i,j}\varepsilon+(2-{g}_{i,j})(1-\varepsilon )]}^{{k}_{i}}$$
(2)

When the clumping procedure retained \(N\) independent SNP sites, a log-transformed likelihood (likelihood score; \({LS}\)) that a genotype data produced the observed human reads in the gut-derived mNGS data was expressed as

$${{LS}}_{j}={\sum }_{i=1}^{N}\log \left({L}_{i,j}\left({g}_{i,j},{n}_{i},{k}_{i}\right)\right)$$
(3)

Next, we drew the background distribution of the likelihood score from (i) human reads in the gut-derived mNGS data which were mapped to the human reference genome, and (ii) allele frequency data for the SNP sites used for calculating the likelihood score. In this study, Japanese subjects in the combined reference panel of 1KG Project Phase 358 version 5 genotype (n = 104) and Japanese WGS data (n = 1037) were used to calculate the allele frequency59. When an alternative allele frequency at SNP site \(i\) was \({p}_{i}\) and the number of the alternative allele was \({g}_{i,{pop}}\) (=0, 1, or 2), theoretical genotype frequencies at SNP site \(i\) were expressed as

$$P\left({g}_{i,{pop}},{p}_{i}\right)=\left\{\begin{array}{c}{(1-{p}_{i})}^{2},({g}_{i,{pop}}=0)\hfill \\ 2{p}_{i}\left(1-{p}_{i}\right),\,({g}_{i,{pop}}=1)\\ {{p}_{i}}^{2},({g}_{i,{pop}}=2)\hfill \end{array}\right.$$
(4)

Then, the expected log transformed likelihood that a genotype data randomly drawn from the specified population produced the observed human reads in the mNGS data was expressed as

$$E\left({{LS}}_{{pop}}\right)={\sum }_{i=1}^{N}E\left({{LS}}_{i,{pop}}\right)\,={\sum }_{i=1}^{N}{\sum }_{{g}_{i,{pop}}=0}^{2}P\left({g}_{i,{pop}},{p}_{i}\right){\log }({L}_{i}({g}_{i,{pop}},{n}_{i},{k}_{i}))$$
(5)

Given that SNP sites were independent, the variance of the likelihood score in a specific population was expressed as

$$ V\left({{LS}}_{{pop}}\right)={\sum }_{i=1}^{N}V\left({{LS}}_{i,{pop}}\right) \\ ={\sum }_{i=1}^{N}{\sum }_{{g}_{i,{pop}}=0}^{2}P\left({g}_{i,{pop}},{p}_{i}\right)\bigg[{\log }({L}_{i}\left({g}_{i,{pop}},{n}_{i},{k}_{i}\right))-{E\left({{LS}}_{i,{pop}}\right)\bigg]}^{2}$$
(6)

Using \(E\left({{LS}}_{{pop}}\right)\) and \(V\left({{LS}}_{{pop}}\right)\), we calculated the standardized likelihood score of the individual \(j\) as \(\frac{({{LS}}_{j}-E\left({{LS}}_{{pop}}\right))}{\sqrt{V\left({{LS}}_{{pop}}\right)}}\). We transformed standardized likelihood scores to P values based on the normal distribution. We identified the pair of the gut-derived mNGS and genotype data (imputed SNP array data was used in this study) derived from the same individuals based on the P values.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.