Fig. 6: Re-identification from a set of genotype data based on the human reads in fecal samples prevented with improved host filtration. | Nature Communications

Fig. 6: Re-identification from a set of genotype data based on the human reads in fecal samples prevented with improved host filtration.

From: Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Fig. 6

The 343 fecal samples from Tomofuji et al. Nature Microbiology 2023, with paired genotype data, were re-analyzed with various combinations of updated host filtration methods (GRCh38.p14, T2T-CHM13v2.0, Human Pangenome Reference Consortium 2024 release) resolving host data leakage. The x-axis of the plots indicates the number of bases used for the calculation of the likelihood scores. The y-axis of the plot indicates the two-sided P values calculated using a standard normal distribution based on the standardized likelihood scores. The red and blue dashed lines indicate p = 4.3 × 10−7 (0.05/117,649 tests) and p = 1.5 × 10−4 (0.05/343 tests), respectively. The results of the 117,649 tests (343 genotype data × 343 metagenome data) are indicated as the colors of the points. Some samples could not be used for the re-identification analysis because too few reads remained after filtering, hence the fewer dots shown across host filtration methods. Full description on the calculation of P values can be found in the Methods.

Back to article page