Fig. 3: OMArk results for simulated proteomes.
From: Quality assessment of gene repertoire annotations with OMArk

a–d, Three example species of the model dataset (left) and the representative dataset (right) are shown for each simulation. Each simulated error in panels a–d was applied to 10%–90% of the proteome (x axis). a, Simulated incompleteness. OMArk (top) and BUSCO (bottom) results for the datasets. Colors represent the part of the conserved gene set found in a single copy (green) or duplicated (light green) or are missing (red). The simulated completeness corresponds to the percentage of the genome that has been randomly selected in each simulation. Horizontal black lines show the expected completeness (that is, the measured completeness for the source proteome). b, Erroneous sequence simulation. Colors represent proteins which map to the correct lineage (consistent, blue), to another lineage (Inconsistent, violet) or have no homologs (unknown, black). Hashes indicate structural inconsistency relative to the gene family (either partial mapping (black hashes) or fragmented genes (white hashes)). The appended error (x axis) corresponds to the quantity of erroneous sequences that was added to the proteome as a percentage of its original protein number. Horizontal red lines indicate the expected number of structural and taxonomically consistent genes, considering the proportion in the source proteome and the known introduced error. c, Fragmented sequence simulation. The x axis corresponds to the percent of the proteome that has been fragmented. The pool of artificially fragmented genes are cut randomly to be between 10% and 90% of the original length of the protein. Horizontal red lines indicate the expected number of nonfragmented taxonomically consistent genes, considering the proportion in the source proteome and the known fragment rates; horizontal pink lines indicate this proportion if half of the fragments are detected. d, Fused sequence simulation. The x axis corresponds to the percent of the proteome that has been fused. Pairs of proteins are selected randomly and appended together to simulate fusion. The fused protein gets added to the proteome while the original proteins get removed. Horizontal red lines indicate the expected number of structural and taxonomically consistent genes, considering the proportion in the source proteome that have been fused. Ancestral lineages for the six shown species are Homo sapiens, Hominidae; Drosophila melanogaster, melanogaster subdivision; Arabidopsis thaliana, Brassicaceae; Mytilus coruscus, Lophotrochozoa; Reticulomyxa filosa, SAR (Stramenopiles-Alveolata-Rhizaria) supergroup; and Hibiscus syriacus, Malvaceae.