Fig. 2: Overview of the OMArk methodology.
From: Quality assessment of gene repertoire annotations with OMArk

a, Sequences from the query proteome are placed into known HOGs using the k-mer-based fast-mapping method OMAmer. Shown is a gene tree with nested gene families (HOGs), delineated by speciation and duplication events. OMAmer provides accurate placement of protein sequences in their correct subfamily. b, The specific taxon of the query species is automatically determined by OMArk. Here, the species tree is shown, with protein placements represented by red dots. The size of the dots is logarithmically proportional to the number of placements in a typical scenario but simplified for this schema. The path to the query taxon (blue) is inferred based on the maximal number of placements, and the path(s) to contaminant taxa (gold) are determined as those with more placements than expected by chance. c, OMArk defines the ancestral reference lineage for a given query species as the most recent taxonomic level, including the species, and that is represented by at least five species in the OMA database. Here, a species tree is shown with colored bars representing individual genes. d, The conserved and lineage-specific gene sets. The conserved repertoire contains all the HOGs defined at the reference ancestral level that cover at least 80% of the species in the clade. These are gene families inferred to be present since the common ancestor. The lineage repertoire is a superset of the conserved repertoire, with the addition of genes that originated later in the lineage and are still present in at least one species in the OMA database. In the repertoires, genes from the different species are grouped into their HOGs. e, OMArk assesses completeness by comparing the conserved ancestral repertoire to the query protein sequences and classifying them as single copy, duplicated or missing. f, OMArk assesses consistency by comparing the query protein sequences to the lineage repertoire and classifying them as taxonomically consistent, inconsistent, unknown or contaminant. OMArk also assesses gene model structure by classifying query proteins as partial mapping or fragment. Shapes of species shown in a and b reprinted from Phylopic (www.phylopic.org). Silhouettes of Homo sapiens and Canis familiaris dingo by T. M. Keesey (public ___domain), Pongo abelii by Gareth Monger (CC-BY 3.0), Pan troglodytes by J. Lawley (public ___domain) and Xenopus laevis by Ian Quigley (CC-BY 3.0). Silhouettes of Saccharomyces cerevisiae by W. Decature (public ___domain), Laccaria by R. Percudani (public ___domain), Caenorhabditis elegans by J. Warner (public ___domain) and Mus musculus by S. Miranda-Rottman (CC-BY 3.0).