Table 2 Overview of the steps and considerations in a typical workflow for generating new active enzyme variants
From: Computational scoring and experimental evaluation of enzymes generated by neural networks
Step | Description | Examples and considerations |
---|---|---|
Curate training data | Gather a list of natural sequences likely to have the target activity and express in the target system. | In addition to UniProt and/or NCBI nr, search expanded databases, such as Mgnify for prokaryotic enzymes or NCBI TSA for eukaryotic enzymes. |
Pay attention to the ___domain content. Unusual ___domain content indicates neofunctionalization. In some cases, the ___domain with the activity of interest will retain its function in the absence of the other domains; therefore, it may be safe to remove extraneous domains. | ||
Pay attention to the presence of localization tags and transmembrane domains. In many cases, these interfere with expression. In some cases, they can be removed without impacting enzyme function. | ||
Filter out unusually short or long sequences, or sequences with other indications that they may be pseudogenes, fragments or derived from poor gene calling. | ||
Use hmm-profile or structure searches in addition to sequence searches to find a broader diversity of training sequences. | ||
Use a clustering algorithm, such as CD-HIT, to reduce the overrepresentation of enzymes from highly sequenced phyla. | ||
Generate new sequences | Use generative models to generate additional members of the enzyme family. Most of these generative models rely on training or fine-tuning of natural sequences curated in the first step of the workflow. | ASR |
Generative adversarial networks (ProteinGAN) | ||
Language models (such as ProtGPT2, ProGen or ESM-MSA | ||
VAE | ||
DCA-based methods | ||
Inverse folding models (ProteinMPNN, ESM-IF) | ||
Select sequences | Select a subset of natural and generated sequences for experimental evaluation. In campaigns where all sequences are natural or ancestral reconstructions, random selection of candidates may be effective, particularly if care is taken in training data curation. For generative models that produce a lower proportion of active sequences, additional filtering may be required. | Randomly select candidate sequences. |
Select sequences with high similarity to the best candidates from previous screening rounds or the literature (phylogeny-based selection). | ||
Select sequences with mutations known to be associated with the target phenotype. | ||
The same curation criteria for natural sequences are also applicable to generated sequences. | ||
Additional criteria may be used to address failure modes common to the generative models used. For example, models may tend to produce overly short or repetitive sequences. | ||
Sequences can be scored and ranked based on various metrics. Reasonable scores for these metrics can be estimated from natural sequences. Alternatively, candidates can be selected from the highest-scoring sequences. | ||
In this study, we settled on a filter composed of six criteria. |