Fig. 3: Performance of LassoPred’s annotator.
From: LassoPred: a tool to predict the 3D structure of lasso peptides

A Microcin J25 as an example of sequence splitting into three parts: ring (cyan), loop (yellow), and tail (pink). The sequence is split into overlapping dipeptides, and fragment categories are shown in the right column with “NA” for boundaries. The cartoon and stick model highlight isopeptide and plug residues using the same color scheme as sequence. B Comparison of plug prediction performance using various sequence featurization methods through repeated holdout validation. Details of featurization are in Table S10. Using dipeptide fragmentation on ring-truncated sequences, each of 100 ring-truncated splits was tested using random forest classifier (RFC), K-neighbors classifier (KNC), gradient boosting classifier (GBC), and support vector classifier (SVC) models. Models were tuned by grid search; the best-performing model was used. Box plots represent accuracy distribution (n = 100 splits): center line = median; box = 25th–75th percentiles. Summary values are labeled; if the mean overlaps with a quartile, it is shown in parentheses. C, D Distribution of the sequence length and loop length, respectively, for the entire dataset (47 LaPs, grey) and the selected split (10 LaPs, orange). E, F ROC curves for the isopeptide and plug classifier, respectively. For isopeptide classification, Classes 0, 1, and 2 correspond to isopeptide, ring, and loop; for plug classification, they represent plug, loop, and tail. G Data splitting test for plug prediction accuracy using ring-truncated dipeptide fragmentation, ESM2 L33 embedding, and the SVC model with optimized hyperparameters on the selected holdout set. For each splitting ratio, Top 1 and Top 3 accuracy were assessed via repeated holdout validation of 100 splits, applying a 4:1 training-to-test ratio and stratified sampling. Accuracy is represented as mean ± standard error (SE) based on 100 splits (n = 100). H Model performance comparison of the original and clean dataset using ring-truncated dipeptide fragmentation, ESM2 L33 embedding, and the SVC model with optimized hyperparameters on the selected holdout set. The clean dataset (sequence similarity <80%) includes 36 LaP sequences. Performance metrics for both datasets were evaluated using 100 repeated holdout splits with a 4:1 train–test ratio and stratified sampling. Performance values are shown as mean ± standard deviation (SD).