Fig. 2: ProtBFN is an effective de novo generator of protein sequences; generating novel protein sequences that in distribution under both local and global metrics. | Nature Communications

Fig. 2: ProtBFN is an effective de novo generator of protein sequences; generating novel protein sequences that in distribution under both local and global metrics.

From: Protein sequence modelling with Bayesian flow networks

Fig. 2

The local metrics considered are amino acid (a) and b oligomer frequencies; both of which show a strong match to the training distribution of the model. The 50 256 oligomers used are taken from the vocabulary of ProtGPT215 and the plot has overlaid a linear correlation fit and the associated Pearson correlation coefficient. Metric computed as functions of the entire protein sequence are used to assess the global coherence of the generations. Both the sequence lengths (c) and predicted mean local distance difference test (pLDDT) scores from ESMFold7 (d) are shown. Predictions with pLDDT > 70 are generally considered high confidence. Also included is a baseline autoregressive, ProtGPT215, and discrete diffusion, EvoDiff32, model. This structural analysis is extended to show the secondary structure along the length of the generated sequences as predicted by NetSurfP-3.034 (e). In all cases, ProtBFN is seen to well match the natural training distribution UniProtCC. Additional results, including equivalent figures for baseline models and metrics not presented here, can be found in Supplementary Note 1. Finally, the maximum sequence identity of the ProtBFN-generated sequences to the UniProtCC training data is plotted (f) to demonstrate that the training data is not being memorised. The violin plots (c-d) show a KDE of the data along with a box plot that displays the three quartile values of the distributions (median, upper and lower quartile); with whiskers extending to points within 1.5 IQRs (interquartile range) of the lower and upper quartile. Statistics are calculated using 10,000 samples for each method or data distribution. Source data are provided as a Source Data file.

Back to article page