Fig. 4: ProtBFN generates sequences that fold into naturally occurring structural motifs.
From: Protein sequence modelling with Bayesian flow networks

Generated samples are folded with ESMFold, and the resulting structures are searched against the CATH S40 database, with a hit determined by TM1 and TM2 scores above 0.5. ProtBFN is found to generate hits significantly more often (65.7% of samples) than the baseline ProtGPT2 (25.3%) and EvoDiff (12.0%) models. The distribution of TM1 and TM2 scores (a), and TM1 against sequence similarity (b) highlight the frequency with which ProtBFN also generates higher TM scores whilst maintaining novelty with respect to the reference sequence of a CATH ___domain. A more granular analysis of the structural hits is given by sequential structure alignment programme (SSAP) scores39,40 (c). SSAP scores of 60-70, 70-80, and 80-100 correspond to similarity at the architecture, topology, and homologous superfamily levels, respectively, in the CATH nomenclature. ProtBFN hits scores higher than those of both baseline models, with the vast majority showing similarity at the topological level or above. The boxplots display the three quartile values of the distributions (median, upper and lower quartile); with whiskers extending to points within 1.5 IQRs (interquartile range) of the lower and upper quartile and data outside this range being displayed as individual points. Statistics are calculated using 10,000 samples for each method. The highlighted elements on these scatter plots are detailed in (d) and selected to illustrate the diversity of structures and functional types generated by ProtBFN (a detailed discussion can be found in the main text). Each selected sample is annotated with sample number, TM1, SSAP, Similarity, Length and TM2. The sample structure is displayed in purple, while the CATH ___domain is displayed in yellow. Finally, Merizo-based ___domain segmentation of ProtBFN samples (e) reveals that zero-, single- and multi-___domain samples are generated. Panel (d) is adapted from an original rendering (Created in BioRender. Copoiu, L. (2025) https://BioRender.com/v21o080). Source data are provided as a Source Data file.