Abstract
The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.
Similar content being viewed by others
Introduction
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) associated proteins (CRISPR-Cas) form an adaptive immune system found in prokaryotes, providing defense by cleaving exogenous DNA that invades the host1,2,3. Over the past decade, this system has been harnessed as a robust genome editing tool, enabling precise mutagenesis at specific genomic sites in a wide range of organisms, including plants and animals4,5,6. While different types of CRISPR-Cas systems share similarities in genome editing principles, their recognition and cleavage mechanisms vary, influencing the choice of the most suitable CRISPR-Cas tool for specific experimental needs7,8. Therefore, exploring additional CRISPR-Cas systems is crucial to expand the toolbox for gene editing and enhance our understanding of bacterial immune mechanisms. Cas proteins, as essential components of the CRISPR-Cas system, are involved in acquiring and executing bacterial immune functions. They also serve as key markers for discovering new CRISPR-Cas systems9,10,11. Traditional bioinformatics approaches for identifying Cas enzymes include the use of hidden Markov models (HMMs) to predict potentially similar Cas enzymes based on known ones and analyzing the upstream and downstream sequences in relation to the Cas1 sequence or CRISPR region12,13. These methods are limited to predicting only known Cas enzyme-like activities and rely on the proximity of a marker sequence to the CRISPR-Cas system. Additionally, traditional approaches mainly depend on protein sequence features, making it difficult to capture complex nonlinear relationships and structural information. Moreover, traditional models usually have relatively simple architectures with a small number of parameters, which hinders their ability to effectively extract deep, high-dimensional information. These limitations significantly restrict the potential for identifying and mining a broader range of Cas proteins. Consequently, there is a clear need for more advanced methodologies capable of overcoming these challenges to fully explore the diversity and functionality of CRISPR-Cas systems.
GNNs14,15, a subset of deep learning tailored for non-Euclidean data, excel in tasks involving graph-structured data. Even simple, untrained GNNs can perform well by uncovering intricate patterns that are challenging for standard neural networks to extract16,17. With the growing availability of biological network data, GNN has become a valuable tool in bioinformatics18,19,20. In various biological networks, graphs can be used to capture interactions among bio-molecules such as RNA, DNA and proteins21,22,23. GNNs have seen considerable success in addressing problems related to small molecules, with many reported applications24,25,26. There are also instances of GNNs being applied to solve macromolecular challenges27,28,29. A very classic GNN implementation is Message Passing Neural Network (MPNN) which updates node representations by aggregating and transforming messages passed along the edges between nodes30,31,32. DMPNN builds upon MPNN concept by calculating messages using directed edges (bonds) rather than vertices (atoms), offering advantages in terms of stability and robustness33,34. This approach was successfully used by Jonathan et al. in the discovery of the broad-spectrum antibiotic ‘Halicin’35.
SMILES encoding has demonstrated broad utility in protein-related tasks. It effectively describes peptide structures with modified amino acid residues, such as glycosylation motifs36. The PeptideSmilesEncoder37 tool has validated its accuracy in maintaining molecular mass. Additionally, SMILES-based molecular similarity kernels show significant computational advantages in protein-drug interaction tasks38. SMILES Pair Encoding39 (SPE) outperforms traditional atom-level tokenization methods in molecular generation, while SMILES-based Transformer models (e.g., MTL-BERT) excel in ADMET prediction tasks40, particularly in leveraging large-scale unlabeled data for pre-training within multi-task learning frameworks. These findings collectively highlight SMILES as a powerful tool for protein-related tasks, offering both efficiency and accuracy.
In this study, we devised strategies to mine and identify Cas1 proteins by analyzing their characteristics, converting them into SMILES format for graph data construction, and training an ensemble of two DMPNN models, which proved effective in identifying Cas1 proteins in the Ensemble database. Our approach also innovatively applies SMILES encoding to biomacromolecule studies. Cas1 is a highly conserved protein known for its role in forming a complex with Cas2 to integrate foreign nucleic acids into the CRISPR sequence as new spacers, effectively storing information about invasive nucleic acids41,42. In light of our analysis and understanding of other types of Cas proteins, such as the similarities revealed through protein large language model representations, we hypothesize that the strategy proposed in this study may also be applicable to the prediction of other Cas proteins. However, this necessitates further validation.
Results
Essential strategy overview
Figure 1 illustrates the comprehensive workflow from raw data acquisition to final prediction analysis. It encompasses the initial data, the preprocessing steps undertaken, the model representation phase, subsequent predictive analysis of the processed data, and the final evaluation of the prediction outcomes. We experimented with various strategies and found that combining two independent models, namely Model 1 and Model 2, proved to be effective in identifying potential Cas1 proteins. Both of these models shared identical structures and hyper-parameters; however, they were trained on distinct training datasets. For Model 1, the positive samples consisted of all the Cas1 proteins we had gathered, while the negative samples were non-Cas proteins selected from UniRef50. The primary function of Model 1 was to distinguish Cas1 proteins from non-Cas proteins. Nonetheless, it struggled to efficiently differentiate Cas1 from other types of Cas proteins, which was the primary role of Model 2. Model 2, on the other hand, used the same positive samples as Model 1, but its negative samples comprised various other types of Cas proteins ranging from Cas2 to Cas14, for detailed information, please refer to the following sections.
Cas1 protein prediction and analysis
In this study, our primary focus was on identifying potential Cas1 proteins within an archaea-associated protein dataset obtained from Ensembl Bacteria43. The preliminary screening process of the data is shown in Table 1. As of April 25th, 2023, Ensembl Bacteria had a total of 31,332 entries (68 GB) related to bacteria and archaea. Each entry contained different types of bioinformatic data, mainly gene and amino acid sequences. By utilizing the archaea classification information provided in Supplementary Table 1, we identified 767 archaea entries, encompassing a total of 2,085,962 proteins in FASTA format, equivalent to 1.045 GB. We proceeded to extract and analyze the header information for each archaea protein. It was noteworthy that among these proteins, 643,734 were less than 400 amino acids (aa) in length and were labeled as ‘hypothetical proteins’. These labels indicated that the existence of these proteins had been predicted, but there was a lack of experimental evidence confirming their in vivo expression.
We applied Model 1 to predict potential Cas1 proteins among these 643,734 proteins, and identified 12,574 Cas1 candidates with prediction scores exceeding 0.9. This result was encouraging as it revealed that 99.81% of these candidates were classified as non-Cas1 proteins, which statistically aligned with the actual distribution of Cas1 proteins. Taking into account the distribution of amino acid sequence lengths for Cas1 proteins (as depicted in Fig. 5a), we further filtered this list and selected a total of 1,113 proteins with amino acid lengths falling within the range of 250 to 350aa. Subsequently, we employed Model 2 to calculate the likelihood of these proteins being Cas1. The predicted scores from both models were summed and sorted. Table 2 clearly and concisely presents this screening process.
To further validate these proteins without conducting biological wet experiments, we undertook the following steps. Firstly, we performed a BLAST search44 with an E-value threshold of 1e−5 for these 1113 Cas1 candidates (query proteins) using a database created from known Cas1 proteins. This step led to the identification of 17 proteins with ‘hits found’ greater than or equal to 1. Secondly, we extracted the values of the keyword ‘Region name’ for these 1113 proteins from NCBI45. Among these, we found 17 proteins clearly labeled as ‘Cas1’ (including Cas1, Cas1_I-II-III, and Cas1_HMARI). It is noteworthy that these 17 proteins are exactly the same 17 proteins identified through BLAST. Thirdly, we conducted an alignment of Cas1 functional domains based on their three-dimensional structures using SMART46. In this analysis, 12 out of the 17 proteins were successfully matched (indicated as ‘–’ meaning no information available). After this three-step analysis, we were confident that these 17 proteins were indeed Cas1 proteins (please refer to Table 3 for specific details).
The predicted scores of these 17 proteins generated by our model, along with their ranking among the 1113 proteins are presented in the fourth column of the table. The positive news is that the top three predicted scores all belonged to Cas1 proteins. Moreover, among the top 100 proteins with predicted scores, 10 out of the 17 (58.8%) were Cas1 proteins. It can be inferred from the table that 94.1% (16 out of 17 proteins) of Cas1 proteins ranked in the top 35.3% (393 out of 1,113) based on predicted scores. Additionally, it’s worth noting that 94.1% of Cas1 proteins had a predicted total score exceeding 1.91, which validates the effectiveness of our stepwise prediction with the two models. Furthermore, we observed that our model still assigned high prediction scores to three proteins (RLE96390, RLE64005, OYT29113) despite their low BLAST results. This suggests that if the E-value threshold for BLAST was set below 1e-5, these proteins with limited ‘hits found’ would not have been identified as Cas1 proteins. However, our method was able to recognize them with relatively high predicted scores. This to some extent demonstrates that our model can effectively extract potential features of Cas1 proteins.
We also conducted a similar analysis for proteins with a length of less than 250aa, and the results are presented in Table 4. As can be seen, the BLAST search identified 15 proteins, with the exception of KPU62407, all of which had relatively low ‘hits found’ values, not exceeding 66. Furthermore, there was minimal information available, indicated as ‘–’ in the table, related to Cas1 for these proteins in both NCBI and SMART databases. These observations collectively suggested that these proteins were less likely to be Cas1. The predictions generated by our model were in agreement with this observation, as the predicted scores were relatively low, with the highest score not exceeding 1.37. However, it should be pointed out that our model did not correctly predict KPU62407, KKG18251, and EFC93806 as Cas1 proteins, despite their classification as such. One possible reason for this discrepancy may be the relatively low representation of short-sequence proteins in our training dataset. In summary, our model performs well in identifying Cas1 proteins, particularly those with amino acid lengths ranging from 250 to 400. However, it is important to acknowledge that the model still yields a relatively high proportion of false positives, and a high prediction score does not necessarily guarantee that a protein is a Cas1 protein.
Discussion
The CRISPR-Cas system serves as a prokaryotic immune system, aimed at defending against the intrusion of foreign genetic material, such as phage viruses and plasmids. Effectively coordinating various Cas proteins is crucial for the proper functioning of the CRISPR-Cas system. The objective of this study was to explore the integration of SMILES encoding and graph neural networks for the identification of Cas1 proteins without the need for explicitly specifying Cas1 protein features. Based on the collected data and targeted analysis, we have designed a suite of pipelines that automatically recognize Cas1 protein within the framework of DMPNN. ROC and AUC results obtained from both the training and newly designed test datasets validated the effectiveness of this approach. Our strategy demonstrated performance in identifying Cas1 proteins ≤ 400aa, achieving higher recall rates than CRISPRCasFinder, and showed potential for further improvement with additional long-sequence data. It faced challenges with non-Cas1 proteins due to their diverse feature space, indicating a need for expanded training data to enhance overall effectiveness. The effectiveness of our strategy on Cas1 proteins naturally leads us to consider that this approach may also be applicable to other types of Cas proteins, given their potential structural and functional similarities. However, these possibilities require further validation.
The success of this study in identifying Cas1 proteins stems from the effective integration of SMILES encoding and the DMPNN framework, highlighting the crucial role of edges and nodes in representing molecular features. Below is an analysis of the key factors: (1) SMILES encoding provides a compact, standardized representation of molecular structures, enabling efficient processing and implicit feature extraction. This eliminates the need for manual feature engineering, making it ideal for capturing the structural complexity of Cas1 proteins. (2) DMPNN excels in handling molecular graphs through its message-passing mechanism, capturing both local and global structural patterns. Its scalability allows generalization to proteins of varying lengths, even beyond the training data, making it highly versatile for protein identification tasks. (3) The combination of SMILES encoding and DMPNN creates a robust pipeline. SMILES provides efficient input representation, while DMPNN extracts complex structural features, enabling superior implicit feature extraction and high performance on both training and test datasets. (4) The ablation analysis suggests prioritizing edge feature encoding, followed by node feature enhancement, and iterative optimization of the message-passing mechanism for molecular graph neural network design. In conclusion, the strategy provides a powerful and flexible approach for Cas1 protein identification. The success of this method highlights the importance of leveraging advanced computational techniques to address complex biological challenges, paving the way for future applications in protein research and beyond.
However, a rational assessment reveals that there are several areas in our work that require further optimization: (1) Dataset limitations: Our dataset primarily consists of short protein sequences (< 400aa), and further comprehensive studies are needed for proteins of longer sequences to confirm whether the method is applicable to long proteins. (2) Annotation credibility issues. Some of these proteins lack experimental validation, leading to uncertainties regarding their biological relevance and accuracy. This could result in the model learning inaccurate features. (3) Model complexity: While SMILES encoding and graph neural networks (GNNs) are capable of capturing complex molecular structural features, the model’s complexity may make it more sensitive to noise and outliers, potentially degrading performance. (4) Negative sample diversity: The selection of negative samples may not be sufficiently diverse, which could lead to inadequate learning of negative sample features by the model. (5) Hyperparameter tuning: Although Bayesian optimization was used for hyperparameter tuning, some choices of hyperparameters might not have fully considered the model’s generalization ability. These issues could contribute to a higher false positive rate and reduced model accuracy. Addressing these limitations will be a key focus for future optimizations. Additionally, in future work, it may be beneficial to incorporate certain features of Cas1 proteins into the graph neural network in the form of node and edge attributes. Extending the two-dimensional graph neural network of proteins to three dimensions to incorporate more structural information could also enhance accuracy.
Methods
Construction of protein graphs
Protein encoding with SMILES
We converted each protein molecule into a SMILES string using the open-source package RDKit47. SMILES is a notation method for representing the chemical structure of molecules using concise ASCII strings48,49. One notable feature of SMILES is its uniqueness; each molecule corresponds to a single, standardized SMILES string. Additionally, SMILES is known for its efficient use of storage space compared to other structural representation methods. Figure 2a illustrates the SMILES string of a CRISPR-associated endonuclease protein selected from Listeria monocytogenes. However, its SMILES string length extends to 1622 characters, which is more than 20 times the length of the amino acid sequence. On average, when a protein is converted into a SMILES string, the string length increases by approximately 10 times. This length expansion can, to a certain extent, reveal potential structural and attribute characteristics of a protein in a more explicit manner, facilitating further analysis using other methods. Figure 2b displays the secondary structure of the protein derived from the SMILES sequence, where the connectivity and type of atoms (single bonds, double bonds, triple bonds, etc.) as well as the cyclic structures are clearly revealed. Figure 2c depicts the 3D structure predicted from SMILES using RDKit. It is evident that, aside from the side chains, it provides almost no effective folding information, and it is even impossible to render it in cartoon mode. Figure 2d, on the other hand, shows the 3D structure predicted directly from the protein sequence using AlphaFold350, with a confidence level generally above 90. Comparing the two reveals a significant difference in the results. However, this does not mean that the protein represented by SMILES lacks 3D information; it simply indicates that there is currently no effective method or tool to accurately map the 3D information from SMILES.
An example demonstrating the conversion of a Cas1 protein into a SMILES code and the 2D conformation and predicted 3D structure. (a) SMILES string of a CRISPR-associated endonuclease protein from Listeria monocytogenes; (b) Secondary structure of the protein derived from the SMILES sequence; (c) Predicted 3D structure from SMILES using RDKit; (d): Predicted 3D structure from the protein sequence using AlphaFold.
It is essential to note that when using SMILES encoding for molecular representation, issues of consistency and uniqueness may arise. Specifically, the same molecule may generate inconsistent encodings when processed by different methods or software, while distinct molecules (e.g., isomers, cis–trans isomers, or enantiomers) might produce identical encodings under the same method. However, this problem is less likely to occur in the encoding of macromolecules such as proteins, due to their structural complexity and the rarity of isomerism (e.g., structural or stereoisomerism) comparable to small molecules. To ensure the absence of such issues, we implemented the following measures: (1) Tool validation: We compared widely used methods (e.g., RDKit, Indigo Toolkit and Open Babel) and selected RDKit for macromolecular conversion based on its robustness. (2) Parameter standardization: Key parameters were strictly controlled during encoding to ensure reproducibility (details in Supplementary Table 2). (3) Cross-validation: All generated SMILES strings were pairwise compared to detect inconsistencies. Under these protocols, no consistency or uniqueness issues were observed, confirming the validity of our approach.
In short, the advantages of representing protein macromolecules using SMILES are: (1) It provides a unique characterization for each protein macromolecule; (2) It offers a detailed representation of the single-letter format; (3) SMILES can encode various complexities in protein molecules, such as stereochemistry, cyclic structures, and functional groups. These features make it easier for deep learning models to learn the features, commonalities, and differences within and between proteins, somewhat reducing the requirements for model structure. However, its drawbacks are also evident: (1) There might be issues with isomers represented by SMILES; (2) SMILES mainly encodes the 2D topology of molecules and does not directly include 3D spatial information; (3) As the size of the molecule increases, the length and complexity of the SMILES string grow exponentially, which may affect the efficiency of data processing.
Atom and bond features
The properties of atoms and bonds in proteins are crucial and need to be explicitly defined as attributes of graph nodes and edges51. In this study, we incorporated a range of atom features, including atomic number, degree, formal charge, chirality, number of bonded hydrogens, hybridization (SP, SP2, SP3, SP3D, SP3D2), aromaticity, and atomic mass. Additionally, our bond features encompassed bond type (single, double, triple), aromaticity, conjugation and ring membership. To illustrate this process, we consider the example of vinyl alcohol (C2H3OH) as an example. Figure 3 provides a detailed breakdown of how these atom and bond features are transformed into numerical vectors. These vectors have lengths of 133 for atom features and 6 for bond features. In the figure, the ‘x’ indicates that a specific property is not applicable or defined for the given molecule. It is important to note that this same process is applied to every protein molecule within our study.
Mathematical methods
A graph can usually be defined as \(G = \left( {V,E} \right)\), where \(V\) and \(E\) represent the set of nodes (atoms) and edges (bonds), respectively. Suppose \(v\) and \(w\) are two directly connected nodes (i.e. atoms), \(e_{vw}\) (i.e. a directional bond, also used as bond feature here) is the directional edge from \(v\) to \(w\), and \(e_{wv}\) is the reverse bond corresponding to \(e_{vw}\), they represent two different bonds. Then, the updated message \(m_{vw}^{t + 1}\) and hidden states \(h_{vw}^{t + 1}\) of \(e_{vw}\) at step \(t + 1\) (t = 1,2,3…T − 1, with \(T\) the total number of message passing steps) can be expressed as:
where \(N\left( v \right)\) is the set of neighbors of \(v\); \(x_{v}^{t}\) and \(x_{k}^{t}\) refer to the atom features at time \(t\); it is worth noting that \(k \ne w\), namely, the new message of \(e_{vw}\) is independent of \(e_{wv}\) generated from the previous step; \(h_{kv}^{t}\) and \(h_{vw}^{t}\) are the hidden states of bonds \(e_{kv}\) and \(e_{vw}\), respectively; \(M_{t}\) is the message function, which determines how to combine atom features and bond hidden states; and \(U_{t}\) is the updating function of bond hidden states. For the sake of simplicity and efficiency, we opted for a simplified message function and ensured that the programs utilize the same updating function \(U\) at different iterations.
In Eq. (4), \(h_{vw}^{0} = \sigma_{1} \left( {W_{1} \cdot e_{vw} } \right)\) uses the learnable matrix \(W_{1} \in {\mathbf{R}}^{{f_{b} \times h}}\) and activation function \(\sigma_{1}\) (we chose ReLU, Rectified Linear Activation Function, as all activation functions) to convert the original bond feature \(e_{vw}\) with length \(f_{b}\) into the initial hidden states \(h_{vw}^{0}\) of size \(h\) before message passing. The learnable matrix \(W_{2} \in {\mathbf{R}}^{h \times h}\) and activation function \(\sigma_{2}\) are the core parts of the updating function that extracts the key information, which in turn determines the hidden states of bonds.
Message passing for all bonds can be implemented by applying the previous four equations driven by iteration step \(t\). When \(t = T\), the bond message passing ends, and the final hidden states of all bonds are obtained. In order to facilitate protein visualization based on final atom representation (atom hidden states), it is necessary to transform the final bond hidden states into final atom hidden states. For this purpose, a summation of the final hidden states of all the incoming bonds for atom \(v\), expressed as Eq. (5), is performed to generate the final atom message \(m_{v}^{T}\). Then, the atom hidden states \(h_{v}^{T}\) is calculated using Eq. (6) based on the results from Eq. (5), where \({\text{cat}}\left( {x_{v} ,m_{v}^{T} } \right) \in {\mathbf{R}}^{{f_{a} + h}}\) concatenates the initial atom feature \(x_{v}\) of length \(f_{a}\) and the atom message \(m_{v}^{T}\) before multiplying the learnable matrix \(W_{3} \in {\mathbf{R}}^{{\left( {f_{a} + h} \right) \times h}}\) and performing the activation operation with \(\sigma_{3}\). This process can be viewed as the atom hidden state generation depending on the bond hidden states,
In the readout phase, protein representation \(p_{i}\) of protein \(i\) is made by calculating the mean vector of hidden states of all the atoms that constitute this protein, which is expressed in Eq. (7),
where \(M\left( i \right)\) is the set of all \(M\) atoms that constitute the protein \(i\).
Finally, a feed-forward neural network \(f\left( \cdot \right)\) is applied to estimate the probability \(\hat{y}_{i}\) that indicates how likely the protein is to be Cas1,
Model construction
MPNN serves as a comprehensive computing framework within the broader category of GNN. Its forward propagation process involves two primary phases: the message passing phase and the readout phase. DMPNN is an improved version of the MPNN, denoted by the ‘D’ signifying directed edges. In DMPNN, an edge, rather than an atom, serves as both the source and receiver of messages. This distinction has two key advantages: it prevents the information within a message from being repeatedly sent back to its source, and it effectively reduces noise in the network52. It is important to note that although the specific objects involved in message passing differ between the two models (MPNN and DMPNN), they both share a common structure in terms of the message passing and readout phases51.
Figure 4 shows the end-to-end training and prediction process of DMPNN model based on the above mathematical method, which mainly includes three stages. In the first stage, a unique graph of atoms (nodes) and bonds (edges) of proteins is established based on the SMILES strings, and also matrices composed of the original features of nodes and edges are respectively formed as the input data for DMPNN. Note that in the graph, both single bond (e.g. C–C) and double bond (e.g. C=O) are defined as two edges in opposite directions, while triple bond (C≡C) is treated as two sets of reverse-edges (i.e. four edges).
The flowchart of model construction and architectures. The figure illustrates the three main stages: (1) Graph construction from SMILES strings, forming node and edge feature matrices. (2) Bond message passing to generate hidden states, crucial for DMPNN. (3) Protein representation construction using atom hidden states, culminating in Cas1 prediction via a classification network.
The second stage focuses on the implementation of bond message passing so as to generate the final bond hidden states which is the core of DMPNN. Before message passing, the model requires a full collection of neural network \(W_{1}\) to convert each original bond feature into a hidden vector \(h_{vw}^{0}\) (step 3) which is not only used as input for message passing (step 4), but also added to the primary bond hidden vector (step 10) to form the final bond hidden vector in the current iteration step (step 11). In addition, correct local bond messages passing depends on accurately extracting the topological relationships of related bonds and atoms from the graph (steps 5, 7 and 9), which requires that the topological relationships of atoms and bonds be translated into precise numerical position vectors during graph construction.
The third stage primarily focuses on the construction of protein representations using the hidden states of atoms. The key point of this stage is the formation of atom messages based on the final bond hidden states (step 13) and subsequent incorporation of the initial atom features into atom messages (step 15), in order to ensure that the representation of a protein includes both bond and atom information (step 17). Finally, we incorporated a classification network comprising three identical fully connected neural networks into the protein representation. This network is responsible for determining the probability that the given protein is Cas1, and it does so by producing a scalar output \(\hat{y}_{i}\) (step 18).
Currently, many protein-related models utilize graph neural network models based on message-passing mechanisms. Taking the ProteinMPNN53 model as an example for comparison with our model, the inputs for ProteinMPNN include: (1) distances between Cα–Cα atoms, (2) relative Cα–Cα–Cα frame orientations and rotations, and (3) backbone dihedral angles. In contrast, our model uses characteristics of all atoms, directional bonds between atoms, and bond features. The former employs the MPNN framework to pass messages between nodes and edges, while the latter utilizes DMPNN, with the key difference being the use of directional bonds, which is one of the critical distinctions between the two models. Given that ProteinMPNN is a generative model and our model is a classification model, there are significant differences in their overall architecture. The former mainly follows the “Encoder-Decoder” pattern and employs random masking, whereas our model first implements bond representation using DMPNN, then combines it with atom features to form the overall representation of the protein molecule, and finally uses a discriminative network to compute error and update parameters. It is evident that there are substantial differences in overall framework and model details between the two, which might also be the distinguishing features of our model.
Data preparation
Positive samples
The choice of samples has a significant impact on the model’s ability to generalize, and it must be handled with great care. In our approach, we utilized the National Center for Biotechnology Information (NCBI) and employed the keyword ‘CRISPR associated’ to search for potential Cas1 proteins that already existed. Subsequently, we checked the ‘gene’ value from the detailed information of each protein we found. If the ‘gene’ value was ‘Cas1’, we temporarily considered that protein as a Cas1 candidate. Through this process, we identified a total of 11,664 potential Cas1 proteins. We further refined our selection by scrutinizing the description information of each protein in the FASTA format file, which led to the removal of 69 proteins that were highly unlikely to be Cas1. This left us with a final count of 11,595 proteins that we regarded as Cas1 candidates.
To gain insight into these proteins, we conducted an analysis of their amino acid sequence lengths. This analysis revealed that the shortest and longest sequence lengths among these proteins were 41 and 1,283, respectively. Figure 5a illustrates these results, indicating that the majority of these proteins have lengths ranging from 280 to 350aa. It is reasonable to believe that this subset of proteins represents the key features of Cas1. To optimize the computational efficiency of our model, taking into account the longest amino acid sequence, and considering that only about 10% of these proteins exceed a length of 400, we chose to retain 10,377 proteins with sequence lengths less than 400.
Subsequently, we converted these proteins into SMILES strings using RDKit. We also conducted a statistical analysis of the distribution of string lengths, as depicted in Fig. 5b. The shortest and longest SMILES strings consisted of 789 and 11,605 characters, respectively. If we were to consider all these Cas1 proteins as positive samples, both the input and coefficient matrices would be extensive and sparse, leading to high memory consumption and a significant reduction in computational efficiency, without necessarily improving the results. To strike a balance between memory usage, computational efficiency, and the validity of our results, we ultimately selected 10,354 Cas1 proteins with SMILES lengths less than 8,000 as the positive samples.
Negative samples
We did not have readily available negative samples to directly combine with the previously mentioned positive samples to create the training set for Model 1. To address this, we turned to UniProt for assistance54. We applied a uniform random sampling method to screen 9000 proteins from UniRef50, a public protein dataset from UniProt. These proteins were selected with the criteria that their amino acid sequence length was less than 400, and their SMILES string length was less than 8000. Subsequently, we used the BLAST algorithm to compare these selected proteins with the previously identified 10,354 Cas1 proteins, setting a low E-value threshold of 0.00005 to ensure that proteins with high homology to Cas1 were not included as negative samples. This process resulted in a negative sample set comprising 8,839 proteins, which, when combined with the positive samples, formed the training set for Model 1.
As for Model 2, the negative samples encompassed all types of Cas proteins except for Cas1, as specified in Table 5. The second column of the table provides information on the number of proteins collected for each Cas type, ranging from Cas2 to Cas14, sourced from various databases, including NCBI. There is a noticeable data imbalance between different Cas types. The negative samples for Model 2 were selected from the raw data set presented in the second column of this table, with the exact numbers chosen for each type displayed in the third column (a total of 9950 proteins). If the number of negative samples in the third column was smaller than that available in the raw data set, we applied random sampling without replacement, such as in the case of Cas2, to select a subset of these Cas proteins as negative samples. In instances where the number of negative samples in the third column exceeded that of the raw data set, we employed multiple random samplings without replacement, as exemplified by Cas11.
t-SNE analysis for the two training sets
To further validate the rationale behind our selection of positive and negative samples, we employed the pre-trained ESM-2 model55 by Facebook to embed each protein into a numerical vector of length 1280. As of November 2022, ESM-2 represented the state-of-the-art in protein language models and could be conveniently fine-tuned for various protein-related tasks, including protein classification. Multiple ESM-2 models were available, each pre-trained on different protein datasets with varying parameter sizes. The specific model we utilized was ‘Esm2_t33_650M_UR50D’, which had been trained with 650 million parameters on Uniref50. Next, scikit-learn’s56 t-distributed stochastic neighbor embedding (t-SNE)57 was used for non-linear dimensionality reduction of the protein embeddings, reducing them from 1280 features to 2 components, which were then plotted. The Scikit-learn’s default values were used for all t-SNE parameters, except for perplexity = 30 and n_iter = 2000.
Figure 6a and b show the t-SNE results of the training sets for Model 1 and Model 2, respectively. First, as depicted in Fig. 6a, the distribution of positive samples (red dots) is more tightly clustered compared to the negative samples (blue dots). This observation aligns with the real-world scenario where Cas1 proteins share common features, while negative samples lack uniform characteristics since they were randomly selected from UniRef50. Second, the clear separation between the positive and negative samples demonstrates the effectiveness of this training set.
t-SNE visualizations of training sets for Model 1 and Model 2, respectively. (a) shows a tight clustering of positive Cas1 samples (red) compared to diverse negative samples (blue), validating the training set’s effectiveness. (b) highlights distinct clusters of various Cas proteins, with Cas1 proteins (outlined) forming a large, distinct group, supporting the rationality of sample selection for Model 2.
In Fig. 6b, different Cas protein clusters are indicated with different colored markers, highlighting the aggregation of the same Cas proteins and the distinct gaps between different clusters. These gaps indicate that different types of Cas proteins possess unique characteristics. The regions outlined by solid lines predominantly contain Cas1 proteins, with most of them converging into one large, distinct region, albeit with a few scattered parts in other regions. This result not only demonstrates recognizable characteristic differences between various Cas proteins but also validates the rationality of our selection of positive and negative samples for Model 2.
Data bias analysis
Our dataset primarily originates from NCBI and UniRef50, which may introduce inherent biases, such as a high proportion of short Cas1 proteins (< 400aa) and hypothetical Cas1 proteins. The prevalence of short sequences may limit the model’s ability to learn features from longer proteins, which often exhibit more complex structures and functions. Our analysis shows that most Cas1 proteins have lengths between 280 and 350aa, indicating a bias in sequence length distribution, potentially linked to their biological functions and structural characteristics. While focusing solely on Cas1 proteins within this length range could enhance recognition accuracy, it would compromise the model’s ability to identify Cas1 proteins of other lengths. As a trade-off, we included all Cas1 proteins up to 400aa.
Hypothetical Cas1 proteins, predicted computationally without experimental validation, introduce uncertainties in biological relevance and accuracy, potentially leading to inaccurate feature learning and reduced model generalization. To mitigate this, we applied stringent criteria during Cas1 protein selection from public databases.
Our negative samples, primarily sourced from UniRef50, may lack diversity, impacting comprehensive feature learning. We addressed this by uniformly random sampling and using BLAST to ensure low homology with Cas1 proteins, aiming for a comprehensive negative sample feature space. Additionally, we employed weighted loss functions during training to balance dataset imbalance. Despite these measures, data biases, particularly concerning Cas1 sequence length, persist. Incorporating more confirmed Cas1 proteins into the positive sample set would further improve model performance.
Model training and evaluation
Optimization of Bayesian hyperparameters
It is widely recognized that a model’s performance is significantly influenced by its hyperparameters. To fully unleash a model’s potential, it is crucial to find an optimal combination of hyperparameter values. Commonly employed methods for hyperparameter optimization include random search58,59, grid search60,61 and Bayesian search62,63. Bayesian hyperparameter optimization applies Bayesian principles to construct a probability model based on historical experimental data to determine the optimal hyperparameters. The key concept is to strike the right balance between exploration and exploitation when searching for performance improvements under various conditions. In comparison to random search, the Bayesian method can save time by requiring fewer iterations. The primary hyperparameter values we focus on during Bayesian hyperparameter optimization are shown in Table 6. We chose 4 message-passing steps to ensure adequate information flow in large and complex protein graphs while avoiding over-smoothing. The hidden size of the model was set to 1500, providing rich feature representations necessary for intricate protein structures while maintaining computational efficiency. We used 3 feedforward layers to capture hierarchical patterns without introducing excessive model complexity, ensuring effective learning of both local and global features. A dropout probability of 0.3 was selected to effectively mitigate overfitting, which is crucial for large and complex datasets like protein graphs. Finally, a batch size of 80 was chosen to ensure stable gradient updates and efficient use of GPU memory, balancing computational efficiency and stability for large graph data. These hyperparameter choices are tailored to handle the intricacies of protein structures and the scale of the graph data, ensuring that our model can accurately capture the necessary features while maintaining robust performance.
Ensemble strategy
Ensemble strategy is one of the important methods to deal with the problem of large variance and small bias of a complex model. In particular, this approach involves training multiple copies of a model using different initial parameters on various training sets. Then, the outputs of all these model copies for a given sample are averaged to produce the final result for that sample. Ensemble models are effective in avoiding local minima, narrowing the gap between the training and search spaces, and reducing generalization errors. However, when the training and search spaces align well and the model excels at extracting key features, ensemble strategies may not yield significant improvements. In our study, recognizing the possibility of the model lacking feature extraction ability, we employed an ensemble strategy for both Model 1 and Model 2.
Each of the two ensemble models consists of 5 submodels, and each submodel is trained on a randomly selected subsample from the training set we previously constructed. We employed threefold cross-validation to assess the generalization performance and stability of the model. With > 10,000 training samples, empirical evidence supports threefold cross-validation’s (CV) adequacy (post random shuffling), verified by testing 3-, 5-, and tenfold CV (Supplementary Fig. 1, the variance of the mean loss for each case is 2.22e−07, 3.76e−06, and 1.24e−06, respectively). Minimal intra-fold loss variance (same curve) confirmed uniform data distribution, while inter-fold comparisons showed no significant deviation, validating threefold CV’s reliability and 60% time efficiency over tenfold. Within each fold, the dataset was randomly divided into three portions: a training set (80%), a validation set (10%), and a test set (10%). The submodel underwent 40 epochs of training (all samples in the training set were trained once) on the training set. At the end of each epoch, the submodel’s performance was assessed on the validation set, and the parameters yielding the best performance were saved. After completing all epochs, these best-performing parameters were used to evaluate the submodel on the test set, producing the test results that represented the submodel’s performance in that fold. This process was repeated three times to generate three test results, and their average was taken as the final performance measure for the submodel in that specific fold. It is worth noting that the purpose of training the three sets of submodel parameters above was solely to obtain as realistic submodel performance as possible and was not intended for actual prediction. Subsequently, we redivided the entire training set into a 90% training set and a 10% validation set. At this stage, there was no need to test the submodel’s performance, so a separate test set was unnecessary, thereby increasing the number of samples in the training set. The submodel was retrained on this new training set for 40 epochs and, similarly, was validated on the validation set after each epoch. Ultimately, the submodel’s parameters with the best performance on the new validation set were chosen as the final parameters for that submodel, contributing to the formation of our ensemble model. This generation process remained consistent for the other four submodels that comprised the ensemble model.
Evaluation of ensemble model performance
All the models were trained using the Adam optimizer on the binary cross entropy loss function, and both ROC (receiver operating characteristics) and AUC (the area under ROC curve) were used to evaluate the model’s performance. Figure 7a shows the mean AUC of Model 1 at each epoch of the five submodels (each model has 3 AUC results from 3 folds at each epoch) during the training process. The figure clearly illustrates that during the initial 11 epochs, the AUC experiences rapid growth and surpasses 0.95 by epoch 11. This not only indicates the model’s impressive ability to converge quickly but also highlights its exceptional feature extraction capability. Subsequently, from the 12th to the 30th epoch, the AUC for each model demonstrates a more gradual increase, with all models approaching a state of stability. The figure suggests that utilizing a total of 15 epochs strikes a good balance between training time and effectiveness. However, for thoroughness, we chose to continue training the model for 30 epochs. A similar pattern of results for Model 2 can be observed in Fig. 7c, affirming the consistency of these findings across both models.
Evaluation of the performance of the ensemble Models 1 and 2: a (Model 1) and c (Model 2) demonstrate the changes in the mean AUC values for the five submodels as training progresses; b (Model 1) and d (Model 2) show the ROC curves and the corresponding AUC values of Model 1 and Model 2, respectively, on three different datasets.
Figure 7b provides an overview of the performance of the final ensemble Model 1 on three distinct datasets outlined in Table 7. As expected, Dataset 1, which essentially constitutes the training set for Model 1, delivers outstanding performance as indicated by the blue ROC curve and the highest AUC value of 0.985. This result is consistent with the model’s familiarity with this dataset. To challenge the model’s capability to handle proteins with more than 400aa, we designed Dataset 2, comprising 1221 Cas1 proteins and 1198 randomly selected proteins from UniRef50. The red ROC curve in the graph demonstrates that, although the performance slightly decreased compared to Dataset 1, the AUC value remained high at 0.957. This outcome strongly suggests that the model can generalize its performance to proteins with longer amino acid sequences. Dataset 3 was crafted to assess the model’s ability to distinguish Cas1 from other Cas proteins. This dataset encompassed all 11,595 Cas1 proteins collected and 11,276 other Cas proteins sampled from Cas2 to Cas14. The green ROC curve and the AUC value of 0.919 indicate that the model excelled at differentiating Cas1 from other Cas proteins. The slight decline in performance on Dataset 3 compared to the other two cases is likely attributable to the fact that other types of Cas proteins share some sequential or structural similarities with Cas1, making it challenging for the model to detect subtle distinctions.
Figure 7d illustrates the results of Model 2 on the same three datasets. The green curve demonstrates superior performance with an AUC of 0.989, which is expected because Model 2 was trained on Dataset 3. Unlike Datasets 1 and 2, Dataset 3 consists of positive samples (Cas1 proteins) and negative samples (other Cas proteins), which poses a higher demand on the model. This is due to two main reasons: first, the protein sequence lengths have a wider range and more uneven distribution; second, the differences between Cas1 and other Cas proteins are subtler compared to non-Cas proteins, necessitating a stronger model to distinguish between them. Fortunately, as evidenced by the results, Model 2 indeed possesses this capability.
The performance drop of Model 2 on Dataset 2 can be clearly explained. Although Model 2 was trained on proteins with fewer than 1300aa, the proportion of Cas1 proteins longer than 400aa among the total 11,595 Cas1 proteins is relatively small. This means that Model 2 must not only distinguish Cas1 from other Cas proteins but also extract features from the fewer Cas1 proteins that exceed 400aa, making it a more challenging task. In short, the insufficient number of Cas1 proteins exceeding 400aa in Dataset 3 led to Model 2’s inability to effectively differentiate these proteins, resulting in suboptimal performance on Dataset 2. This conclusion is further supported by Model 2’s better performance on Dataset 1, where such long sequences are less frequent. The results in Fig. 7d effectively demonstrate that the model can learn complex molecular deep features and patterns, thereby achieving a deeper understanding of protein function and structure. However, to fully leverage these capabilities, an adequate amount of training samples is essential. Based on our practical applications across the three databases discussed in this paper, we estimate that having approximately 10,000 samples per class is appropriate for optimal model performance.
From the preceding analysis, we understand that Model 2 exhibits a noticeable performance decline on Datasets 1 and 2, primarily due to imbalanced training samples. Nonetheless, this result remains acceptable. It is also important to note that although Model 2 was not trained on non-Cas proteins, its ability to identify non-Cas proteins remains strong. This can be attributed not only to the model’s inherent capabilities but also to the significant feature differences between non-Cas and Cas proteins, as analyzed in previous sections.
Ablation analysis
We conducted a controlled ablation analysis on three architectural pillars (DMPNN, edge features, and node features) using Model 1 and Model 2 under the principle of single-variable control. Six comparative experimental groups were generated (Fig. 8), with baseline models (solid lines L1-L3) and ablated variants (dashed lines L4-L6) evaluated through a three-dataset cross-validation strategy (Data1-Data3), maintaining identical hyperparameters across all trials.
Ablation analysis of Model 1 and Model 2. (a) Replacing DMPNN with MPNN reduced AUC by 5.36% on average, highlighting DMPNN’s directed message-passing advantage. (b) Node feature removal caused an 11.49% AUC decline, emphasizing atomic properties’ role. (c) Edge homogenization led to a 14.47% AUC drop, underscoring bond attributes’ importance. (d–f) Model 2’s ablation patterns mirrored Model 1, confirming consistent component roles.
(1) Component efficacy decomposition: Fig. 8a demonstrates that replacing DMPNN with standard MPNN reduced Data1’s AUC from 0.985 to 0.938 (Δ = 4.77%), with an average degradation of 5.36% across datasets. This performance gap stems from DMPNN’s directed edge message-passing mechanism, which mitigates MPNN’s cyclic redundancy through bond-oriented convolution64. Despite architectural similarities, DMPNN achieves superior molecular representation learning by encoding directional bond information during feature aggregation65. (2) Topological feature analysis: Node feature removal (Fig. 8b) caused an 11.49% average AUC decline, validating the auxiliary role of atomic properties (electronegativity, hybridization state) in molecular prediction. Notably, node ablation induced performance degradation rather than system failure, confirming their supplementary nature66. Edge homogenization (Fig. 8c) triggered the most severe impairment (Δ = 14.47%), highlighting chemical bond attributes (bond order, conjugation state) as critical molecular graph descriptors. This aligns with molecular dynamics principles where bond dissociation energy directly governs molecular stability. Remarkably, Model 2’s ablation patterns (Fig. 8d–f) exhibited strong consistency with Model 1, confirming component role generalizability across architectures.
The ablation study reveals the following insights: 1) Edge features, as the core topological descriptors of molecular graphs, play a decisive role in model performance due to their information completeness; 2) Node features enhance prediction accuracy by supplementing atomic-level physicochemical properties, but they are not a systematic necessity; 3) Although structural optimization of DMPNN brings marginal benefits, the intrinsic advantages of the message-passing paradigm have already been fully realized in the baseline model. These findings provide important guidance for the design of molecular graph neural networks: priority should be given to the refined encoding of edge features, followed by the supplementary enhancement of node features, and finally, iterative optimization of the message-passing mechanism.
Comparison with other methods
We compared our models with two open-source tools, CRISPRcasIdentifier (CCIdentifier67) and CRISPRCasTyper (CCTyper68), which rely on traditional machine learning methods like hidden Markov models and decision trees. We evaluated their performance on three datasets (Table 7) and present the results in Fig. 9. Our method provides a probability value (0–1), while the others yield a Bitscore (higher values indicate greater reliability). For comparison, we ranked proteins by CCTyper’s Bitscore (red triangular line) and plotted CCIdentifier’s results (blue pentagram line) on the right y-axis. Our model’s results are shown as light cyan circles on the left y-axis. We also performed statistical analysis on predicted results based on protein sequences and scores.
Comparison of our models with CRISPRcasIdentifier (CCIdentifier) and CRISPRCasTyper (CCTyper). (a–c) Show differences in Bitscores and prediction scores. Our model (light cyan circles) consistently distinguishes between positive and negative samples, except in (c, d) where performance is weaker due to longer proteins in Dataset 2.
Figure 9 shows clear differences between CCIdentifier and CCTyper. For example, in Fig. 9a, CCTyper has the lowest Bitscore on the far right, while CCIdentifier has the highest. In Fig. 9c, CCTyper’s Bitscore is mid-range, whereas CCIdentifier consistently yields the highest values. This may be due to their reliance on explicit rules. In contrast, our model demonstrates consistent performance across all proteins, as shown by the evenly distributed light cyan circles and the rectangular histograms at the top of each subplot. The vertical histograms display the predicted score distribution of our model. Ideally, positive samples should cluster around 1, and negative samples around 0. Our method effectively distinguishes between most negative and positive samples, except in Fig. 9c and 9d, where performance is slightly weaker due to Dataset 2 containing proteins longer than 400aa, while our model was trained on shorter proteins.
Another critical aspect is setting an appropriate threshold for Cas1 protein scores, as improper thresholds can lead to misclassification. This requires careful optimization for practical applications.
To further validate the convenience and effectiveness of our strategy, we compared it with the CRISPR-Cas system detection tool, CRISPRCasFinder69. Unlike CRISPRCasFinder, MacSyFinder70, and CRISPRsuite71, our method directly takes protein sequences as input, bypassing the limitations associated with relying on genomic sequences. This makes our approach particularly suitable for scenarios where only protein sequences are available. The comparison was conducted on Dataset 1 and Dataset 2 (Table 7), with the evaluation criterion for our model defined as the sum of the prediction scores from Model 1 and Model 2 being greater than 1.4. The results are presented in Table 8.
In Dataset 1, for the critical Cas1 proteins (≤ 400aa), our model achieved a recall rate of 94.4%, outperforming CRISPRCasFinder’s 93.3%. Among the 690 proteins missed by CRISPRCasFinder, our model successfully identified 372 (53.9%). Conversely, among the 580 proteins missed by our model, CRISPRCasFinder identified only 261 (45%). For non-Cas1 proteins, our model’s recall rate was 93.4%, lower than CRISPRCasFinder’s 99.8%. In Dataset 2, the recall rates for Cas1 proteins (1221 in total) were 72.5% for our model and 99.6% for CRISPRCasFinder, while for non-Cas1 proteins (1198 in total), the recall rates were 95.4% and 99.9%, respectively. CRISPRCasFinder outperformed our model in these datasets.
Overall, the results were in line with expectations. Our model’s superior performance on Cas1 proteins ≤ 400aa highlights its advantage in extracting deep features. The slightly weaker performance on non-Cas1 proteins can be attributed to the large feature space of these samples, which are derived from Uniref50, making it challenging for the model to focus effectively. For proteins > 400aa, our model’s performance was weaker than that of CRISPRCasFinder but still exceeded our initial expectations, likely because our model was primarily trained on Cas1 proteins ≤ 400aa. We believe that with more long-sequence Cas1 protein data, the performance of our model will further improve.
Tests on independent data sets
To evaluate the strategy’s efficacy comprehensively, we conducted rigorous testing on a precisely curated independent dataset. Utilizing the Advanced Search interface of the PDB database, we retrieved 62,053 initial records using the keywords ‘CRISPR-associated endonuclease Cas1’ and ‘Cas1’. After applying strict ‘macromolecule’ filters (specifically biological polymers such as proteins), 41 candidate proteins were identified. Subsequent structure-based deduplication and experimental validation cross-checking yielded 25 Cas1 proteins with verified biochemical activity and 3D structural data, spanning bacterial (19), archaeal (3), and bacteriophage (3) evolutionary lineages. For comparative analysis, 36 non-Cas1 proteins (bacterial (20), eukaryotic (12), bacteriophage (2), viral (2)) with validated structures were randomly selected.
The performance of our two models in Cas1 recognition is summarized in Fig. 10a, b and Supplementary Table 3: Model 1 (green histogram) achieved > 0.9 prediction scores for 44% of samples but exhibited significant misjudgments (e.g., 6OPM scored 0.373). This limitation stems from the heterogeneity of training data (14 distinct Cas protein families), which hindered the model’s ability to capture Cas1-specific signatures; Model 2 (red histogram) demonstrated superior performance, with 84% of samples scoring > 0.9 (minimum: 0.748) and accurate identification of 6OPM (0.973). Its enhanced capability likely arises from an optimized co-evolutionary information processing module, enabling robust feature extraction.
Statistical analysis of the predicted scores of Model 1 and Model 2 on the independent dataset. (a, b) Model 1 showed moderate prediction accuracy with notable misjudgments, while Model 2 demonstrated higher accuracy and reliable identification of key samples. (c, d) In non-Cas1 protein tests, Model 1 exhibited moderate specificity with a significant false-positive rate, whereas Model 2 achieved high specificity and lower false-positive scores, confirming its enhanced recognition capability.
Moreover, from the perspective of biological species, the following preliminary conclusions can be drawn, but further validation with more data is still needed. The predicted scores for Bacteria are generally high, with consistent results from Model 1 and Model 2, indicating that the models accurately and stably predict bacterial outcomes. In contrast, the predicted scores for Archaea are more scattered, with particularly low scores in Model 1. This suggests that the models may not reliably predict archaeal outcomes, indicating some level of instability or uncertainty. Lastly, the predicted scores for Phage are lower than those for Bacteria, and the consistency between Model 1 and Model 2 is poor. This implies that the models may have limited predictive power for phages.
In non-Cas1 protein tests (Fig. 10c, d and Supplementary Table 4): Model 1 showed moderate specificity (50% of samples scored < 0.4) but suffered from a 36% false-positive rate (e.g., 2AKI/5ABB approached 1.0); Model 2 achieved 94% specificity (scores < 0.4) with a maximum false-positive score of 0.708, confirming its enhanced recognition of Cas1-specific domains, such as metal-dependent DNase active sites.
These results validate Model 2’s reliability in Cas1 prediction. However, relying solely on Model 2 remains risky due to inherent algorithmic uncertainties. A dual-model consensus approach, integrating prediction scores from both models, is recommended to enhance robustness.
Software and hardware conditions
We implemented our method on a Linux system with NVIDIA GPUs, and the specific configuration can be seen in Table 9. Under the given hardware and software conditions, the time spent on 3-, 5-, and tenfold cross-validation on Dataset 1 (seq ≤ 400aa, total sample count: 10,374 + 8839 proteins) with 40 epochs per fold was 7.6 days, 14.36 days, and 31.25 days, respectively. When making predictions with the final trained model, the prediction time for 10,000 proteins (seq ≤ 400aa) was 1821s.
Data availability
Additionally, the dataset used for our analysis is accessible at: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data.
Code availability
All code associated with this study, including data preprocessing, training, and inference scripts, is publicly available and can be accessed through the following link: https://github.com/chengaoxiang1985/Cas1_pre/tree/main. Software used: SMART: February 27, 2024: https://smart.embl-heidelberg.de/. RDKit: March 4, 2024: https://github.com/rdkit/rdkit. AlphaFold3: October 17, 2024: https://alphafoldserver.com/about. BLAST: 2.14.1: https://blast.ncbi.nlm.nih.gov/Blast.cgi. Scikit-learn: 1.4.2: https://scikit-learn.org/stable/. ESM-2: April, 2023: https://github.com/facebookresearch/esm. CRISPRCasFinder: 1.2.1: https://crisprcas.i2bc.paris-saclay.fr/.
References
Butiuc-Keul, A., Farkas, A., Carpa, R. & Iordache, D. CRISPR-Cas system: the powerful modulator of accessory genomes in prokaryotes. Microb. Physiol. 32, 2–17 (2022).
Devi, V., Harjai, K. & Chhibber, S. Repurposing prokaryotic clustered regularly interspaced short palindromic repeats-Cas adaptive immune system to combat antimicrobial resistance. Future Microbiol. 18, 443–459 (2023).
Li, J. et al. Clustered regularly interspaced short palindromic repeat/crispr-associated protein and its utility all at sea: Status, challenges, and prospects. Microorganisms 12, 118 (2024).
Shimatani, Z. et al. Targeted base editing in rice and tomato using a CRISPR-Cas9 cytidine deaminase fusion. Nat. Biotechnol. 35, 441–443 (2017).
Lemaire, C., Le Gallou, B., Lanotte, P., Mereghetti, L. & Pastuszka, A. Distribution, diversity and roles of CRISPR-cas systems in human and animal pathogenic Streptococci. Front. Microbiol. 13, 828031 (2022).
Huang, Y. et al. A bimetallic nanoplatform for STING activation and CRISPR/Cas mediated depletion of the methionine transporter in cancer cells restores anti-tumor immune responses. Nat. Commun. 14, 4647 (2023).
Liu, Z., Dong, H., Cui, Y., Cong, L. & Zhang, D. Application of different types of CRISPR/Cas-based systems in bacteria. Microb. Cell Fact. 19, 1–14 (2020).
Katz, M. A. et al. Diverse viral cas genes antagonize CRISPR immunity. Nature 1, 1–7 (2024).
Hillary, V. E. & Ceasar, S. A. A review on the mechanism and applications of CRISPR/Cas9/Cas12/Cas13/Cas14 proteins utilized for genome engineering. Mol. Biotechnol. 65, 311–325 (2023).
Guk, K. et al. Hybrid CRISPR/Cas protein for one-pot detection of DNA and RNA. Biosens. Bioelectron. 219, 114819 (2023).
Stella, G. & Marraffini, L. Type III CRISPR-Cas: beyond the Cas10 effector complex. Trends Biochem. Sci. 49, 28–37 (2024).
Shmakov, S. et al. Diversity and evolution of class 2 CRISPR–Cas systems. Nat. Rev. Microbiol. 15, 169–182 (2017).
Chai, G., Yu, M., Jiang, L., Duan, Y. & Huang, J. HMMCAS: A web tool for the identification and ___domain annotations of Cas proteins. IEEE/ACM Trans. Comput. Biol. Bioinf. 16, 1313–1315 (2017).
Dwivedi, V. P. et al. Benchmarking graph neural networks. J. Mach. Learn. Res. 24, 1–48 (2023).
Wu, L. et al. Graph neural networks for natural language processing: A survey. Found. Trends Mach. Learn. 16, 119–328 (2023).
Rey, S., Segarra, S., Heckel, R. & Marques, A. G. Untrained graph neural networks for denoising. IEEE Trans. Signal Process. 70, 5708–5723 (2022).
Berk, A. et al. A coherence parameter characterizing generative compressed sensing with Fourier measurements. IEEE J. Sel. Areas Inf. Theory 3, 502–512 (2022).
Zhang, X. M. et al. Graph neural networks and their current applications in bioinformatics. Front. Genet. 12, 690049 (2012).
Wang, R. H., Luo, T., Zhang, H. L. & Du, P. F. PLA-GNN: Computational inference of protein subcellular ___location alterations under drug treatments with deep graph neural networks. Comput. Biol. Med. 157, 106775 (2023).
Galluzzo, Y. A comprehensive review of the data and knowledge graphs approaches in bioinformatics. Comput. Sci. Inf. Syst. 00, 27–27 (2024).
Sagendorf, J. M., Mitra, R., Huang, J., Chen, X. S. & Rohs, R. Structure-based prediction of protein-nucleic acid binding using graph neural networks. Biophys. Rev. 16, 297–314 (2024).
Li, F. et al. Advancing mRNA subcellular localization prediction with graph neural network and RNA structure. Bioinformatics 40, 504 (2024).
Zheng, M., Sun, G., Li, X. & Fan, Y. EGPDI: Identifying protein–DNA binding sites based on multi-view graph embedding fusion. Brief. Bioinform. 25, 330 (2024).
Sun, R., Dai, H. & Yu, A. W. Does gnn pretraining help molecular representation. Adv. Neural. Inf. Process. Syst. 35, 12096–12109 (2022).
Niu, Z. et al. Prediction of small molecule drug-miRNA associations based on GNNs and CNNs. Front. Genet. 14, 1201934 (2023).
Zhao, Z. et al. DIG-Mol: A contrastive dual-interaction graph neural network for molecular property prediction. IEEE J. Biomed. Health Inform. (2024).
Aldeghi, M. & Coley, C. W. A graph representation of molecular ensembles for polymer property prediction. Chem. Sci. 13, 10486–10498 (2022).
Dong, C., Li, D. & Liu, J. Glass transition temperature prediction of polymers via graph reinforcement learning. Langmuir 40, 18568–18580 (2024).
Zhang, X. et al. Polymer-unit graph: Advancing interpretability in graph neural network machine learning for organic polymer semiconductor materials. J. Chem. Theory Comput. 20, 2908–2920 (2024).
Flynn, N. R. & Swamidass, S. J. Message passing neural networks improve prediction of metabolite authenticity. J. Chem. Inf. Model. 63, 1675–1694 (2023).
Liu, C., Sun, Y., Davis, R., Cardona, S. T. & Hu, P. ABT-MPNN: An atom-bond transformer-based message-passing neural network for molecular property prediction. J. Cheminform. 15, 29 (2023).
Gale-Day, Z. J., Shub, L., Chuang, K. V. & Keiser, M. J. Proximity graph networks: Predicting ligand affinity with message passing neural networks. J. Chem. Inf. Model. 64, 5439–5450 (2024).
Shan, M. et al. Predicting hERG channel blockers with directed message passing neural networks. RSC Adv. 12, 3423–3430 (2022).
Malhotra, P., Biswas, S. & Sharma, G. D. Directed message passing neural network for predicting power conversion efficiency in organic solar cells. ACS Appl. Mater. Interfaces. 15, 37741–37747 (2023).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).
Minkiewicz, P., Iwaniak, A. & Darewicz, M. Annotation of peptide structures using SMILES and other chemical codes-practical solutions. Molecules 22, 2075 (2017).
Zeng, W. F. et al. AlphaPeptDeep: A modular deep learning framework to predict peptide properties for proteomics. Nat. Commun. 13, 7238 (2022).
Öztürk, H., Ozkirimli, E. & Özgür, A. A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinform. 17, 128 (2016).
Li, X. & Fourches, D. SMILES pair encoding: A data-driven substructure tokenization algorithm for deep learning. J. Chem. Inf. Model. 61, 1560–1569 (2021).
Aksamit, N. et al. Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery. BMC Bioinform. 25, 255 (2024).
Levy, A. et al. CRISPR adaptation biases explain preference for acquisition of foreign DNA. Nature 520, 505–510 (2015).
Nuñez, J. et al. Integrase-mediated spacer acquisition during CRISPR-Cas adaptive immunity. Nature 519, 193–198 (2015).
Yates, A. D. et al. Ensembl Genomes 2022: An expanding genome resource for non-vertebrates. Nucleic Acids Res. 50, D996–D1003 (2022).
McGinnis, S. & Madden, T. L. BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).
Tatusova, T. et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 44, 6614–6624 (2016).
Letunic, I., Khedkar, S. & Bork, P. SMART: Recent updates, new developments and status in 2020. Nucleic Acids Res. 49, D458–D460 (2021).
Landrum, G. RDKit: Open-Source Cheminformatics. https://rdkit.org/docs/index.html (2006).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Weininger, D., Weininger, A. & Weininger, J. L. SMILES 2 Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model 59, 3370–3388 (2019).
Qian, C., Xiong, Y. & Chen, X. Directed graph attention neural network utilizing 3d coordinates for molecular property prediction. Comput. Mater. Sci. 200, 110761 (2021).
Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 1–10 (2008).
Bischl, B. et al. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdiscipl. Rev. 13, e1484 (2023).
Ali, Y. A., Awwad, E. M., Al-Razgan, M. & Maarouf, A. Hyperparameter search for machine learning algorithms for optimizing the computational complexity. Processes 11, 349 (2023).
Belete, D. M. & Huchaiah, M. D. Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results. Int. J. Comput. Appl. 44, 875–886 (2022).
Ogunsanya, M., Isichei, J. & Desai, S. Grid search hyperparameter tuning in additive manufacturing processes. Manuf. Lett. 35, 1031–1042 (2023).
Albahli, S. Efficient hyperparameter tuning for predicting student performance with Bayesian optimization. Multim. Tools Appl. 83, 52711–52735 (2024).
Aghaabbasi, M., Ali, M., Jasińasi, A. & Leonowicz, Z. hyperparameter optimization of machine learning methods using a Bayesian optimization algorithm to predict work travel mode choice. IEEE Access 11, 19762–19774 (2023).
Soleimany, A. P. et al. Evidential deep learning for guided molecular property prediction and discovery. ACS Cent. Sci. 7, 1356–1367 (2021).
Wang, Y., Zhou, K., Liu, N., Wang, Y., & Wang, X. Efficient sharpness-aware minimization for molecular graph transformer models. arXiv:2406.13137. (2024).
Yang, Y. & Li, D. Nenn: Incorporate node and edge features in graph neural networks. In Asian Conference on Machine Learning, 593–608. (PMLR, 2020).
Padilha, V. A. et al. CRISPRcasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems. Gigascience 9, 062 (2020).
Russel, J. et al. CRISPRCasTyper: Automated identification, annotation, and classification of CRISPR-Cas Loci. CRISPR J. 3, 462–469 (2020).
Couvin, D. et al. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 46(W1), W246–W251 (2018).
Abby, S. S. et al. MacSyFinder: A program to mine genomes for molecular systems with an application to CRISPR-Cas systems. PLoS ONE 9, e110726 (2014).
Biswas, A. et al. CRISPRDetect: A flexible algorithm to define CRISPR arrays. BMC Genom. 17, 1–14 (2016).
Acknowledgements
This research is supported by Key Research Project of Zhejiang Lab (No. 117005-AC2106/001) and Zhejiang Province ‘Vanguard and Leading Goose + X’ R&D Plan (No. 2024C03004). The authors would also like to express their gratitude to EditSprings (https://www.editsprings.cn) for the expert linguistic services provided.
Funding
This research is supported by Key Research Project of Zhejiang Lab (No. 117005-AC2106/001) and Zhejiang Province ‘Vanguard and Leading Goose + X’ R&D Plan (No. 2024C03004).
Author information
Authors and Affiliations
Contributions
In this research project, GXC played a pivotal role by taking charge of conceptualization, methodology development, software creation, data curation, investigation, and validation. He was also instrumental in drafting the initial manuscript and revising it for submission. LYH contributed significantly through formal analysis and provided overall supervision, ensuring the quality and progress of the research. ZWL supported the project by providing essential resources, which were crucial for the smooth execution of experiments and studies. BX expertise was evident in software development and data curation, offering technical support and managing the research data effectively. YQL participated in the writing process, contributing to both the original draft and the subsequent review and editing, which were vital for the accuracy and clarity of the research findings. Each team member’s professional skills and diligent work collectively propelled the success of the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, G., Hou, L., Li, Z. et al. A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding. Sci Rep 15, 15236 (2025). https://doi.org/10.1038/s41598-025-99999-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-99999-2