Table 7 Properties of the three datasets used for evaluating the performance of the models.
From: A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
Dataset for prediction | Number of positive samples | Number of negative samples |
---|---|---|
Dataset 1: sequence length ≤ 400aa | Cas1: 10374 | Uniref50: 8839 |
Dataset 2: sequence length > 400aa | Cas1: 1221 | Uniref50: 1198 |
Dataset 3: sequence length ≤ 1300aa | Cas1: 11595 | Cas2-Cas14: 11276 |