Table 7 Properties of the three datasets used for evaluating the performance of the models.

From: A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding

Dataset for prediction

Number of positive samples

Number of negative samples

Dataset 1: sequence length ≤ 400aa

Cas1: 10374

Uniref50: 8839

Dataset 2: sequence length > 400aa

Cas1: 1221

Uniref50: 1198

Dataset 3: sequence length ≤ 1300aa

Cas1: 11595

Cas2-Cas14: 11276