Table 2 Description of the datasets used in our experiments
From: Optimized model architectures for deep learning on genomic data
Number of FASTA Files | Number of Sequences (L = 150) | Number of Sequences (L = 10k) | |||||||
---|---|---|---|---|---|---|---|---|---|
Class | Training | Validation | Test | Training | Validation | Test | Training | Validation | Test |
Bacteria | 15,826 | 4523 | 2263 | 373,404,076 | 118,398,785 | 59,111,408 | 5,579,451 | 1,772,121 | 881,031 |
Virus (non-Phage) | 18,093 | 5171 | 2588 | 2,262,526 | 568,717 | 458,526 | 20,075 | 5552 | 5350 |
Bacteriophage | 9987 | 2855 | 1428 | 4,702,821 | 1,301,375 | 609,088 | 64,937 | 18,031 | 8400 |