Table 2 Description of the datasets used in our experiments

From: Optimized model architectures for deep learning on genomic data

 

Number of FASTA Files

Number of Sequences (L = 150)

Number of Sequences (L = 10k)

Class

Training

Validation

Test

Training

Validation

Test

Training

Validation

Test

Bacteria

15,826

4523

2263

373,404,076

118,398,785

59,111,408

5,579,451

1,772,121

881,031

Virus (non-Phage)

18,093

5171

2588

2,262,526

568,717

458,526

20,075

5552

5350

Bacteriophage

9987

2855

1428

4,702,821

1,301,375

609,088

64,937

18,031

8400