Background & Summary

I. elongata is an important species in clupeoid fisheries across Asian countries1, significantly contributing to local economies. Ecologically, it occupies a crucial position in the marine food chain, linking primary producers to larger predators and contributing to ecosystem stability. During the day, it is mainly active in the middle and lower layers of the water, but in the evening, it is more likely to live in the middle and upper layers of the water2. Spawning occurs near river mouths in summer, with peak activity observed in June, as indicated by ichthyoplankton data collected in March, May, July, and November3.

Environmental pressures, including climate change, habitat destruction, and overfishing, have adversely impacted I. elongata populations, leading to observable trends like miniaturization and early sexual maturity. These changes threaten the sustainability of the species and underscore the urgent need for conservation measures. In-depth exploration in this field not only provides scientific support for the sustainable development of fisheries, but also plays a key role in protecting wild fish stocks and maintaining the ecological balance of waters4. The study of fish genomes, including that of I. elongata, helps solve biological questions related to reproduction, adaptation, and population structure.

Despite significant progress, there remains a gap in understanding the specific genetic mechanisms behind the environmental adaptability and reproductive biology of I. elongata. These studies not only solve the biological problems of fish themselves, but also provide new perspectives and important clues to reveal the entire genetic evolution of vertebrates5. Advances in high-throughput sequencing and assembly technologies have enabled genome studies to expand beyond evolution into areas such as gene function, ecology, and environmental adaptation6.

In this context, genome sequencing of I. elongata is timely and essential. It will provide insights to support sustainable fisheries management, preserve genetic resources, and ensure the long-term viability of this valuable species.

Methods

Sample collected for genome assembly

In this study, the I. elongata samples used for genome assembly were collected from the offshore waters of Zhoushan, Zhejiang Province, China. This fish was a one-year-old female. Various tissue samples of I. elongata, including muscles, livers, gills, intestines and hearts, were collected and placed in separate cryopreservation tubes, which were subsequently stored in liquid nitrogen to facilitate subsequent DNA/RNA extraction.

DNA and RNA extraction

Classical phenol/chloroform extraction method was used to extract DNA from liver tissue of I. elongata in this study. After extraction, the integrity of the DNA sample was assessed by 1% agarose gel electrophoresis, while the concentration was determined using Nanodrop7 and quantified with Qubit2.0 (Invitrogen)8 to ensure that a high quality DNA sample was obtained. Other tissue samples of I. elongata, including back muscles, livers, gills, intestines and hearts, were extracted using a TRIzol kit (Invitrogen, USA). The NanoDropND-1000 spectrophotometer (LabTech) and 2100 Bioanalyzer (Agilent Technologies) were then used for quality assessment. The qualified RNA samples were individually sequenced, with 2 µg extracted for transcriptome library preparation and sequencing.

Library construction, high throughput sequencing and raw data quality control

Following the standardized operation procedure of MGI sequencing, a double-ended sequencing library with an insert fragment length of 350 bp was first constructed using qualified DNA samples. After the quality of the library was qualified, DNBSEQ-T7 second-generation sequencing technology was used for sequencing. Fastp (v0.12.4)9 and Trimmomatic (v0.39)10 tools were used to perform quality control on the original sequencing data generated by the MGI platform, removing the splicing sequence and screening out low-quality read segments. Then, according to the official specification of SMRTbell Template Prep Kit, a SMRTbell long-read sequence library of about 20 kb fragments was constructe. After quality verification, the library was sequenced on the PacBio Sequel IIe platform. The SequelQC11 tool was used to remove joints and low quality areas. The construction of the Hi-C library was made by MboI enzyme digestion and formaldehyde cross-linking treatment of liver cells of I. elongata. The high quality Hi-C library was constructed and verified. The 150 bp double-ended read segment was sequenced by DNBSEQ-T7 platform, and the quality control and filtering of the original data were carried out. Ensure that pairs of reads are available for comparison analysis. For each RNA sample, a clean data set was generated through a series of steps. First, a cDNA library was constructed for each sample individually, followed by 150 bp double-end sequencing using the DNBSEQ-T7 platform. During the later stages of genome sequencing analysis, data containing splices, poly-N sequences, or low-quality reads were removed to ensure data integrity. The overall process included several critical steps, such as DNA and RNA library preparation, sequencing, and Hi-C library construction and sequencing. Each stage adhered to strict quality control protocols to guarantee high-quality data, providing a robust foundation for subsequent bioinformatics analysis.

Genome survey analysis

GCE (v1.0.2)12 software was used to pre-assess the genome size of I. elongata based on sequencing data from the MGI platform. GCE software estimated the genome characteristics by simulating the Poisson distribution law of K-mer in the genome. In this study, the size of K-mer selected was 17, and the total number and coverage of K-mer were calculated. The process of genome survey analysis was mainly divided into two stages: First, the kmerfreq program was used to calculate the frequency distribution of K-mer in the second-generation sequencing data; Secondly, according to the distribution of K-mer, the gce program was used to estimate the genome size, heterozygosity, repeat sequence proportion and other key indicators. These two stages of analysis provided insight into the characteristics of the I. elongata genome. These preliminary assessments provided important reference data for subsequent genome splicing, helping to select the most appropriate assembly strategies and computational tools to ensure more accurate genome sequences.

Genome assembly process

The NextDenovo (v2.5.2)13 software was selected for genome assembly of long-read sequence data generated by PacBio. After assembly, in order to further improve the accuracy of the splicing sequence, racon (v1.5.0)14 software was used to modify the contigs obtained by splicing. Subsequently, in order to further refine these sequences, meticulous error correction was performed on contigs using Pilon (v1.23)15 software based on second-generation sequencing data. Comparing the initial assembly results with the estimated genome size from the previous survey analysis, the preliminary assembly of the I. elongata genome showed low heterozygosity and proportion of repetitive sequences, demonstrating the high quality of genome assembly.

A series of specialized tools, including Juicer (v1.5)16, 3D-DNA17, and Juicebox (v1.9.8)18, were used in the later stages of genome assembly to attach the initially assembled sequence to the chromosome level. The key operation steps were as follows: (1) Hi-C data comparison: Firstly, Juicer software was used to efficiently compare Hi-C sequencing data to the assembled genome sketch, which provides basic data for the subsequent horizontal assembly of chromosomes; (2) 3D-DNA analysis and correction: Subsequently, 3D-DNA software was used to analyze the Hi-C comparison results, identify and correct the possible wrong connections in contigs, and generate optimized and corrected genome files through Polish, Split, Seal and Merge. (3) JuiceBox correction and visualization: According to the genome assembly manual provided by Aiden Laboratory, Juicebox software was used for graphical manual correction, which not only further improves the accuracy of the genome, but also ensures a high level of reference genomes assembled at the chromosome level. After performing the above steps, the sequence was successfully promoted from contigs to chromosome level, forming a high-quality reference genome.

After the de novo assembly of the I. elongata genome, in order to ensure the high quality of genomic data, a comprehensive quality evaluation of the assembly results must be carried out, mainly in the following three core aspects: (1) evaluation of sequence continuity: The continuity of genome assembly was assessed using QUAST (v5.0.2, http://cab.cc.spbu.ru/quast/)19, and the Matplotlib library was used for graphical plotting, and the assembled genome sequence was used as an input file to run the software to contig N50, scaffold N50 and other metrics were calculated and visualized; (2) Sequence consistency evaluation: Using the BWA (v0.7.17)20 and Minimap221 tools, quality controlled MGI and PacBio sequencing data were compared to the assembled I. elongata reference genome. Qualimap (v2.2.2)22 comparison rate (mapping rate) and genome coverage were evaluated. (3) Genome integrity assessment: The genome integrity was analyzed using BUSCO software23 (https://gitlab.com/ezlab/busco). As I. elongata belongs to the radial fins subclass, the actinopterygii_odb10 database was selected as the reference gene set to evaluate the completeness of the genome assembly.

Genome annotation

In this study, two methods, de novo prediction and homologous prediction, were used to annotate the repeated sequence of the genome of I. elongata. Together, they provide complementary approaches for genome annotation. The specific steps are as follows: (1) A repetitive sequence library was established using RepeatModeler (v2.0.1)24 and LTR-FINDER (v1.07)25, and the database was searched by RepeatMasker (v4.1.0)26 to predict the repetitive sequences in the I. elongata genome. Tandem Repeat Finder (v4.09)27 (https://tandem.bu.edu/trf/trf.whatnew.html) was used to identify tandem repeats in the genome. (2) The identification of repeat sequences in the I. elongata genome, similar to those in the ReBase repetitive sequence database (http://www.girinst.org/repbase), is performed using RepeatMasker (v4.1.0)26 and RepeatProteinMask (v4.1.0)28 (http://www.repeatmasker.org) software; Finally, the results based on homologous prediction and de novo prediction were integrated to obtain a comprehensive annotation of the genomic repeat sequence of I. elongata. This strategy effectively captures repetitive elements in the genome, offering essential insights for subsequent functional annotation and evolutionary studies.

Based on the covariance model of Rfam family, miRNA, snRNA and rRNA sequences were predicted using INFERNAL (v1.1.3)29 software that comes with Rfam. The specific process was as follows: (1) download and verify the ___location of the Rfam database file; (2) based on the Rfam database file (Rfam.cm), apply the cmpress script to build the library; (3) apply the cmscan script to search the I. elongata genome against the Rfam database, and ultimately obtain the sequence information from the comparison. BLASTN30 software was employed to align sequences with the known rRNA sequences of related species, leveraging the high conservation of rRNA, to identify rRNA sequences within the genome. Furthermore, tRNAscan-SE (v2.0.9)31 software was utilized to detect tRNA sequences in the genome of I. elongata. The tRNAscan-SE tool integrated multiple analysis programs to analyze the tRNA secondary structure, conserved sequence pattern of promoter elements and transcriptional control elements.

Gene structure prediction of the I. elongata genome was performed using three different methods: de novo prediction, homology prediction and transcriptome-based prediction. De novo prediction uses algorithms to identify genes directly from genomic sequences without external references. Meanwhile, homology prediction detects genes by comparing sequences to known reference genomes, leveraging similarities. In addition, transcriptome-based prediction incorporates RNA data to confirm gene expression and refine gene annotations, making these methods complementary for accurate genome annotation. The following steps are detailed: (1) gene structure was predicted de novo using GENSCAN32 and Augustus (v3.3.3)33 software; (2) the I. elongata genome was compared to the genomes of related species, including Clupea harengus, Denticeps clupeoides, Alosa alosa, Oncorhynchus keta, and Danio rerio, using the TBLASTN34 program (E-value ≤ 1e-05). The steps involved are as described: (1) gene structure was predicted; (2) the I. elongata genome was compared to the genomes of related species. The results of the comparison were analyzed using GeneWise (v2.4.1)35 software to predict the protein structures of the coding genes; (3) the transcriptome data was aligned to the reference genome using HISAT2 (v2.2.1)36 to generate SAM files. Transcriptome assembly was performed using Trinity (v2.1.1)37, and transcriptome-based annotation is conducted using the PASApipeline (v2.4.1)38 via the Launch_PASA_pipeline.pl script; finally, the results obtained from the three prediction methods were integrated using the EvidenceModeler (v1.1.1)39 software, and then the results were filtered using the GFF3Clear script under the GETA (v2.5.6) (https://github.com/chenlianfu/geta) software to Filtering was performed to obtain a comprehensive gene set. Next, the InterPro40, GO41, NR42, SwissProt43, TrEMBL44, KEGG45 and KOG46 databases were downloaded. The blastp subcommand of Diamond (v2.0.2)47 software was then used to compare and annotate the predicted gene set with each database (the comparison parameter was set to E-value ≤ 1e-09). InterProScan40 was used to compare the gene set of the I. elongata genome with the InterPro protein database. In addition, Blast2GO (v5.2.5)48 software was used to perform GO annotation on genes of I. elongata based on GO database.

The raw sequencing data from the DNBSEQ-T7 platform were subjected to quality control and filtering, resulting in 78.92 Gb of MGI second-generation data, which were used for genome survey analysis and assembly error correction. Additionally, 69.52 Gb of Hi-C data were obtained to assist in the assembly of the genome at the chromosome level. Furthermore, 28.15 Gb of transcriptome data were collected for gene annotation of the genome. The HiFi data from the PacBio Sequel IIe platform were also subjected to quality control and filtering, ultimately yielding 35.37 Gb of long-read data, which were used for genome assembly (Table 1). These long reads comprised 1,995,442 reads with an average length of 17,692 bp and an N50 of 16,795 bp. Of these, 434,775 reads exceeded 20 kb in length (Fig. 1).

Table 1 The statistical information of the sequenced genome data of I. elongata.
Fig. 1
figure 1

2D statistical chart of the quality and length of the third-generation PacBio sequencing results of I. elongata genome.

Prior to genome assembly, the characteristics of the genome can be estimated from the quality-controlled MGI second-generation data. K-mer analysis is employed to predict the genome size, heterozygosity, and repetitive sequence information. The kmerfreq program was used for statistics, and the K-mer analysis showed a total K-mer count of 61,709,089,203, with the estimated genome size of the I. elongata being approximately 793 Mb, corrected to 785 Mb. The genome heterozygosity was 1.2%, and the proportion of repetitive sequences was 39.6%. The K-mer frequency distribution results were visualized (Fig. 2), revealing two distinct peaks in the K-mer distribution curve, which indicates a high level of heterozygosity in the I. elongata genome.

Fig. 2
figure 2

Frequency and depth distribution plot of the I. elongata genome surveyed through genome analysis.

In this study, the PacBio sequencing data were assembled using the nextDenovo software. The initial genome assembly of the I. elongata was obtained, with a size of 831.89 Mb and containing 980 contigs, with a contig N50 of 4.73 Mb (Table 2). Subsequently, the genome was corrected and refined using both third-generation and second-generation data, resulting in the final corrected genome, which had a size of 839.57 Mb. In the corrected genome, the contig N50 reached 4.82 Mb, and the longest sequence was 16,686,348 bp (Table 2).

Table 2 Assembly results of I. elongata genome.

In this study, Hi-C sequencing technology was employed to assist in the chromosomal-level assembly of the I. elongata genome. The quality-controlled Hi-C paired-end data were aligned with the contig-level assembly, and a chromosome interaction map was constructed for visualization. The results showed that 769.94 Mb of contig sequences were anchored to 24 chromosomes (Fig. 3), with chromosome lengths ranging from 11.95 Mb to 45.42 Mb (Table 3; Fig. 4). Further statistical analysis of the assembly, conducted with Python scripts, revealed that the I. elongata genome size is 815 Mb, the scaffold N50 is 32.61 Mb, and the chromosome anchoring rate is 96.54%.

Fig. 3
figure 3

Hi-C interaction heatmap of I. elongata chromosome genome.

Table 3 Statistics of the chromosome-level assembly results of I. elongata genome.
Fig. 4
figure 4

Circle map of I. elongata genome at the chromosome level. From outside to inside: (A) Chromosome length, (B) GC content distribution, (C) Second-generation sequencing depth distribution, (D) Third-generation sequencing depth distribution, (E) Gene density distribution, (F) Distribution of long tandem repeat sequences.

Assembly quality evaluation

The genome of I. elongata was assembled at the chromosome level with a contigN50 and scaffoldN50 of 4.82 Mb and 32.61 Mb respectively, showing good sequence continuity. At the same time, the comparison rate of the second- and third-generation data, after quality assessment, to the assembled genome was 96.54%, and the comparison rate of the third-generation data was 97.56%, with coverage of 99.65% and 99.86% respectively (Table 4). In addition, BUSCO software was used to compare the genome of I. elongata with the phylogenetic database, and 3,640 conserved genes of Actinopterygii were selected for comparison. The results showed that 3,489 genes (95.8%) could be completely matched to the genome of I. elongata, of which 92.2% were single-copy genes and 3.6% were multi-copy genes. Besides, 75 genes (2.1%) were only partially matched and 76 genes (2.1%) were not matched (Fig. 5).

Table 4 Alignment statistics of second-generation and third-generation sequencing data.
Fig. 5
figure 5

BUSCO evaluation results of I. elongata genome at the chromosome level.

Repeat sequences and non-coding RNA

The repeat sequence obtained by homologous prediction strategy annotation was 90.74 Mb (10.76%), and the repeat sequence obtained by de novo prediction strategy annotation was 244.06 Mb (28.94%). Combining the results of the two strategies annotation, a total of 295.7 Mb was obtained, accounting for 35.08% of the genome sequence of I. elongata (Table 5). Common repeats fall into two main categories: Interspersed repeats and Tandem Repeats (TRs). In the genome of I. elongata, tandem repeats accounted for 7.41%, Simple repeat 7.12% and Satellite 0.07%; In scattered repeats, 23.91% of DNA transposons (TEs), 0.82% of long scattered transposons (lines), 0.44% of short scattered transposons (SINE) and 0.65% of long terminal repeats (LTR) were found (Table 5). In addition, some repeat sequences have not been successfully compared with existing databases and are classified as Unknown repeat types. The length of such sequences is 0.41 Mb, accounting for 0.04% in the whole genome. With the help of tRNAscan-SE, 705 tRNAs were identified through structural prediction. Based on the Rfam database, 794 miRNAs, 105 rRNAs and 564 snRNAs were annotated; 2,826 lncRNAs were annotated using GETA (Table 6).

Table 5 The statistical analysis of repetitive sequences in the genome of I. elongata.
Table 6 Annotation of non-coding RNA genes in I. elongata genome.

Gene structure and function annotation

Through annotating 46,804 genes by using AUGUATUS software based on the de novo prediction method, the AUGUSTUS model was trained with an accuracy of 66.3%, and at the nucleotide level, its sensitivity reached 88.2% and specificity reached 79.5%. A total of 33,951 genes were predicted through homology, including 10,747 complete genes, 7,800 5′ partial genes, 3,434 3′ partial genes, and 11,970 internal genes. The alignment rate of the transcriptome data reached 99.05%, and 25,923 genes were predicted based on the transcriptome, including 14,022 complete genes, 5,437 5′ partial genes, 3,157 3′ partial genes, and 3,307 internal genes. By combining the three gene prediction methods and filtering the gene models, a total of 26,381 structural genes were identified, of which 5,777 had variable splicing, 12,363 were from AUGUSTUS predictions, 10,790 were from transcripts, and 3,228 were from homolog. By calculation, the average length of CDS was 182.6 bp, the average length of genes is 13,451.3 bp, each gene contains an average of 8.65 introns and 9.64 exons, the average length of introns was 1,290.81 bp, and the average length of exons was 235.86 bp (Table 7). Functional annotation was carried out based on the predicted results of genomic structure of I. elongata, and the predicted protein sequences were compared with InterPro, GO, KEGG, Swissprot, TrEMBL, NR and KOG databases. The results showed that the functions of 24,596 genes were successfully annotated, accounting for 93.23% of the total number of genes (Table 8).

Table 7 Predicted protein-coding genes in I. elongata genome.
Table 8 Table of functional annotations for protein-coding genes.

Data Records

The genomic Illumina, PacBio, Hi-C, and transcriptomic sequencing data were deposited in the Sequence Read Archive (SRA) at NCBI under the accession numbers SRP53053449.

The final chromosome assembly was deposited in the GenBank under the accession JBLRXZ00000000050.

The annotation files have been uploaded to the figshare database, with the DOI number: https://doi.org/10.6084/m9.figshare.28151450.v151.

Technical Validation

High-throughput sequencing was used to obtain the raw genome data, which were then subjected to quality control and filtering to generate clean data. Prior to assembly, a survey analysis was conducted on the second-generation sequencing data of the I. elongata genome. The results of the survey analysis indicated that the I. elongata genome size is 785 Mb, with a heterozygosity rate of 1.2% and a repetitive sequence proportion of 39.6%.

In this study, third-generation PacBio sequencing data were used for genome assembly, followed by sequence correction using both second- and third-generation data to generate an initial genome draft, with a contig N50 size of 4.82 Mb. Subsequently, Hi-C data were employed for assisted assembly, enabling the I. elongata genome to be assembled to the chromosomal level. The final genome size was 815 Mb, and the construction of the chromosome interaction map clearly revealed that contig sequences were anchored to 24 chromosomes (Fig. 3), with an anchoring rate of 96.54%. The scaffold N50 was 32.61 Mb (Table 2). Genome evaluation using BUSCO indicated a gene completeness rate of 95.8% (Fig. 5).

Based on the I. elongata genome assembly, further annotation analysis was conducted, which identified 295.7 Mb of repetitive sequences, accounting for 35.08% of the I. elongata genome. This includes 7.41% of tandem repeats, 3.34% of interspersed repeats, and 0.04% of repetitive sequences that could not be annotated to any homologous databases (Table 5). Structural annotation of the I. elongata genome revealed a total of 26,381 protein-coding genes (Table 7), of which 93.23% were successfully annotated with functional information (Table 8).