Background & Summary

The family Siganidae (also known as rabbitfish), are small and medium-sized marine fish. Rabbitfish inhabit nearshore reef areas and are found in the Indo-Pacific from the Red Sea and the coast of eastern Africa through the Pacific Ocean as far as Pitcairn Island1. As a group of perciform fishes, rabbitfish only includes one genus, namely Siganus Forsskål 1775 and currently 28 species are recognized2. However, natural hybridization are also found between both close related species or morphs and distantly related ones within rabbitfish3, making taxonomy and phylogenetic studies of this taxa a little difficult and complicated. Rabbitfish are herbivorous and feed on benthic algae, consisting of a important community in coral reef ecosystem. Due to this feeding characteristic, they are usually introduced in culture ponds to clean net cages4. In aquaculture, there are several species (e.g., S. canaliculatus, S. guttatus and S. fuscescens) that are heavily explored because of their high protein content and delicious meat4. In addition, some species in Siganidae are very popular in the Indo-Pacific and Mediterranean regions as ornamental fishes due to their gorgeous appearance, such as S. vermicularisi and S. corallinus5. In China, 14 Siganidae species are formally described or recorded with a distribution across South China Sea to East China Sea5.

Among these species, the White-spotted spinefoot S. canaliculatus (synonym of S. oramin), is an important member for various reasons. First, S. canaliculatus is a common commercial fish in the family Siganidae and widely distributed in tropical and subtropical areas of the Indo-Pacific Ocean1. It is especially abundant in the wild along the coast of South China. Most of the rabbitfish have beautiful body color and appearance while S. canaliculatus has many small oblong yellow spots on the head and side of the body, which are relatively unremarkable5. Interestingly, its color can change sharply when inspired by external stimulus. As other species in this genus, S. canaliculatus is also featured by possessing poisonous glands in its dorsal and pelvic fin spines. The toxins likely originate from its food resource such as algae. However, its muscle is nontoxic and full of unsaturated fatty acids as well as minerals and trace elements4. The large gallbladder could be responsible for this special phenomenon (equal to 30% of its body length). These above valuable traits have made S. canaliculatus as one of the most important marine aquaculture species in the past decades in China costal provinces. For example, in Fujian province, more than 1000 tons have been reported for the annual production of this fish5.

Meanwhile, as a saltwater fish, S. canaliculatus has the characteristics as freshwater fish. In general, the fertilized eggs of freshwater fish are heavy and sticky, while the fertilized eggs of marine fish are floating (caused by differences between the density of freshwater and seawater). However, as a true marine fish, S. canaliculatus is unusual by laying heavy and sticky fertilized eggs6. Moreover, freshwater fish usually have the ability to synthesize highly unsaturated fatty acids (HUFAs) while seawater fish generally lack or are poor at this ability. Their demands for HUFAs mainly depend on direct food intake, so the diet of seawater fish are highly dependent on fish oil. S. canaliculatus is the first seawater fish that has been found to possess the ability to convert linolenic acid and linoleic acid into HUFAs7. The elovl gene family was shown to function underlying biosynthesis of HUFAs8,9.

Apart from nutrition studies, in recent years, there are many investigations of S. canaliculatus covering divers topics. For instance, morphology6, genetic structures10,11, phylogenetics3,12, reproduction13, net cage culture14 as well as disease control15. However, our knowledge of S. canaliculatus have still been limited due to lack of genetic resources and genomic information. The advancements of third-generation sequencing and high-throughput chromatin conformation capture (Hi-C) technologies have provided an unprecedented opportunity for producing high quality and chromosome-level genomes for various organisms on the earth.

In this study, we employed an integrated strategy of HiFi long reads, Hi-C, Iso-seq and RNA-seq sequencing technologies to assemble a high-quality genome of S. canaliculatus. This genome was 547.39 Mb with contig N50 of 21.41 Mb and scaffold N50 of 21.79 Mb. Approximately 95.32% (521.76 Mb) of assembled sequences were placed into 24 pseudochromosomes with the support of Hi-C contact map. 25,323 protein-coding genes were predicted and 96.96% were functionally annotated. BUSCOs assessment of the assembly showed 3589 (98.6%) BUSCOs was complete. This high-quality S. canaliculatus reference genome will provide an important genomic resource for genetic breeding and molecular mechanism related studies.

Methods

Ethics statement

The fish in our experiments were collected from Shenzhen City, Guangdong Province, China. Furthermore, the methods used in this work are strictly in accordance with the Guidelines for the Care and Use of Laboratory Animals and approved by Laboratory Animal Ethics Committee of South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (permit reference number No. 2024-MRB-00-001). Fish was collected for experiment utilization only and sacrificed using MS-222 (Sigma).

Sample collection and DNA extraction

A wild female S.canaliculatus (body mass: 250.2 g) was collected from Da Peng, Shenzhen, Guangdong, China (22°38′32.31″N; 114°24′40.87 E). The muscle was isolated and flash-frozen for ~30 minutes. Total DNA was extracted using QIAGEN Genomic DNA extraction kit and was used for PacBio sequencing and Hi-C sequencing. The extracted high molecular weight was assessed by 1% agarose gel and Qubit 3.0 Fluorometer (Invitrogen, USA).

Library construction and DNA sequencing

a SMRTbell Express Template Prep Kit 2.0 was used to generate a 20 kb long library for PacBio HiFi sequencing. The library was then sequenced on a PacBio Revio System (Pacifc Biosciences, Menlo Park, CA, USA). HiFi reads were obtained using the CCS module in SMRT Link v9.016. After HiFi reads calling, 25.14 Gb PacBio HiFi reads were generated (N50: 20.47 kb, 45.02× in depth) (Table 1).

Table 1 Sequencing data for Siganus canaliculatus genome assembly.

For Hi-C sequencing, a GrandOmics Hi-C kit with DpnII enzyme (GrandOmics, China) was used to construct libraries following the standard manufacturer’s protocol. The resulted Hi-C libraries were sequenced on a MGISEQ-2000 platform (MGI, BGI Shenzhen, China). 101.66 Gb raw reads were produced. These raw reads were filtered by using fastp v0.19.517 to filter low quality reads. 96.75 Gb (173.26 × in depth) clean reads were obtained in total. This clean Hi-C data was subsequently used for placing contigs onto psedochromosomes.

RNA extraction and sequencing

Both RNA-seq and Iso-seq were employed to assist RNA evidence based gene prediction. Seven tissues (skin, fin, heart, liver, gill, muscle and gonad) from the same individual as DNA extraction were equally mixed and extracted by using a TRIZOL Kit (Invitrogen, Carlsbad, CA, USA) following the manufacturer’s instructions. RNA integrity and quality was checked by the Nanodrop 2000 spectrophotometer and the Agilent 2100 Bioanalyzer System (Agilent Technologies, Santa Clara, CA, USA). RNA with RIN (RNA integrity number) ≥7.0 were selected for library construction. Procedures described in our previous study18 were performed for Iso-seq. Briefly, the extracted RNA was used for cDNA synthesis followed by a large-scale PCR amplification step. PCR products were purified and subjected to the construction of SMRTbell template libraries. Finally, SMRT cells were sequenced on a PacBio Revio platform. For RNA-seq, cDNA libraries with insert sizes of ~350 bp were constructed and sequenced on a MGISEQ-2000 platform (MGI, BGI Shenzhen, China). 96.30 Gb and 18.14 Gb raw data were generated from Iso-seq and RNA-seq, respectively (Table 1).

Genome assembly and telomere identification

HiFi reads were first assembled using hifiasm v0.19.5-r58719 with default parameters to generate a contig-level assembly which had a size of 558.39 Mb with 108 contigs (N50: 21.41 Mb). The mitochondrial sequences were removed in this step. After hifiasm assembly, purge_dups v1.2.620 was used to remove haplotigs and contig overlaps based on read depth following the standard pipeline. AutoHiC v1.3.321 was then used to scaffold these contigs using deep learning-based methods for automatic error correction. Briefly, this newly developed software utilizes Hi-C reads and input draft reference assembly to generate a candidate assembly. With built-in AutoHiC deep learning models, AutoHiC can automatically correct errors during genome assembly and generate a chromosome-level genome. The resulted draft genome was then polished by NextPolish v1.4.122 to fix base errors (SNV/Indel) with HiFi long reads. Telomere sequences at ends of each chromosome was identified quarTeT v1.2.523. The size of the final assembly version was 547.39 Mb, of which 95.32% (521.76 Mb) were placed onto 24 chromosomes with Hi-C heat map support (Figs. 1, 2; Table 4). 70 sequences were presented in the final assembly with N50 length of 21.79 Mb. The length of 24 chromosome-level sequences ranged from 12.47 Mb to 27.41 Mb. The 24 chromosome numbers suggested by the Hi-C heat map was identical with a karyotype study of S. canaliculatus24. Telomere sequences were found to be presented at both ends of three chromosomes while only single telomere sequences were identified at one end of 20 chromosomes (Table 4).

Fig. 1
figure 1

Circos plot of Siganus canaliculatus genome. (a) chromosome sizes, (b) gene density, (c) GC density, (d) repeat elements abundance, (e) DNA transposons, (f) LTRs, and (g) ncRNAs.

Fig. 2
figure 2

Chromosome heatmaps of Hi-C data of Siganus canaliculatus. The bar beside indicates chromatin interactions quantified based on the count of Hi-C reads.

Repeat elements annotation

EDTA pipeline25 was used to annotate repeat elements in the S. canaliculatus genome. This pipeline was developed for automated whole-genome de-novo TE annotation. It first utilizes LTR-FINDER v1.0.626, LTRharvest27, HelitronScanner28 and TIR-Learner29 to predict LTR, TIR and Helitron, respectively. Then, LTR_retriever v3.0.330 was used to filter false positive results of LTR. Subsequently, basic and advance filter in EDTA were applied to do additional filtering and resulted in raw TE library. This raw library was used for RepeatMasker v4.1.2-p131 to mask the target genome followed by RepeatModeler v2.0.332 to predict the remaining TE in the genome. The results showed 89,597,434 bp (16.37%) was identified to be repetitive sequences (Table 2), in which LTR accounting for 2.58%, TIR 4.19%, nonLTR 0.38%, nonTIR 0.58% and repeat_region 8.1%.

Table 2 Statistics of repetitive sequences.

Gene structure prediction and functional annotation

The masked genome generated in the repeat annotation step was used as an input for gene structure prediction. Three approaches which were commonly adopted was employed in this study: (1) Ab initio prediction: AUGUSTUS v3.5.033 and GeneMark-ET34 were performed to do ab initio prediction; (2) Homology-based prediction: Protein sequences from five representative species (Danio rerio, Oreochromis niloticus, Oryzias latipes, Scatophagus argus, Takifugu rubripes) were download from the NCBI database. Using these data as references, gene structures in the S. canaliculatus genome were predicted using blastx v2.2.2635 and exonerate v2.236; (3) Transcriptome-based: for RNA-seq based predictions, raw RNA-seq reads were filtered using fastp17 (-a auto --adapter_sequence_r2 auto --dedup --dup_calc_accuracy 3). After filtering, 16.96 Gb clean reads were mapped onto the S. canaliculatus genome using HISAT2 v2.2.137 and stringtie v2.2.138 and merged with TACO v0.7.339. For Iso-seq based predictions, raw Iso-seq read was processed using isoseq pipeline40. GMAP41 was introduced to align cDNA to the S. canaliculatus genome. Finally, gene structures predicted from above three methods were integrated by MAKER v3.01.0342. Genes with a Annotation Edit Distance (AED) ≤1 were retained in the final dataset.

For functional annotation of predicted genes, protein sequences were extracted from the S. canaliculatus genome and blasted against nine commonly used protein databases (NR, Swissprot, KEGG, KOG, GO, Pfam, TrEMBL, eggNOG, InterPro) using DIAMOND v0.9.2543 with an E value of 1e−5 and InterProscan v5.59-91.044.

Non-coding RNA (ncRNAs, i.e., tRNAs, rRNAs, miRNAs, snRNAs and snoRNAs) in the S. canaliculatus genome were also annotated. We first utilized tRNAscan-SE v1.3.145 to predict tRNAs in the assembly. For the rRNA genes, RNAmmer v1.246 was used (-S euk -m lsu,ssu,tsu -gff). MiRNAs, snRNAs and snoRNAs were searched by CMSAN v1.1.247 against the Rfam v14.10 database48 (--cut_ga --rfam --nohmmonly --tblout --fmt 2).

For ab initio prediction, AUGUSTUS v3.5.033 and GeneMark-ET34 found 38789 and 38161 genes in the S. canaliculatus genome, respectively. Homology-based approach predicted 37191 to 49829 genes depending on reference genomes. RNA-seq based evidence predicted 30416 genes while Iso-seq based evidence found 35972 genes (Table 3). After integrated by MAKER v3.01.0342, 25323 protein-coding genes were finally annotated with a range from 572 to 1415 genes across each chromosome (Table 4). Functional annotation results showed 71.45% to 96.68% of proteins can be blasted in one of nine databases (Fig. 3). After removing redundancy, 96.96% proteins had at least one database hits (Table 5). For ncRNA annotation, 1352 miRNA, 1551 tRNA, 2968 rRNA, 260 snRNA and 209 snoRNA were predicted in the S. canaliculatus genome (Table 6).

Table 3 Statistics of gene prediction.
Table 4 Statistics of gene numbers predicted across each chromosome.
Fig. 3
figure 3

Upset plot showing protein sequences of Siganus canaliculatus annotated in nine databases. Only the first 30 intersections have been shown.

Table 5 Statistics of gene functional annotation.
Table 6 Statistics of non-coding genes.

Data Records

Raw reads sequenced in this study have been submitted to the National Genomics Data Center (https://ngdc.cncb.ac.cn/, BioProject number: PRJCA02996149, Run IDs: CRR1288946-CRR1288949). The genome sequences and annotation files were deposited at figshare (https://doi.org/10.6084/m9.figshare.2711716950) and NCBI (accession number: JBLRWB00000000051).

Technical Validation

The quality of the assembly was assessed using BUSCO v5.5.052 with the actinopterygii_odb10 database (3,640 BUSCOs). The BUSCO assessment showed that 3589 (98.6%) BUSCOs were identified as complete, of which 3574 (98.2%) and 15 (0.4%) were single-copy and duplicated, respectively. Chromosome numbers of the S. canaliculatus genome were confirmed by the Hi-C heat map (Fig. 2). Completeness assessment of proteins showed that a total of 3518 (96.6%) BUSCOs were identified as complete. Of these, 3488 (95.8%) were single-copy and 30 (0.8%) were duplicated BUSCOs (Fig. 4). Taking all above results and quality assessment metrics together, we concluded that the S. canaliculatus genome was high quality and has good annotations.

Fig. 4
figure 4

BUSCO assessment results of Siganus canaliculatus gene and protein sequences.