Chromosome-level genome assembly and annotation of the White-spotted spinefoot Siganus canaliculatus

Huang, Xiaolin; Lu, Yanke; Zhang, Hui; Xian, Lin; Huang, Shiting; Yang, Yukai; Wang, Lei; Zhang, Dianchang; Li, Chao

doi:10.1038/s41597-025-04844-w

Download PDF

Data Descriptor
Open access
Published: 23 March 2025

Chromosome-level genome assembly and annotation of the White-spotted spinefoot Siganus canaliculatus

Xiaolin Huang^1,2,3^na1,
Yanke Lu⁴^na1,
Hui Zhang⁴^na1,
Lin Xian^1,2,5^na1,
Shiting Huang⁴,
Yukai Yang^1,2,3,
Lei Wang⁴,
Dianchang Zhang^1,2,3 &
…
Chao Li ORCID: orcid.org/0000-0002-1012-4723⁴

Scientific Data volume 12, Article number: 482 (2025) Cite this article

1531 Accesses
Metrics details

Subjects

Abstract

The White-spotted spinefoot S. canaliculatus, is an economically important marine fish in South China and featured by possessing poisonous glands in its fin spines. However, the unavailability of the S. canaliculatus genome has been a serious obstacle to genetic breeding as well as basic researches such as uncovering genomic basis underlying its toxigenic glands. Here, we presented a chromosome-level genome assembly coupled with good annotation of S. canaliculatus using multiple omics technologies. The assembled genome size was 547.39 Mb, with a contig N50 and scaffold N50 length of 21.41 Mb and 21.79 Mb, respectively. Approximately 95.32% (521.76 Mb) of assembled sequences were placed into 24 pseudochromosomes with the support of Hi-C contact map. Furthermore, around 16.37% of the genome was composed of repetitive elements. The quality of the assembly assessed using BUSCO showed that 98.6% of BUSCO genes were identified as complete. 25,323 protein-coding genes were predicted after integration of three kinds of evidence, of which 96.96% were functionally annotated in at least one of nine protein databases. In sum, the chromosome-level genome assembly and annotation provide fundamental resources for genetic breeding and molecular mechanism related studies of S. canaliculatus.

Chromosome genome assembly and annotation of the spiny red gurnard (Chelidonichthys spinosus)

Article Open access 12 July 2023

Chromosome-level genome assembly and annotation of the Spinibarbus caldwelli

Article Open access 28 August 2024

Chromosome-level genome assembly of Fistularia commersonii (Syngnathiformes, Fistulariidae)

Article Open access 20 January 2025

Background & Summary

The family Siganidae (also known as rabbitfish), are small and medium-sized marine fish. Rabbitfish inhabit nearshore reef areas and are found in the Indo-Pacific from the Red Sea and the coast of eastern Africa through the Pacific Ocean as far as Pitcairn Island¹. As a group of perciform fishes, rabbitfish only includes one genus, namely Siganus Forsskål 1775 and currently 28 species are recognized². However, natural hybridization are also found between both close related species or morphs and distantly related ones within rabbitfish³, making taxonomy and phylogenetic studies of this taxa a little difficult and complicated. Rabbitfish are herbivorous and feed on benthic algae, consisting of a important community in coral reef ecosystem. Due to this feeding characteristic, they are usually introduced in culture ponds to clean net cages⁴. In aquaculture, there are several species (e.g., S. canaliculatus, S. guttatus and S. fuscescens) that are heavily explored because of their high protein content and delicious meat⁴. In addition, some species in Siganidae are very popular in the Indo-Pacific and Mediterranean regions as ornamental fishes due to their gorgeous appearance, such as S. vermicularisi and S. corallinus⁵. In China, 14 Siganidae species are formally described or recorded with a distribution across South China Sea to East China Sea⁵.

Among these species, the White-spotted spinefoot S. canaliculatus (synonym of S. oramin), is an important member for various reasons. First, S. canaliculatus is a common commercial fish in the family Siganidae and widely distributed in tropical and subtropical areas of the Indo-Pacific Ocean¹. It is especially abundant in the wild along the coast of South China. Most of the rabbitfish have beautiful body color and appearance while S. canaliculatus has many small oblong yellow spots on the head and side of the body, which are relatively unremarkable⁵. Interestingly, its color can change sharply when inspired by external stimulus. As other species in this genus, S. canaliculatus is also featured by possessing poisonous glands in its dorsal and pelvic fin spines. The toxins likely originate from its food resource such as algae. However, its muscle is nontoxic and full of unsaturated fatty acids as well as minerals and trace elements⁴. The large gallbladder could be responsible for this special phenomenon (equal to 30% of its body length). These above valuable traits have made S. canaliculatus as one of the most important marine aquaculture species in the past decades in China costal provinces. For example, in Fujian province, more than 1000 tons have been reported for the annual production of this fish⁵.

Meanwhile, as a saltwater fish, S. canaliculatus has the characteristics as freshwater fish. In general, the fertilized eggs of freshwater fish are heavy and sticky, while the fertilized eggs of marine fish are floating (caused by differences between the density of freshwater and seawater). However, as a true marine fish, S. canaliculatus is unusual by laying heavy and sticky fertilized eggs⁶. Moreover, freshwater fish usually have the ability to synthesize highly unsaturated fatty acids (HUFAs) while seawater fish generally lack or are poor at this ability. Their demands for HUFAs mainly depend on direct food intake, so the diet of seawater fish are highly dependent on fish oil. S. canaliculatus is the first seawater fish that has been found to possess the ability to convert linolenic acid and linoleic acid into HUFAs⁷. The elovl gene family was shown to function underlying biosynthesis of HUFAs^8,9.

Apart from nutrition studies, in recent years, there are many investigations of S. canaliculatus covering divers topics. For instance, morphology⁶, genetic structures^10,11, phylogenetics^3,12, reproduction¹³, net cage culture¹⁴ as well as disease control¹⁵. However, our knowledge of S. canaliculatus have still been limited due to lack of genetic resources and genomic information. The advancements of third-generation sequencing and high-throughput chromatin conformation capture (Hi-C) technologies have provided an unprecedented opportunity for producing high quality and chromosome-level genomes for various organisms on the earth.

In this study, we employed an integrated strategy of HiFi long reads, Hi-C, Iso-seq and RNA-seq sequencing technologies to assemble a high-quality genome of S. canaliculatus. This genome was 547.39 Mb with contig N50 of 21.41 Mb and scaffold N50 of 21.79 Mb. Approximately 95.32% (521.76 Mb) of assembled sequences were placed into 24 pseudochromosomes with the support of Hi-C contact map. 25,323 protein-coding genes were predicted and 96.96% were functionally annotated. BUSCOs assessment of the assembly showed 3589 (98.6%) BUSCOs was complete. This high-quality S. canaliculatus reference genome will provide an important genomic resource for genetic breeding and molecular mechanism related studies.

Methods

Ethics statement

The fish in our experiments were collected from Shenzhen City, Guangdong Province, China. Furthermore, the methods used in this work are strictly in accordance with the Guidelines for the Care and Use of Laboratory Animals and approved by Laboratory Animal Ethics Committee of South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (permit reference number No. 2024-MRB-00-001). Fish was collected for experiment utilization only and sacrificed using MS-222 (Sigma).

Sample collection and DNA extraction

A wild female S.canaliculatus (body mass: 250.2 g) was collected from Da Peng, Shenzhen, Guangdong, China (22°38′32.31″N; 114°24′40.87 E). The muscle was isolated and flash-frozen for ~30 minutes. Total DNA was extracted using QIAGEN Genomic DNA extraction kit and was used for PacBio sequencing and Hi-C sequencing. The extracted high molecular weight was assessed by 1% agarose gel and Qubit 3.0 Fluorometer (Invitrogen, USA).

Library construction and DNA sequencing

a SMRTbell Express Template Prep Kit 2.0 was used to generate a 20 kb long library for PacBio HiFi sequencing. The library was then sequenced on a PacBio Revio System (Pacifc Biosciences, Menlo Park, CA, USA). HiFi reads were obtained using the CCS module in SMRT Link v9.0¹⁶. After HiFi reads calling, 25.14 Gb PacBio HiFi reads were generated (N50: 20.47 kb, 45.02× in depth) (Table 1).

Table 1 Sequencing data for Siganus canaliculatus genome assembly.

Full size table

For Hi-C sequencing, a GrandOmics Hi-C kit with DpnII enzyme (GrandOmics, China) was used to construct libraries following the standard manufacturer’s protocol. The resulted Hi-C libraries were sequenced on a MGISEQ-2000 platform (MGI, BGI Shenzhen, China). 101.66 Gb raw reads were produced. These raw reads were filtered by using fastp v0.19.5¹⁷ to filter low quality reads. 96.75 Gb (173.26 × in depth) clean reads were obtained in total. This clean Hi-C data was subsequently used for placing contigs onto psedochromosomes.

RNA extraction and sequencing

Both RNA-seq and Iso-seq were employed to assist RNA evidence based gene prediction. Seven tissues (skin, fin, heart, liver, gill, muscle and gonad) from the same individual as DNA extraction were equally mixed and extracted by using a TRIZOL Kit (Invitrogen, Carlsbad, CA, USA) following the manufacturer’s instructions. RNA integrity and quality was checked by the Nanodrop 2000 spectrophotometer and the Agilent 2100 Bioanalyzer System (Agilent Technologies, Santa Clara, CA, USA). RNA with RIN (RNA integrity number) ≥7.0 were selected for library construction. Procedures described in our previous study¹⁸ were performed for Iso-seq. Briefly, the extracted RNA was used for cDNA synthesis followed by a large-scale PCR amplification step. PCR products were purified and subjected to the construction of SMRTbell template libraries. Finally, SMRT cells were sequenced on a PacBio Revio platform. For RNA-seq, cDNA libraries with insert sizes of ~350 bp were constructed and sequenced on a MGISEQ-2000 platform (MGI, BGI Shenzhen, China). 96.30 Gb and 18.14 Gb raw data were generated from Iso-seq and RNA-seq, respectively (Table 1).

Genome assembly and telomere identification

HiFi reads were first assembled using hifiasm v0.19.5-r587¹⁹ with default parameters to generate a contig-level assembly which had a size of 558.39 Mb with 108 contigs (N50: 21.41 Mb). The mitochondrial sequences were removed in this step. After hifiasm assembly, purge_dups v1.2.6²⁰ was used to remove haplotigs and contig overlaps based on read depth following the standard pipeline. AutoHiC v1.3.3²¹ was then used to scaffold these contigs using deep learning-based methods for automatic error correction. Briefly, this newly developed software utilizes Hi-C reads and input draft reference assembly to generate a candidate assembly. With built-in AutoHiC deep learning models, AutoHiC can automatically correct errors during genome assembly and generate a chromosome-level genome. The resulted draft genome was then polished by NextPolish v1.4.1²² to fix base errors (SNV/Indel) with HiFi long reads. Telomere sequences at ends of each chromosome was identified quarTeT v1.2.5²³. The size of the final assembly version was 547.39 Mb, of which 95.32% (521.76 Mb) were placed onto 24 chromosomes with Hi-C heat map support (Figs. 1, 2; Table 4). 70 sequences were presented in the final assembly with N50 length of 21.79 Mb. The length of 24 chromosome-level sequences ranged from 12.47 Mb to 27.41 Mb. The 24 chromosome numbers suggested by the Hi-C heat map was identical with a karyotype study of S. canaliculatus²⁴. Telomere sequences were found to be presented at both ends of three chromosomes while only single telomere sequences were identified at one end of 20 chromosomes (Table 4).

Repeat elements annotation

EDTA pipeline²⁵ was used to annotate repeat elements in the S. canaliculatus genome. This pipeline was developed for automated whole-genome de-novo TE annotation. It first utilizes LTR-FINDER v1.0.6²⁶, LTRharvest²⁷, HelitronScanner²⁸ and TIR-Learner²⁹ to predict LTR, TIR and Helitron, respectively. Then, LTR_retriever v3.0.3³⁰ was used to filter false positive results of LTR. Subsequently, basic and advance filter in EDTA were applied to do additional filtering and resulted in raw TE library. This raw library was used for RepeatMasker v4.1.2-p1³¹ to mask the target genome followed by RepeatModeler v2.0.3³² to predict the remaining TE in the genome. The results showed 89,597,434 bp (16.37%) was identified to be repetitive sequences (Table 2), in which LTR accounting for 2.58%, TIR 4.19%, nonLTR 0.38%, nonTIR 0.58% and repeat_region 8.1%.

Table 2 Statistics of repetitive sequences.

Full size table

Gene structure prediction and functional annotation

The masked genome generated in the repeat annotation step was used as an input for gene structure prediction. Three approaches which were commonly adopted was employed in this study: (1) Ab initio prediction: AUGUSTUS v3.5.0³³ and GeneMark-ET³⁴ were performed to do ab initio prediction; (2) Homology-based prediction: Protein sequences from five representative species (Danio rerio, Oreochromis niloticus, Oryzias latipes, Scatophagus argus, Takifugu rubripes) were download from the NCBI database. Using these data as references, gene structures in the S. canaliculatus genome were predicted using blastx v2.2.26³⁵ and exonerate v2.2³⁶; (3) Transcriptome-based: for RNA-seq based predictions, raw RNA-seq reads were filtered using fastp¹⁷ (-a auto --adapter_sequence_r2 auto --dedup --dup_calc_accuracy 3). After filtering, 16.96 Gb clean reads were mapped onto the S. canaliculatus genome using HISAT2 v2.2.1³⁷ and stringtie v2.2.1³⁸ and merged with TACO v0.7.3³⁹. For Iso-seq based predictions, raw Iso-seq read was processed using isoseq pipeline⁴⁰. GMAP⁴¹ was introduced to align cDNA to the S. canaliculatus genome. Finally, gene structures predicted from above three methods were integrated by MAKER v3.01.03⁴². Genes with a Annotation Edit Distance (AED) ≤1 were retained in the final dataset.

For functional annotation of predicted genes, protein sequences were extracted from the S. canaliculatus genome and blasted against nine commonly used protein databases (NR, Swissprot, KEGG, KOG, GO, Pfam, TrEMBL, eggNOG, InterPro) using DIAMOND v0.9.25⁴³ with an E value of 1e⁻⁵ and InterProscan v5.59-91.0⁴⁴.

Non-coding RNA (ncRNAs, i.e., tRNAs, rRNAs, miRNAs, snRNAs and snoRNAs) in the S. canaliculatus genome were also annotated. We first utilized tRNAscan-SE v1.3.1⁴⁵ to predict tRNAs in the assembly. For the rRNA genes, RNAmmer v1.2⁴⁶ was used (-S euk -m lsu,ssu,tsu -gff). MiRNAs, snRNAs and snoRNAs were searched by CMSAN v1.1.2⁴⁷ against the Rfam v14.10 database⁴⁸ (--cut_ga --rfam --nohmmonly --tblout --fmt 2).

For ab initio prediction, AUGUSTUS v3.5.0³³ and GeneMark-ET³⁴ found 38789 and 38161 genes in the S. canaliculatus genome, respectively. Homology-based approach predicted 37191 to 49829 genes depending on reference genomes. RNA-seq based evidence predicted 30416 genes while Iso-seq based evidence found 35972 genes (Table 3). After integrated by MAKER v3.01.03⁴², 25323 protein-coding genes were finally annotated with a range from 572 to 1415 genes across each chromosome (Table 4). Functional annotation results showed 71.45% to 96.68% of proteins can be blasted in one of nine databases (Fig. 3). After removing redundancy, 96.96% proteins had at least one database hits (Table 5). For ncRNA annotation, 1352 miRNA, 1551 tRNA, 2968 rRNA, 260 snRNA and 209 snoRNA were predicted in the S. canaliculatus genome (Table 6).

Table 3 Statistics of gene prediction.

Full size table

Table 4 Statistics of gene numbers predicted across each chromosome.

Full size table

Table 5 Statistics of gene functional annotation.

Full size table

Table 6 Statistics of non-coding genes.

Full size table

Data Records

Raw reads sequenced in this study have been submitted to the National Genomics Data Center (https://ngdc.cncb.ac.cn/, BioProject number: PRJCA029961⁴⁹, Run IDs: CRR1288946-CRR1288949). The genome sequences and annotation files were deposited at figshare (https://doi.org/10.6084/m9.figshare.27117169⁵⁰) and NCBI (accession number: JBLRWB000000000⁵¹).

Technical Validation

The quality of the assembly was assessed using BUSCO v5.5.0⁵² with the actinopterygii_odb10 database (3,640 BUSCOs). The BUSCO assessment showed that 3589 (98.6%) BUSCOs were identified as complete, of which 3574 (98.2%) and 15 (0.4%) were single-copy and duplicated, respectively. Chromosome numbers of the S. canaliculatus genome were confirmed by the Hi-C heat map (Fig. 2). Completeness assessment of proteins showed that a total of 3518 (96.6%) BUSCOs were identified as complete. Of these, 3488 (95.8%) were single-copy and 30 (0.8%) were duplicated BUSCOs (Fig. 4). Taking all above results and quality assessment metrics together, we concluded that the S. canaliculatus genome was high quality and has good annotations.

Code availability

No new scripts or pipelines were developed for this study. Software for raw data quality control, genome assembly and annotation, quality assessment have been described in the method part of this paper with parameters specified if applicable.

References

Froese, R. & Pauly, D. Family Siganidae. FishBase (2023).
Randall, J. E. & Kulbicki, M. Siganus woodlandi, new species of rabbitfish (Siganidae) from New Caledonia. Cybium 29, 185–189 (2005).
MATH Google Scholar
Kuriiwa, K., Hanzawa, N., Yoshino, T., Kimura, S. & Nishida, M. Phylogenetic relationships and natural hybridization in rabbitfishes (Teleostei: Siganidae) inferred from mitochondrial and nuclear DNA analyses. Mol Phylogenet Evol 45, 69–80, https://doi.org/10.1016/j.ympev.2007.04.018 (2007).
Article CAS PubMed Google Scholar
Yang, Y. et al. Comparative analysis of nutritional composition of muscle from Siganus oramin living in different habitats (in Chinese). South China Fisheries Science 19, 128–134 (2023).
MATH Google Scholar
Ma, Q. & Lu, J. Introduction and prospect of the systematics study of Siganidae in China (in Chinese). South China Fisheries Science 2 (2006).
Huang, X. et al. Morphology and growth of larval, juvenile and young Siganus oramin (in Chinese). South China Fisheries Science 14, 88–94 (2018).
MATH Google Scholar
Li, Y. et al. Vertebrate fatty acyl desaturase with Delta4 activity. Proc Natl Acad Sci USA 107, 16840–16845, https://doi.org/10.1073/pnas.1008429107 (2010).
Article ADS PubMed PubMed Central MATH Google Scholar
Li, Y. et al. Genome wide identification and functional characterization of two LC-PUFA biosynthesis elongase (elovl8) genes in rabbitfish (Siganus canaliculatus). Aquaculture 522 https://doi.org/10.1016/j.aquaculture.2020.735127 (2020).
Wen, Z., Li, Y., Bian, C., Shi, Q. & Li, Y. Characterization of two kcnk3 genes in rabbitfish (Siganus canaliculatus): Molecular cloning, distribution patterns and their potential roles in fatty acids metabolism and osmoregulation. Gen Comp Endocrinol 296, 113546, https://doi.org/10.1016/j.ygcen.2020.113546 (2020).
Article CAS PubMed Google Scholar
Huang, X. et al. Genetic variations among Siganus oramin populations in coastal waters of southeast China based on mtDNA control region sequences (in Chinese). Journal of Tropical Oceanography 37, 45–51, https://doi.org/10.11978/2017109 (2018).
Article CAS MATH Google Scholar
Peng, M. et al. Genetic diversity analysis of different geographical populations of Siganus canaliculatus along the South China Coast (in Chinese). Journal of Hydroecology 43, 127–133, https://doi.org/10.15928/j.1674-3075.202104280127 (2022).
Article MATH Google Scholar
Huang, X. et al. Phylogenetic information analysis of mitochondrial genome sequences in Siganus (Perciformes: Siganidae) (in Chinese). Journal of Biology 35, 33–36 (2018).
ADS CAS MATH Google Scholar
Huang, X. et al. Gonadal development of first sexual maturation of Siganus oramin cultured in pond (in Chinese). South China Fisheries Science 16, 99–107, https://doi.org/10.12131/20200051 (2020).
Article MATH Google Scholar
Feng, G. et al. Feeding habit and growth characteristics of Siganus canaliculatus cultured in sea net cage (in Chinese). Marine Fisheries 30, 37–42 (2008).
MATH Google Scholar
Jiang, B. et al. Transcriptome analysis provides insights into molecular immune mechanisms of rabbitfish, Siganus oramin against Cryptocaryon irritans infection. Fish Shellfish Immunol 88, 111–116, https://doi.org/10.1016/j.fsi.2019.02.039 (2019).
Article CAS PubMed Google Scholar
Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13, 278–289, https://doi.org/10.1016/j.gpb.2015.08.002 (2015).
Article PubMed PubMed Central MATH Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central MATH Google Scholar
Li, C. et al. Full-Length Transcriptome Data for the White Cloud Mountain Minnow (Tanichthys albonubes) From a Wild Population Based on Isoform Sequencing. Frontiers in Marine Science 9 https://doi.org/10.3389/fmars.2022.831148 (2022).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898, https://doi.org/10.1093/bioinformatics/btaa025 (2020).
Article CAS PubMed PubMed Central MATH Google Scholar
Jiang, Z. et al. A deep learning-based method enables the automatic and accurate assembly of chromosome-level genomes. Nucleic Acids Res https://doi.org/10.1093/nar/gkae789 (2024).
Article PubMed PubMed Central MATH Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Article CAS PubMed MATH Google Scholar
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic Res 10, uhad127, https://doi.org/10.1093/hr/uhad127 (2023).
Article PubMed PubMed Central Google Scholar
Shu, H., Huang, C., Zhang, H. & Wang, Y. Studies on the karyotype of Siganus canaliculatus (in Chinese). Journal of Guangzhou University (Natural Science Edition) 9, 90–93 (2010).
MATH Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol 20, 275, https://doi.org/10.1186/s13059-019-1905-y (2019).
Article CAS PubMed PubMed Central MATH Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18, https://doi.org/10.1186/1471-2105-9-18 (2008).
Article CAS PubMed PubMed Central Google Scholar
Xiong, W., He, L., Lai, J., Dooner, H. K. & Du, C. HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes. Proc Natl Acad Sci USA 111, 10263–10268, https://doi.org/10.1073/pnas.1410068111 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Su, W., Gu, X. & Peterson, T. TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome. Mol Plant 12, 447–460, https://doi.org/10.1016/j.molp.2019.02.008 (2019).
Article CAS PubMed MATH Google Scholar
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol 176, 1410–1422, https://doi.org/10.1104/pp.17.01310 (2018).
Article CAS PubMed MATH Google Scholar
Tarailo‐Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 25, 4.10. 11–14.10. 14 (2009).
Article Google Scholar
Abrusán, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330 (2009).
Article PubMed Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34, W435–439, https://doi.org/10.1093/nar/gkl200 (2006).
Article CAS PubMed PubMed Central MATH Google Scholar
Lukashin, A. & Borodovsky, M. GeneMark. hmm: new solutions for gene finding. Nucleic acids research 26, 1107–1115 (1998).
Article CAS PubMed PubMed Central MATH Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421, https://doi.org/10.1186/1471-2105-10-421 (2009).
Article CAS PubMed PubMed Central MATH Google Scholar
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS PubMed PubMed Central MATH Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article CAS PubMed PubMed Central MATH Google Scholar
Niknafs, Y. S., Pandian, B., Iyer, H. K., Chinnaiyan, A. M. & Iyer, M. K. TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat Methods 14, 68–70, https://doi.org/10.1038/nmeth.4078 (2017).
Article CAS PubMed MATH Google Scholar
PacificBiosciences. IsoSeq. github, https://github.com/PacificBiosciences/IsoSeq?tab=readme-ov-file (2024).
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875, https://doi.org/10.1093/bioinformatics/bti310 (2005).
Article CAS PubMed MATH Google Scholar
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188–196, https://doi.org/10.1101/gr.6743907 (2008).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60, https://doi.org/10.1038/nmeth.3176 (2015).
Article CAS PubMed MATH Google Scholar
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res 33, W116–120, https://doi.org/10.1093/nar/gki442 (2005).
Article CAS PubMed PubMed Central Google Scholar
Chan, P. P. & Lowe, T. M. tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. Methods Mol Biol 1962, 1–14, https://doi.org/10.1007/978-1-4939-9173-0_1 (2019).
Article CAS PubMed PubMed Central MATH Google Scholar
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35, 3100–3108, https://doi.org/10.1093/nar/gkm160 (2007).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935, https://doi.org/10.1093/bioinformatics/btt509 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 49, D192–D200, https://doi.org/10.1093/nar/gkaa1047 (2021).
Article CAS PubMed MATH Google Scholar
Chao, L. White-spotted spinefoot genome data archieve. National Genomics Data Center https://bigd.big.ac.cn/gsa/browse/CRA018870 (2024).
Chao, L. Chromosome-level genome assembly and annotation of the White-spotted spinefoot Siganus canaliculatus. figshare https://doi.org/10.6084/m9.figshare.27117169 (2024).
Chao, L. White-spotted spinefoot genome. GenBank https://identifiers.org/ncbi/insdc:JBLRWB000000000 (2025).
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This study was financially supported by the Core Technology Research Project for Suitable Species of Modern Marine Ranch in Guangdong Province (2024-MRB-00-001), Central Public-interest Scientific Institution Basal Research Fund (CAFS2023TD58). Chao Li was funded by the Natural Science Foundation of China (32300366), Guangdong Basic and Applied Basic Research Foundation (2023A1515010991;2022A1515110391), Guangzhou Basic and Applied Basic Research Foundation (2024A04J00318), China Postdoctoral Science Foundation (2022M711218), Open Project of Institute of Zoology, Guangdong Academy of Sciences (GIZ-KF202302).

Author information

These authors contributed equally: Xiaolin Huang, Yanke Lu, Hui Zhang, Lin Xian.

Authors and Affiliations

Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, 510300, China
Xiaolin Huang, Lin Xian, Yukai Yang & Dianchang Zhang
Sanya Tropical Fisheries Research Institute, Hainan Engineering Research Center of deep-sea aquaculture and processing, Sanya, 572018, China
Xiaolin Huang, Lin Xian, Yukai Yang & Dianchang Zhang
National Fishery Resources and Environment Dapeng Observation and Experimental Station, Shenzhen Base of South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Shenzhen, 518121, China
Xiaolin Huang, Yukai Yang & Dianchang Zhang
Guangzhou Key Laboratory of Subtropical Biodiversity and Biomonitoring, Guangdong Provincial Key Laboratory for Healthy and Safe Aquaculture, Guangdong Provincial Engineering Technology Research Center for Environmentally Friendly Aquaculture, School of Life Sciences, South China Normal University, Guangzhou, China
Yanke Lu, Hui Zhang, Shiting Huang, Lei Wang & Chao Li
State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
Lin Xian

Authors

Xiaolin Huang
View author publications
Search author on:PubMed Google Scholar
Yanke Lu
View author publications
Search author on:PubMed Google Scholar
Hui Zhang
View author publications
Search author on:PubMed Google Scholar
Lin Xian
View author publications
Search author on:PubMed Google Scholar
Shiting Huang
View author publications
Search author on:PubMed Google Scholar
Yukai Yang
View author publications
Search author on:PubMed Google Scholar
Lei Wang
View author publications
Search author on:PubMed Google Scholar
Dianchang Zhang
View author publications
Search author on:PubMed Google Scholar
Chao Li
View author publications
Search author on:PubMed Google Scholar

Contributions

C.L., X.H. and D.Z. conceived this project; H.Z., Y.L. and S.H. collected and identified the samples; C.L., Y.L., L.W. and X.H. did the genome assembly and annotation. C.L., H.X., Y.L. and L.X. wrote the manuscript. All authors have read and approved the final manuscript for publication.

Corresponding authors

Correspondence to Dianchang Zhang or Chao Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, X., Lu, Y., Zhang, H. et al. Chromosome-level genome assembly and annotation of the White-spotted spinefoot Siganus canaliculatus. Sci Data 12, 482 (2025). https://doi.org/10.1038/s41597-025-04844-w

Download citation

Received: 11 October 2024
Accepted: 17 March 2025
Published: 23 March 2025
DOI: https://doi.org/10.1038/s41597-025-04844-w