Background & Summary

The Plagiognathops microlepis, commonly known as smallscale yellowfin, is a benthic fish belonging to the subfamily Xenocyprinae within the family Cyprinidae, the largest family of freshwater fishes1,2,3. P. microlepis is the sole species of the genus Plagiognathops and is widely distributed in lakes and rivers across East Asia4. Cultivated since the 1970s, it is highly regarded for its delicious taste5. P. microlepis possesses distinctive characteristics such as medium body size, strong adaptability, rapid growth, high reproductive ability, and low susceptibility to breeding diseases, making it economically valuable6. The successful implementation of artificial breeding techniques has elevated P. microlepis to a crucial source of animal protein for local people in recent years7. Its delectable and tender flesh has garnered popularity among consumers, resulting in a high commercial value3. It has been included in the list of key protected economic aquatic animal and plant resources for China (first batch).

In its natural habitat, P. microlepis primarily uses its keratinized jaw to scrape humus, organic debris, fixed algae and higher plant debris from the bottom of water bodies1. Therefore, P. microlepis can effectively clear organic matter and algae from the water, which helps improve water quality and mitigate eutrophication processes. It can form a population in shallow lakes or reservoirs with abundant humus and plays an important role in maintaining local ecological balance8,9. Consequently, P. microlepis is recognized as an environmentally beneficial fish and is often used in maintaining aquatic ecological systems10. Furthermore, its population distribution across distinct and isolated basins makes it suitable for comparative genetic investigations11,12. Despite its ecological and commercial significance, the population size of P. microlepis has declined in the wild in recent years13,14. Market demand has driven the growing interest in P. microlepis aquaculture over the past decades, as wild populations are threatened by overfishing and water pollution. However, despite the increasing interest, there is a need for reported genome resources for P. microlepis and other fishes in the subfamily Xenocyprinae, significantly impeding research on their phylogeny and evolution biology.

Decoding a high-quality reference genome could provide valuable insights into conservation biology, genetics, and breeding studies for the species15. Previous studies on P. microlepis have primarily focused on its biological and physiological characteristics, reproductive behavior, age and growth patterns, dietary habits, muscle composition, population genetics, breeding strategies and nutritional requirements16. Additionally, mitochondrial cytochrome b and microsatellite markers have been used to assess their genetic structure, while isozymes have been used to identify their germplasm3,16. Whole-genome sequencing and analysis of P. microlepis genomic data offer a foundational framework for investigating species diversity, population dynamics, and developing strategic conservation measures, aquaculture, and breeding practices.

This study presents the first high-quality chromosome-level genome assembly of P. microlepis, achieved using PacBio HiFi and Hi-C sequencing methodologies. The assembly yielded a 976.41 Mb genome with a contig N50 length of 34.41 Mb and a scaffold N50 of 38.38 Mb. Using Hi-C data, 99.84% of the assembled bases were successfully anchored to 24 chromosomes, resulting in a high-quality reference genome. This achievement provides a solid foundation for future population and conservation genetic studies. Moreover, the chromosome-scale genome of P. microlepis will contribute to comparative genomic studies on the genome evolution of the Xenocyprinae subfamily and beyond.

Methods

Sample collection and DNA extraction

A mature male P. microlepis was obtained from the Fisheries Research Institute, Anhui Academy of Agricultural Sciences. Muscle tissue from this specimen was used for DNA extraction and subsequent whole-genome sequencing, including both short-read and long-read sequencing, as well as Hi-C sequencing. All experiments adhered to the recommendations of the Ethics Committee of the Fisheries Research Institute, Anhui Academy of Agricultural Sciences. Genomic DNA was extracted from the muscle tissues using the Qiagen DNeasy Blood and Tissue Kit (Qiagen, USA) following the manufacturer’s instructions. The quality and concentration of DNA were assessed with a NanoDrop One spectrophotometer (Thermo Scientific, USA) and 1% agarose gel electrophoresis.

Genome sequencing

The extracted DNA was randomly sheared into approximately 350 bp fragments, and a short fragment library was constructed using the MGIEasy PCR-Free DNA library prep set (BGI, China). Sequencing was conducted on the DNBSEQ-T7 platform (MGI, China), yielding a total of 59.19 Gb of paired-end raw reads, each 150 bp in length (Table 1). For PacBio sequencing, a SMRTbell Express Template Prep Kit 2.0 was used for HiFi sequencing based on PacBio’s standard protocol (Pacific Biosciences, USA), and a long-read library was constructed using the PacBio Sequel II system. This process generated 35.65 Gb of raw PacBio continuous long reads (HiFi) with an average length of 15.91 kb (Table 1). For Hi-C sequencing, approximately 1 g of muscle tissue from the male P. microlepis was used. The GrandOmics Hi-C kit (with DpnII as the restriction enzyme; GrandOmics, China) was used according to the manufacturer’s protocol. The Hi-C library was sequenced on the DNBSEQ-T7 platform (MGI, China), producing a total of 114.16 Gb of Hi-C read data for accessory chromosomes (Table 1).

Table 1 Statistics of sequencing data.

RNA extraction and transcriptome sequencing

To facilitate genome annotation, total RNA was extracted from a total of 11 tissues, including the spleen, kidney, brain, muscle, testis, liver, heart, intestine, skin, etc. The quality of the RNA was assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). A composite RNA sample was then used to construct a cDNA library with the MGIEasy library prep set (BGI, China), following the manufacturer’s instructions. The library was sequenced on the DNBSEQ-T7 platform using a paired-end 150 bp layout, yielding 50.60 Gb of transcriptome data to assist in genome and gene annotations (Table 1).

Genome size and heterozygosity estimation

The genome size of P. microlepis was estimated through a k-mer analysis using MGI clean reads. Initially, Jellyfish (v2.3.0)17 was used to calculate the frequency of 17-mers and generate a k-mer frequency table. Subsequently, GenomeScope (v2.0)18 analyzed the 17-mer frequency table, identifying a total of 839,321,169 17-mers, with 0.522% heterozygosity and 46.9% unique sequences (Fig. 1).

Fig. 1
figure 1

The 17-mer frequency distribution analysis chart for the P. microlepis genome.

Genome assembly

The genome assembly employed the default parameters of Hifiasm (v0.19.5)19. HiFi long reads were used as input for Hifiasm to construct primary assembly contig graphs. Overlaps were precomputed and subsequently recalculated from the corrected reads. Haploid duplications were removed using Hifiasm, followed by three rounds of error correction. The assembled graph yielded 99 contigs with a total length of 976.41 Mb, a maximum contig size of 57.12 Mb, and an N50 of 34.41 Mb (Table 2).

Table 2 Summary statistics of P. microlepis genome assembly.

Juicer (v1.6)20 combined with 3D-DNA (v201008)21 was used for scaffolding. Initially, BWA (v0.7.17)22 was used to index the contig-level genome, and Juicer was used to create restriction enzyme cutting sites. Clean Hi-C (paired-end) reads were mapped to the contigs using Juicer, and Hi-C-assisted initial chromosome assembly was conducted using the 3D-DNA algorithm with the standard procedure. Chromosome boundaries were adjusted, and scaffolds were corrected using the Juicerbox (v1.11.08)23 module, resulting in the resolution of 24 chromosomes (Figs. 2, 3). The results from the modified file output by Juicebox were revised and used as input for 3D-DNA to conduct re-scaffolding on a per-chromosome basis. The “FINAL” assembly consisted of a total of 67 scaffolds, with a maximum scaffold size of 57.81 Mb and an N50 size of 38.38 Mb (Tables 2, 3).

Fig. 2
figure 2

Hi-C contact map produced by 3D-DNA.

Fig. 3
figure 3

Features of the P. microlepis genome. From outside to inside: (a) The 24 pseudo-chromosomes, (b) GC content, (c) Gene density repeats, (d) Repeats content, (e) LTR content, (f) LINE content, (g) DNA-TE content and (h) Collinear regions (each line connects paralogous genes on different chromosomes).

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly of P. microlepis.

Repeat annotation

Recognizing the importance of tandem repeats, we used the software tools GMATA (v2.2.1)24 and Tandem Repeats Finder (TRF, v4.10.0)25 to conduct a genome-wide search for tandem repeat sequences with default parameters. GMATA specializes in detecting SSRs with shorter repeat units, while TRF identifies tandem repeats encompassing all types of repeat units. Subsequently, we analyzed the dispersed repetitive sequences. Initially, MITE-hunter26 was employed to detect a small transposon known as MITE within the genome, constructing a MITE library file. Following this, a hard mask operation was executed on the genome, marking repeated sequences as “N,” and RepeatModeler27 was used to conduct a de novo search for repeated sequences to form a de novo library file (RepMod.lib). Given the presence of numerous unknown repeated sequences in RepMod.lib, TEclass28 was employed for classification. Lastly, the MITE.lib, RepMod.lib, and Repbase29 libraries were integrated to form a general library file. This comprehensive library file was then used with RepeatMasker30 to search for repeated sequences throughout the entire genome.

The results indicate that dispersed repetitive sequences account for 45.18% of the total genome length (Table 4). Among transposable elements (TEs), DNA elements are the most prevalent type, accounting for 33.13% of the genome, followed by long terminal repeat (LTR) retrotransposons at 6.86%, long interspersed nuclear elements (LINEs) at 3.55%, and short interspersed nuclear elements (SINEs) at 0.26% of the genome. Ultimately, a total of 499,767,501 base pairs of repetitive sequences were identified, comprising 51.18% of the entire genome (Table 4, Fig. 4).

Table 4 Repetitive sequences in the genome of P. microlepis.
Fig. 4
figure 4

Four types of TE sequence divergence distribution diagram annotated by RepeatMasker.

Gene prediction and function assignment

Gene structure prediction was conducted using three methodologies: homology-based, transcriptome-based, and ab initio annotations. For the homology-based prediction, we employed GEMOMA (v1.6.1)31 to compare homologous proteins from five related species (Danio rerio, Carassius auratus, Cyprinus carpio, Megalobrama amblycephala, Sinocyclocheilus grahami) with our assembled genome. Transcriptome sequence annotation was performed with PASA (v2.3.3)32 to obtain gene information, which was then used in semi-supervised self-training of gene models with GeneMark-ST33 software for prediction. The predicted genes were compared with the SwissProt Database34 using Blastp, and alignment results were filtered based on identity (identity ≥ 95%). The top 3,000 genes with the highest alignment scores from GeneMark-ST were chosen as the training set for AUGUSTUS35 model training. Subsequently, AUGUSTUS was employed to predict genes within the genome using the developed predictive model.

The predicted gene set information files from ab initio prediction, homology-based annotation, and transcriptome prediction were converted into a file format compatible with EVM32. These files were integrated using the EVM software with default parameters, resulting in an initial non-redundant gene set. In total, our predictions identified 25,389 genes in the genome, with an average gene length of 17,691.90 bp, an average exon sequence length of 178.06 bp, and an average of 9.90 exons per gene (Table 5, Fig. 5).

Table 5 Gene structures and function annotation of P. microlepis.
Fig. 5
figure 5

Venn diagram of function annotations from various databases.

Phylogenetic tree

We searched the NCBI database for the following species: Lepisosteus oculatus, Takifugu rubripes, Oryzias latipes, Astyanax mexicanus, Misgurnus anguillicaudatus, Danio rerio, Cyprinus carpio, Puntigrus tetrazona, Gymnocypris eckloni, Onychostoma macrolepis, Cirrhinus molitorella, Ctenopharyngodon idella, Chanodichthys erythropterus, Megalobrama amblycephala, Hypophthalmichthys nobilis. Gene information for P. microlepis and these fifteen species was annotated and filtered. Subsequent gene family identification was carried out using OrthoFinder (v2.5.4)36 with default parameters. Firstly, pairwise comparisons (E-value ≤ 1e-5) of protein sequence datasets from all studied species were performed using BLASTP (v2.2.26)37 software to obtain similarity information between protein sequences. Secondly, genes were clustered to identify direct homologous genes, paralogous genes, and single-copy homologous genes using the MCL algorithm. Based on the single-copy immediate homologous genes identified in the family identification results, MAFFT38 software was utilized to generate a protein multiple sequence alignment dataset (protein-MSA). Then, the PAL2NAL tool was employed to perform protein sequence comparison, converting it into corresponding codon-based sequences (CDS), i.e., CDS-MSA. Finally, a species tree was constructed using the GTRGAMMA model in RAXML39, with Lepisosteus oculatus designated as the out group, and further edited using Figtree. Additionally, the MCMCtree tool in the PAML (v4.10.6)40 package was used to estimate divergence times among the fifteen selected fish species. The phylogenetic tree revealed that the closest species to P. microlepis are Chanodichthys erythropterus, the ancestor of P. microlepis separated from the ancestor of C. erythropterus approximately 8.8 million years ago. The ancestor of P. microlepis separated from the ancestor of Cyprinus carpio around 33.4 million years (Fig. 6).

Fig. 6
figure 6

Phylogenetic analysis and divergence time tree of P. microlepis and other 15 representative species.

Genome synteny analysis

To compare whole genome synteny, two chromosome-level genomes, Danio rerio41 and Chanodichthys erythropterus42, were aligned with the genome assembly of P. microlepis using MCscan43 with default parameters. Syntenic relationships were plotted using the JCVI44 package. The results indicate that chromosome 1 of P. microlepis corresponds to chromosomes 10 and 22 of D. rerio, while the remaining 23 chromosomes exhibit a one-to-one correspondence. P. microlepis and C. erythropterus displayed a high level of conserved synteny, with a nearly perfect one-to-one correspondence between their chromosomes (Fig. 7). Some minor segmental exchanges were observed between P. microlepis and D. rerio, as well as C. erythropterus (Fig. 7). Nevertheless, the chromosome systems and homologies of these species remained largely unchanged, indicating a relatively conserved synteny.

Fig. 7
figure 7

Synteny analysis of chromosomes between P. microlepis, D. rerio and C. erythropterus.

Data Records

All the raw sequencing data utilized in this study, including WGS, Hifi, Hi-C and RNA-Seq have been deposited in the NCBI database under the BioProject accession number PRJNA111591645. The genome assembly is available for public access at the NCBI GenBank under the accession number GCA_040144785.146. Genome annotations, predicted coding sequences and protein sequences can be accessed through the Figshare47.

Technical Validation

Assessment of genome assembly

The accuracy of the P. microlepis genome assembly was assessed by evaluating its completeness using the conserved metazoan gene set ‘actinopterygii_odb10’ from BUSCO48 (v5.4.3). The analysis demonstrated a high level of completeness, with an overall completeness of 98.08%. Specifically, 96.79% genes were complete and single-copy, 1.3% genes were complete and duplicated, 0.55% genes were fragmented and genes 1.37% were missing. These results indicate that the P. microlepis genome assembly is of high quality (Table 6).

Table 6 BUSCO analysis of the genome assembly and genes.

Gene annotation validation

To assess the integrity of the annotated gene set, a BUSCO analysis was conducted using conserved single-copy homologous genes from the actinopterygii_odb10 library. The results showed that approximately 96.98% of the complete gene elements are present in the annotated gene set, indicating a high level of completeness in the conserved gene predictions. Specifically, 95.60% of the genes were complete and single-copy BUSCOs, with only 0.30% fragmented and 2.72% missing from the assembly (Table 7). These findings underscore the remarkable integrity and conservation of gene content in the dace genome assembly, resulting in highly reliable prediction outcomes.

Table 7 BUSCO analysis of the genome annotation and genes.