Background & Summary

The mud carp (Cirrhinus molitorella) is classified within the order Cypriniformes, family Cyprinidae, subfamily Labeoninae, and the genus Cirrhinus1. This species is distributed across southern China and Southeast Asia and is economically significant in southern China, with aquaculture production reaching 72,800 tons in 20212. Mud carp are recognized for their diverse dietary preferences, robust disease resistance, high productivity, and adaptability to various aquatic environments. In South China, it is a prominent species in pond aquaculture and is often listed among the four major economic carp species in Guangdong province, alongside grass carp (Ctenopharyngodon idella), silver carp (Hypophthalmichthys molitrix), and bighead carp (Hypophthalmichthys nobilis)3. Renowned for its tender, flavorful meat at a reasonable price, mud carp enjoys popularity as aquatic products in the market. Additionally, mud carp holds significant value in food processing, with products such as canned mud carp with black beans, mud carp cakes, mud carp balls, and mud carp skin being highly sought after by consumers4. Moreover, mud carp is an important species for leisure fishing in southern China, offering anglers an exhilarating experience with its strong forward rush when hooked, which attracts many fishing enthusiasts.

Genomics is the scientific field focused on studying biological genomes and gene functions5. This field includes sequencing, assembly, functional analysis, and other methodologies, providing crucial insights into the fundamental principles of life6. Recent advancements in second- and third-generation sequencing technologies have greatly improved sequencing efficiency, reduced costs, and expanded the application of these techniques7. Concurrently, genomic research has advanced significantly, particularly in the realm of fish genetics, witnessing a burgeoning number of genomic studies8. Currently, there are over 3,700 species of Cyprinidae distributed across 210 genera worldwide. To date, genome sequencing and assembly have been completed for only a few Cyprinidae species, including Zebrafish (Danio rerio)9, grass carp (Ctenopharyngodon idella)10,11, common carp (Cyprinus carpio)12, goldfish (Carassius auratus)13,14, silver carp (Hypophthalmichthys molitrix)15,16, bighead carp (Hypophthalmichthys nobilis)15,17, Blunt Snout Bream (Megalobrama amblycephala)18, and topmouth culter (Culter alburnus)19. These genome assemblies have facilitated research on species evolution, chromosome rearrangement, and genetic analysis of economic traits, serving as critical foundations for further investigation20,21.

To date, there are no reports on the chromosome-level genome assembly of fishes within the Labeoninae subfamily. Species in the Labeoninae are widely distributed across various water systems south of the Qinling Mountains in China, extending to Southeast Asia and South Asia22. In China alone, there are approximately 40 species and subspecies across 17 genera. Within the Labeoninae subfamily, mud carp holds significant importance in aquaculture. Renowned as the ‘pond scavenger’, mud carp has been cultivated in China for centuries23. As an omnivorous species, mud carp can consume a variety of diets, which facilitating its integration with other fish species to enhance production24. Previous studies on mud carp have primarily focused on its biological and physiological characteristics, growth patterns, dietary habits, population genetics, and sex determination4,25,26. The completion of whole genome sequencing and chromosome-level genome assembly for mud carp will provide a crucial foundation and reference for extensive research, not only on mud carp itself but also on other fish species within the Labeoninae subfamily.

In this study, we utilized PacBio Hifi long-read sequencing and Hi-C technology to generate a high-quality chromosome-level assembly of the mud carp genome. With the development of this high-quality reference genome, we anticipate significant advancements in the field of population genetics and the identification of functional genes associated with key economic traits in mud carp. The elucidation of this genomic foundation holds the potential to pave the way for molecular breeding and gene editing in mud carp.

Methods

Sample collection and DNA extraction

A mature male mud carp was collected from the Pearl River in Guangzhou, China. Muscle tissue from this specimen was utilized for DNA extraction, which was subsequently used for whole-genome sequencing, including short-read, long-read, and Hi-C sequencing. All experiments were conducted following the guidelines set forth by the Ethics Committee of the Pearl River Fisheries Research Institute, Chinese Academy of Fishery Sciences. Genomic DNA was extracted from the muscle tissue using a Qiagen DNeasy Blood and Tissue Kit (Qiagen, USA) according to the manufacturer’s instructions. The quality and concentration of the extracted DNA were evaluated using a NanoDrop One spectrophotometer (Thermo Scientific, USA) and 1% agarose gel electrophoresis.

Genome sequencing

The extracted DNA was randomly sheared into approximately 350 bp fragments, and a short-read library was constructed using the MGIEasy Universal DNA Library Prep Set (MGI, China). Sequencing was performed on the MGISEQ T7 platform (MGI, China), generating a total of 73.91 Gb of paired-end raw reads, each 150 bp in length (Table 1). For PacBio sequencing, a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) was employed according to PacBio’s standard protocol for HiFi sequencing, and a long-read library was constructed using the PacBio Sequel II system. This process yielded 33.47 Gb of raw PacBio continuous long reads (HiFi) with an average length of 17.13 kb (Table 1). For Hi-C sequencing, approximately 1 g of muscle tissue from the male mud carp was dissected and processed with the GrandOmics Hi-C kit (DpnII restriction enzyme; GrandOmics, China) following the manufacturer’s protocol. The Hi-C library was sequenced on the MGISEQ T7 platform (MGI, China), producing a total of 112.00 Gb of Hi-C read data (Table 1).

Table 1 Statistics of sequencing data.

RNA extraction and transcriptome sequencing

To facilitate genome annotation, total RNA was extracted from various tissues, including the spleen, kidney, brain, muscle, ovary, and liver. RNA quality was assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). A mixed RNA sample was then used to construct a cDNA library with the MGIEasy Universal DNA Library Prep Set (MGI, China) following the manufacturer’s protocol. This library was sequenced on the MGISEQ T7 platform (MGI, China) with a paired-end 150 bp layout, resulting in 49.63 Gb of transcriptome data to support genome and gene annotations (Table 1).

Genome size and heterozygosity estimation

To estimate the genome size of the mud carp, a k-mer analysis was performed using MGI clean reads. Initially, Jellyfish (v2.3.0)27 was used to calculate the frequency of 17-mers and generate a k-mer frequency table. Subsequently, GenomeScope (v2.0)28 was used to analyse the 17-mer frequency table, resulting in an estimated genome size of 884,402,008 bp, with 0.461% heterozygosity and 46.4% unique sequences (Fig. 1).

Fig. 1
figure 1

The 17-mer frequency distribution analysis chart for the Cirrhinus molitorella genome.

Genome assembly

Genome assembly was performed using the default parameters of Hifiasm (v0.16.1)29. HiFi long reads were used as input for Hifiasm to construct primary assembly contig graphs. Overlaps were precomputed and then recalculated from the corrected reads. Haploid duplications were removed using Hifiasm, followed by three rounds of error correction. The assembly yielded 69 contigs with a total length of 1033.41 Mb. The maximum contig size was 54.85 Mb, with an N50 of 33.29 Mb (Table 2). Scaffolding was performed using Juicer (v1.6)30 combined with 3D-DNA (v201008)31. Initially, BWA (v0.7.17)32 was used to index the contig-level genome, followed by Juicer to create restriction enzyme cutting sites. Juicer was used to map clean Hi-C (paired-end) reads to the contigs, and Hi-C-assisted initial chromosome assembly was conducted using the 3D-DNA algorithm following standard procedures. Chromosome boundaries were adjusted, and scaffolds corrected using the manually operated Juicerbox (v1.11.08)33 module, resulting in the resolution of 25 chromosomes (Figs. 2, 3). The modified file output by Juicebox was revised and used as input for 3D-DNA to perform re-scaffolding on a per-chromosome basis. The final assembly consisted of 42 scaffolds, with a maximum scaffold size of 67.41 Mb and an N50 size of 39.86 Mb (Tables 2, 3).

Table 2 Summary statistics of Cirrhinus molitorella genome assembly.
Fig. 2
figure 2

Hi-C contact map produced by 3D-DNA.

Fig. 3
figure 3

Features of the Cirrhinus molitorella genome. From outside to inside: (I) the 25 pseudo-chromosomes, (II) gene density, (III) repeats, (IV) GC content, (V) variant density and (VI) collinear regions (each line connects paralogous genes on diferent chromosomes).

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.

Repeat annotation

In recognition of the significance of tandem repeats, we utilized two software tools, GMATA (v2.2.1)34 and Tandem Repeats Finder (TRF, v4.10.0)35, to conduct a genome-wide search for tandem repeat sequences with default parameters. GMATA primarily identifies SSRs with shorter repeat units, while TRF explores tandem repeats encompassing all types of repeat units. The results indicate that SSRs constitute 0.19% of the total genome length, while tandem repeat sequences account for1.06% of the genome length. Subsequently, we scrutinized the dispersed repetitive sequences. First, we used MITE-hunter36 to identify miniature inverted-repeat transposable elements (MITEs) within the genome, constructing a MITE library file. Subsequently, a hard-masking operation was executed on the genome, marking repeated sequences as ‘N’, and RepeatModeler (v2.05)37 was then used to perform a de novo search for additional repeated sequences, generating a de novo library file (RepMod.lib). Considering that RepMod.lib contains many Unknown repeated sequences, TEclass38 was used for classification. Finally, the MITE.lib, RepMod.lib and Repbase (v19.06)39 libraries were combined to create a comprehensive library file. This combined library file was then employed with RepeatMasker (v4.1.6)40 to search for repeated sequences throughout the entire genome. The results show that dispersed repetitive sequences make up 45.18% of the genome length (Table 4). Among transposable elements (TEs), DNA elements are the most prevalent, comprising 29.37% of the genome, followed by long terminal repeat (LTR) retrotransposons at 5.55%, long interspersed nuclear elements (LINEs) at 4.34%, and short interspersed nuclear elements (SINEs) at 0.55%. In total, 539,498,149 base pairs of repetitive sequences were identified, representing 52.21% of the entire genome (Table 4).

Table 4 Repetitive sequences in the genome of Cirrhinus molitorella.

Gene prediction and function assignment

Gene structure prediction was conducted through three distinct methodologies: homology-based, transcriptome-based, and ab initio annotations. For homology-based prediction, we employed GEMOMA (v1.6.1)41 to compare homologous proteins from five related species (Danio rerio, Carassius auratus, Cyprinus carpio, Megalobrama amblycephala, Sinocyclocheilus grahami) with our assembled genome. Transcriptome sequence annotation was conducted using PASA (v2.3.3)42, which provided gene information for semi-supervised self-training of gene models via GeneMark-ST (v5.1)43. The predicted genes were then compared with the SwissProt database44 using BLASTP, with alignments filtered for identity ( ≥ 95%). We selected the top 3,000 genes with the highest alignment scores from GeneMark-ST as the training set for AUGUSTUS model training. AUGUSTUS (v3.5.0)45 was subsequently employed to predict genes within the genome using the trained model. The gene prediction results from ab initio, homology-based, and transcriptome-based methods were converted into a file format compatible with EVM (v2.1.0)42. These files were integrated using EVM with default parameters, producing an initial non-redundant gene set. Our predictions identified a total of 25,865 genes in the genome, with an average gene length of 20,822.68 bp, an average coding sequence length of 1,675.78 bp, and an average of 10.08 exons per gene (Table 5, Fig. 4).

Table 5 Gene structures and function annotation.
Fig. 4
figure 4

Venn diagram of function annotations from various databases.

Data Records

The raw sequencing reads of all libraries have been deposited into NCBI SRA database via the accession number PRJNA100116446. The assembled genome has been deposited at Genbank under the accession number GCA_033026305.147. Moreover, data of the genome annotations, predicted coding sequences and protein sequences are available at Figshare48.

Technical Validation

Assessment of genome assembly

The accuracy of the mud carp genome assembly was evaluated by assessing its completeness using the conserved metazoan gene set ‘actinopterygii_odb10’ from BUSCO (v5.4.3)49. The analysis demonstrated high completeness, with an overall completeness of 98.1%. Specifically, 96.8% of the genes were complete and single-copy, 1.3% were complete and duplicated, 0.9% were fragmented, and 1.0% were missing. These findings indicate the high quality of the mud carp genome assembly (Table 6).

Table 6 BUSCO analysis of the genome assembly and genes.

Gene annotation validation

To evaluate the integrity of the annotated gene set, we conducted BUSCO analysis using conserved single-copy homologous genes from the ‘actinopterygii_odb10’ library. The results revealed that approximately 96.54% of the complete gene elements are present in the annotated gene set, indicating a high level of completeness in the conserved gene predictions. Specifically, 95.08% of the genes were complete and single-copy BUSCOs, with only 0.47% fragmented and 2.99% missing from the assembly (Table 7). These findings highlight the exceptional integrity and conservation of gene content in the dace genome assembly, leading to highly confident prediction outcomes.

Table 7 BUSCO analysis of the genome annotation and genes.