Background & Summary

The ratmouth barbel (Ptychidio jordani Myers, 1930) is a freshwater ray-finned fish in the Cyprinidae family, native to the Pearl River drainage basin in China1. It prefers fast-flowing, clear streams and rivers with gravel substrates, occupying a bottom-dwelling ecological niche and primarily feeding on mollusks and algae2. It is known to reach a maximum length of 37 cm and is easily identified by its notable barbels, which are critical for its foraging behavior3. Due to its high muscle fat content, tender and delicate flesh, and excellent taste, P. jordani has high economic value and is considered one of the four most renowned fish in the Pearl River region. Historically an important species for local fisheries, P. jordani has experienced a drastic population decline due to overfishing and habitat disruption, primarily caused by dam construction4,5. In 2011, the IUCN classified P. jordani as critically endangered6. Subsequently, in 2016, P. jordani (wild populations) was recorded in the “Red List of China’s Vertebrates” and classified as a critically endangered species in China7. Additionally, P. jordani presents significant challenges for artificial breeding, as it has proven difficult to reproduce in controlled environments8. Given the sharp decline in wild populations and the difficulties in achieving artificial reproduction, urgent conservation measures are needed for P. jordani.

However, developing effective conservation strategies is challenging due to the current lack of detailed research on the biology and ecological characteristics of P. jordani. Gaining insight into the genetic composition of P. jordani is vital for identifying the evolutionary adaptations that have enabled its persistence in specific freshwater environments. While the mitochondrial DNA and microsatellite marker studies have provided preliminary insights into its genetic diversity and population structure8,9,10, the lack of a chromosome-level genome sequence significantly limits the depth of current research. A chromosome-level genome sequence provides a high-resolution resource to uncover key genetic and evolutionary traits, including population-specific adaptations, genetic diversity, and sex determination mechanisms11,12,13. This information is fundamental for designing targeted conservation strategies, such as identifying at-risk populations, preserving genetic diversity, and implementing effective restocking and habitat restoration programs. Furthermore, it can facilitate advancements in artificial breeding techniques by identifying genes linked to reproduction, growth, and adaptive traits, thereby enhancing breeding efficiency and promoting aquaculture sustainability.

In this study, we present the chromosome-level genome assembly of P. jordani (Fig. 1), utilizing multiple sequencing technologies, including next-generation short-read sequencing and third-generation long-read sequencing, supplemented by high-throughput chromosome conformation capture (Hi-C) techniques. Ultimately, we sequenced and assembled a high-quality chromosome-level genome of P. jordani, with the final genome size of 1.14 Gb, comprising 25 chromosomes with a contig N50 of 25.14 Mb and a scaffold N50 of 42.91 Mb. We also evaluated the methylation level of the assembled genome and identified the distribution of methylation loci across the chromosomes. Additionally, we conducted transcriptomic analyses across muscle, caudal fin, liver, gill, heart, spleen, and kidney. These analyses provided a deep understanding of gene expression patterns in P. jordani. Thus, we constructed a detailed and comprehensive genomic map of P. jordani, facilitating the identification of critical genes and regulatory elements related to the species’ growth, reproduction, and adaptation to the environmental conditions. Our research substantially enriches the genomic resources available for P. jordani, providing valuable data for comparative genomic studies within the Cyprinidae family and supporting both conservation efforts and the development of artificial breeding techniques.

Fig. 1
figure 1

The genome structure of Ptychidio jordani. From outer to inner circles: (a) chromosomes (Chr1-Chr25), (b) gene density, (c) repeats coverage, (d) distribution of non-coding RNAs, including rRNA (black), snRNA (red), tRNA (green), miRNA (blue), and (e) GC ratio.

Methods

Ethics statement

This study was performed in compliance with ethical standards and approved by the Laboratory Animal Ethics Committee of the Pearl River Fisheries Research Institute under License No. LAEC-PRFRI-2024-04-01.

Sample collection

A cultured specimen of P. jordani was collected from Guangdong Hanyu Ecological Technology Co., LTD, Guangzhou city, Guangdong Province, China (113°30′45″N, 22°55′54″E). Following lethal anesthesia of the specimen, multiple tissue samples, including muscle, caudal fin, liver, gill, heart, spleen, and kidney, were collected, frozen in liquid nitrogen, and stored at −80 °C for subsequent DNA and RNA extraction.

Genomic DNA extraction, species identification and sequencing

Muscle tissue was utilized for next-generation sequencing, PacBio HiFi sequencing, and Hi-C sequencing. Genomic DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Germany) following the manufacturer’s instructions. The concentration and purity of the DNA were measured using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, USA).

Due to the ambiguous morphology of the Ptychidio genus, we initially identified the species using the COI mitochondrial marker. COI primers and PCR protocols were adopted from14. We then blasted the target COI sequence against the other species in the genus Ptychidio and conducted a phylogenetic tree using all available COI sequences in the CoSFISH website and its online analysis tools15. The COI phylogenetic result confirmed that our specimen is P. jordani (Figure S1).

For the genome survey, next-generation sequencing was conducted using a DNA library prepared with the MGIEasy Universal DNA Preparation Reagent. The library was constructed through a series of steps, including fragmentation, end-repair, adapter ligation, and subsequent amplification. Sequencing was performed on the BGISEQ DNBSEQ-T7 platform, producing 150 bp paired-end reads with an insert size of 350 bp. SOAPnuke (version 2.1.0)16 was used to filter the raw reads, removing low-quality sequences and adapter contaminants, resulting in 55.57 Gb of clean data.

The PacBio HiFi library was prepared using SMRTbell prep kit 3.0 (PacBio 102-182-700) following the manufacturer’s protocols. High-quality genomic DNA was sheared to a target size range of 15–18 kb using the Megaruptor 3 system (Diagenode), followed by size selection with the AMPure PB bead kit (PacBio 102-182-500). The enriched fragments were then end-repaired, A-tailed, and ligated to SMRTbell adapters. The prepared library was then sequenced on the PacBio Revio platform, generating 42.95 Gb of HiFi reads.

The Hi-C library was constructed according to the Hi-C library protocol by Rao et al.17, utilizing frasergen Hi-C kit. Sequencing was carried out on the BGISEQ DNBSEQ-T7 platform, resulting in 123.62 Gb of raw reads. The sequencing depth for Hi-C was 108.6X, ensuring comprehensive coverage for chromosomal interaction mapping. After trimming with Trimmomatic (version 0.40)18, we obtained 110.55 Gb of polished Hi-C reads.

Transcriptome sequencing

Tissue samples of muscle, caudal fin, liver, gill, heart, spleen, and kidney were used for RNA-seq analysis. RNA was extracted from each tissue sample using the TRIzol reagent (Invitrogen, MA, USA) according to the manufacturer’s instructions. The concentration and purity of the extracted RNA were measured using Nanodrop 2000 spectrophotometer (NanoDrop Technologies, USA). RNA integrity was evaluated through agarose gel electrophoresis and Agilent 2100 bioanalyzer (Agilent Technologies, Palo Alto, CA). Subsequently, RNA-seq libraries were individually constructed for each tissue sample using the qualified RNA. The mRNA was purified from the total RNA using poly‐T oligo‐attached magnetic beads. Sequencing libraries were then generated from the purified mRNA using the VAHTS Universal V6 RNA-seq Library Kit for MGI (Vazyme, Nanjing, China) following the manufacturer’s recommendations. Finally, the libraries were sequenced on the BGISEQ DNBSEQ-T7 platform, producing 150 bp paired-end reads for each tissue sample. Detailed RNA-seq data information for each tissue sample is provided in Table 1.

Table 1 Sequencing strategies employed in this study.

Genome size estimation

Based on next-generation short reads, the k-mer frequency with k = 17 was used to estimate the genome size, heterozygosity, and repeat content of P. jordani using GCE software (version 1.0.2)19 (Figure S2). The analysis revealed that the estimated genome size was approximately 1.08 Gb, with a heterozygosity rate of 0.43% and a repeat content of 57.88%.

Genome assembly

The draft genome was assembled using PacBio Hi-Fi reads processed with Hifiasm (version 0.19.5)20 under default parameters, resulting in 251 contigs with a contig N50 of 25.14 Mb and an estimated genome size of approximately 1.15 Gb. Polished Hi-C reads were then mapped to the draft contig-level genome assembly using Juicer (version 1.6)21 to generate interaction matrices. These matrices enabled the anchoring and scaffolding of 100 contigs, accounting for 97.48% of the total sequences, into 25 chromosomes using 3D-DNA (version 180922)22 under default parameters (Fig. 2). The remaining 132 unanchored contigs were analyzed using BLAST (version 2.11.0+)23 against the NCBI NT (Nucleotide Sequence Database) to identify and remove potential contaminants, including bacterial and organelle-derived sequences. The final genome assembly consisted of 232 contigs, with a contig N50 of 25.14 Mb, a scaffold N50 of 42.91 Mb, and a genome size of 1.14 Gb. The number of chromosomes is consistent with karyotypic findings from a previous study24. Detailed chromosome information is provided in Table S1. The summarized assembly information is provided in Table 2.

Fig. 2
figure 2

Hi-C interaction heatmap of the P. jordani genome. The x- and y-axes represent genomic positions (in Mb), with darker red indicating stronger chromosomal interactions. The color scale (right) reflects interaction frequencies.

Table 2 Assembly statistics of the genome of P. jordani.

Repeat identification

Identification of repetitive sequences was performed using a combination of de novo and homology-based approaches. RepeatModeler (version 2.0.1, default parameters) and LTR-FINDER (version 1.0.7, parameter -w 2 -C)25 were utilized to construct a de novo library of repetitive sequences. This library was subsequently employed to identify repetitive regions using RepeatMasker (version 4.1.2)26 with the parameter -nolow -no_is -norna -parallel 2. Additionally, tandem repeats were detected using the Tandem Repeats Finder (Trf) software (version 4.09.1)27 with the parameter 2 7 7 80 10 50 2000 -d -h. Homologous repeats were identified by aligning the assembled genome sequences against the RepBase database (http://www.girinst.org/repbase) using RepeatMasker (version 4.1.2) and RepeatProteinMask (version 1.36)26. By merging the results from both methods, we identified a total of 751.75 Mb repeats, accounting for 66.04% of the genome (Table 3). Among these repetitive sequences, transposable elements (TEs) accounted for 56% of the genome. Among TEs, the most abundant type was DNA transposons (Table 4).

Table 3 Identification of repeats in the genome of P. jordani.
Table 4 Information on different types of transposable elements.

Gene prediction and functional annotation

Gene prediction was conducted using a combination of ab initio prediction, homology-based annotation, and transcriptome-based prediction. The ab initio prediction was performed using Augustus (version 3.4.0)28 with the parameter –noInFrameStop = true -strand = both and Genscan (version 1.0)29 with default parameters. For homology searching, the protein sequences of the four phylogenetically closely related species, including Cirrhinus molitorella, Onychostoma macrolepis30, Labeo rohita31 and Puntigrus tetrazona12, were downloaded from NCBI and aligned against the assembled genome using BLAST (version 2.11.0+)23 with an e-value ≤ 1e–05. Subsequently, Exonerate (version 2.4.0)32 with the parameter –model protein2genome was used to predict gene structures. For transcriptome-based prediction, we first used SOAPnuke (version 2.1.0)16 to filter the raw RNA-seq data, with parameters -lowQual = 20, -nRate = 0.005 and -qualRate = 0.5. The clean RNA-seq reads, combined from all tissue samples, were assembled using Trinity (version 2.8.5)33 and then aligned to the assembled genome using HISAT2 (version 2.2.1)34. The aligned reads were subsequently assembled into transcripts with StringTie (version 2.1.7)35. The non-redundant reference gene set was compiled by integrating gene predictions from three different methods using MAKER (version 3.01.03)36 and subsequently annotated with PASA (version 2.4.1)37. Finally, a total of 25,183 protein-coding genes were identified (Table 5). This gene set was functionally annotated by aligning it against the InterPro38, SwissProt39, TrEMBL39, NR (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), GO40, and KEGG41 databases using Diamond (version 2.0.7)42. Consequently, 98.28% (24,750) of the predicted protein-coding genes obtained functional annotations (Table 6).

Table 5 Gene predictions of the genome of P. jordani.
Table 6 Summary of the functionally annotated genes.

Non-coding RNA annotation

We also annotated the non-coding RNAs using various tools: tRNAs were identified with tRNAscan-SE (version 2.0.9)43, rRNAs were predicted using RNAmmer (version 1.2)44 with the parameter -S euk -multi -m lsu,ssu,tsu. Additionally, miRNAs and snRNAs were detected with infernal (version 1.1.4)45 as implemented in the Rfam database46. Finally, 19,373 ncRNAs were annotated, including 458 miRNAs, 9,692 tRNAs, 8,533 rRNAs and 690 snRNAs (Table 7).

Table 7 Non-coding RNA annotation of the genome of P. jordani.

Methylation profiling

HiFi reads were aligned to the assembled genome using pbmm2 (version 1.13.1) (https://github.com/PacificBiosciences/pbmm2), an implementation within minimap2 (version 2.21)47, with the parameters–preset HIFI–sort -j 16 -J 16. The resulting sorted BAM file was then used as input for pb-CpG-tools (version 2.2.0-x86_64) (https://github.com/PacificBiosciences/pb-CpG-tools/) to determine the site-level modification proportions, producing probabilistic estimates of CpG methylation patterns. Based on the coverage of C sites, we used loci with sequencing depth greater than 5 to calculate DNA methylation levels and to count the number of methylation loci in different methylation regions. Ultimately, the overall 5mC methylation proportion of the sample was 79.41%. The distribution of methylation loci was analyzed at the assembled chromosome level. Chromosomes were divided into bins, and the average methylation loci per bin was calculated using a 100 kb window (Fig. 3). The majority of chromosomes had methylation loci ranging from approximately 1,000 to 3,000 per bin. Outliers were observed across all chromosomes, with some bins showing exceptionally high numbers of methylation loci, reaching up to 8,000, while the lowest values were close to zero. These outliers may highlight potential hotspots or areas of differential regulation within the genome.

Fig. 3
figure 3

The distribution of methylation loci across chromosomes in the genome of P. jordani.

Gene expression analysis

Gene expression levels were quantified using the fragments per kilobase of transcript per million mapped reads (FPKM) method with RSEM (version 1.3.3)48 under default parameters. Differentially expressed genes (DEGs) were identified with a false discovery rate (FDR) <0.05 and a log2 fold change (log2FC) >1 or <−1 using edgeR (version 3.22.5)49. The differentially expressed genes (DEGs) among all tissue samples were displayed in Fig. 4. Hierarchical clustering was performed on both genes and tissues to highlight similarities and differences in gene expression profiles. In the comparative analysis with muscle tissue, 12,135 DEGs were identified in the caudal fin, 11,465 in the liver, 14,438 in the gill, 12,413 in the heart, 8,301 in the spleen, and 3,578 in the kidney. Detailed information on the up-regulated and down-regulated DEGs between each pair of tissues is provided in Table S2.

Fig. 4
figure 4

Heatmap of differentially expressed genes (DEGs) across different tissues. The color key indicates the expression levels, with blue representing lower expression and red representing higher expression. The colored bars on the vertical axis represent hierarchical clustering of DEGs, with each color corresponding to a distinct gene cluster based on similar expression patterns across tissues.

Data records

All raw sequencing data from this study have been deposited in the NCBI SRA database. Specifically, next-generation short reads (SRR29932654)50, PacBio third-generation reads (SRR29932653)51, Hi-C reads (SRR29932652)52, and RNA-seq reads (SRR29932645-SRR29932651)53,54,55,56,57,58,59 are available under BioProject accession number PRJNA1137526. The final chromosome-level assembled genome file is accessible in the GenBank database under accession number JBGFUG00000000060, and the genome annotation files are available on Figshare61. The gene expression matrix for all tissues has been deposited in the GEO database under accession number GSE28230662, and the methylation profiling data are available under accession number GSE28231163.

Technical Validation

The completeness of the genome assembly was evaluated using BUSCO (version 5.2.2)64, which indicated a high level of completeness with a score of 97.3%. Additionally, we mapped the next-generation reads to the assembled genome using BWA (version 0.7.17)65 with the parameter mem -t 50, achieving a mapping rate of 99.47% and an overall coverage of 99.75%. Similarly, for the HiFi reads, mapped using minimap2 (version 2.21)47 with the parameter -ax map-hifi, the mapping rate was 99.93% with an overall coverage of 100%. We also evaluated the assembly using Merqury (version 1.3)66 to assess the consistency of the genome. Merqury k-mer-based analysis showed a QV (Quality Value) score of 45.40 (Table 2). To evaluate gene prediction accuracy and reliability, we examined the distribution of exon number, intron number, and GC content in the genes and coding sequences (CDS) of P. jordani and four phylogenetically related species (C. molitorella, O. macrolepis, L. rohita and P. tetrazona). The consistent distribution patterns observed among these species further validated the annotated gene dataset for P. jordani (Fig. 5). All the aforementioned results indicate that the assembled genome is complete and highly reliable. The quality of RNA-seq clean reads across all tissues was assessed using SOAPnuke (version 2.1.0)16, yielding Q20 and Q30 values ranging from 98.70% to 99.99% and 89.32% to 97.52%, respectively (Table S3). These results underscore the high reliability and robustness of the RNA-seq data.

Fig. 5
figure 5

Comparative analysis of genomic elements, including exon number, intro number, GC rate of genes and coding sequences (CDS), among P. jordani and the other four phylogenetically closely related species.