Background & Summary

Triplophysa bombifrons (Herzenstein1), a species of bony fish within the Nemacheilidae family, was first identified by Russian scientist Herzenstein in 1888 and originally named Nemachilus bombifrons Herzenstein1. This fish has a limited distribution in China, predominantly inhabiting the Tarim and Aksu Rivers2. Over recent decades, the wild population of T. bombifrons has experienced significant declines due to environmental disturbances such as pollution and habitat destruction. As a result, it has been classified as critically endangered in the Red List of China’s Vertebrates (2016)3. Belonging to the genus Triplophysa, which is part of the Cobitoidea superfamily and includes over 141 species, T. bombifrons is a key component of the ichthyofauna on the Qinghai-Tibetan Plateau (QTP). Despite its significance, the phylogenetic position of Triplophysa remains unresolved due to limited molecular data. To date, 28 complete mitochondrial genomes (mitogenomes) of Triplophysa species have been documented, using mitochondrial genes or complete mitogenome data to explore phylogenetic relationships within the genus2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19. However, the specific phylogenetic links between T. bombifrons and other species within Triplophysa are still unclear20,21. Although sequencing technologies have advanced significantly and reduced sequencing costs, only five chromosome-level genomes from this genus have been sequenced: T. tibetana, T. rosa, T. bleekeri, T. dalaica, and T. siluroides. Unfortunately, T. bombifrons still lacks a reference genome, limiting comprehensive genetic and evolutionary studies22,23,24,25,26.

Sequencing the genome of T. bombifrons could provide essential data for analyzing its genetic diversity and population history. Such genomic information is crucial for understanding the genetic adaptations and molecular mechanisms behind the species’ endangerment, aiding in the development of more effective management and conservation strategies.

In this research, we successfully compiled and annotated a chromosome-level genome of T. bombifrons using PacBio single-molecular DNA sequencing complemented by high-throughput chromatin conformation capture (Hi-C) technology. The chromosome-level genome assembly of T. bombifrons measured 655.95 Mb, with a contig N50 length of 14.6 Mb and a scaffold N50 of 24.25 Mb. Approximately 628.24 Mb of sequences were anchored onto 25 chromosomes, covering 99.38% of the total assembly. A chromosome-scale genome assembly of of T. bombifrons was 628.24 Mb with the scaffold N50 length of 24.25 Mb. The sequence continuity of T. bombifrons is confirmed through the high N50 lengths, which align with those found in similar species. Our analysis revealed that the genome is notably complete, as indicated by the presence of over 97.4% of BUSCO genes. We identified 34,211 protein-coding genes in the T. bombifrons genome, of which 99% were successfully annotated against public protein databases. This high-quality genomic resource will support future studies in comparative genomics, phylogenomics, and evolutionary research within the Nemacheilidae family. The research establishes a high-quality genome resource that will underpin future investigations into the population dynamics, conservation strategies, and environmental adaptations of fish in extreme habitats.

Methods

Experimental sampling

A female T. bombifrons was captured using nets in the 2nd Yurungkax River, located at coordinates 37°6′39.6” E, 79°54′46.8” N, in the Hotan district of the Xinjiang Uygur Autonomous Region, China. The specimen was preserved and catalogued at Tarim University under the accession number GYQ2022030001, and is under the care of Xinyue Wang (Fig. 1). The species and sex were confirmed by Professors Adark and YongSong through dissection and examination of the gonads. Pectoral fin samples were collected, preserved in 75% ethanol, and stored at -80°C for subsequent DNA/RNA extraction. All procedures adhered to the relevant regulations on animal care and use.

Fig. 1
figure 1

Picture of T. bombifrons. Note: Photographs of vouchers was taken by Shengao Chen.

Library construction, and sequencing

Sequencing of the T. bombifrons genome was conducted using multiple platforms. For short-read sequencing, the sample was randomly fragmented into approximately 350 bp DNA fragments using an ultrasonic processor (Covaris S220; Woburn, MA, USA). The fragments were then prepared by performing end repair, adding a 3′ A tail, and ligating adapters. The resulting library was sequenced on a MGIDNBSEQ T7 platform (BENAGEN Company, Wuhan, China). Raw short reads were filtered using Fastp to eliminate adapters and low-quality reads, yielding a total of 85.1 Gb of clean data for T. bombifrons, providing an approximate coverage of 135 × for the T. bombifrons genome.

To construct the PacBio HiFi library, the DNA template was sheared to an average size of 20 kb using a g-TUBE (Covaris, Inc., MA, USA), and target DNA fragments were recovered through size selection using the BluePippin system (Sage Science, Inc., MA, USA). The SMRTbell library was prepared with the SMRTbell Express Template Prep kit 2.0 (Pacific Biosciences, California, USA) following the manufacturer’s instructions. Sequencing was performed on the PacBio Sequel II platform (Pacific Biosciences, Menlo Park, USA). We confirm that the genome sequencing was conducted in CCS (HiFi) mode, generating Circular Consensus Sequencing (CCS) reads (also referred to as HiFi reads). These HiFi reads were produced using the CCS software (https://github.com/pacificbiosciences/unanimity) with the parameter ‘-minPasses 3’, ensuring highly accurate consensus sequences. This process yielded approximately 22.47 Gb of data, with the average and N50 lengths of these reads being 14.404 kb and 49.465 kb, respectively, achieving approximately 30x coverage of the T. bombifrons genome (Table 1).

Table 1 Sequencing data for the T. bombifrons genome assembly.

Fresh musscle tissue of T. bombifrons was used to construct a library for Hi-C analysis. The tissue was first cross-linked with formaldehyde, followed by cell lysis using Nuclear Isolation Buffer. Chromatin DNA was then digested with the restriction enzyme MboI, creating sticky ends at the cleavage sites. These sticky ends were biotinylated and proximity-ligated to form chimeric junctions, which were enriched. The DNA was purified, and impurities were removed before being randomly fragmented into 300–500 bp pieces for library construction. The purified DNA underwent blunt-end repair, A-tailing, and adaptor ligation, followed by biotin-streptavidin pull-down and PCR amplification. The Hi-C libraries were quantified and sequenced on the Illumina NovaSeq platform (Illumina, San Diego, CA, USA). The resulting 113.84 Gb of raw data provided 166 × coverage of the genome (Table 1).

RNA sequencing

RNA samples were extracted from muscle, liver, and brain tissues using standard Trizol reagent (Invitrogen, CA, USA) and then combined in equal proportions for sequencing. The purity and integrity of the RNA were assessed using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA). RNA contamination was also checked using 1.5% agarose gel electrophoresis. Full-length cDNA was synthesized using the Clontech SMARTer PCR cDNA Synthesis Kit (Takara Biotechnology, China), followed by SMRTbell library construction with the Pacific Biosciences SMRTbell Template Prep Kit (Pacific Biosciences, USA). The transcriptome was then sequenced using the PacBio Iso-Seq protocol in CCS mode on the PacBio Sequel II platform by Frasergen Bioinformatics Co., Ltd. (Wuhan, China). Importantly, no Illumina-based RNA sequencing was performed in this study. After removing adaptors from the polymerase reads, a total of 47.19 Gb of Iso-Seq subreads were obtained (Table 1).

Quality control of sequencing data For the PacBio sequencing data (including both genome assembly HiFi CCS reads and Iso-Seq RNA-seq reads), any reads with a read quality value (RQ-value) below 0.99 were discarded to ensure high accuracy. For the DNA paired-end (PE) sequencing data, which included only genomic short-insert and Hi-C reads, raw data were filtered using Fastp (version 0.23.2)27 to remove adapter sequences and low-quality reads (those containing more than 10% unknown bases or more than 40 low-quality bases). No RNA-seq data were generated using this DNA PE sequencing approach. All remaining high-quality sequencing data from each platform were retained for subsequent analyses.

Genome estimate and assembly

The estimated sizes of the predicted genomes were determined in silico using JELLYFISH (version 2.2.3) and Genomescope (http://qb.cshl.edu/genomescope/), which analyzed the frequency distribution of 27-mers28,29. K-mer analysis suggests that the T. bombifrons genome spans approximately 510.674 Mb, with 217 Mb consisting of repeated sequences and a heterozygosity rate estimated at 0.849% (Fig. 2).

Fig. 2
figure 2

Genome size estimates for T. bombifrons using Kmer-based method.

To create a high-quality genome assembly for T. bombifrons, we employed a hierarchical strategy utilizing various sequencing data types—PacBio long reads, genomic short-insert reads, and Hi-C reads. Initially, PacBio HiFi reads were assembled into contigs using Hifiasm (version 0.17.4) with default settings30. Subsequently, high-accuracy short-read Illumina data were aligned to these contigs using BWA (version 0.7.12), and the draft assembly was refined using Pilon (version 1.21)31,32. Following this, valid Hi-C paired-end reads were processed and a contact matrix with 100 kb resolution was generated using HiC-Pro33. This tool also filtered out low-quality, unmapped, and invalid paired reads, and constructed a contact matrix based on interaction frequencies. After filtering, uniquely mapped sequences from valid paired-end reads were used for additional assembly steps. The genome was then assembled using the Hi-C data with Lachesis, which managed clustering, ordering, and orientation of sequences. A heat map was employed to evaluate the assembly quality. Using PacBio long reads, the T. bombifrons genome was assembled into a structure measuring 655.95 Mb, comprising 609 contigs, with an N50 length of 14.6 Mb. The longest contig measured 26.65 Mb (Table 2).

Table 2 Assembly and annotation statistics of the T. bombifrons genome.

Hi-C technology utilizes interaction information between different chromosomal regions, predicated on the notion that interactions are more frequent among proximal regions than distant ones. In this study, we generated 113.84 Gb of sequencing data through Hi-C library sequencing. Utilizing this interaction data, we achieved a chromosome assembly totaling 656.20 Gb with a scaffold N50 length of 24.25 Mb. Remarkably, over 628 Mb of sequences were successfully anchored across 25 chromosomes, demonstrating a high chromosome anchoring rate of 99.38% at the base level (Fig. 3). The integrity of the assembled genome was assessed using BUSCO (v5.5.0) with the Vertebrata odb10 database34.

Fig. 3
figure 3

The Hi-C contact map of the T. bombifrons genome. chr 01–25 represented for the 25 pseudo-chromosomes. The color bar showed the contact density from white (low) to red (high).

Gene prediction and annotation

We employed two strategies—homology-based and de novo prediction—to identify repetitive content in the T. bombifrons genome. In the homology-based approach, known transposable elements (TEs) within the T. bombifrons genome were identified using RepeatMasker (version 4.0.7) and the Repbase TE library35,36. For the de novo prediction, we developed a repeat library specifically for the T. bombifrons genome using RepeatModeler (version 2.0.6) (http://www.repeatmasker.org/RepeatModeler/). This tool automatically engages two core de novo repeat-finding programs, RECON (version 1.08) and RepeatScout (version 1.0.5), to generate, refine, and classify consensus models of potential interspersed repeats37. Additionally, we conducted a de novo search for long terminal repeat (LTR) retrotransposons within the genome sequences using LTR_FINDER (version 1.0.7), LTR_harvest (version 1.5.11), and LTR_retriever (version 2.7)38,39,40. Tandem repeats were identified using the Tandem Repeat Finder (TRF) package, and simple sequence repeats (SSRs) were detected using MISA (version 1.0)41. Subsequently, we merged the library files from both methods and used RepeatMasker to catalog the repeat contents. Our annotation pipeline revealed that 228.3 Mb of the genome sequences, representing 34.60% of the genome, were annotated as repetitive elements in the T. bombifrons genome (Tab.4). Specifically, these included 2.63% DNA transposons (17.2 Mb), 3.19% long interspersed nuclear elements (LINEs) (20.931 Mb), and 5.58% long terminal repeats (LTRs) (36.63 Mb) (Table 3).

Table 3 Genome quality assessment statistics of the T. bombifrons genome.

The protein-coding genes of the T. bombifrons genome were predicted using three distinct methods: ab initio gene prediction, homology-based gene prediction, and RNA-Seq-guided gene prediction. Initially, the assembled T. bombifrons genome was processed for both hard and soft masking using RepeatMasker (version 4.1.6). For the ab initio gene prediction, we utilized Augustus (version 3.3.3), training models on a set of high-quality proteins derived from our RNA-Seq dataset42. Homology-based gene prediction was conducted using MAKER (version 2.31.10). This involved aligning protein sequences and transcript sequences to the genome assembly, and predicting coding genes with default parameters in MAKER43. For RNA-Seq-guided gene prediction, we first mapped clean RNA-Seq reads to the genome using Minimap2 (version 2.24). Gene structures were then deduced using Transdecoder (version 5.5) and MAKER244. Ultimately, the results from all three methods were consolidated into a comprehensive gene set using EVidenceModeler (version 1.1.1)45.

Functional annotations for the predicted protein-coding genes were determined by aligning their sequences to several public databases. These databases include Gene Ontology (http://geneontology.org/), the Integrated Resource of Protein Domains and Functional Sites (InterPro: https://www.ebi.ac.uk/interpro/), Kyoto Encyclopedia of Genes and Genomes (KEGG: https://www.kegg.jp/), Clusters of Orthologous Groups of Proteins (COG: https://www.ncbi.nlm.nih.gov/COG/), Swiss-Prot (www.uniprot.org), TrEMBL (www.uniprot.org), and the nonredundant proteins database (NR: https://ftp.ncbi.nlm.nih.gov/blast/db). Additionally, four types of non-coding RNAs—microRNAs, transfer RNAs (tRNA), ribosomal RNAs (rRNA), and small nuclear RNAs (snRNA)—were predicted within the T. bombifrons genome. These predictions were performed using tRNAscan-SE (version 1.3.1) for tRNAs, and Infernal (version 1.1.3) with the Rfam database for the remaining RNAs46,47.The efforts encompassed protein and non-coding gene prediction as well as functional annotation Table 4.

Table 4 Statistics of repetitive sequences in the T. bombifrons genome.

Using de novo, homology, and RNA-seq data methods, a total of 34,211 protein-coding genes were predicted in the T. bombifrons genome. Among these genes, approximately 99%, 95%, and 68% exhibited homologous sequences in the NCBI NR, TrEMBL, and Swiss-Prot databases, respectively. Furthermore, 70% of these genes were associated with Pfam domains, while 45%, 40%, and 50% were assigned Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Clusters of Orthologous Groups (KOG) terms, respectively (Table 5). Overall, 34,151 of the protein-coding genes, representing 99% of the total, received functional annotations.

Table 5 Functional annotation of protein-coding genes for T. bombifrons.

Data Records

The DNA and RNA sequencing data were submitted to the NCBI Sequence Read Archive (SRA) database under the SRA IDs: SRR22343166,SRR22343167,SRR22343168 and SRR22343165, which is associated with the BioProject accession number PRJNA90322748. The assembled genome of Triplophysa bombifrons have been deposited at the NCBI GenBank (https://identifiers.org/ncbi/insdc.gca:GCA_029783895.1)49.The annotation results of repeated sequences, gene structure and functional prediction have been deposited at the Figshare database (https://doi.org/10.6084/m9.figshare.27054901)50.

Technical Validation

Evaluation of the genome assembly

To assess the integrity of the assembly, short reads were mapped to the genomes using BWA v0.7.17, giving a mapping rate of 96.6% and a genome coverage of 99.70%. The completeness and accuracy of the final genome assembly were checked by Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.5.0 with vertebrata_odb10. The BUSCO analysis revealed that 97.4% of the complete genes were retrieved in the genome, with 95.8% being single-copy and 1.6% duplicated. Only 0.5% and 2.1% of BUSCO genes were fragmented and missing, respectively(Table 3).