Background & Summary

The short-faced mole (Scaptochirus moschatus) belonging to the genus Scaptochirus of the family Talpidae is endemic to China, mainly distributed in Hebei, Shandong, Inner Mongolia, Jilin, Shanxi, Gansu, etc1. It is similar to the Ussuri Mole (Mogera robusta), but it has a smaller body size. The short-faced moles are stout, with a short and sharp mouth. Their small eyes hidden in the fur. Their claws are flat, strong and sharp. Their whole body are covered with brown and metallic fine fur. The base of the fur is dark gray, and the top of the fur is dyed brown. The short-faced moles live underground for all the life. Their hearing and smell are very sensitive. They crawl fast in the ground and rarely climb out. They like to live in sandy areas with dry and loose soil and deep soil layer. The extensive ecotype and morphological features make the short-faced mole an interesting model for studying adaptive evolution2.

Previous studies on moles have focused on morphology2,3, taxonomy4, karyology5, phylogeny6 and gut microbiology1. The follow-up evolutionary biological and molecular ecological studies on the short-faced mole is constrained by the limitation of the reference genome. In this study, multiple sequencing technologies, including short-read sequencing (Illumina), long-read PacBio high-fidelity sequencing (PacBio), and proximity ligation-based chromatin capture sequencing (Hi-C) were integrated to construct a high-quality genome assembly of the short-faced mole. The Illumina data was used to survey the genome yielded a genome size of 2.17 Gb. A total of 2.25 Gb genome contigs was assembled from the HiFi data, with a contig N50 of 67.66 Mb (Table 1). The genome was annotated to the chromosome level by assisted assembly using Hi-C data combined with HiFi data. The total length of genome scaffolds is 2.25 Gb, with a N50 of 110.51 Mb (Table 1). The assembled genome was successfully anchored to 24 chromosomes with the anchor rate of 94.33%. 0.88 Gb of repetitive sequences were identified, representing for about 39.41% of the genome. 21,139 genes were predicted of which 94.6% were functionally annotated, with an average gene length of 30.01Kb. 8.58 exons per gene were predicted. The average lengths of the exons and introns are 0.17Kb and 3.76Kb, respectively. The annotated gene set were evaluated by using the single-copy direct orthologous gene library, and they yield 94.1% complete BUSCOs.

Table 1 Statistics of the assembled genome of S. moschatus.

Methods

Sample collection and ethics statement

A male short-faced mole was sampled from Liaocheng city, Shandong province, China, in May 13th, 2024. It was euthanized with ether. Samples of heart, liver, spleen, lungs, kidneys, muscles and blood were collected and subsequently stored at Qufu Normal University and kept at −80 °C before DNA and RNA extraction. All experiments were conducted in accordance with the Guidelines for the Care and Use of Laboratory Animals in China, and approved by the Biomedical Ethics Committee of Qufu Normal University with the registration number 2024118.

Genome sequencing

Blood samples were taken from the short-faced mole for Illumina and PacBio sequencing. Muscle tissues were used for Hi-C sequencing. Heart, liver, lung, spleen, kidney and muscle tissues were taken from the same mole for transcriptome sequencing. All sequencing analyses were performed by the Novogene Co., Ltd (Beijing, China). The QIAGEN AllPrep DNA/RNA Mini Kit (Qiagen, Germany) was used to extract Genomic DNA and total RNA. The quality of the DNA and RNA was checked by the NanoDrop 2000c spectrophotometer, Qubit 3.0 (Invitgen, USA) and the 2100 bioanalyzer (Agilent, USA).

Illumina sequencing library was generated using the Illumina PE Cluster Kit (Illumina, USA) according to the manufacturer’s instructions. DNA libraries were sequenced on Illumina Novaseq platform and 150 bp paired-end reads were generated. A total of 103.14 Gb short reads were derived from Illumina sequencing. Muscle samples were used for extracting genome DNA for Hi-C sequencing after cross-linked by 4% formaldehyde solution and marked with biotin-14-dCTP. The obtained genome DNA was randomly interrupted into fragments by Covaris crusher and the Hi-C libraries were constructed according the standard protocol described previously7. Total RNA extracted from multi-tissues was used to construct cDNA libraries by TruSeq RNA Sample Prep Kit v2 (Illumina, USA). Both the libraries for Hi-C and transcriptome sequencing were sequenced on Illumina Novaseq platform by using a paired-end strategy. A total of 202.85 Gb data were derived from Hi-C sequencing, and each sample obtained more than 6 G data from transcriptome sequencing. Fastp v0.20 was used to clean the raw data with default parameters8.

High quality DNA extracted from blood samples (primary band >30 kb) that passed the assay were selected to be randomly sheared into fragments (15–18 kb). Large fragments of DNA were enriched and purified using magnetic beads. Damage repair and end repair were performed on the fragmented DNA. Stem-loop sequencing junctions were attached to the ends of the DNA fragments, and exonucleases were used to remove fragments that failed to connect. The constructed library was sequenced by PacBio Revio/Sequel II/IIe platform followed the standard protocol (Pacific Biosciences, CA, USA), and it produced a total of 98.9 Gb HiFi long reads.

Genome survey and assembly

To estimate the genome size, the clean Illumina sequencing reads were used for k -mer analysis. Jellyfish v2.2.69 was used to calculate the optimal k-mer and GenomeScope v2.01210 was used to estimate the genome size for corresponding k-mers. Referring to the genome size of affiliate species in family Talpidae11, we selected k-mer = 17 for genome survey. The corresponding genome size of the short-faced mole is 2.17 Gb (Table S1, Fig. S1). The heterozygosity rate (obtained by calculating the proportion of heterozygous sites) in the surveyed genome is 0.34%.

The long reads obtained from PacBio sequencing were broken from junctions, and subreads were obtained after filtering out the junction sequences. The mean length and N50 length of the subreads were 18,852 bp and 18,761 bp, respectively. The CCS (https://github.com/PacificBiosciences/css) was used to generate high-precision HiFi reads, with the parameter of min-rq = 0.99. The HiFi reads obtained was used for genome assembly by using Hifiasm 0.19.9 software to give contigs12. A total of 2.25 Gb genome contigs were produced with an N50 of 67.66 Mb (Table 1). The GC content of the assembled contig genome was 42.87%. The integrity of the assembled genome was BUSCO v5.4.513 evaluated based on the Single-Copy Orthologs library (metazoa_odb10) using the software such as Metaeuk and HMMER. From 954 total BUSCO groups searched, 99.6% of complete BUSCOs (including 90.7% of complete and single-copy BUSCOs, 8.9% of complete duplicated BUSCOs), 0.2% of fragmented BUSCOs, and 0.2% of missing BUSCOs were identified, which further emphasized the precision and completeness of gene prediction.

Chromosome level genome assembly

The contig genome obtained from the Hifiasm 0.19.9 was combined with Hi-C data for chromosome clustering, orientation and sorting using ALLHiC 0.9.87 (parameters: enz = Dpn II, CLUSTER = n) to obtain the near-chromosome level genome. Juicebox v1.11.08 was employed to undertake an examination, perform a manual curation of the identification according to the chromosome interaction intensity14, and to build the final chromosome-level genome assembly. After the above assembly, we obtained a total of 2.25 Gb scaffolds, with the lengths of N50 of 110.51 Mb, respectively (Table 1). Reflecting the previous report about the 2 N = 48 karyotype of the short-faced mole5, a total of 2.13 Gb of genome data were anchored to 24 chromosomes with an anchor rate of 94.33% (Fig. 1), of which the 24th chromosome (Chr 24) probably contained the X and Y chromosomes. The anterior segment (about 16 Mb) of Chr 24 may be the Y chromosome and the posterior segment (about 103 Mb) may be the X chromosome (Fig. S2). However, since there is no other support for the annotation results of sex chromosomes of related species in the family Talpidae, this inference needs to be further verified. The length of the chromosomes ranged from 28.66 Mb to 156.99 Mb (Table S2 and Fig. 2).

Fig. 1
figure 1

Heat map of Hi-C linkage density of Scaptochirus moschatus. The x-axis and y-axis represent the genomic positions. Red dots indicate regions with a high density of paired reads, suggesting that they are more likely to be on the same chromosome.

Fig. 2
figure 2

Genome characteristics of Scaptochirus moschatus. From the outer ring to the inner ring are the distributions of RNA TEs, DNA TEs, gene density and GC content.

Repeats and non-coding RNAs annotations

Following the completion of the genome assembly, the annotation was performed on three kinds of repetitive sequences, and on ncRNAs and PCGs. Repetitive elements were delineated by employing RepeatModeler v2.0.115 following its preset parameters, and a new repeat sequence library was assembled. A customized library was built in conjunction with Dfam 3.116 and Repbase databases17. Repeated elements in custom libraries were masked in homology prediction by using RepeatMasker v4.1.018. The analysis yielded the detection of 0.88 GB of repetitive sequences within the short-faced mole genome, representing 39.41% of the total genome. The primary categories of TEs (tandem repeats) and IEs (interspersed repeats) identified are as follows: long interspersed nuclear elements (LINEs) at 26.45%, short interspersed nuclear elements (SINEs) at 0.44%, long terminal repeats (LTRs) at 8.19% and DNA transposon elements at 3.90% (Table 2).

Table 2 Statistics of repeat sequence classification results.

The annotation of all non-coding RNAs (rRNAs, tRNAs, snRNAs, and miRNAs) was performed using Infernal v1.1.319 and tRNAscan-SE v2.0.720. Non-coding RNA annotation is performed at the genome-wide level, and genome that are masked for repetitive sequences are used for subsequent gene structure annotation. In our annotation process, transcriptome data are used only for prediction of genes and their structures, and are not used for prediction of non-coding RNAs. ncRNA predictions are obtained from sequence comparisons of databases with the whole genome, and there are no predictions of overlap with protein-coding genes. A variety of non-coding RNAs were catalogued, resulting in the identification of 89,096 rRNAs, 5,958 transfer RNAs, 3,386 snRNAs, and 4,915 miRNAs (Table 3).

Table 3 Annotations of non-coding RNAs in the S. moschatus genome.

Gene structure and function predictions

The De novo annotation21 was performed using Augustus v3.5 and Snap v2013.11.29. Homologous encoded protein prediction22 was performed using Blastall v2.2.26, Solar v0.9.6 and GeneWise v2.4.1. The gene annotation results were integrated using EVM v1.1.1 with default parameters. The NCBI RefSeq database23 was queried to obtain the protein sequences of two affiliate species (Talpa occidentalis, Condylura cristata) and a model species (Rattus norvegicus), and the gene structure was predicted by blat (http://genome.ucsc.edu/cgi-bin/hgBlat)24. The gene structure was annotated by using transcriptome data. Combining the above prediction results and the transcriptome alignment data, EVidenceModeler (http://evidencemodeler.sourceforge.net)25 was used to integrate the gene sets predicted by various methods into a non-redundant, more complete gene set. Finally, PASA (http://pasa.sourceforge.net) was used to combine transcriptome assembly results to correct the EVM, and UTR and variable shear were added to obtain the final gene set. There were 21,139 protein coding genes in the genome of the short-faced mole, 94.6% of the genes were predicted to have a function, and the mean length of these genes was 30.01 Kb. The mean count of exons per gene stands at 8.58, while the typical exon spans 0.17 Kb, an average CDs extends to 1.49Kb and an average intron extends to 3.76 Kb (Table 4, Fig. S3).

Table 4 Basic statistical results on gene structure of affiliate species and a model species.

The gene sets obtained from annotation of the gene structure were used for gene function prediction based on the NR (non-redundant protein records) (http://www.ncbi.nlm.nih.gov/protein), Swiss-Prot26 (http://www.uniprot.org), Pfam (http://pfam.xfam.org), KEGG27 (http://www.genome.jp/kegg), and InterPro (https://www.ebi.ac.uk/interpro) databases by using Interproscan v5.59-91.028, Blastp v2.2.26 and Diamond v0.8.22. 94.6% of the 21,139 annotated genes were predicted successfully, each identified as having at least one homologous gene, with the information corroborated across three public databases (Fig. 3).

Fig. 3
figure 3

Statistical results of gene function annotation of the Scaptochirus moschatus genome.

Data Records

All sequencing data and genome assembly have been deposited in the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1127535), which include SRR3161760029, SRR3147821330, SRR3210808431 in SRA and JBFCTN00000000032 in GenBank. The genome annotations deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.28430021)33.

Technical Validation

BUSCO13 was used to assess the integrity of the assembled genome yielded 99.6% of the genome were identified as complete BUSCOs (metazoa_odb10). BWA 0.7.8 software34 was used to calculate the alignment of the fragment reads to the assembled genome to evaluate the completeness of the assembly and the uniformity of sequencing. It yielded a 99.68% alignment rate and a 99.98% genome coverage indicating good consistency between reads and assembled genome. The SNP calling of the alignment results performed using SAMtools 0.1.1935 yielded a 0.001% heterozygous SNP ratio and a 0.000% homozygous SNP ratio, indicating that the assembly has a high single base accuracy (Table S3). The genomic sequence accuracy was evaluated using Merqury36 yielded a quality value of 50.97 indicating that the sequence accuracy was more than 99.99%. The assembled genome was evaluated by multiple methods and showed good genomic consistency, completeness, and accuracy.