Background & Summary

In the field of entomological research, exploring the genomes of specific species is critical for understanding their biological essence, ecological adaptation mechanisms, and potential applications. Hermonassa cecilia, a significant species within the genus Hermonassa of the family Noctuidae, is primarily found in East Asia1. This genus exhibits the highest species diversity in the Sino-Himalayan mountains, straddling the border between the Palaearctic and Oriental regions2. Additionally, several known Lepidopteran pests that damage the underground parts of plants belong to the same subfamily, Noctuinae, as H. cecilia. Currently, research on H. cecilia mainly focuses on its morphological characteristics, geographical distribution, and preliminary ecological observations. Although existing studies have outlined its biological characteristics and life habits, in-depth genomic analysis remains an unexplored field.

Genomics has recently become an indispensable tool for studying biodiversity, tracing evolutionary history, and analyzing ecological adaptation mechanisms. Internationally, genome research on other noctuid insects (such as Agrotis segetum3, Spodoptera litura4 and Spodoptera frugiperda5) has made significant progress, providing valuable insights into the genetic mechanisms, ecological adaptability, and pest management strategies of noctuid insects. Similarly, studies on other Lepidopteran pests, such as Ostrinia furnacalis6 and Conogethes punctiferalis7, have expanded our understanding of pest evolution, ecological niche differentiation, and control strategies across diverse species. However, the genetic differences and ecological niche differentiation among species underscore the unique scientific value and practical significance of independent research on the H. cecilia genome. For example, genome research not only provides a comprehensive view of the complexity of their genetic information but also uses the powerful tool of comparative genomics to deeply analyze their phylogenetic relationships with related species, gene family evolution dynamics, and potential adaptive evolutionary events.

This study focuses on a comprehensive analysis of the H. cecilia genome using high-throughput sequencing technology, aiming to construct a high-quality reference genome sequence and conduct systematic annotation and in-depth analysis. To achieve this goal, this study employs the advanced combination of PacBio HiFi and Hi-C sequencing technologies for high-precision sequencing of the H. cecilia genome. Advanced bioinformatics methods are utilized for the precise processing, assembly, annotation, and in-depth analysis of the sequencing data. As a result, we obtained a chromosome-level genome assembly for H. cecilia, with a genome size of 626.10 Mb and a scaffold N50 of 21.00 Mb. The genome assembly completeness was evaluated at 99.4%, showcasing its exceptional quality. Our analysis identified 44.21% repeat sequences and 22,662 protein-coding genes. Additionally, functional prediction for protein-coding genes was performed using four databases, resulting in functional annotation information for a total of 20,221 genes.

The publication of the H. cecilia genome will enrich the genomic data of Noctuinae species, providing potential for a more comprehensive understanding of their genetic basis, evolutionary history, and complex relationships with their ecological environment. Furthermore, the results of this study will offer valuable references and insights for the genomic research of other noctuid insects, promoting the in-depth development of genomics research in entomology, and contributing new scientific knowledge and strength to ecological protection and pest management.

Methods

Sample collection, DNA and RNA extraction

The sequencing samples of H. cecilia were all male adults, collected through light trapping in October 2023 from the cropland of Xingping City, Xianyang, Shaanxi Province, China. Genomic DNA (gDNA) from a single male adult insect, intended for both PacBio HiFi and Hi-C sequencing, was extracted using the Qiagen Genomic DNA Kit (Cat#13323, Qiagen). The quality and quantity of the extracted gDNA were evaluated using a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific) and a Qubit 3.0 Fluorometer (Invitrogen), following the manufacturers’ protocols. The Total RNA from each of the four adult insects was individually extracted using the RNeasy Mini Extraction Kit (Qiagen), in accordance with the manufacturer’s instructions. The integrity and purity of the RNA were assessed using the same methods employed for gDNA quality assessment.

Genome and transcriptome sequencing

The DNA libraries were thoroughly examined and sequenced using the PacBio Sequel II platform at GrandOmics in Wuhan. The raw sequencing data underwent preprocessing with the CCS program, generating 41.00 Gb of Circular Consensus Sequencing (CCS) bases, resulting in a sequencing depth of 65.49×. Additionally, Hi-C library construction was performed using the same male moth for DNA extraction. The DNA was cross-linked in situ, extracted, and digested with the restriction enzyme DpnII. The resulting fragments were ligated to form chimeric junctions, purified, and the Hi-C libraries were amplified using PCR. These libraries were then sequenced on the Illumina NovaSeq 6000 platform using a 150 bp paired-end configuration, producing a total of 109.41 Gb of clean reads, representing approximately 174.47 × coverage. Regarding the data used for the genome survey, short insertion fragment libraries, ranging in size from 300 to 500 base pairs (bp), were constructed at BGI (Beijing Genomics Institute) in Shenzhen. These libraries were then sequenced on the BGISEQ-500 platform using 150-bp paired-end reads, following standard protocols throughout the process. In total, 60 Gb of sequencing data was obtained. The extracted RNA was used to prepare indexed cDNA libraries with the NEBNext Ultra RNA Library Prep Kit for Illumina. These libraries, with insert sizes of 250–300 bp, were sequenced on the Illumina NovaSeq 6000 platform using a paired-end 150 bp configuration, generating a total of 29.42 Gb of filtered sequencing data (Table 1).

Table 1 Statistics for sequencing data for Hermonassa cecilia genome assembly.

Genome assembly and chromosome anchoring

Before the genome assembly, the genome size was estimated using the Kmerferq v4.0 software8. The k-value was set to 17, resulting in an approximate size of 631.91 Mb, with repetitive sequences accounting for 44.16% (Fig. 1). The primary genome assembly was generated using HiFiasm v0.16.19 with PacBio CCS reads in--primary option, supplemented by Hi-C data (--h1/--h2 parameters) for haplotype phasing9. The initial assembly comprised 136 contigs, which was subsequently processed through purge_dups v1.2.5 (https://github.com/dfguan/purge_dups) to eliminate residual heterozygous sequences. Potential microbial contamination in the assembled genome was identified by Kraken2 screening against NCBI RefSeq microbial databases, with putative contaminant contigs subsequently validated using BLASTn (e-value < 1e-5). This two-step purification process ensured a contamination-free haploid assembly. The assembly’s quality was assessed using BUSCO v5.7.110 with the lepidoptera_odb10 geneset11, which indicated a completeness level of 99.4%. After performing quality control on the Hi-C reads by Trimmomatic-0.39-212, Juicer’s generate_site_positions.py was then used to construct possible restriction enzyme site files based on the previously assembled genome and to obtain the length information of each contig. The purified reads were mapped to the assembled contigs using Juicer13 with default parameters. Subsequently, 3D de novo assembly (3D-DNA) v18092214 pipelines were employed to cluster, sort, and orient the contigs or scaffolds to achieve a chromosome-level genome assembly. Juicebox v1.11.081915 was then used for visual inspection and correction of any errors in the order and orientation of the contigs. The final chromosome-level genome of H. cecilia was obtained, with a genome size of 626.10 Mb and a scaffold N50 of 21.00 Mb (Fig. 2). The GC content was 37.87%. BUSCO analysis indicated that 99.4% (98.9% single-copy genes and 0.5% duplicated genes) of the 5286 BUSCOs were complete orthologs (Table 2). The completeness of the chromosome-level genome assembly is very close to that of the initial assembly, with one more complete BUSCO and two more single-copy BUSCOs compared to the initial assembly. Moreover, its BUSCO results outperform those of several published genomes of Noctuidae species, including Agrotis ipsilon16, A. segetum3, and Athetis lepigone17, among others. These results suggest the assembly of a high-quality chromosome-level genome for H. cecilia.

Fig. 1
figure 1

K-mer analysis of Hermonassa cecilia. The estimated genome size is 631.91 Mb, with repetitive sequences accounting for 44.16%.

Fig. 2
figure 2

Circos Diagram of the Hermonassa cecilia Genome This diagram offers a comprehensive visualization of the H. cecilia genome structure. From the outermost layer inward, the layers represent: I. Distribution of markers across 31 chromosomes, at a megabase scale. II. Gene density in each chromosomal region. III. GC ratio. IV-VII. Distribution of different repeat elements (DNA repeats, Long Terminal Repeats (LTR), Long Interspersed Nuclear Elements (LINE), and Short Interspersed Nuclear Elements (SINE)) across the chromosomes. An illustration of an adult H. cecilia at the core highlights the subject of this genomic study. The photo of H. cecilia come from https://www.inaturalist.org/observations/187888582.

Table 2 Assembly features for genomes of Hermonassa cecilia and other scale insects.

Repeat element annotation

To elucidate the patterns of repetitive elements in the H. cecilia genome, we employed homology-based prediction methods, utilizing the established repeat sequence database Repbase. Predictions were performed using RepeatMasker v4.1.518 and RepeatProteinMask v4.1.5. Additionally, we constructed a de novo repeat library from the genomic sequence using RepeatModeler v2.0.419 with the -LTRStruct option to enable comprehensive LTR retrotransposon annotation, followed by repeat masking and classification using RepeatMasker. Furthermore, the Tandem Repeats Finder v4.09 (TRF)20 program was used to identify tandem repeat sequences in the genome. The results from the four prediction methods were then integrated to obtain the final repeat sequence information. A total of 281.45 Mb of repetitive sequences were identified, accounting for 44.21% of the assembled genome (Fig. 1). Among these, DNA transposons accounted for 3.06%, Long terminal repeats (LTRs) for 1.93%, Long interspersed nuclear elements (LINEs) for 11.56%, and Short interspersed nuclear elements (SINEs) for 6.60% (Table 3).

Table 3 Statistics for repeat elements in the genome of Hermonassa cecilia.

Protein-coding gene annotation

Protein-coding gene prediction based on the genome with masked repetitive sequences was conducted using three methods: de novo prediction, transcriptome-based prediction, and homology-based prediction. De novo prediction was performed using Augustus v3.321 with the etraining and augustus commands with the default parameters. Transcriptome data from four moths were used for gene prediction. The transcriptome sequencing data were first filtered using Trimmomatic v0.39-212, and then aligned to the assembled genome using Tophat v2.1.122 to identify splice sites. Transcripts were constructed and their abundance estimated using Cufflinks v2.2.122, and then the four transcripts were merged using Cuffmerge. Homology-based prediction was conducted using Genome-wide electronic tool for annotation Geta v2.6.1’s homolog_genewise command https://github.com/chenlianfu/geta/releases/tag/v2.6.1, based on the publicly available protein sequences of four Lepidoptera noctuid insects: A. ipsilon16 Xestia c-nigrum23, S. frugiperda5, and Helicoverpa armigera24. Finally, the predictions from these three methods were integrated using EVidenceModeler v2.0.025, resulting in the prediction of 22,662 protein-coding sequences. The predicted gene models showed excellent BUSCO completeness (lepidoptera_odb10; n = 5,286), with 94.6% of conserved genes present (90.9% single-copy, 3.7% duplicated) and only 3.7% missing, demonstrating high annotation quality. Functional annotation of the predicted protein-coding genes was performed using Diamond v2.1.8.162’s blastp command with the parameters -evalue 1e-6 -max-target-seqs. 1 -outfmt 6, based on public protein databases such as UniProt_invertebrates, non-redundant protein database (NR), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups (EggNOG)26. A total of 20,221 genes were annotated with functions in at least one database, accounting for 89.29% of the predicted proteins (Fig. 3a.b, Table 4).

Fig. 3
figure 3

Functional Annotations of Hermonassa cecilia Protein-Coding Genes Using Public Databases. (a) Bar chart showing the number of genes annotated by each of the four databases: Uniprot, NR, KEGG, and EggNOG. (b) Venn diagram displaying the annotation results from the four databases.

Table 4 Functional annotation of Hermonassa cecilia proteins.

Data Records

The raw data generated by the PacBio sequencing platform has been deposited into the NCBI Sequence Read Archive (SRA) with accession number SRR3174293427. The Hi-C reads are stored in the NCBI SRA with accession number SRR3174293328. The RNA-seq reads are available in the NCBI SRA under accession numbers SRR31744080 to SRR3174408329,30,31,32. The BGISEQ raw reads are available in the NCBI SRA under accession number SRR3208230433. The final genome assembly is deposited in GenBank under accession number PRJNA119869134. Genome annotations are available in the FigShare repository35.

Technical Validation

After extracting genomic DNA, its integrity was assessed by agarose gel electrophoresis, and DNA concentration was measured using NanoDrop and Qubit 3.0 fluorometer, with an absorbance of approximately 2.0 at 260/280. Low-quality Hi-C and transcriptome raw data were filtered using Trimmomatic-0.39, retaining only high-quality sequencing reads. The assembled genome size is 626.10 Mb, with repetitive sequences accounting for 44.21%. The size of the assembled genome and the proportion of repetitive elements are both consistent with the estimated results of 631.91 Mb and 44.16%, respectively. The assembled genome size is also similar to the genome sizes of A. ipsilon16 and Agrotis segetum3 in the Noctuinae subfamily. The initial HiFiasm assembly results showed only 136 contigs, indicating that the assembled genome exhibits high continuity and integrity. BUSCO analysis revealed a genome assembly completeness of 99.4%, with a low percentage of duplicated single-copy genes at just 0.5%, indicating that duplication is not a significant issue in the assembly process. This confirms the high quality of the genome assembly comparable to other recently published lepidopteran genomes7,36. These results demonstrate that we have successfully obtained a high-quality genome for H. cecilia.

Our integrated annotation approach successfully identified 22,662 protein-coding genes, achieving exceptional genome annotation quality as evidenced by the high BUSCO completeness score (94.6%). The complementary use of de novo prediction, transcriptomic evidence, and homology-based methods ensured comprehensive detection of both evolutionarily conserved and species-specific gene content, with functional annotations assigned to 89.29% of predicted genes. While these results establish a reliable genomic framework, emerging single-nucleus RNA sequencing technologies present exciting opportunities to enhance our understanding of cell-type-specific gene expression patterns and refine current gene models37. The 10.7% of genes lacking functional annotation likely include important lineage-specific adaptations and non-coding elements, highlighting valuable targets for future functional characterization. This high-confidence genome annotation provides an essential foundation for advancing Lepidoptera genomic research and facilitating downstream functional studies.