Background & Summary

Crambidae is one of the most speciose families of Lepidoptera, currently containing 15 subfamilies, 1,015 genera and over 11,500 species globally1,2,3. Many species are economically important pests, affecting crops and stored food products. Spilomelinae is one of the species-rich subfamilies with 4,135 described species belonging to 344 genera worldwide, it is the most speciose group among pyraloids3. Their host plants range from ferns4 over gymnosperms5 to a wide spectrum of angiosperms. Many Spilomelinae tribes have a narrow food spectrum, with the larvae feeding on plants of only one or a few plant families6, including a variety of economically important crops.

Species of the genus Parotis Hübner, 1831 from the subfamily Spilomelinae are easily distinguishable taxa with typical morphological characters that uniform the whole body with green or yellow-green color. This genus comprises 43 recognized species distributed across the Palaearctic, Oriental, and Australian regions, with 15 species documented in China, particularly in the southern region. Parotis larvae are leaf-folders, which fold both sides of a leaf to be a bag-like shape, host plants including Rubiaceae, Apocynaceae and Euphorbiaceae7,8. Larvae of Parotis prefer to feed on tender leaves as a window-feeder by removing discrete patches of mesophyll and overlying epidermis to avoid the latex secreted veins7. Despite its taxonomic distinctiveness, Parotis has received relatively limited research attention, with most studies focusing on species identification and taxonomic revisions. The limited genomic resource extremely hinders deeper understanding of evolution, adaptation, and phylogenetic relationships of this genus.

Parotis chlorochroalis was first described by Hampson (1912)9, and distributes in Cameroon, Nigeria Congo10,11,12 and China13. This species is characterized by its pale green body, fulvous-marked palpi, and slight fulvous stripes on the shoulders. The forewings have a pale fulvous costal edge with black discoidal and terminal points, while the hindwings also feature a black discoidal point, both with whitish cilia. Males possess a prominent fuscous-black anal tuft mixed with silvery scales. To enhance the understanding of the evolution and ecology of Parotis, a chromosome-level genome of P. chlorochroalis (Hampson, 1912) was obtained through the combination of PacBio Hi-Fi long reads, Illumina short reads, and Hi-C data. The repeats, non-coding RNAs (ncRNAs), and protein-coding genes (PCGs) were annotated, and conducted gene family evolution analysis. The high-quality genome of P. chlorochroalis is an important milestone in understanding of Parotis and will contribute to the study of Parotis evolution and ecology.

Methods

Sample collection and sequencing

The P. chlorochroalis samples used in this study were collected in Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences in Yunnan Province, China on 15 July 2022 (Figure S1). Adult individuals were collected by light-trap, brought back to laboratory alive, and stored in −80 °C freeze after instantly freezing with liquid nitrogen. Two male adult specimens were used for PacBio Hi-Fi and Hi-C sequencing, one female specimen for transcriptome sequencing. Besides, to identify the sexual link chromosome, both one male and female adult specimen were used for genome survey sequencing. Genomic DNA and RNA from specimens were extracted using the DNeasy Blood & Tissue Kit and TRIzoTM Reagent, following the manufacturer’s instructions. The abdomen of all specimens was removed before DNA extraction to avoid contamination of intestinal contents. PCR-free short-read libraries of 150 bp paired-end read with a 350 bp insert size were generated using the Truseq DNA PCR-free Kit. The Hi-C sequencing was carried out by digesting extracted DNA with the Mbol restriction enzyme. The Illumina NovaSeq6,000 platform was utilized to sequence all short-read libraries.

After examination of the quality of isolated DNA, the library of 15 kb was constructed using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA). The construction included DNA shearing, AMPure PB Bead purification, ssDNA overhangs removing, damage repair, end repair, hairpin adapter ligation, and bead purification of the library. After quality control test, a SMRTbell library was obtained. The library was sequenced using a single 8 M SMAT Cell on the PacBio Sequel II platform (Pacific Biosciences, CA, USA) (PacBio Sequel II System).

Berry Genomics (Beijing, China) carried out all library construction and sequencing. Finally, a total of 384.50 Gb of sequencing data were obtained, comprising 37.63 Gb (82.48 × coverage) of PacBio Hi-Fi reads, 71.48 GB of Illumina reads (31.72 GB (69.53×) for male, 39.76 GB (87.15x) for female), 54.03 Gb (118.43 × coverage) of Hi-C data, and 6.89 Gb of transcriptome data (Table 1). The raw PacBio Hi-Fi reads had a scaffold N50 and an average length of 17.53 and 17.74 kb, respectively.

Table 1 Statistics of the sequencing data used for genome assembly.

Genome size estimation and assembly

Quality control on raw Illumina data performed using fastp v0.23.214 using default parameters. The strategy of short-read k-mer distributions was employed to estimate the genome size. The histogram of k-mer frequencies was computed with 17-mers using Jellyfish v2.3.015, and the k-mer histogram was provided to the R package findGSE v1.016 to estimate the genome size. As a result, the genome size was estimated to be 460.17 Mb (Figure S2).

The primary assembly of PacBio Hi-Fi long reads was generated using hifiasm v 0.16.1-r37517 and wtdbg2 v2.518. The haplotypic duplication was identified and removed with purge_dups v1.2.51319 for hifiasm assembly. NextPolish v1.4.120 was used to polish the wtdbg2 assembly with Illumina and PacBio reads. After then, two assemblies were merged using quickmerge v0.321. Hi-C reads were aligned to the merged assembly after performing quality control using Juicer v1.622. Subsequently, contigs were anchored onto chromosomes using 3D-DNA v180923. To ensure accuracy, manually review and correction were performed with Juicebox v1.11.0824.

Potential contaminants were screened using blastn (BLAST + v2.11)25 against the NCBI nucleotide database, and sequences shorter than 1,000 bp and non-target sequences were filtered using BlobToolKit environment26. Besides, univec contaminants regions were removed by alignment against the UniVec database27, and the final non-contaminated genome assembly was extracted using seqkit v2.2.028 and bedtools v2.30.029.

The final chromosome-level genome assembly of P. chlorochroalis had a size of 456.23 Mb, comprising 362 scaffolds and 569 contigs, with the scaffold and contig N50 sizes of 16.46 Mb and 15.53 Mb, respectively. Among them, 178 contigs (98.62%, 449.84 Mb) were anchored into 31 chromosomes with lengths ranging from 7.16 to 69.80 Mb and the GC content was 37.08% (Table 2, 3; Figs. 1, 2). The genome completeness was assessed with BUSCO v 5.4.230 with the reference lepidoptera gene set (n = 5,286). The final genome assembly showed a BUSCO completeness of 96.1%, consisting of 946 (93.4%) single-copy BUSCOs, 2.7% were duplicated, 0.7% were fragmented, and 3.2% were missing. The mapping rates of PacBio, Illumina, and RNA reads to the genome were 99.90%, 98.52% (female) /97.89% (male), and 97.71%, respectively.

Table 2 Genome assembly statistics for Parotis chlorochroalis.
Fig. 1
figure 1

Genome-wide chromosomal heatmap of Parotis chlorochroalis, with each chromosome and contig framed in blue and green, respectively.

Fig. 2
figure 2

Genome characteristics of Parotis chlorochroalis. From the outer ring to the inner ring are the distributions of chromosome length, gene density, GC content, simple repeats and TEs (LINE, SINE, DNA, and LTR).

Sex chromosomes detection

To identify sex-linked fragments in the genome, high-quality clean reads were mapped to the pseudochromosome sequences using bwa v0.7.1731 and samtools v1.15.132. The depth coverages of male and female samples were calculated using bamdst v1.0.9 (https://github.com/shiquan/bamdst). Subsequently, the male to female (M: F) coverage ratio calibrated by the average depth coverages was used to determine sex-linked scaffolds. Chromosome 1 with a log2 (F: M coverage ratio) value approximately 1 was defined as Z chromosome (Z chromosome possess approximately twice greater coverage in male than in female), other pseudochromosome with value approximately 0 as autosome (Figure S3).

Genome annotation

A custom repeat library was generated using RepeatModeler v2.0.333. RepeatMasker v4.1.2-p134 was utilized to identify repetitive elements in the P. chlorochroalis genome by aligning it against the custom library. The analysis revealed that the P. chlorochroalis genome contains approximately 39.85% (181.82 Mb) repetitive elements, comprising unknow elements (8.61%), LTR elements (1.21%), DNA transposons (2.26%), LINE (11.48%), SINE (5.00%), simple repeats (0.87%) (Table 3), as well as other elements (Table S1). Furthermore, sequence divergence estimates revealed a peak at low divergence rates (1%) in TE sequences of P. chlorochroalis, indicating a recent expansion of TEs (Figure S4).

Non-coding RNAs (ncRNAs) and tRNAs were identified using Infernal v1.1.435 and tRNAscan-SE v2.0.936, respectively. The low-confidence tRNAs were filtered using EukHighConfidenceFilter from tRNAscan-SE. A total of 817 ncRNAs in the genome of P. chlorochroalis were identified (Table 3), including 90 ribosomal RNAs, 81 microRNAs, 101 small nuclear RNAs, 478 transfer RNAs, four ribozymes, and 63 other ncRNAs (Table S2).

Protein-coding genes (PCGs) were annotated using MAKER v3.01.0337 based on three strategies, containing ab initio predictions, homology-based, and transcriptome-based approaches. To maximize ab initio predictions, the BRAKER v2.1.638 were employed with transcriptome and protein evidence, and combined their results as the ab initio input for MAKER. BRAKER used Augustus v3.4.039 and GeneMark-ES Suite 4.71_lic40 as predictors and automatically trained them from reference proteins mined from OrthoDB v10 database41. Protein sequences from five species (Apis mellifera (GCF_003254395.2), Drosophila melanogaster (GCF_000001215.4), Bombyx mori (GCF_030269925.1), Chilo suppressalis (GCA_902850365.2), and Diatraea saccharalis (GCA_918026875.4)) were used for homologous gene annotation. The transcriptome used for MAKER pipeline was assembled under a genome-guided method via HISAT2 v2.1.142 and StringTie v2.2.143, and redundant isoforms were removed with cdhit v4.8.144.

The final annotation predicted 16,299 protein coding genes, with an average length of 7291.36 bp for genes (Table 3). The average number of exons, introns, and CDS of each gene were 6.44, 5.44, and 6.33, respectively, and their corresponding mean length was 301.59, 983.57, and 223.71 bp, respectively (Table S3). BUSCO completeness of the protein sequences was 91.6% (n = 5,286), including 90.7% (4797) single-copy, 0.9% (47) duplicated, 1.9% (101) fragmented, and 6.5% (341) missing BUSCOs, indicating high-quality predictions.

Table 3 Genome assembly and annotation statistics for chromosome-level assemblies of Parotis chlorochroalis.

Gene functional annotation was conducted by searching against the UniProtKB database45 using Diamond v2.0.15.15346 in sensitive mode with the parameters “--sensitive -e 1e-5”. eggNOGmapper v2.1.1247 and InterProScan 5.53–87.048 were employed to assign Gene Ontology (GO) and (KEGG, Reactome) pathway annotations and to identify protein domains. The InterProScan analyses included five databases: Pfam49, SMART50, Superfamily51, Gene3D52, and CDD53. The results predicted by the above tools were integrated to obtain the final gene function prediction. Genes with 8,141 GO terms, 4,176 KEGG pathways, 3,032 Reactome pathways, 2,430 Enzyme Codes, and 11,690 COG categories were assigned by integrating the InterProScan and eggNOG annotation results (Table S3).

Data Records

The raw sequencing data and genome assembly of P. chlorochroalis have been deposited at the Genome Sequence Archive54 and Genome Warehouse55 in National Genomics Data Center (NGDC)56. The raw sequence data of Illumina, transcriptome, Hi-C, and PacBio can be found under identification numbers CRR1371036-CRR137104057,58,59,60,61 for NGDC and SRP50644662 for NCBI, the whole genome sequence assembly can be found under identification numbers GWHFIDL00000000.163 as well as deposited in the NCBI assembly with the accession number GCA_047302205.164. Additionally, the results of annotation for repeated sequences, gene structure, functional prediction and supplementary files have been deposited in the ScienceDB database65.

Technical Validation

Phylogeny and gene family evolution

Orthology analyses was performed on PCG sequences across 26 Lepidoptera species, comprising 21 moth species and five butterfly species (Table 4). The redundant isoforms were eliminated using cdhit (-c 0.98), after then orthogroup (gene families) inference using OrthoFinder v2.5.566 with Diamond mode (“-S diamond”) for sequence alignment. A total of 417,889 (93.53%) genes were assigned to 20,282 orthogroups, of which 3,231 were shared by all eight species and 454 were single-copy genes (Table S4). For P. chlorochroalis, 15,190 genes (93.20%) were contained in 10,600 gene families, of which 49 families and 808 genes were specific to this species.

Table 4 Species taxonomic information and accession code of all samples used in this study.

Single-copy orthologues identified by OrthoFinder were aligned using MAFFT v7.49067 with the high-accuracy LINS-I strategy. Alignment gaps were removed using trimAl v1.4.168 with the “automated1” parameter, and all sequences were concatenated using FASconCAT-G v1.0469. Finally, the phylogenetic tree was reconstructed on the single-copy orthologs using IQ-TREE v2.0770 with the LG site-homogeneous model. The ultrametric tree was transformed using r8s v1.8.171 and the time-calibrated by the divergence time between Cnaphalocrocis medinalis and Chilo suppressalis (68.8 Mya), Danaus plexippus and Kallima inachus (72.9 Mya), Bombyx mori and Manduca sexta (74.5 Mya), Pieris napi and Zerene cesonia (69.5 Mya) from the TimeTree database72. The analysis showed that P. chlorochroalis is closely related to Cnaphalocrocis medinalis, which both belong to subfamily Spilomelinae, and six Crambidae species format a cluster (Fig. 3). The phylogenetic results were consistent with previous studies, supporting Pyralidae (Ephestia elytella) as a sister group to Crambidae73,74 (Fig. 3).

Fig. 3
figure 3

The phylogeny and gene family changes among 26 Lepidoptera species. The divergence times were estimated by r8s using the calibration time from Timetree. The values labeled at terminals denote the number of significantly expanded and contracted gene families. “1:1:1” represents universal single-copy genes in all species, “N:N:N” represents multi-copy genes, “others” represents unclassified orthologues, and “unassigned” represents orthologues that cannot be assigned to any orthogroups.

Gene family evolution (expansion or contraction) was estimated using CAFÉ v4.2.175 based on the generated phylogenetic tree, revealing 820 expanded and 1,449 contracted gene families in P. chlorochroalis, including 39 gene families that underwent rapid evolution (35 expansions and 4 contractions). The significantly expanded families included the Reverse transcriptase, CCHC-type ___domain-containing protein, Ribonuclease H protein, Chitin-binding and other families that play important roles in the development, metabolism, and adaptive evolution of P. chlorochroalis (Table S5). Subsequently, functional enrichment (GO and KEGG) analysis on PCGs from significantly expanded families were performed using ClusterProfiler v4.0.176 with default parameters. The enrichment of GO and KEGG in rapidly expanding families further indicates their function in the membrane biogenesis, cell-cell junction, lipid biosynthesis, and signaling pathway, among others (Figure S5a,b).

Chromosome synteny

To investigate interspecific chromosomal evolution, the genome of P. chlorochroalis was compared against with that of B. mori and D. saccharalis. The pairwise synteny were searched, filtered, and visualized using JCVI77, the subset of blocks were extracted with following options “--minspan = 30 --simple”. Syntenic analyses showed that 80 syntenic blocks (8,500 gene pairs contained 16,998 collinear genes) between P. chlorochroalis and B. mori and 99 syntenic blocks (8,851 gene pairs contained 17,694 collinear genes) between P. chlorochroalis and D. saccharalis were conserved. The average number of genes per block was 106 and 89, while a notable 33.75% (27 blocks) and 23.23% (23 blocks) contained over 100 collinear genes for P. chlorochroalis vs. B. mori and P. chlorochroalis vs. D. saccharalis, respectively. Notably, the analysis revealed that the three B. mori chromosomes 11, 23 and 24 were clearly divided into three pairs in P. chlorochroalis: 10 and 29, 13 and 30, 27 and 31, respectively (Fig. 4). The D. saccharalis chromosomes 1~5, 7, 8 and 20 were divided into eight pairs in P. chlorochroalis: 1 and 21, 16 and 18, 17 and 22, 8 and 24, 20 and 25, 7 and 23, 19 and 26, 27 and 31, respectively; D. saccharalis chromosomes 6 divided into 3 chromosomes in P. chlorochroalis (13, 28 and 29); while D. saccharalis chromosomes 22 and 23 merged into one chromosome in P. chlorochroalis (30); D. saccharalis chromosomes 21 should be the chromosome W which not represented in current P. chlorochroalis genome assembly.

Fig. 4
figure 4

Chromosomal synteny between Parotis chlorochroalis, Diatraea saccharalis and Bombyx mori.