Background & Summary

Callosobruchus maculatus (Coleoptera: Chrysomelidae), commonly known as cowpea weevil, is a kind of universal stored grain insect pest1. C. maculatus feeds on a diverse range of legume seeds, and was originally distributed in the tropical and subtropical areas especially Africa and South Asia2. However, with the global climate change and international communication, C. maculatus were observed in wider regions3 and caused great losses to agricultural grain storage. Each female beetle lays huge amount of eggs on the surface of seeds, then the first instar larvae hatch and tunnel into the seeds where larvae will grow up and pupate by feeding on cotyledon, and complete its lifecycle by emerging as adult beetle1. To control pests and reduce food waste, dozens of approaches have been implemented in the past decades4,5,6,7,8.

A high-quality chromosome-level genome for C. maculatus (Fig. 1a) supplies a practical and much-needed resource for C. maculatus agricultural pest control due to its immeasurably damage to stored products, which is also valuable for understanding the molecular mechanisms of its physiological activity and evolution relationships in Coleoptera. In the past, a contig-level genome of C. maculatus had been assembled9, but it was smaller than the expected genome size due to unknow reasons10. The C. maculatus genome is highly heterozygous with large proportion of repetitive DNA sequences, rendering substantial challenges for the genome assembly which is also a momentous reason for difference between K-mer analysis and flow cytometry11,12 in genome size13. Compared to the result of K-mer analysis, flow cytometry assay experiment was more credible11,12. The K-mer distribution analysis of the new assembled genome also indicated C. maculatus genome was large with high heterozygosity compared to other known Coleoptera species14,15. To address the challenges posed by high heterozygosity and repetitive DNA sequences, long-read sequencing (Nanopore), next-generation sequencing (Illumina), and high-throughput chromosome conformation capture (Hi-C) mapping have been employed, and these approaches have proven successful in achieving a high level of completeness and continuity in genome assemblies for various plant species16,17 as well as animals14,18.

Fig. 1: Photograph of C. maculatus and Hi-C interaction heatmap.
figure 1

(a) A picture of male C. maculatus on cowpea. (b) Hi-C heatmap of C. maculatus genome showing interactions among the 10 assembled chromosomes at 1 Mb resolution. Boxes in blue and green colour indicate scaffolds and contigs. (c) The relative gene density, GC content, repeat density, as well as the coverage of second-generation, third-generation, and Hi-C sequencing data in chromosome level are represented from the outside to the center of the circle.

To facilitate C. maculatus control and prevention, a 1.2 Gb novel high-quality chromosome-level genome of C. maculatus was provided in this research stemming from the new whole genome assembly and correct strategies. Here, we sequenced and assembled the high-quality chromosomal-level reference genome of cowpea weevil combining Illumina, Nanopore and Hi-C sequencing technologies. The current version of high-quality genome offers superior data resources study of adaptive evolution in of C. maculatus.

Methods

Sampling and genome sequencing

Adult insects of C. maculatus were intercepted in imported cowpeas originally from Nigeria and transferred via Ethiopia by Beijing Customs, P.R. China, and then inbreed for about 20 generations with full sib-pair mating strategy which were reared on 28 ± 2 °C, 75% relative humidity, 16 h/8 h light/dark photoperiod in ΦA = 90 mm petri dishes and nurtured with cowpeas. Genomic DNA were extracted from 10 male beetles for sequencing using DNeasy Blood & Tissue Kit (QIAGEN, Germany).

Short reads libraries (insert size: 350 bp) were generated using a Next Ultra DNA Library Prep Kit (NEB, USA), and sequenced on Illumina Novaseq6000 platform (PE-150) at Novogene. The raw reads ran through quality control before the next analysis. For short-read data (i.e., Illumina data), the reads which included adapters, the low-quality reads and N bases were removed, a total of 142.48 Gb clean data was obtained (Table 1). The rest of short data were used to estimate genome size based on the K-mer size using Jellyfish19 (2.3.0) with option -s 1.2 G. The genome size was also estimated by flow cytometry (Table 2) using 5 males and 5 female beetles as described before18. Additionally, these short-read data were used to correct the potential base errors in de novo genome assembly. For long-read data (i.e., Nanopore data), genomic DNA was extracted to construct 20-kb libraries and used for Oxford Nanopore sequencing platform at NextOmics and generated 137.33 Gb raw data (Table 1).

Table 1 Summary of sequence reads.
Table 2 Result of flow cytometry assay.

The Hi-C sequencing was also conducted at NextOmics, which followed the standard protocol and was sequenced on the Illumina NovaSeq6000 platform. Restriction enzyme DpnII was used to lyse and digest the isolate cells which were from sliced tissues and cross-linked before overnight. The cohesive ends were blunted, reversed, and marked with biotin-14-dATP and purified the DNA by removing biotin from unligated ends. DNA was sheared to 200–300 bp fragments via a Covaris M220 and pulled down the point ligation junctions by Dynabeads MyOneTM Streptavidin C1 after size selection with AMPure XP beads. Finally, a total of 668.22 Gb raw data of Hi-C sequencing was obtained and used to assist genome assemble on chromosome-level.

For transcriptome sequencing of adult female/male C. maculatus, each sample consisted of three beetles, with three replicates for each stage or gender. RNA was extracted using Trizol (Invitrogen, USA), library construction, sequencing on the Illumina NovaSeq6000 platform (PE-150) and quality control was performed in Novogene, and clean data was generated for further analysis.

De novo genome assembly

To combine the advantages of high accuracy of second-generation sequencing and long reads of third-generation sequencing20, this research employed the strategy that both sequencing methods were combinedly used for genome assembly. Genomic DNA was extracted and sequenced via Illumina platform and in prior to the Oxford Nanopore sequencing, the K-mer distribution analysis indicated that the genome size of C. maculatus was 1,358.89 Mb based on 21-mers using the illumina data, which was used to evaluate its size and characteristics, and consistent with the result of flow cytometry assay experiment (1,182.11 Mb, Table 2).

The long-read data was used to assemble the primary genome. NextDenovo21 (2.3.1) package was used to assemble the genome, which had several scripts and seq_stat, a binary script to statistic the information of sequencing data, is one of them. After using seq_stat with option -g = 1,200,000,000 to generate the seed cutoff value, which was required by the main program, run command NextDenovo, the main program, to get the primary genome with default configure. The primary genome was adjusted with option -a = 50 by purge_dups22 from 3 Gb into 1.2 Gb which was confirmed by K-mer analysis and flow cytometry previously23. NextPolish24 (1.2.3) package was used to polish the primary genome with default configure, which integrated the short-read data and the assembled the 1.2 Gb draft genome with an N50 of 1.03 Mb.

The Hi-C paired-end reads were mapped to the above draft genome iteratively using chromap25 (0.2.3) and yahs26 (1.2a1). Finally, juicebox27,28 (2.18) was applied to correct the contig orientation and move the suspicious fragments into unanchored scaffolds via visual exploration of Hi-C heatmaps. After all, 10 assembled chromosomes (Fig. 1b and Table 3), which holds 9 autosomes and one X chromosome (2n = 20, n = 9 + X), consistent with that by karyotype analyses23,29. Totally, 78.5% fragments were anchored to the assembled chromosomes, and the genome length is 1,222.50 Mb (N50 = 117.71 Mb) (Fig. 1b and c).

Table 3 Statistics of C. maculatus genome.

Identification of X chromosomes

Female and male beetle transcriptomes proved that Chr-10 is X-chromosome since female beetle transcripts showed significant larger coverage on Chr-10 (Table 4). In addition, we also performed integrating collinearity analysis of multiple homologous species genomes to identify X chromosomes. Chromosome X (Chr-X) is a distinctive group, which had been identified in Tribolium castaneum (GCA_000002335.3)30, Harmonia axyridis (GCA_914767665.1)31, and Coccinella septempunctata (GCA_907165205.1)32, and analyzed among the seven Coleoptera species (Fig. 2a). Furthermore, it was reported that several genes, including Trx, Spastin, ARNTH, etc., were located on the Chr-X in T. castaneum, which could be found in all other six Coleoptera species Chr-X (Fig. 2b). Besides, the GO enrichment analysis of all seven Chr-X showed a strong connection to sexual reproduction (Supplementary Fig. 1).

Table 4 Coverage between female and male beetles on chromosomes.
Fig. 2: Chromosomes synteny and comparative analysis on chromosome X.
figure 2

(a) Gene synteny analysis of seven coleopteran species. Homologous genes are linked by grey lines between chromosomes, while X-chromosome homologous genes are linked by blue lines. (b) Gene density of seven coleopteran species X-chromosomes (including predicted chromosomes) is showed from blue (low) to red (high), and white blank meant that no gene are annotated. Nine key genes are located on the chromosomes. Hob: Holotrichia oblita, Psh: Psylliodes chrysocephala, Bra: Brassicogethes aeneus, Cma: C. maculatus, Tca: T. castaneum, Hax: H. axyridis, Cse: C. septempunctata.

Repeat annotation

Repetitive sequence annotation was divided into two types: homologous sequence alignment and ab initio prediction. The homologous sequence alignment was based on the repeat sequence database (RepBase library version: RepeatmaskerEdition-20181026), using Repeatmasker33 (4.1.5) and RepeatProteinMasker to identify the repetitive sequence had known. Ab initio prediction used LTR FINDER34 (0.3.1), RepeatScout35 (1.0.6) and RepeatModeler36 (2.0.4) combined with Repbase nucleotides library and Repbase proteins library. In the beginning, de novo repeat library was established, and then used Repeatmasker to predict them. In addition, in the method of ab initio prediction, tandem repeat finder (TRF)37 (4.0.9) was applied to find tandem repeats (TEs) in the draft genome. Integrated with the result of ab initio prediction, 797.47 Mb (65.17%) repeat elements were identified totally (Fig. 3a).

Fig. 3: Genome functional annotation of whole genome.
figure 3

(a) Taking Repbase as the library, the tandem repeat sequences (TEs) divergence distribution map was obtained by RepeatMasker annotation. The abscissa shows the divergence between the TEs annotated in the C. maculatus genome and the corresponding sequences in Repbase; the ordinate is the percentage of TEs in the genome under a specific divergence, and different TEs are marked with different colours. (b) Venn diagram how the overlap of five databases, Pfam, Swiss-port, KEGG, GO and NR, used in the annotation. 14,458 genes are annotated from five databases and 9,098 genes are in all the databases.

Gene structure prediction

The strategy of Gene structure prediction combined multiple prediction methods including homology prediction (seven species), ab initio prediction and RNA sequences-based prediction. Homology prediction was to compare the coding protein sequence of Drosophila melanogaster (GCA_000001215.4)38, Diabrotica virgifera (GCA_917563875.2)39, T. castaneum, Anoplophora glabripennis (GCA_000390285.2)40, Sitophilus oryzae (GCA_002938485.2)41, Leptinotarsa decemlineata (GCA_000500325.2)42, and Aethina tumida (GCA_024364675.1)43 with the genome sequence of C. maculatus via blast and genewise to predict the gene structure in the genome. Geneid44 (1.4.5), Augustus45 (3.5.0), GlimmerHMM46 (3.0.4), SNAP47 (2006-07-28), and Genscan48 (1.0) were employed with default configure in the ab initio prediction which relied on the statistical characteristics of genome sequence data, codon frequency and exon-intron distribution, to predict gene structure. Program to Assemble Spliced Alignments (PASA)49 (2.5.3) and Cufflinks50 (2.2.1) were applied in the RNA sequences-based prediction method with default settings. Based on the above prediction results, combined with the transcriptome comparison data, the EVidenceModeler (EVM)49 (2.1.0) and Liftoff51 (1.6.3) was used to integrate the gene sets predicted by various methods into a non-redundant and more complete gene set with different weights (Supplementary Fig. 2). Finally, used PASA to correct the EVM annotation results combined with the transcriptome assembly results, added untranslated region (UTR) and variable splicing and other information to get the final gene set. Using homology prediction, ab initio prediction and RNA sequences-based prediction, totally 14,458 genes were predicted (Table 5, Fig. 3b), and 91.9% proteins were conserved in BUSCO52 (5.4.1) analysis with protein mode based on Insecta odb10 (Table 6). And using the same method for previous version of the genome showed that annotated gene-sets have completeness value was 84.2% with 63.6% single copy, while the original completeness value was 75% which based on arthropod datasets10.

Table 5 Comparison of gene structures in related species.
Table 6 BUCSO evaluation for genome assembly.

Gene function prediction and Non-coding RNA annotation

The gene functions were identified by aligning to Swiss‐prot, the nonredundant sequence databases: Nucleotide collection (NR and NT), eukaryotic orthologous groups of proteins (KOG), KEGG, TrEMBL and using BLAST53 (2.15.0 + ) with an E-value cutoff of 1e-5. The Blast2GO54 (6.0) was employed to annotate gene functions in the GO database based on the aligned results from the NR database. The molecular pathways of predicted genes, which might be involved, were detected through search and annotation for the KEGG database. Using Interproscan55 (5.62–94.0) to search in the Pfam (35.0), PRINTS (42.0), SMART (9.0) databases, known motifs and domains in the C. maculatus genome were found. The ___domain boundaries of interesting proteins were searched on the Pfam website. In all, 14,013 of 14,458 genes were supported by functional annotation from the databases (Fig. 3b). It is worth noting that several genes related to hypoxia, odorant, and immunity were not well annotated in the previous genome, such as olfactory receptors and clip-___domain serine proteases which had been spot checked.

According to the structural characteristics, tRNAscan-SE56 (1.3.1) was used to identify the tRNA in the genome. Meanwhile, because of the highly conservative of it, the rRNA sequences of related species were used as a reference sequence and aligned with C. maculatus genome via blast. Additionally, the covariance model of the Rfam family was used, and the INFERNAL57 (1.1.4) that came with Rfam to predict the information of miRNA and snRNA on the genome. There were 493,139 bp non-coding RNA, and most of them is tRNA (2,827 genes for 206,000 bp), while it was 6,948 tRNA genes in the previous genome10.

Data Records

The whole raw data has been deposited at the NCBI Sequence Read Archive under BioProject number PRJNA1048654 and BioSample ID SAMN38657795 for C. maculatus. Raw sequencing data (Illumina, Nanopore, Hi-C and RNA-seq data) have been deposited in the Sequence Read Archive database as SRP47724758. The final genome assembly and gene annotation results have been deposited in GenBank59 and Figshare60.

Technical Validation

To evaluate the completeness of the assembly, BWA61 (0.7.17) was used to align the short-reads data with genome while Minimap262 (2.17) aligned the long-reads data and the coverage depth for assembled chromosomes were calculated via SAMtools63 (1.16.1) (Table 7). The chromosome-level genome was also evaluated via BUSCO52 (5.4.1) which was compared with Insecta odb10 with 1367 genes, and the results showed that 98.4% and 0.3% conserved core genes were identified as completed and fragmented (Table 6). These results showed that the assembled C. maculatus chromosome-level genome has an elevated level of completeness.

Table 7 Statistics of genome mapping.