Background & Summary

The papaya mealybug, Paracoccus marginatus Williams and Granara de Willink (Hemiptera: Pseudococcidae) (Fig. 1), is a highly polyphagous pest that affects more than 200 plant species1. It poses significant economic risks to various crops and ornamental plants, such as avocado (Persea americana)2, cassava (Manihot esculenta)3, papaya (Carica papaya)4, tomato (Solanum lycopersicum)5, chili pepper (Capsicum annuum)5, and eggplant (Solanum melongena)6. Originally native to Mexico and Central America7,8, P. marginatus was not a major pest in its native habitat, likely due to the presence of endemic natural enemies. However, it became a serious pest after its introduction to the Caribbean in the 1990s, and has since spread to over 60 countries, including tropical and subtropical regions in the Americas, Africa, and Asia1,7,9. Studies have shown that infestations in papaya orchards in Bangladesh result in average economic losses of approximately USD 700 per hectare annually due to P. marginatus damage10. In Kenya, the papaya mealybug causes losses exceeding USD 29.8 million per year, with individual households experiencing losses between USD 51 and USD 740, depending on local conditions and market access11. Crop losses due to P. marginatus can range from 10% to 60% depending on the crop, but in severe cases, they can exceed 90%12,13,14. These significant economic and yield losses highlight the global threat posed by the papaya mealybug, making it a critical concern for agricultural sustainability and food security.

Fig. 1
figure 1

Ecological photos of papaya mealybug. (A) adult female. (B) adult male; (C) eggs; (D) feeding on papaya. (Collected site: Sanya city, Hainan Province, China, June 2022; Images were provided by Linjia Xue, the one of author).

Recent studies have revealed that P. marginatus in Asia consists of a single haplotype, suggesting that its presence in the region is the result of a recent invasion (unpublished data)8. This could be attributed to the short length of the gene segments used, which might not capture sufficient genetic diversity. Therefore, longer molecular markers are needed for more comprehensive future studies. The lack of high-quality genomic information has limited our understanding of the molecular mechanisms underlying P. marginatus ecology and the development of new molecular markers. While extensive research has been conducted on biological control15,16, risk assessment7, biology17, and genetics8, the molecular mechanisms behind the papaya mealybug’s adaptability to various environments and host plants remain unclear.

In the current study, a chromosome-scale genome assembly for papaya mealybug was achieved by integrating Hi-Fi sequencing technologies with the Hi-C scaffolding approach. The assembled genome had contig and scaffold N50s of 20.2 Mb and 48.01 Mb, respectively. The assembled genome, with a total size of 213.81 Mb, was anchored onto four chromosomes. The genome exhibited contig and scaffold N50s of 20.2 Mb and 48.01 Mb, respectively. A total of 13,367 protein-coding genes were predicted, with 11,719 of these genes functionally annotated.

Methods

Sampling and sequencing

The papaya mealybugs used for sequencing were collected from Sanya City, Hainan Province, China, in June 2022. These mealybugs were not exposed to any insecticides and were found on papaya (Carica papaya L.). The specimens were reared in the laboratory at the College of Plant Protection, Shanxi Agricultural University. A total of 150 female adult individuals were used for genome sequencing, with the entire bodies of these samples utilized to construct the library.

High-quality DNA was extracted from the whole body of adult female by the QIAGEN DNeasy Blood and Tissue kit (Qiagen, Germantown, USA). Illumina and third generation sequencing (PacBio) were used respectively. An excellent integrity of DNA molecules was observed in the current study. Genome sequencing was carried out by Biomaker Biotechnology Company in Beijing. The raw data will be used in the laboratory for subsequent genome assembly, annotation, and other related tasks.

Genome size estimation and contig assembly

A genome survey of the papaya mealybug was performed using K-mer analysis (K = 19) with a 350 bp library dataset. The Illumina short-read DNA library produced 41.31 Gb of high-quality filtered data, providing 143.69 × genome coverage, with Q20 and Q30 scores exceeding 99.21% and 95.43%, respectively. Based on the 19-mer frequency distribution analysis (Fig. 2), the papaya mealybug genome was estimated to be approximately 287.46 Mb (287,457,296 bp) in length. The heterozygosity of the genome was calculated to be 0.57%. Repetitive sequences constituted 49.26% of the genome, and the GC content was 32.82%.

Fig. 2
figure 2

The distribution of 19-mer frequency in the papaya mealybug. Frequency distribution of k-mers of different occurrences in two paired-end libraries. K-mer occurrences (x-axis) were plotted against their frequencies (y-axis).

De novo assembly of the PacBio HiFi reads was performed using Hifiasm (version 0.16)18, with redundant sequences filtered out using purge_dups19. The completeness of the genome was assessed using CEGMA (version 2.5)20 and BUSCO (version 4)21. A total of 14.01 Gb of HiFi reads, with a maximum read length of 43,103 bp and an average read length of 14,691 bp, were obtained (Tables S1, S2). The initial genomic contig sequence had a total length of 245.75 Mb and a contig N50 of 20.02 Mb. After removing redundancy, the length of the genomic contig was reduced to 215.04 Mb, maintaining a contig N50 of 20.02 Mb (Table S2). CEGMA assessment indicated a completeness of 99.56%, while BUSCO analysis yielded a completeness score of 95.50% based on the Hemiptera database in OrthoDB 10.

Hi-C analysis and chromosome assembly

Hi-C libraries were constructed and sequenced using the Illumina NovaSeq. 6000 platform (Biomaker Technologies Co., Ltd., Beijing, China). Burrows-Wheeler Aligner (version 0.7.17-r1188) was used to align the paired-end clean reads to the assembled genome sequence, yielding uniquely mapped read pairs. Final valid reads for downstream analysis were selected after removing invalid read pairs, including dangling ends, self-cycles, re-ligation, and discarded products, using Hic-Pro (version 2.10.0). The chromosome matrix was visualized as a heatmap, displaying diagonal patches indicative of strong linkages. Gene density, transposable element (TE) density, and GC content were plotted as a circus plot using the Circlize package (version 0.4.1)22.

A total of 208,871,684 read pairs (61.87 Gb of clean data; 42 × genome coverage) were generated from the Hi-C library (Table S1). Of these, 82.96% (173,267,369 bp) were uniquely mapped to the assembled genome. Among the uniquely mapped read pairs, 64.25% (111,328,377 bp) were valid interaction pairs used for the Hi-C assembly. The Hi-C assembly anchored 213.81 Mb (213,808,492 bp) of the genome sequences to 4 chromosomes, with the order and direction of 213.54 Mb (99.87%) of these sequences accurately determined (Table 1, Fig. 3A). The heatmap of the Hi-C interaction bins confirms the high quality of the genome assembly (Fig. 3B). Detailed chromosome sequence distribution is provided in Table S3. The final assembled genome size for P. marginatus is 213.81 Mb, consisting of 98 contigs and 31 scaffolds, with contig and scaffold N50 values of 20.2 Mb and 48.01 Mb, respectively (Table 1).

Table 1 Summary of Genome sequencing and assembly of papaya mealybug.
Fig. 3
figure 3

Heatmap of genome-wide Hi-C data and overview of the genomic landscape of P. marginatus. (A) Circos plot of distribution of the genomic elements in papaya mealybug was visualized by Circos. Blocks on the outmost circle represent all four chromosomes of P. marginatus. Peak plots from outer to inner circles represent the length of each chromosome, TE density, Gene density and GC content. (B) The heatmap showing all-by-all Hi-C interactions among four pseudochromosomes of P. marginatus. The frequency of Hi-C interaction links is represented by colours, which ranges from white (low) to red (high).

Genome annotation

The repetitive elements in the papaya mealybug genome were identified and annotated using RepeatModeler2 (version 2.0.1)23, which includes two tools: RECON (version 1.0.8)24 and RepeatScout (version 1.0.6)25. The combined results from RepeatModeler, along with databases Repbase (v19.06)26, REXdb (v3.0)27, and Dfam (version 1.07)28, were used to create the final repeat sequence libraries for subsequent analysis with RepeatMasker (v4.1.2)29. Tandem repeats (TRs) were also identified using MISA (Microsatellite Identification tool) (v2.1) and TRF (Tandem Repeat Finder) (v4.09). In total, 55,387,198 bp of transposable elements (TEs) were annotated, representing 25.76% of the entire genome (Table S4). Among these TEs, DNA transposons comprised 22.79% of the whole genome, making them the largest category of repeats. Tandem repeat sequences accounted for 11,202,955 bp, which is 5.21% of the total TEs (Table S5).

Genome gene structure prediction was carried out using a combination of three approaches: ab initio, homology-based, and RNA-seq-based methods. Ab initio gene models were predicted using Augustus (version 3.1.0)30 and SNAP (version 2006-07-08)31. Homology-based predictions were performed with GeMoMa (version 1.7)32, utilizing homologous genes from three species: Aphis gossypii, Phenacoccus solenopsis, and Metopolophium dirhodum. GeneMarkS-T (version 5.1)33 was employed for predictions based on assembled transcripts, while PASA (version 2.4.1)34 was used to refine gene predictions based on unigenes assembled by Trinity (version 2.11)35. The final gene models were integrated using EVM (version 1.1.1)36 and further modified with PASA (version 2.4.1). The completeness of the genome prediction for the papaya mealybug was evaluated using BUSCO (version 5.2.2)21 by searching against the Hemiptera database. Additionally, various non-coding RNAs were identified, including tRNAs, rRNAs, miRNAs, snRNAs, and snoRNAs. tRNAs were determined using tRNAscan-SE (version 1.3.1)37; rRNAs were predicted with barrnap (version 0.9)38; miRNAs, snoRNAs, and snRNAs were identified using Infernal (version 1.1)39 based on the Rfam (version 14.5)40 database. Pseudogenes were searched using GenBlastA (version 1.0.4)41 and GeneWise (version 2.4.1).

A total of 13,367 protein-coding genes were predicted by integrating ab initio, homology-based, and RNA-seq strategies (Table S6, Fig. 4). The average gene length is 6,164 bp, with an average exon length of 2,001 bp, average coding sequence length of 1,524 bp, and average intron length of 4,162 bp (Table 2, Fig. 4). Additionally, 2,346 BUSCOs were identified, with a completeness of 93.47%, 0.52% fragmentation (13 BUSCOs), and 6.02% missing (151 BUSCOs), indicating a high accuracy in gene prediction across the papaya mealybug genome. Non-coding RNAs were found to have an average length of 246 bp, including 32 rRNA, 195 tRNA, 7 miRNA, and 13 snRNA (Table S7). A total of 10 pseudogenes were identified, with a combined length of 16,406 bp and an average length of 1,640.6 bp (Table S7).

Fig. 4
figure 4

The protein-coding genes were predicted by integrating three methods. Venn diagram illustrate protein-coding genes were predicted based on 3 different strategies.

Table 2 The basic information statistics of assembled genome.

Gene function prediction was conducted using various databases, including NR, eggNOG (Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups)42, KEGG (Kyoto Encyclopedia of Genes and Genomes)43, TrEMBL, KOG, SWISS-PROT, and Pfam44. Motifs and domains were identified using InterProScan (version 5.34–73.0)45. In total, 11,719 protein-coding genes in the papaya genome were functionally annotated through alignments with these public databases (NR, eggNOG, GO, KEGG, Swiss-Prot, Pfam, KOG, and TrEMBL) (Table S8). Additionally, 591 motifs and 19,572 domains were annotated in this study.

Data Records

The PacBio and Hi-C sequencing data that were used for the genome assembly have been deposited in the NCBI Sequence Read Archive with accession number SRR3255413446 and SRR2549786647 under BioProject accession number PRJNA992385. The final chromosome assembly were deposited in the GenBank with accession number JBLZST00000000048. The annotation results of repeated sequences, gene structure and functional prediction were deposited in the Figshare database49.

Technical Validation

DNA quality was evaluated by 0.75% agarose gel electrophoresis, and DNA concentration was quantified using a NanoDrop 2000 spectrophotometer (Thermo Fisher, USA) in combination with a Qubit 3.0 Fluorometer (Invitrogen, USA).

Genome completeness was assessed using CEGMA (v2.5)20 and BUSCO (v4.0)21. The final assembly spanned 213.81 Mb, with 99.56% of sequences anchored to four chromosomes. Assembly continuity metrics revealed contig N50 of 20.2 Mb and scaffold N50 of 48.01 Mb. Benchmarking analyses demonstrated high completeness: BUSCO assessment yielded a score of 95.5%, while CEGMA identified 99.56% core eukaryotic genes, collectively supporting the high-quality annotation of P. marginatus genome.