Background & Summary

Youcha, in a broad sense, encompasses over more than 60 shrubs belonging to the genus Camellia (Theaceae)1. As one of the world’s most productive woody plants for edible oil, Youcha trees have a cultivation history exceeding 2300 years and possess a diverse wide of uses2,3,4. Among the Camellia species, the Camellia oleifera is the most extensively cultivated and is predominant in oil production, which is valued for its high nutritional content and health benefits5,6,7. Additionally, five other species are cultivated for edible oil production, including C. meiocarpa, C. reticulata, C. chekangoleosa, C. yuhsiensis, and C. vietnamensis3,8. The Youcha trees currently cover more than 4.6 million hectares in South China9. As evergreen and productive oil crops, oil-Camellia trees have garnered increasing attention, especially amidst global food crises.

C. meiocarpa, among the commonly cultivated Youcha plants, is distinguished by its leafy tree habit, large flowers, and abundant fruit (Fig. 1a–d), indicating its potential for agricultural utilization with high oil content and yield10,11,12. It has been reported that the tetraploid C. meiocarpa may have originated from hybridization between closely related diploid species, resulting in an allotetraploid genome13. However, there remains a dearth of chromosomal-scale genomic data supplementation. Polyploidy, a frequent and recurrent phenomenon, is not random and always associated with adaptation to periods of environmental upheaval, providing rich evolutionary material for unique phenotypes14,15,16. In crop improvement, induced polyploidy is considered a powerful tool to increase production, enhance quality, and improve stress adaptation17,18,19. Therefore, it is imperative to explore the genomic feature of Youcha species, particularly those that are naturally polyploid. Within the group of Youcha species, polyploid complexes is frequently observed, from diploid to dodecaploid20,21, yet few high-quality genomes of Youcha species have been published4,6,22,23,24, and none provide a haplotype-scale scenario for genomic divergence.

Fig. 1
figure 1

Morphological and genomic characteristics of Camellia meiocarpa Hu. (a) The complete plant and its natural habitat. (b) Close-up of the flower, highlighting the stamen and pistil. (c) A bud in development. (d) The fruit, which is rich in oil. (e) Estimations of genome size and heterozygosity rate based on K-mer counting.

In this study, we used a century-old C. meiocarpa ancient tree, which was identified as a tetraploid after karyotype analysis and ploidy detection and planted at the National Engineering Research Center of Youcha, as the research subject. We constructed and annotated a high-quality chromosome-level reference genome of C. meiocarpa, resolving the four sets of haplotypes for the tetraploid genome (Fig. 2a). The assembly determined a high continuity genome with contig N50s of 44.46 Mb and 41.40 Mb. After anchoring the contigs to 4 × 15 pseudo-chromosomes, the final assembled genome of the four haplotypes were obtained, with lengths of 2967.25 Mb for haplotype A, 3041.66 Mb for haplotype B, 2946.45 Mb for haplotype C, and 3030.54 Mb for haplotype D, covering more than 95% of the K-mer-based estimation of genome size (Table 1; Fig. 1e). A total of 51336, 52010, 51586, and 52631 protein-coding genes were predicted for each of the four haplotypes, of which 99.4% were functionally annotated (Tables 1, 3). The more than 82% of the C. meiocarpa genome is annotated as repetitive sequences (Table 2). This haplotype-resolved tetraploid genome of C. meiocarpa provides an in-depth understanding on the influence of polyploidy on important phenotypic traits and the potential for future utilization in genetic study and breeding programs.

Fig. 2
figure 2

Haplotype assembly and genomic characteristics of Camellia meiocarpa genome. (a) Hi-C interaction heatmap illustrating the scaffolding of the four haplotypes. (b) A circus plot representation of the 4 × 15 chromosomes, detailing their genomic attributes: (I) GC content, (II) transposable element abundance, (III) gene density, (IV) and collinearity blocks among chromosomes. This circos figure was generated using the R package “circlize”75.

Table 1 Statistics of genome assembly and gene annotation for the four haplotypes.
Table 2 Identification of the repetitive elements in the haplotype-resolved Camellia meiocarpa genome.
Table 3 Functional annotation of the genes in each haplotypes of Camellia meiocarpa.

Methods

Plant materials and sequencing

C. meiocarpa was cultivated at the National Germplasm Resources Gardern of Youcha (NGRGY) in the Experimental Forestry Farm of Hunan Academy of Forestry (113°01′ E, 28°06′ N). Healthy young leaf samples of C. meiocarpa were collected and subsequently stored in liquid nitrogen. Genomic DNA was extracted from leaf sample by using CTAB method25. Genomic DNA was processed into a Nextera DNA Flex Library, targeting insert sizes between 200 to 400 bp, and assessed for quality and quantity using a Qubit 3.0 Fluorometer and Agilent 2100 system. The library was sequenced on the Illumina Nova-Seq 6000 platform (Illumina, San Diego, CA, USA) to produce paired-end 250 bp reads. The PacBio HiFi library was prepared with the SMRTbell Express Template Prep Kit and sequenced on the PacBio Sequel II platform (Pacifc Biosciences, California, USA), yielding 374.15 Gb HiFi reads with an N50 length of 17.57 kb after removing adapters, assessment of contaminated reads and quality control. For chromosomal scaffolding, young leaf tissues were cross-linked, digested with Dpn II, and ligated. DNA fragments of 300 to 700 bp were selected, and the library was sequenced on the Illumina Nova-Seq platform (Illumina, San Diego, CA, USA) to acquire a total of 383,901,279 paired reads for further chromosomal scaffolding. Transcriptome sequencing was conducted on six different tissues, including young leaf, adult leaf, leaf shoot, stem, bark, and young fruit. The total RNA from these tissues was extracted and sequenced using the Illumina Nova-Seq platform (Illumina, San Diego, CA, USA). Additionally, a merged RNA sample from all six tissues was sequenced on the PacBio Sequel II platform (Pacifc Biosciences, California, USA) for gene structure annotation.

Genome assembly and haplotype-resolved chromosome scaffolding

Initially, we estimated the genome size and heterozygosity rate of C. meiocarpa from HiFi data based on k-mer analysis. We utilized KMC (v3.2.4)26 to count the k-mer frequency at k = 23, and then submitted the results to GenomeScope2 (v2.0.1)27 with the parameter “-k = 23”. Consequently, we determined the genome size of C. meiocarpa to be approximately 3.092 Gb, with a heterozygosity rate of 2.28%.

To resolve the four haplotype sets of the tetraploid C. meiocarpa genome, we employed Hi-C integrated assembly using Hifiasm (v0.18.6) with the default parameters28. The pipeline of purge_dups (v1.2.5) was utilized to remove the redundant haplotigs29, leading to the draft contig-level assemblies with N50 values of 44.46 Mb and 41.40 Mb. After primary filtration using Fastp (v0.23.2)30, the clean Hi-C reads were aligned to the contig-level genome using BWA aligner (v0.7.18) with default parameters31. Only uniquely mapped read pairs were retained for subsequently analysis. The PCR-derived duplicates would be discarded.

Then, the ALLHiC (v0.9.8) workflow32 was applied to classify the contigs and construct the pseudo-chromosome structures. These scaffolding step was conducted to all the contigs from different haplotypes, in order to assign the contigs to correct haplotypes. The pipeline of 3D-DNA (v180419)33 and juicer (v1.6)34 was used to order and orient contigs. Juicerbox (v1.11.08)34 was employed for manual correction of assembly errors, resolving 15 pseudo-chromosomes for each of the four haplotypes (Fig. 2a; Table 1). Finally, we acquired the chromosome-level assemblies of the four haplotypes with final lengths of 2967.25 Mb for haplotype A, 3041.66 Mb for haplotype B, 2946.45 Mb for haplotype C, and 3030.54 Mb for haplotype D (Fig. 2b; Table 1).

Repetitive elements annotation

Repetitive elements were annotated using a combination of de novo and homology-based approaches. First, the LTR_FINDER_parallel35 was ran with the parameters “-threads 16 -harvest_out -size 1000000 -time 300”, while the LTRharvest (v1.62)36 used the parameters “-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes”. The raw candidates of long terminal repeat retrotransposons (LTR-RTs) identified by LTR_FINDER_parallel and LTRharvest were refined by the LTR_retriever pipeline37. In addition, RepeatModeler (v2.0.4)38 was also used to identify and model de novo transposable element (TE) families. We employed TEcalss (v2.1.3)39 to classify the members from the library merged from the results of LTR_retriever and RepeatModeler. This de novo library was combined with the public repetitive sequence database RepBase (v20181026)40, which was analyzed with RepeatMasker (v4.1.5) and RepeatPriteinMask (v4.1.5)41 for genome-wide identifications of repetitive elements in out haplotype-resolved C. meiocarpa genome.

The analysis revealed that over 2.4 Gb of repetitive sequences were identified in both of the four haplotype sets, constituting more than 82% of the tetraploid C. meiocarpa genome (Table 2). Long terminal repeat retrotransposons (LTR-RTs) predominated among these repetitive elements, comprising 70.57%, 70.81%, 69.31%, and 70.25% of the genomes for haplotypes A, B, C, and D, respectively (Table 2). Furthermore, DNA transposons were found to constitute approximately 10% of the C. meiocarpa genome (Table 2). The substantial proportion of repetitive elements, particularly the proliferation of LTR-RTs, may underlie the ‘genome obesity’ observed in C. meiocarpa.

Gene prediction and function assignment

Gene structures were predicted using an integrated approach of transcriptome-based, ab initio, and homology-based strategies. The repetitive sequences would be masked in the haplotype-resolved genome for gene prediction. In this study, we conducted sequencing of the NGS and full-length transcriptome across six diverse tissues to inform gene prediction (Table S1). The clean NGS reads were aligned to the reference genome of four haplotypes, which encompasses four haplotypes, using Hisat2 (2.2.1) with default parameters42. Stringtie2 (v2.2.1) was then applied to assemble the NGS RNA-seq transcripts43. Furthermore, the PacBio full-length RNA-seq data underwent correction, clustering, and filtering through the IsoSeq3 program within the SMRTlink framework (Pacific Biosciences) (https://github.com/PacificBiosciences/IsoSeq). The Minimap2 aligner44 in conjunction with the cDNA_Cupcake annotator (https://github.com/Magdoll/cDNA_Cupcake) was instrumental in identifying transcripts for subsequent prediction of protein-coding genes. The transcripts from both NGS and full-length RNA-seq were consolidated using the TAMA program (v1.0)45, facilitating the prediction of open reading frames (ORFs) with TransDecoder (v5.7.0)46. The transcripts that encompassed complete ORFs were subsequently designated as transcriptome-based candidates.

For the homology-based gene prediction, we sourced protein-coding genes from five closely related species— C. oleifera6, and C. sinensis47, C. chekiangoleosa22, C. lanceoleosa23, and Arabidopsis thaliana48—to identify homologous regions within the C. meiocarpa genome using tBLASTn49. Subsequently, the Exonerate (v2.4.0) tool50 was utilized to dissect the homology findings, producing a set of homology-based gene candidates.

The gene structures of the C. meiocarpa genome were refined using Augustus (v3.5.0)51 to generate de novo gene annotations. Subsequently, the MAKER (v3.01.03) pipeline52 was employed to synthesize the annotation data from the three distinct sources, complemented by the exclusion of transposon proteins via TransposonPSI (v1.0.0) (https://transposonpsi.sourceforge.net/). Furthermore, we employed the IGV-GSAman program53 to manually refine the gene structures and GFF3toolkit54 to confirm the format of final gff3 annotation file for the four haplotypes of C. meiocarpa. For example, tandem duplicated genes with incomplete exon structures would be discarded (Fig. 3a). Additionally, genes that are specific to a single haplotype and located between pairs of syntenic genes would also be removed based on comparisons among the four haplotypes (Fig. 3b). After the manual corrections of 6,119 genes, the genomes of haplotypes A, B, C, and D were annotated with 51,336, 52,010, 51,586, and 52,631 protein-coding genes, respectively (Fig. 2b; Table 1). Analysis using OMArk and BUSCO55,56 revealed that our gene structure annotations achieved a completeness of over 96% for all four haplotypes (Table S2).

Fig. 3
figure 3

Examples of manual gene structure corrections using IGV-GSAman. (a) Incomplete tandem duplicated genes. Genes with red names are identified as tandem duplicates with incomplete exon structures relative to their upstream and/or downstream genes. These genes would be removed. (b) Simple genes without any syntenic counterparts among the four haplotypes.

For functional annotation of the protein-coding genes, we aligned the protein sequences against various databases using DIAMOND (v2.1.7)57, such as UniProt and the Non-Redundant database, resulting in GO and KOG category annotations (Table 3). We also applied InterProScan (v5.55–88.0)58 and HMMER (v3.3.2)59 to identify motifs and domains, and compared these with the KOfam and Pfam databases60,61, culminating in the acquisition of KEGG pathway annotations (Table 3). Collectively, over 94% of the protein-coding genes across all haplotypes received annotations from at least one functional database, signifying the high quality of annotations within the C. meiocarpa genome.

Additionally, non-coding RNA in the C. meiocarpa genome was annotated, encompassing miRNA, tRNA, rRNA, and snRNA (Table 1). The tRNA structure was discerned using tRNAscan-SE (v2.0.12)62, while rRNA was forecasted using RNAmmer (v1.2)63. The remaining non-coding RNAs were categorized through the INFERNAL (v1.1.4)64 and Rfam database (v14.9)65.

Haplotype clustering

Recently, the haplotype-resolved genome of tetraploid C. oleifera has been reported66, providing valuable insights into the evolution, agronomic trait development, and genetic architecture of oil Camellia plants. The genomic relationship between the tetraploid C. meiocarpa and C. oleifera could be explored based on the analyses of genomic content. Based on the LTR_retriever outcomes, we conducted clustering on the complete set of 120 chromosomes to confirm the partitions of the haplotypes of both C. meiocarpa and C. oleifera (Fig. 4). Utilizing the SSM method67, we constructed a similarity matrix for these chromosomes. Our findings reveal that the four haplotypes of C. oleifera form a distinct cluster, yet they exhibit genomic contents that are notably different from those of C. meiocarpa (Fig. 4). These results provide robust evidence supporting the classification of C. meiocarpa and C. oleifera as two separate species at the genomic level. The chromosomes of C. meiocarpa can be distinctly categorized into four haplotype groups, which implies their independent evolutionary origins. This genetic partitioning implies that the tetraploid genome of C. meiocarpa may have arisen from two distinct hybridization events, which have resulted in varied LTR similarities among the homologous chromosome sets.

Fig. 4
figure 4

Cluster analysis of the chromosomes based on LTR similarity. The tetraploid geonomes of Camellia oleifera and C. meiocarpa, are investigated here, suggesting the relationships and genomic structures between these two species.

Data Records

The PacBio HiFi long reads, Hi-C interaction data, and multi-tissue RNA-seq datasets RNA-seq data of multiple tissues have been submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database under the SRA accession number SRP53126768. The final chromosome assembly has been deposited at ENA under the accession number GCA_965213565.169, as well as at National Genomics Data Center (accession GWHFIBF0000000070). The genome assembly and annotation results have also been deposited in the figshare71.

Technical Validation

The continuity of the genome assembly was evaluated by remapping NGS and HiFi genomic reads to the assembled haplotype-resolved genome (Fig. 5a). The NGS reads were aligned using BWA (v0.7.17) program31, achieving high coverage rates of over 99.9% for all haplotypes. For the HiFi reads, Minimap2 aligner44 was utilized to assess coverage at minimum depths of 1x, 10x, and 20x. The results demonstrated that more than 99.9% of the assembled genome was covered at least 1x for all four haplotypes, with approximately 95% and 83% of the genome covered at 10x and 20x depths, respectively (Fig. 5a).

Fig. 5
figure 5

Evaluations of genome assembly quality. (a) Mapping rates of NGS and HiFi reads to the assembled genomes, demonstrating the coverage of the assembled haplotype genomes. (b) BUSCO analysis results for each haplotype, indicating the completeness of assembly.

The completeness of the genome assembly was assessed using BUSCO (v5.2.2) with the embryophyta_odb10 orthologous database, which contains 1,614 conserved single-copy genes56. The BUSCO analysis indicated a completeness range of 94 to 95.2% at the chromosome level for the four haplotypes (Fig. 5b). Additionally, the LTR Assembly Index (LAI) was calculated using LTR_retriever37,72, yielding values between 12.13 and 15.49. These assessments confirm that the haplotype-resolved genome of C. meiocarpa exhibits excellent assembly quality, characterized by high completeness and continuity.

We further assessed genome completeness using Merqury (v1.3) program73 with HiFi reads, achieving a high-quality value (QV) of 75.90. Additionally, we utilized CRAQ (Clipping information for Revealing Assembly Quality)74 to generate further evaluations of the genome assembly quality. These analyses indicated an R-AQI (Reference Assembly Quality Index) of 94.67 and an S-AQI (Scaffold Assembly Quality Index) of 92.97, both of which are indicative of a high-quality genome assembly.