Introduction

Natural rubber (NR, cis-1,4-polyisoprene), an elastomer of high-molecular mass with exceptional physical and mechanical performance, is irreplaceable in many important industrial applications, such as heavy-duty airplane tires1. The world’s commercial NR supplies rely on Hevea brasiliensis (para rubber tree), a tropical tree species native to the Amazon basin. Hybridization-based rubber tree breeding has largely resolved the issues of increasing global NR demand via frequently releasing new cultivars with incremental yield improvement2. Strikingly, the considerable achievements in rubber tree breeding in almost 150 years are mainly based on the Wickham population that originated from only 22 seedlings, representing a narrow genetic diversity that urgently requires infusion of exotic germplasms for further advancement of this crop2. Moreover, the current devastating effect of a fungal disease, the South American leaf blight (SALB), on rubber plantations in the Americas and the rapid expansion of rubber tree cultivation to marginal regions necessitate the breeding strategies that incorporate aspects of SALB resistance and environmental resilience without sacrificing yield. The rich genetic variations in H. brasiliensis wild accessions, and the SALB resistance and drought tolerance in some other Hevea species, including H. nitida, H. pauciflora and H. benthamiana2,3,4 offer great opportunities for their utilization in the development of the next-generation of elite rubber cultivars.

Since the establishment of the genus Hevea by Aublet in 1775, species delineation within this genus has been a matter of dispute. For instance, the number of proposed species varied from 11 in 1874 to 24 in 1906, and then to 9 in 19705. This inconsistency can be attributed to limited surveys of natural occurrences6 and the likely clonal reproduction in the genus, which may result in the survival of inter-lineage hybrids that obscures the boundaries between phenotypic entities7. All described species within the genus until now have been established solely based on morphological and ecological characteristics observed in type specimens and a limited number of other individuals2,3. These traditional approaches lack statistical comparisons in both phenotypic and genetic distinctions across multiple individuals/populations that can accurately distinguish independently evolving lineages7. Therefore, the collation of genomic data from H. brasiliensis and other Hevea species will help identify these lineages and substantially enhance our understanding of species delineation in this genus. Furthermore, it will have a meaningful impact on breeding practices related to rubber tree cultivation using the identified wild lineages.

In the Hevea genus to date, the genomes have been de novo assembled for only one species, H. brasiliensis, including two8,9 by short-read sequencing technology, two10,11 by the hybrid of short and long-read sequencing technologies, and three12,13,14 by long-read sequencing technology. Our previous studies have suggested the importance of a REF/SRPP (rubber elongation factor/small rubber particle protein) tandem gene cluster in rubber production9.

In this study, we generate eight high-quality de novo genome assemblies from three H. brasiliensis cultivars, two H. brasiliensis wild accessions and three other Hevea species (H. nitida, H. pauciflora and H. benthamiana). Thereafter, we perform pan-genome analysis in Hevea together with the resequencing data of 94 Hevea accessions, the vast RNAseq and rubber yield data as well as the integrative analysis of multiple high-quality angiosperm and gymnosperm genomes. Our study provides compelling evidence for species delineation for the genus Hevea and the association of a large REF/SRPP gene cluster with evolution of the traits governing rubber production.

Results

Genome assembly and annotation of eight accessions from four Hevea species

To capture genome diversities in the genus Hevea, we selected eight representative accessions for de novo genome assembly. These accessions comprised three high-yielding H. brasiliensis (Hb) cultivars (RY879, RY73397 and RRIM600), two Hb wild germplasms (XJA868, medium-yielding; XJA968, low-yielding), and three other Hevea species (H. benthamiana, Hbe; H. nitida, Hn; H. pauciflora, Hp) with two of these (Hn and Hp) being highly SALB-resistant4. These three sets represent three categories of Hevea germplasms, viz. the 1876 Wickham collection, the 1981 IRRDB collection, and allied Hevea species that are currently available in the major rubber-producing countries2. A genomic survey using the k-mer-based method was conducted to estimate the genome size, the repeat content, the GC content, and heterozygosity (Supplementary Fig. 1). The genome sizes estimated for the eight accessions ranged from 1.48 to 1.58 Gb (Table 1). The size (1.50 Gb) estimated for the RY73397 reference genome was confirmed by flow cytometry that measured 1.49 and 1.50 Gb against the genomes of Zea mays and Solanum lycopersicum, respectively, as internal standards (Supplementary Fig. 2). Moreover, the genome sizes estimated here are comparable to the reported genome sizes (1.47 to 1.60 Gb) from 53 cultivated Hb germplasms determined by flow cytometry15, demonstrating little variations in genome size among Hevea accessions.

Table 1 Statistics of eight Hevea genome assemblies

We de novo assembled genomes from these eight Hevea accessions using long-read sequencing on ONT PromethION and short-read Illumina sequencing (HiSeq) with an average coverage depth of 95× and 12×, respectively (Table 1). The contig N50 sizes of the eight whole-genome assemblies ranged from 1.33 to 2.72 Mb with a mean of 2.23 Mb, substantially higher than or comparable to six out of the seven published Hb genomes, i.e., four assembled by short-read sequencing technologies8,9,10,11 and two by long-read sequencing technologies12,13, but apparently lower than the latest CATAS8-79 (synonym of RY879) genome with a contig N50 of 11.21 Mb14 (Fig. 1a; Supplementary Data 1). The final assembled sizes of our genomes, 1.49 to 1.58 Gb with a mean of 1.54 Gb, corresponded to approximately 99.1% of their estimated size on average. The high quality and completeness of our assemblies were verified by all five well-established assessment methods, viz. Illumina reads mapping, the core eukaryotic genes mapping approach (CEGMA), benchmarking universal single-copy orthologs (BUSCO), Merqury, and LTR assembly index (LAI) (Table 1). We also generated chromosome conformation capture (Hi-C) data for RY73397 (Hb), Hbe, Hn and Hp, and anchored >99.9% of their contigs into 18 pseudomolecules, with lengths ranging from 46.06 to 114.09 Mb (Fig. 1b, c; Supplementary Fig. 3; Supplementary Data 2-3). The relative sizes of 18 pseudomolecules generated for RY73397 significantly correlated with those of the 18 chromosomes photographed in the somatic cells at mitotic metaphase for the same cultivar (r = 0.985; p = 1.285E-11) (Fig. 1c, d; Supplementary Data 4). Moreover, the chromosome-level scaffolds of our eight genome assemblies of four Hevea species showed overall high collinearity with the CATAS8-79 genome14 (Supplementary Fig. 4). By contrast, the Hp genome displayed the lowest collinearity, consistent with its longest evolutionary distance from the RY879 (Fig. 1e).

Fig. 1: Assembly and annotation of eight genomes from four Hevea species.
figure 1

a Statistics for the initial contig assembly of the Hevea genomes generated here, together with the seven previously published H. brasiliensis (Hb) genomes. Red, the Hb cultivar RY73397 genome as a representative for the eight Hevea genomes assembled this study; Gray, genome of the Hb CATAS8-7914; Cyan, genome of a Hb wild accession (MT)13; Green, genome of the Hb GT115; Purple, genome of the Hb RY733979; Blue, genome of the Hb BPM2411; Orange, genome of the Hb RRIM60010; Tawny, genome of the Hb RRIM6008; Red dotted line denotes the position of contig N50. b Eighteen pseudomolecules scaffolding with Hi-C data for four Hevea species: Hb (RY73397), Hp, Hbe and Hn. c Genome features depicted for Hb and H. pauciflora (Hp) across the eighteen chromosomes. Units on the circumference (track I) are in megabase (Mb) values. Tracks II, gene density (range 1-128 per Mb); III, Gypsy (blue) and Copia (red), respectively (range 0-13 per Mb); IV-VI, gene expressions in leaf (FPKM range 0-11,115 per gene), bark (FPKM range 0-32,014 per gene) and latex (FPKM range 0-139,158 per gene); VII, GC content (range 30%–39% per Mb); VIII, synteny between the genomes of Hb and Hp. d Photograph of Hb RY73397 somatic chromosomes. Scale bars=10 μm. The chromosome photographs of five independent cells at metaphase were selected, and one representative was shown here. e Phylogenetic tree of eight Hevea accessions and eight representative plant species, viz. Manihot esculenta, Jatropha curcas, Ricinus communis, Populus trichocarpa, Arabidopsis thaliana, Carica papaya, Vitis vinifera and Oryza sativa. f Schematic representation of the karyotypes for Hb. Eudicots are assumed to have evolved from a paleo-hexaploid ancestor formed from three diploid ancestors. The seven-color code illustrates the seven protochromosomes in any of three diploid ancestors18. Based on the orthologous gene blocks shared with ancestral core eudicots karyotypes (ACEK)19, the karyotype pictures of H. brasiliensis and the other four species are depicted according to the example files of WGDI. The sequence of M. esculenta chromosomes is adjusted to that of Hb based on the similarity in the karyotypes.

Repetitive DNA accounted for 81.8% of each genome, ranging from 81.0% to 82.1% (Table 1), which was higher than for the previously reported Hb genomes8,9,10,11,12,13,14. Among the repetitive sequences, Copia and Gypsy long terminal repeat (LTR)-retrotransposons were the most abundant, covering an average of 69.8% in each genome (Supplementary Data 3). A mean of 43,938 protein-coding genes, 8629 pseudogenes, 195 microRNA (miRNA), 274 ribosomal RNA (rRNA), and 615 transfer RNA (tRNA) genes per genome were predicted (Supplementary Data 3). Over 97% of the predicted protein-coding genes showed homology to the genes or proteins in at least one of the five public databases (NCBI-nr, GO, KEGG, KOG and TrEMBL) (Supplementary Fig. 5; Supplementary Data 5).

High similarities were observed across the 18 chromosomes between Hb and its most divergent sequenced species relative, Hp, as shown in chromosome size, gene density, Gypsy and Copia distribution, gene expression in three tissues (leaf, bark and latex), GC content, and synteny (Fig. 1c). To further explore the relationships among the eight accessions of the four Hevea species, a phylogenetic tree was constructed using a stringent set of 199 single-copy orthologues (Supplementary Data 6) from these Hevea accessions and other eight representative plant species with their divergence age calibrated based on two fossils (the monocot-eudicot and Arabidopsis-papaya split times)16,17. The phylogenetic tree revealed that all sampled Hevea accessions started to diverge around 3.77 million years ago (MYA) (Fig. 1e). To understand the evolutionary process of Hevea chromosomes and facilitate later phylogenomic analysis, a schematic representation of the karyotypes was deduced as previously described18,19 for Hb and four eudicot species, viz. V. vinifera, A. thaliana, P. trichocarpa, and M. esculenta. A high similarity in karyotype was observed between Hb and its close relative M. esculenta (Fig. 1f) in the family Euphorbiaceae.

Distinct genetic lineages in the sampled Hevea accessions

The crown age of the genus Hevea at 3.77 MYA (Fig. 1e) is consistent with the onset of species/lineage diversification for most genera that took place predominantly within five MYA20,21,22,23,24,25,26,27,28,29,30,31.

To further discern genetic lineages and genetic diversity for the Hevea genus across multiple sampled accessions, we further exploited the resequenced genomes of 94 Hevea germplasms, including 25 Hb cultivars, 62 Hb wild accessions and seven from other species or variants (Supplementary Data 7) to identify high-confidence nuclear and chloroplast DNA variants. Hb cultivars originated from a limited number of the Wickham germplasms back in 1876, while Hb wild accessions from the Brazilian states of Acre (AC), Rondônia (RO) and Mato Grosso (MT) were gathered by the International Rubber Research and Development Board (IRRDB) in 1981 (Fig. 2a). Based on the nuclear variants (6,272,766 SNPs and 264,425 Indels), a total of 94 Hevea accessions (including eight de novo genomes) could be roughly clustered into five main lineages using phylogenetic and admixture analyses (Fig. 2b, Supplementary Fig. 6). Similar results were obtained using principal component analysis (PCA) (Fig. 2c). Lineage 1 consisted mostly of the Hb cultivars and H. benthamiana, lineages 2 to 4 corresponded roughly to the wild accessions of MT, RO and AC, respectively, and lineage 5 contained two other Hevea species (H. pauciflora, Hp; H. sprueana, Hs) and three H. nitida (Hn) variants (HnV5, 7 and 8). Notably, many accessions showed genomic admixture with hybrid origins from these five lineages, especially the sampled accessions of two species of Hn and Hp, showing the genetic admixtures with hybrid origins between pure Hs and Hb cultivars or wild Hb lineages, respectively. Moreover, the PCA using genome-scale nuclear variations similarly suggest hybrid origins of these accessions due to their intermediate populations between five genetic groups (Fig. 2c) although maternally inherited chloroplast DNA variants (393 SNPs and 234 Indels) grouped them into five lineages (Fig. 2d). Furthermore, the genetic tree of Hevea germplasms with mixture events were inferred by TreeMix (http://treemix.googlecode.com) (Fig. 2e), and strongly supported and complemented the results of Admixture (Fig. 2b). From the inferred tree, we detected widespread gene flow among the five lineages, and revealed that the Hb cultivar lineage (Cul) received genetic introgression from two other wild lineages and also suggested natural genetic exchanges between wild lineages (Fig. 2e).

Fig. 2: Population genetic and phylogenetic analyses in the Hevea genus.
figure 2

a Origin, domestication and dispersal of Hevea germplasms. The Hevea genus is native to South America, and its germplasm then spread to South Asia, Southeast Asia and West Africa through two major introductions, the Wickham germplasm in 1876 (blue arrow) and the IRRDB germplasm in 1981 (orange arrow). The current annual yields of natural rubber in producer countries are displayed in colors of light (low yield) to dark green (high yield). b Phylogenetic tree (left panel) and ADMIXTURE plot (right panel) for 94 Hevea accessions, showing the distribution of K = 5 genetic clusters (for other K-values, see Supplementary Fig. 6). The phylogenetic tree construction and admixture analysis based on the nuclear DNA variants (SNPs and Indels) from 25 H. brasiliensis cultivars (Cul), 61 wild germplasms (AC, RO and MT) and 7 other Hevea species (Hbe, Hn, Hp and Hs) or variants (HnV), with the 16 germplasms resequenced in this study appearing on the right. The three Brazilian states (Acre, AC; Rondônia, RO; Mato Grosso, MT) where the H. brasiliensis wild germplasms were gathered, and the approximate locations of other Hevea species are as shown in the inset of Fig. 2a. c PCA plot of H. brasiliensis cultivars, wild germplasms and other Hevea species or species variants based on nuclear DNA variants. d PCA plot of H. brasiliensis cultivars, wild germplasms and other Hevea species or species variants based on chloroplast DNA variants (SNPs and Indels). e Inferred tree of Hevea germplasms with mixture events. Plotted is the structure of the graph inferred by TreeMix (http://treemix.googlecode.com) for H. brasiliensis cultivars, wild germplasms and other Hevea species or species variants, allowing four migration events. Migration arrows are colored according to their weight. Horizontal branch lengths are proportional to the amount of genetic drift that has occurred on the branch. The scale bar shows ten times the average standard error of the entries in the covariance matrix. Source data are provided as a Source Data file.

We observed a significant reduction of genetic diversity in the Hb cultivars (Hb-cul) (π = 6.0 × 10−3) relative to the Hb wild accessions (Hb-wild) (π = 8.3 × 10−3) and other Hevea species or variants (π = 6.9 × 10−3) (Supplementary Fig. 7). Significant reduction of genetic diversity in cultivar vs. wild populations is also common in other crops32,33. However, the ratio of π wild/π cultivar (1.38) in Hb was apparently smaller than that reported in other crops32. This might be due to the short domestication history of rubber tree, merely 150 years with four breeding generations so far as well as the clonal propagation2 that retained the genetic variations of the initial selected variants.

Pan-genome analysis of eight accessions from four Hevea species

We further performed pan-genome analyses for the eight de novo assembled genomes of four morphologically defined Hevea species using the strategies reported earlier34,35,36. All genes from the eight genomes were classified into 42,967 families by ortholog investigation. The total gene sets grew with increasing number of genomes and reached a plateau when n equaled 7 (Fig. 3a). The number of genomes required for a representative in the genus Hevea is much smaller than that reported in other crops, e.g., 25 in soybean35 and 67 in rice37, which might reflect a slower rate of differentiation or a limited sampling that has so far been utilized for de novo genome assembly in Hevea. Of the total gene sets, 25,171 families that resided in all eight accessions were defined as core genes, 4,959 families in seven accessions as softcore genes, 12,437 families in two to six accessions as dispensable genes, and 400 families in only one accession as private genes (Fig. 3b; Supplementary Data 8). The core, soft-core, dispensable and private genes covered an average of 73.2%, 12.2%, 14.3% and 0.3% of the pre-clustering genes in individual accessions, respectively (Supplementary Data 9). The total of core and softcore gene families that accounted for 70.1% of the total gene sets in the eight accessions comprised an average of 85.4% of the pre-clustering genes in individual accessions (Fig. 3b–d; Supplementary Data 9). In all three examined tissues of latex, leaf and bark, the expressional rankings were broadly the same for the four gene compositions: core > softcore > private > dispensable (Fig. 3e). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses revealed contrasting enrichment in pathways between core and dispensable genes (Supplementary Data 10). Of the top 20 most enriched pathways under the class “A09100 Metabolism” (Supplementary Data 10), no overlaps were observed between core and dispensable genes (Fig. 3f). The core genes were mainly enriched in primary metabolism, whereas dispensable genes were mostly in secondary metabolism, such as biosynthesis of phenylpropanoids, flavonoids, isoprenoids, and antibiotics. In contrast, substantial overlaps existed among the softcore, dispensable and private genes, all of which exhibited a large proportion of enrichment in secondary metabolism (Supplementary Fig. 8; Supplementary Data 10). These results indicate that the core genes may play important roles in basic metabolism, whereas the softcore, dispensable and private genes may contribute to the divergence of environmental fitness among Hevea accessions. In previous crop pan-genome studies34,35,36,37, similar findings have been achieved in regard to the variations in functional enrichment among different gene compositions.

Fig. 3: Pan- and core-genome analyses of eight Hevea accessions.
figure 3

a Variation of gene families in the pan- and core-genomes along with an additional Hevea genome. b Compositions of the pan-genome and individual genomes. The histogram shows the number of gene families in the eight genomes with different frequencies. Pie shows the proportion of the gene family marked by each composition. c Presence and absence of information for pan gene families in the eight Hevea genomes. d Gene number of each composition in individual genomes. e Expression level of the genes in each composition represented by FPKM for three tissues, viz. latex, bark, and leaf. Statistical differences were analyzed by one-way ANOVA with Tukey’s HSD test, and are indicated by the lower-case letters. The bold line in the box shows the median, box edges represent the 25th and 75th percentiles, vertical lines represent the maximum and minimum data points within 1.5× interquartile range outside box edges, solid circles represent the outlier data. f Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses of core and dispensable gene families. Significance was tested by two tailed Fisher’s exact test method. Source data and p values are provided as a Source Data file.

To discover the genomic variations among Hevea accessions, we anchored the genome sequences of Hbe, Hp and Hn onto the Hb (RY73397) genome to identify single nucleotide polymorphisms (SNPs), small insertions/deletions (Indels) and five types of structural variations (SVs), i.e., presence/absence variation (PAV; >50 bp insertion or deletion), translocation, inversion, duplication, and copy number variation (CNV) (Supplementary Data 11). Consistent with the phylogenetic divergence of Hp from the other Hevea accessions (Fig. 1e), the Hp genome had a higher number of SNPs but fewer small Indels compared to the genomes of Hn and Hbe; moreover, the Hp genome constituted a much greater number and length of all five types of SVs, which was comparable to or even greater than those of Hn and Hbe in combination. Most of the PAVs had a length of <2 kb, the inversions mainly ranged from 10 to 200 kb, the translocations were generally from 1 to 12 kb (Supplementary Fig. 9), and the CNVs varied from 2 to >10 with a prevalence of 2 and 3. The PAVs collectively constituted an average of 16 Mb per genome, accounting for only ~1.07% in each Hevea accession.

REF/SRPP expansion and divergence is crucial to rubber trait evolution

The participation of laticifer-predominant REF/SRPP proteins in rubber production has been backed up by a number of in vitro rubber biosynthesis assays38,39,40 and genetic studies41,42. Nevertheless, more work is needed to understand, especially from an evolutionary perspective, the laticifer-predominant divergence of the REF/SRPP gene family and its roles in rubber production, a supposed trait of defense against herbivory43 and attack by boring insects. Our previous work also provided clues to the association of expansion and divergence of the REF/SRPP gene family with rubber production in Hb9. Here, its relevance with the trait of rubber production is reinforced by its being the sole gene family expanding within the rubber biosynthesis pathway in four Hevea species (Hb, Hp, Hbe and Hn) vs. two non-rubber-producing species, cassava, the closest species so far known to the genus Hevea in Euphorbiaceae, and Arabidopsis (Supplementary Fig. 10). However, due to limitations of technology at the time, our previous investigation9 of REF/SRPP gene numbers was based on the genomes of a limited number of plant species, most of which were assembled at a middling level. Now many high-quality genomes are available from a wider range of plant species owing to the development of long-read sequencing technologies44,45. Therefore, a deeper phylogenomic assessment of the REF/SRPP gene family in rubber trait evolution becomes feasible using recently published genomes together with the eight de novo assembled Hevea genomes herein.

To achieve an overall representation of seed plants, a total of 152 high-quality genomes with ~80% of them being released in the past five years, covering 48 orders and 97 families of both angiosperms and gymnosperms, were downloaded and searched for REF/SRPP genes (Supplementary Data 12). In the family Euphorbiaceae, the REF/SRPP gene family seems to be expanded only in the genus Hevea, having 14 or higher gene copies compared to only 2-7 copies in other species. Moreover, the number of REF/SRPP genes in the genomes of the Hevea genus is significantly higher than those in most angiosperms or gymnosperms (Fig. 4a; Supplementary Data 12). Even though six species outside the genus Hevea did harbor comparable numbers of REF/SRPP gene copies (12 to 33) (Supplementary Data 12), these REF/SRPP genes were scattered across their genomes, whereas in the genus Hevea the vast majority of the REF/SRPP genes were located on a large tandem gene cluster of >10 copies spanning a genomic region of <300 kb (Supplementary Fig. 11a, b). Interestingly, the genomic region of chr02 flanking the REF/SRPP cluster in H. brasiliensis RY73397 was highly collinear with chromosome 14 of the grapevine (V. vinifera), a representative palaeo-hexaploid ancestral dicot (Fig. 4b, c). More intriguingly, such collinearity resided across a wide range of plant lineages (Fig. 4d), and substantial collinearity was observed even in the most ancient extant angiosperm, Amborella trichopoda46. However, within these collinear genomic regions, large REF/SRPP clusters emerged only in the accessions of the genus Hevea and other rubber-producing plants, e.g., two dandelion species (Taraxacum kok-saghyz and T. mongolicum)47 and lettuce48 (Fig. 4d; Supplementary Fig. 11c, d). Only one exception came from the genome assembly of a wild H. brasiliensis accession (WMT)13, which did not contain a large REF/SRPP cluster featured in Hevea (Supplementary Fig. 11a). Nevertheless, this WMT genome assembly was highly collinear with the RY73397 genome in the REF/SRPP cluster-harboring regions (Supplementary Fig. 12a), indicating either the uniqueness of this accession or, more probably, the incorrect assembly of the REF/SRPP cluster genes (Supplementary Fig. 12b).

Fig. 4: Analysis of the REF/SRPP gene cluster in Hevea.
figure 4

a Copy number of REF/SRPP genes in 152 plant genomes, including 11 from four Hevea species (Hevea), 7 from seven Euphorbiaceae genera other than Hevea (Eup), 122 from 90 angiosperm families other than Euphorbiaceae (Ang), and 12 from six gymnosperm families (Gym). Solid circles represent the gene number, and see Fig. 3e legend for box blot details. Significance was determined using a two tailed t-test. b Diagram of the 190-kb genomic REF/SRPP cluster in H. brasiliensis RY73397 chr02 (middle panel). Assembly authenticity is validated by RNA-seq (upper panel) of three tissues and genomic (lower panel) reads mapping. Red solid boxes depict the SRPP/REF gene models, and blue and green boxes depict other genes. The SRPP/REF gene models 1-13 are SRPP11, SRPP9, SRPP8, SRPP1, REF7, SRPP3, REF3, REF4, REF1, RFE9, REF8, REF5, and SRPP5, respectively. See supplementary Data 13 for details. c The genomic region flanking the Hb REF/SRPP cluster in chr02 is highly collinear with regions in another Hb chr07 and the V. vinifera chr14. The chromosome karyotypes are plotted with JCVI, and grey lines indicate background collinearity. In the close-up view, the REF/SRPP genes are shown in red boxes. d Collinearity plots of the REF/SRPP cluster locus from 11 representative plant species, i.e., Hb, Hbe, Hn, Hp, M. esculenta (Me), R. communis (Rc), J. curcas (Jc), P. trichocarpa (Pt), A. thaliana (At), V. vinifera (Vv) and A. trichopoda (Atr). Red boxes indicate the REF/SRPP orthologues in each species. Collinear gene pairs were shown in red (REF/SRPPs) or grey (other genes). e Postulated evolution of the REF/SRPP cluster genes (right) according to their phylogenic tree (left) and genomic locations as in b. The cassava orthologue Manes.09G170000.1.v7.1 as shown in d was used as the outgroup. f log2FPKM expressions of the REF/SRPP cluster genes in latex (Lx), leaf (Lf), and bark (Bk) of four Hevea species. Gray circles indicate the absence of corresponding genes. g Latex yield dynamics with successive tapping of virgin rubber trees. Results are the means ± SD from 3 biological replicates. h Expression dynamics of the top four laticifer-dominant REF/SRPP isoforms in three cultivars across the first, third, fifth, seventh, and ninth tappings. Two-tailed Pearson’s correlations (insets) were calculated between yield and gene expression with the confidence interval 95%. Source data are provided as a Source Data file.

The large REF/SRPP gene clusters specific to rubber-producing plants prompted us to further explore the roles of such a tandem duplication-elicited event in rubber trait evolution. The phylogenic tree and genomic locations allowed us to deduce the evolutionary path of the REF/SRPP gene cluster genes post the Hevea-cassava split, revealing that the 13-gene cluster might have been created by 12 rounds of tandem gene duplication from a REF/SRPP ancestor (Fig. 4e). Intriguingly, all the laticifer-predominantly expressed REF/SRPP isoforms resided in the gene cluster (Fig. 4f; Supplementary Data 13). Among the 10 to 15 REF/SRPP cluster genes in the accessions of four Hevea species, 3 to 5 exhibited FPKM values of >3000 in the latex (rubber particle bearing cytoplasm of laticifers) that was >20-fold higher than that in the other tissues, and had cumulative values varying from 37,826 in a H. brasiliensis wild accession (XJA968) to an incredible value of 191,283 in H. nitida (Supplementary Data 13). Outside the REF/SRPP cluster, none of the REF/SRPP genes exhibited comparable patterns of laticifer-predominant expression. In their native rainforest habitat, Hevea trees are repeatedly afflicted by fungal pathogens and borer insects, and the production of rubber-containing latex is believed to be an important defense mechanism evolved in Hevea43,49,50,51,52. When virgin (i.e., previously untapped) Hevea trees are first brought to latex harvesting, the latex output increases conspicuously with successive tappings, a mechanical wounding treatment to harvest latex that could mimic the process of repeated attacks by the bark borer insects in the rainforest53. The genes involved in tapping-stimulated latex production are postulated as candidate participants in defense mechanisms of Hevea. In the present study, mature virgin trees of three Hevea cultivars were exploited to investigate the roles of laticifer-predominant REF/SRPP genes in the stimulation of latex output. Consistent with our previous results54,55, the latex yield of three cultivars increased gradually after successive tapping, reaching a plateau at the seventh tapping with a relatively steady latex production thereafter (Fig. 4g). Correspondingly, expressions of the four laticifer-predominant REF/SRPP genes, REF1, REF3, REF7 and SRPP1, were stimulated by the tapping treatment (Supplementary Data 14). Furthermore, significant correlations were observed between the top two REF/SRPP genes (REF1 and SRPP1) and the latex yields in all three cultivars (Fig. 4h).

Discussion

Broadening the genetic basis for rubber breeding is urgently needed to meet the increasing global NR demand under changing environmental conditions. The pan-genome dataset of the eight high-quality genome assemblies from this study, which includes other morphologically delineated Hevea species besides H. brasiliensis, will greatly facilitate future in-depth functional genomic studies for various research purposes and breeding practices.

Here, we provide genomic evidence for the taxonomically disputed genus Hevea. Through sampling 94 individuals of five species, we identified five distinct genetic lineages that did not align well with their morphological delineation (Fig. 2). For example, H. benthamiana exhibited the same genomic composition as the widely cultivated H. brasiliensis lineage (Fig. 2b). Notably, we found evidence of genomic admixture for many accessions between five relatively pure lineages (four H. brasiliensis lineages and one H. sprueana lineage), suggesting the widespread gene flow and incomplete reproductive isolation between them (Fig. 2b–d). Three sampled H. nitida individuals (V5, V7 and V8) showed obvious genomic admixtures between H. sprueana and the H. brasiliensis cultivar lineage. Similarly, the investigated H. pauciflora individual displayed genomic admixtures from two H. brasiliensis lineages and H. sprueana. Out of the four wild Hevea lineages identified here, only two have been introduced into the cultivated H. brasiliensis accessions, with the resultant genomic admixture (Fig. 2b). These findings indicate that further investigations are required to delineate Hevea species. Such an operation should involve genomic sampling of more individuals for each previously delineated species56, as well as the examination of both morphological units and reproductive barriers at the population level7,57. Given the incomplete reproductive barriers between these Hevea lineages and clonal reproduction of para rubber, our study suggests that the wild lineages should be further explored and utilized in the future to enhance the genetic diversity of this crop and develop more productive and widespread-resistant cultivars.

Our study provides high-quality genomes of eight Hevea accessions, along with extensive transcriptome datasets, which offer valuable insights into the evolutionary process of rubber traits that has a defense role against insect borers43,52. The emergence of a large REF/SRPP cluster and its specialization in laticiferous cells play a crucial role in this evolution (Fig. 4; Supplementary Fig. 11; Supplementary Data 12 to 14). In mature trees that are regularly tapped, the commercial rubber yield depends on the duration of latex flow after tapping and latex regeneration before the next tapping53. The abundant expression of laticifer-predominant REF/SRPP isoforms, e.g. REF1 and SRPP1 is necessary for this process38,39,40 and is efficiently facilitated by the tapping procedure itself (Fig. 4g, h). However, the REF/SRPP genes have not been among the candidate domestication genes identified by genome-wide association studies13,14. This can be reasoned by the Liebig’s barrel (Liebig’s law of the minimum), where the high abundance of REF/SRPP transcripts in Hevea laticifers is essential to rubber production, but yet this can be easily achieved after a few of latex harvesting tappings (Fig. 4h), and thus, does not consist of one of the shortest staves that affect the commercial latex yield. In conclusion, our high-quality pan-genome assemblies and the findings of this study would significantly benefit both the rubber breeding and research communities.

Methods

De novo genome sequencing

All eight de novo sequenced Hevea accessions in this study were sourced from the National Rubber Tree Germplasm Repository in Danzhou, Hainan, China. These accessions included three H. brasiliensis (Hb) cultivars (RY73397, RY879, and RRIM600), two wild accessions (XJA868 and XJA968), and three other Hevea species (H. pauciflora, Hp; H. benthamiana, Hbe; H. nitida, Hn). All the trees of these accessions were 12 years old, and had been tapped for four years in a system of half spiral, every three days (S/2, d/3), without ethephon stimulation. Leaves of three trees of each accession were harvested at 10 a.m., and snap frozen in liquid nitrogen for gDNA extraction.

gDNA was ultrasonicated to a fragment size of 500 bp and the library was then prepared using a NEB Ultra DNA library prep kit (NEB, UK) according to the manufacturer’s instructions. The resulting library was qualified using Agilent 2100 and qPCR methods. Sequencing of the library was performed on HiSeq X ten using the run configuration 2 × 350 bp. The generated reads were trimmed using fastq_quality_trimmer in the FASTX-Toolkit (ver. 0.0.11) with default parameters. An average of 74 Gb (~50×) data for each accession was obtained.

A total of two μg of gDNA were processed using the NEB Next FFPE DNA Repair Mix kit (M6630, USA) and subsequently processed using the ONT Template prep kit (SQK-LSK109, UK) according to the manufacturer’s instructions. The large segments library was premixed with loading beads and then pipetted into a previously used and washed R9 flow cell. The library was sequenced on the ONT PromethION platform with Corresponding R9 cell and ONT sequencing reagents kit (EXP-FLP001.PRO.6, UK) according to the manufacturer’s instructions.

A Hi-C library was constructed from young leaves of four Hevea accessions (RY73397, Hbe, Hn and Hp) using the kit from BioMarker Technologies, Inc. (Beijing, China) according to the manufacture’s manual58. Qualified libraries were sequenced on the NovaSeq X plus platform. A total of 2.4 billion 150 bp paired-end reads were produced on the Illumina HiSeq X ten platform.

Genome assembly and evaluation

The genome size was estimated using K-mer analysis. We counted the 17-kmers with Jellyfish59 and calculated the genome characteristics using Genomescope60 based on the high-quality Illumina PE reads.

The clean ONT data were further corrected using Canu software, and then assembled using SMARTdenovo software. To further improve the accuracy of reference assembled contigs, three-step polishing strategies were performed. First, three rounds of correction were made using Racon61 software with the ONT long reads. Then, three rounds of correction were carried out using high-quality Illumina reads with Pilon62 software. Finally, the genome was subjected to redundancy removal using Purge Haplotigs (parameters: l = 3, m = 41, h = 100).

The assembly was first validated by comparing the Illumina short reads to the reference assembly using SAMtools; more than 99.6% were successfully mapped back to the final assembly in all accessions. We also assessed the continuity and completeness of assembly by CEGMA, BUSCO (version 4.1.4), LAI63, and Merqury64.

Hi-C technology help anchor contigs

For anchoring contigs, ~600 million clean read pairs for each of the four accessions (RY73397, Hp, Hbe, Hn) were generated from the Hi-C library, and were mapped to our assembly using BWA (bwa-0.7.17) with the default parameters. Paired reads with mate mapped to a different contig were used to construct the Hi-C associated scaffolding. Self-ligation, non-ligation and other invalid reads, such as Start NearRsite, PCR amplification, random break, and too large or small fragments, were filtered. As a result, 99.5% of the contigs were successfully clustered into 18 groups with the Lachesis agglomerative hierarchical clustering method65. Lachesis was further applied to order and orient the clustered contigs, and 94.5% were successfully ordered and oriented. Finally, we completed the chromosomal-level high-quality assembly for four Hevea species, with chromosomal lengths from 46 Mb to 109 Mb containing more than 95% of the assembly sequence. The contigs of the other four genomes without Hi-C data were assembled into chromosomes using the RagTag software (https://github.com/malonge/RagTag) with the RY73397 genome as the reference. For comparison, 3D-DNA66, another popular Hi-C scaffolder, was also used for Hi-C-based scaffolding of the four genomes, and resulted in high-quality assemblies too (Supplementary Fig. 13). In addition, the collinearity comparison revealed a very good consistency between the two versions of genome assemblies (Supplementary Fig. 14; Supplementary Data 15-16).

Genome annotation

Transposable elements (TEs) and tandem repeats were annotated by the following workflows. TEs were identified by a combination of homology-based and de novo approaches. We first customized a de novo repeat library of the genome using RepeatModeler (http://www.repeatmasker.org/RepeatModeler/)67, which automatically executed two de novo repeat finding programs, including RECON and RepeatScout. Then, full-length long terminal repeat retrotransposons (fl-LTR-RTs) were identified using both LTRharvest (v1.5.9)68 and LTR_finder69 (v2.8). The high-quality intact fl-LTR-RTs and non-redundant LTR library were then produced by LTR_retriever70. A non-redundant species-specific TE library was constructed by combining the de novo TE sequence library above with the known Dfam (v3.5) database. Final TE sequences in the genomes were identified and classified by homology search against the library using RepeatMasker71 (v4.12). Tandem repeats were annotated by Tandem Repeats Finder72 and the MIcroSAtellite identification tool73 (MISA v2.1). Non-coding RNAs including miRNA, rRNA and tRNA were predicted. Briefly, Rfam database and miRBase database together with Infenal 1.1 (1e-5) were respectively used to predict the rRNA and microRNA. The tRNA sequences were recognized using tRNAscan-SE v1.3.1 with option “-E -H”. For pseudogene annotation, the GenBlastA74 (v1.0.4) program was used to scan the whole genomes after masking predicted functional genes. Putative candidates were then analyzed by searching for non-mature mutations and frame-shift mutations using GeneWise75.

We integrated three approaches, namely, de novo prediction, homology search, and transcript-based assembly, to annotate protein-coding genes in the genome. The de novo gene models were predicted using two ab initio gene-prediction software tools, Augustus76 (v3.1.0) and SNAP. For the homolog-based approach, GeMoMa (v1.7) software was used with reference gene models from O. sativa, A. thaliana and H. brasiliensis. For the transcript-based prediction, RNA-sequencing data were mapped to the reference genome using Hisat77 (v2.1.0) and assembled by Stringtie78 (v 2.1.4). GeneMarkS-T (v5.1) was used to predict genes based on the assembled transcripts. The PASA (v2.4.1) software was used to predict genes based on the unigenes (and full-length transcripts from the PacBio sequencing) assembled by Trinity79 (v2.11). Gene models from these different approaches were combined using the EVM software (v1.1.1) and updated by PASA. In total, 351,502 protein-coding genes with an average length of 4253 bp were predicted in the assembled eight Hevea genomes. Gene functions were inferred according to the best match of the alignments to the National Center for Biotechnology Information Non-Redundant (NCBI-nr), EggNOG, KOG, TrEMBL, InterPro and Swiss-Prot protein databases using diamond blastp (diamond v2.0.4.142) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database with an E-value threshold of 1E-5. The protein domains were annotated using InterProScan (v5.34-73.0) based on InterPro protein databases. The motifs and domains within gene models were identified by PFAM databases. Gene Ontology (GO) IDs for each gene was obtained from TrEMBL, InterPro and EggNOG. In total, approximately 34,065 (about 97%) of the predicted protein-coding genes could be functionally annotated with known genes, conserved domains, and Gene Ontology terms.

Evolutionary analyses

The MCMCTree program in the PAML80 (v.4.9) package was used to estimate the divergence time. Single-copy genes from 12 representative species, including O. sativa, V. vinifera, C. papaya, A. thaliana, P. trichocarpa, R. communis, J. curcas, M. esculenta, three Hb cultivars (RY73397, RY879 and RRIM600), two wild accessions (XJA868 and XJA968), and three other Hevea species (Hp, Hbe and Hn) were selected for a rough estimate of the substitution rate, with the divergence time of A. thaliana-C. papaya (68-72 MYA)16 and monocot-eudicot (120-140 MYA) (http://timetree.org/) used as calibrators.

Syntenic analysis

To reconstruct the karyotype evolution model of H. brasiliensis, we compared the genomes of A. thaliana, P. trichocarpa, M. esculenta and H. brasiliensis with that of V. vinifera, and inferred the chromosome composition of each species according to their collinear relationships. In detail, first, to identify syntenic blocks, we performed an all-against-all LAST81 and connected the LAST hits at a distance cut-off of 20 genes while requiring at least five pairs for each syntenic block using JCVI82. Then, we obtained an anchor file containing the homologs identified via LAST. According to the position of V. vinifera genes on the ancestral chromosome karyotype, combined with the collinear relationship between V. vinifera and other analyzed species, we could infer which ancestral chromosome the genes of analyzed species were on. The final visualization of the karyotype result was achieved through WGDI or the graphics module of JCVI with parameter --seed 14 to lock color.

Pan-genome analysis

The pan-genome analysis was conducted using the gene-based approach. We used protein sequences of all eight Hevea accessions to identify ortholog groups shared by all proteomes, among different numbers of proteomes and unique to each proteome using OrthoFinder83. The core genome was defined as orthogroups present in all accessions, the sofcore was defined as orthogroups shared in seven accessions, while the dispensable genome was defined as orthogroups shared in 2 to 6 accessions. The private genome was defined as orthogroups unique to each isolate.

Resequencing

DNAs from leaf tissue of an additional 11 Hevea accessions, trees of which were sourced as the aforementioned de novo sequenced Hevea accessions in regard to ___location, age, tapping history and sampling. The samples were used for the construction of 500-bp paired-end libraries, and were then sequenced using the Illumina HiSeq2000 platform. In total, 812.9 Gb of data were obtained. Ninety-four accessions, including 17 sequenced in this study and 77 downloaded from the BIG database, were subjected to subsequent genome resequencing analysis.

Resequencing data for the 94 accessions were mapped to the RY73397 genome using BWA84 mem (v.0.7.17). The mapped reads were then sorted according to genomic coordinates using Samtools85. Sequence data generated from different libraries were combined using ‘samtools merge’, after which duplicates were removed using Picard86 (v.2.5.0). HaplotypeCaller from the GATK87 (v.3.8) was then used to call individual-specific gvcf files. Finally, the GenotypeGVCFs was used for joint calling of SNPs. After quality control of the SNPs using bcftools88, the SNPs were hard filtered using GATK VariantFiltration (DP < 300 | | DP > 3000 | | QD < 2.0 | | FS > 60.0 | | MQ < 40.0 || MQRankSum <−12.5 || ReadPosRankSum <−8.0), and only biallelic SNPs were selected for further analysis.

We carried out analysis of admixtures using the ADMIXTURE89 software (v.1.3.0). Before analysis, SNPs in LD were dropped out in Plink90 using a sliding window of 50 SNPs with 10 SNP step size, SNPs with R2 > 0.1 were filtered out, these settings being as recommended in the ADMIXTURE user manual (–indep-pairwise 50 10 0.1). The optimal number of populations (K) was selected using tenfold cross-validation.

Gene expression analysis

RNAseq data were generated for the three tissues of latex, leaf, and bark from the eight aforementioned de novo sequenced accessions of four Hevea species, and for the latex samples from the tapping treatment of three virgin (previously untapped) Hb cultivars, RY879, RY72059 and PR107 at the first, third, fifth, seventh, and ninth tappings. For de novo sequenced Hevea accessions, the tissues of latex and bark were harvested at 6-7 a.m., while the leaves were sampled as did for de novo genome sequencing. For the tapping treatment, eight-year-old trees of the three Hb cultivars were growing in neighboring fields at the Experimental Farm of the Chinese Academy of Tropical Agricultural Sciences in Danzhou, Hainan, China. Three replicates of ten trees with similar girth were selected for each cultivar. All harvested tissues were snap frozen in liquid nitrogen for RNA extraction and sequencing. Paired-end reads mapped by Bowtie291 were used for quantifying transcript abundance by RSEM92 software. The fragments per kilobase per million mapped reads (FPKM) values of expression genes were calculated.

REF/SRPP genes

A total of 152 quality genomes, covering 48 orders and 97 families of angiosperms and gymnosperms, were downloaded and searched for the REF/SRPP genes using BLASTP93 (cutoff E value < 1 × 10−5). Hmmscan was further used to search for the Pfam REF (PF05755) ___domain (cutoff E value < 1 × 10−5) (Supplementary Data 12). To further verify the accuracy of sequence assembly where the REF/SRPP gene cluster was located on the chr02 of RY73397 genome, both RNA-seq and genomic ONT/CLR reads were mapped against this genomic fragment using BWA94 and Minimap295. To postulate evolution of the REF/SRPP cluster genes, CDS sequences of 13 REF/SRPP cluster gene and one M. esculenta REF/SRPP Manes.09G170000.1.v7.1 were aligned using MAFFT96 v7.475, and phylogenetic analysis was performed with FastTree97 v2.1.11. Collinearity of the REF/SRPP gene cluster-containing loci was investigated across 11 representative plant species, i.e., Hb, Hbe, Hn, Hp, M. esculenta (Me), R. communis (Rc), J. curcas (Jc), P. trichocarpa (Pt), A. thaliana (At), V. vinifera (Vv) and A. trichopoda (Atr), using MCScanX98 and TBtools99.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.