Introduction

Theaceae is one of the most diverse plant family, having its origin in the late L-Cretaceous and showing early radiation during the Early Eocene Climatic Optimum1. Currently it consists of 372 accepted species falling into three tribes and nine genera2. Due to the presence of similar morphological traits in some species, the prevalence of specific self-incompatibility, and frequent inter-genus (species) hybridization and polyploidization, the classification of this family is challenging3,4,5, especially when it comes to the subgenera of Camellia among which conflicting or unresolved relationships still exist6. Moreover, controversial taxonomic outcomes also arise from genetic relationship analysis based on morphology and evolutionary analysis based on molecular information.

Camellia is the largest genus in the family Theaceae. The common ancestor of this genus and other genera in the same family underwent changes at diverse frequencies in the early Miocene, followed by increased rate of speciation. Camellia diverged from the common ancestor with Tulcheria 38.4 Mya (millions of years ago), experienced a radiative divergence of major clades about 23 − 19 Mya in the late Miocene, and then a subsequent species burst happened in it 10 − 5 Mya7. The theory on the classification of Camellia has been continuously supplemented since its formation; three main Camellia taxonomy systems which are contradictory to a certain extent were proposed by Sealy, Chang, and Min, respectively8. There are differences in the boundaries on the family, subgenera, and species as well as in the boundaries and relationships between species among the three systems on one hand, and subgeneric division was not mentioned in the Sealy system but constitutes one part of the Chang system9 and the Min system10 on the other hand. Furthermore, some species are treated differently and different classification bases are used in the three systems. Taking morphological traits for example, Sealy11 and Min12 attached more importance to the traits of pedicel, bracteoles and sepals, whereas Chang9,13 probably gave a higher taxonomic value to the characteristics of filaments. All these classification approaches are based on morphological traits and differences among them might be unavoidable due to cognitive discrepancies among taxonomists. The rise of molecular biology has provided more valuable information for plant classification. The availability of vast amount of molecular bioinformation such as chloroplast and mitochondrial genomes, transcriptomes and various molecular markers such as ddRAD, ISSR, nrITS, SNP, and AFLP for phylogenetic relationship studies14,15,16,17,18 allows for more accurate understanding of species-level phylogenetic relationships. Zan et al. investigated the phylogeny of and interspecific relationships among 87 species in the genus Camellia using 95 transcriptomes and found that these species could be categorized into 8 major clades and 2 independent clades. Using four chloroplast genome regions (matK, rbcL, ycf1, and trnL-F) as markers, Pang et al.19 studied the phylogeny of and interspecific relationships among 161 species representing all four subgenera of Camellia and concluded that these 161 species could be grouped into 13 clades. These findings are inconsistent with those based on existing Camellia taxonomy systems.

Golden camellia species are a group of evergreen shrubs or small trees characterized by yellow flowers in the section Chrysantha of the subgenus Thea of the genus Camellia of the family Theaceae. According to literature search, more than 50 species of golden camellia have been found all over the world20,21, mainly distributed in southern Guangxi, China and a region of northern Vietnam bordering Guangxi22. Over 30 species and 5 varieties were found in China23, and 42 species were distributed in Vietnam, 32 of which were endemic to Vietnam21. In addition, several golden camellia species were distributed sporadically in Yunnan and Guizhou, China24 as well as Myanmar, Thailand, Malaysia, and Japan25. Species in the section Chrysantha started to diverged 20 − 17 Mya, most after14-12 Mya1,7. Differences exist in the divergence time and phylogenetic relationships among different scientific documents; the evolutionary time of species retrieved from http://timetree.org/ is generally later than that in other literature. For example, the divergence time of C. flavida and C. impressinervis in Zan et al.7 and Cheng et al.1 is 9.7 Mya and 11 Mya, respectively, whilst that from http://timetree.org/ is 4.3 Mya; the initial divergence time of the section Chrysantha is approximately 16.9 Mya and 20 Mya in Zan et al.7 and Cheng et al.1, but that from http://timetree.org/ is around 13 Mya. The classification of the section Chrysantha has been complicated by discrepancies in phylogenetic relationships among different literature. Until now, species in the section Chrysantha have not been reviewed and studied in a systematic manner, thus the exact quantity of species in this section is yet to be determined, and the naming of some species is confused by homonyms and synonyms17,18. Therefore, it is of great significance to incorporate the taxonomic knowledge of novel species belonging in the section Chrysantha to improve the existing taxonomy systems.

Due to influences from climate, geographical conditions, soil, and other aspects, the distribution of wild resources of golden camellia species is restricted to a narrow area, 90% of the world’s wild golden camellia species are only distributed near Lanshan Branch of Shiwan Mountain in Fangchenggang, Guangxi26. Considering the scarce resources of wild golden camellia species, China has listed them as endangered species in need of protection. Golden camellia species show a high soil specificity, most of them only grow in calcareous limestone soil in karst areas, and merely a few ones grow in acidic soil; moreover, golden camellia species show differences in their adaptability to these two kinds of soil27,28,29. In addition, golden camellia species are thermophilic, hygrophilous, and sciophilous, their survival decreasing as illumination intensity increases30. Overall the phylogenetic structure of golden camellia species are closely associated with their geographic distribution, those with the same or similar geographic distribution are also genetically closely related14,15,31,32,33,34. A long period of geographical isolation has limited the flow of genes between golden camellia species growing in different geographic regions, resulting in a high degree of genetic variation between different geographic regions35,36. Genetic variations are higher among ex situ populations as compared with natural populations, but artificial populations developed using seeds and cuttings have a low genetic diversity relative to wild ex situ populations37,38.

Species in the section Chrysantha may be a paraphyletic or polyphyletic group19,39, reticulate evolution can result in inconsistent phylogenetic outcomes based on nuclear and chloroplast sequences17, and the complex genetic background of these species further adversely affects their accurate classification. Phenotypic traits are similar in different golden camellia species but may be diverse in the same species40, increasing the difficulty in effective classification and accurate authentication of the section Chrysantha. Some researchers regarded an ecotype as a species, but different habitat conditions tend to contribute to variations in golden camellia species; and some formas were also treated as new species, causing confusion in the classification of species in the section Chrysantha41. Ye believed that the collection of flower colors and other similar traits is an important basis for enabling subclassification of the genus Camellia42. For instance, species in the section Chrysantha are characterized by yellow flowers; most species in the genus Camellia have axillary flowers, without true terminal flowers. Chang and Ye43 noted that the combination of floral organ morphology with flower colors along the phylogenetic tree forms a basis for classification of the section Chrysantha. The classification of the section Chrysantha may also be performed on the basis of anatomical characteristics of leaf blades and some chemical components21,44.

As science and technology advance, molecular biotechnology is widely applied in species classification and identification. For example, phylogenetic relationships in the section Chrysantha can be resolved using mitochondrial and chloroplast genomes45,46. However, reticulate gene evolution of species results in inconsistency between phylogenetic results based on nuclear and chloroplast sequences. Wei et al.17 found that results of phylogenetic analysis using nuclear ddRAD-seq and RNA-seq are consistent with morphological traits of golden camellia species relative to using chloroplast genomes. Classification of the same species using different molecular markers may produce inconsistent results47, making it necessary to choose an appropriate molecular biotechnology for plant classification. Analyses of phylogenetic relationships and divergence time based on the whole genome and transcriptome sequences of species enable us to understand the order of inter-specific divergence from the entire level of genetic development and thus classify and identify species more accurately and effectively1. Without doubt, the information on distribution of an existing species alone is not sufficient to fully understand the evolution and distribution history of that species, and more accurate analysis results may be obtained when fossil information is available48.

All golden camellia species are endangered species with important ornamental and medicinal value49,50,51,52,53,54,55,56. Accurate authentication and classification of golden camellia species will facilitate discovery of new species, germplasm conservation, utilization of alternative herbal drugs, and medicinal quality testing of the section Chrysantha. So far no studies on the phylogenetic relationships among golden camellia species have covered all golden camellia species, and the methods used therein are different, leading to differences in results among different literature. For example, Zheng57 regarded C. huana and C. tianeensis as the same species, but Zheng et al.58 believed that C. tianeensis is a variety of C. huana. The differences in the literature have further complicated the phylogenetic relationships among golden camellia species. In the present study, 20 golden camellia species were analyzed using the transcriptomics technique and conserved single-copy homologous genes based on sequential or structural differences in proteins, in order to answer the following two questions: (1) What are the phylogenetic relationships between these species? and (2) When is the divergence time of these species?

Materials and methods

Sampling, quantification of phenotypic traits and RNA-seq sequencing

Living specimens of 20 golden camellia species are preserved in the gene bank of Chongfeng Golden Camellia Plantation Base (108°7′43″E, 21°38′40″N, 36.8 m asl) located at Dongxing, Fangchenggang, Guangxi. They are all wild individuals transplanted from the habitats of golden camellia in Guangxi and Vietnam in 2013, and were identified to be genuine species in the Section Chrysantha of the genus Camellia of the family Theaceae by Liandong Huang, a Senior Landscape Engineer and the former manager of the gene bank of Nanning Golden Camellia Park in Nanning, Guangxi. In November 2022, leaves producing in that year were collected randomly from all the 20 species, quick-frozen with liquid nitrogen, and stored at -80℃ in a refrigerator for subsequent RNA extraction. Phenotypic traits of the 20 species were quantified, at least 10 plant individuals for each species. Tapeline and rule were used to measure plant height, canopy diameter, petiole length, leaf length and width, and leaf area and leaf length-width ratio were calculated; electronic scales were used to measure wet and dry leaf weight. Pictures of leaves, flowers, and flower buds of each species were taken. Information on sampled species including original habitat, soil type, and NCBI accession number (SRA number) is listed in Table 1.

Total RNA was extracted from quick-frozen leaf samples of the 20 species using Omega Plant RNA kit (Omega Bio-Tek, R6827) according to its instructions for use, subjected to quality testing by Agilent 2100, and then used to construct transcriptome libraries with NEB#7530 kit (#E7530, New England Biolabs). The resulting libraries were tested for their quality using DNA 1000 Assay Kit (Agilent Technologies, 5067 − 1504). The cDNA libraries were sequenced on the Illumina sequencing platform by Genedenovo Biotechnology Co., Ltd (Guangzhou, China). Rhododendron griersonianum, Diospyros lotus, and Impatiens glandulifera were used as the outgroups (see Table 1), and their transcriptomic data were downloaded from GenBank.

Table 1 Information on sampled species including original habitat, soil type, and NCBI accession number (SRA number).

Sequence assembly and homologous gene identification

fastp was used for quality control of raw reads59 to filter low-quality data to obtain clean reads. The quality control procedure included removal of reads containing the adapter, reads containing more than 10% of N, reads whose bases were all A, and reads with a low quality (the number of bases with the quality value Q equal to or lower than 20 accounts for more than 50% of the whole read). The Trinity v2.8.4 software60 was used for de novo assembly of clean reads of each species. N-free assembled fragments were obtained through reads overlapping and taken as Unigenes. The quality of assembly results was evaluated using N50 values, sequence length, and Benchmarking Universal Single-Copy Orthologs (BUSCO) (http://busco.ezlab.org/). Using blastx, Diamond alignment61 (evalue < 1e-5) was performed between Unigene sequences and protein databases Nr, Swiss-Prot, KEGG62,63,64, and COG/KOG. The running priority order of these protein databases was the same as their order of occurrence here. The alignment of any Unigene sequence terminated when it was aligned with a protein in a high-priority database, otherwise auto alignment with a lower-quality database would proceed until alignment with all databases was run. After being determined by aligned proteins, the coding region sequence (in 5’->3’ direction) in a Unigene was translated into amino acid sequence according to the standard codon table. For any Unigenes failing in alignment with all of the above protein databases, the nucleotide sequence in their coding regions (in 5’->3’ direction) was predicted using TransDecoder Release v5.7.1 (a software used to Find Coding Regions Within Transcripts; https://github.com/TransDecoder/TransDecoder/wiki) and also translated into amino acid sequence. Alignment parameters used were as follows: -no-unlink -k 0 -f 6 -p 8 -e 1e-5 --query-cover 30% --salltitles --more-sensitive. KOG, KEGG, and GO annotation was performed to coding regions, and the combination of Diamond and OrthoMCL65 was used to identify inter-specific homologous Unigenes (see Appendix 1 in the Supplementary Materials for OrthoMCL parameters).

Phylogenetic analysis

Based on data on phenotypic traits, principal component analysis (PCA) and orthogonal partial least squares discriminant analysis (OPLS-DA) were conducted using SIMCA14.1, and agglomerative clustering was performed using R hclust functions. The permutation test was performed to validate the presence or absence of overfitting. Multiple sequence alignment was performed on single-copy homologous genes using the Muscle software66, followed by the construction of phylogeny trees using Neighbor-Joining (NJ) method (command: megacc -a infer_NJ_nucleotide.mao -d out_cds.gblocks.gbo.fa -f Fasta -o tree_nj) in MEGA10.0.167, Maximum Likelihood (ML) method (model: GTR + I + G; command: iqtree -nt 12 -s out_cds.phy -m GTR + I + G -bb 1000 -st DNA) in iqtree 1.6.368 with 1000 ultrafast bootstrap repetitions, and Bayesian Inference (BI) method (model: GTR + I + G; see Appendix 3 in the Supplementary Materials for operating parameters) in MrBayes 3.2.6 × 6469.

Estimation of inter-specific divergence time

Based on the ML topology, mcmctree in the PAML software package70 (http://abacus.gene.ucl.ac.uk/software/paml.html) was used to estimate the time of divergence between species using the HKY85 model and the GAMMA prior (see Appendix 2 in the Supplementary Materials for parameters). The root of the divergence time tree was determined by the divergence time between the family Theaceae and the outgroups available at http://www.timetree.org/, and the divergence time between all species concerned was calibrated using the divergence time between C. pubipetala and C. longzhouensis, between C. flavida and C. impressinervis, between C. limonia and C. fusuiensis, and betwen C. euphlebia and C. impressinervis as documented at http://www.timetree.org and in Zan et al.7 and Cheng et al.1. Based on the BI topology, MrBayes 3.2.6 × 6469 was used to generate the divergence time tree using the HKY85 model and the GAMMA prior (see Appendix 2 in the Supplementary Materials for parameters), and the divergence time between all species concerned was calibrated using the divergence time between C. limonia and C. fusuiensis and between C. euphlebia and C. impressinervis as recorded at http://www.timetree.org/.

Results

Phenotypic traits of and relatedness between the 20 golden camellia species

Measurements of main phenotypic traits of plants and leaf blades are listed in Table S1, and the average values are provided in Table 2. Fig. S1 shows leaves, flowers, and flower buds of the 20 species (see the Supplementary Materials). The number of principal components obtained from PCA and OPLS-DA is presented together with R2X, R2Y, and Q2 in Table 3, from which we know that 1 principal component was obtained from each of the two analyses, and that R2X > 0.4 in PCA and both R2Y and Q2 were higher than 0.5 in OPLS-DA, indicating a high goodness of prediction of the model. In PCA, variables that contributed the most to the principal component were leaf weight and leaf area (see Fig. 1A); samples of the same species had similar scores, and the scores of C. macrophylla, C. euphlebia Merr, and C. pubipetala had no intersection with those of other species (see Fig. 1B), which means that these species may be distinguished from other species based on phenotypic traits. The VIP (Variable Importance in the Projection) values of variables in OPLS-DA are shown in Fig. 2A, which displays that leaf weight and leaf area contributed significantly to the overall goodness-of-fit and classification capacity of the model whereas leaf length-width ratio made the minimum contribution. Sample scores in OPLS-DA (see Fig. 2B) showed the similar trend to that in PCA. Via Permutation Test, internal 200-fold cross validation was run on main parameters of the model with Q2 =-0.0516 and R2 = 0.0243, suggesting that there was no overfitting by the OPLS-DA model (Fig. 2C).

Table 2 Quantitative results of phenotypic traits of 20 golden camellia species.
Table 3 The number of principal components obtained from PCA and OPLS-DA as well as R2X, R2Y, and Q2.
Fig. 1
figure 1

Results of PCA analysis. (A) Principal component contribution rate of 7 variables (plant height, canopy, wet leaf weight, dry leaf weight, leaf area, petiole length, and leaf length-width ratio), the vertical coordinate is weight coefficient representing the principal component contribution rate, and the whiskers denote standard deviation; (B) Principal component scores, the horizontal coordinate showing 240 samples from 20 golden camellia species and the vertical coordinate being principal component 1 (t1).

The result of agglomerative clustering on phenotypic traits is shown in Fig. 3A, and the results of agglomerative clustering on PCA and OPLS-DA scores of samples are present in Fig. 3B and C, respectively. These three types of clustering had consistent outcomes when the 20 golden camellia species were grouped into three clusters: One cluster only consisted of C. macrophylla which had the biggest plant height, canopy diameter, wet leaf weight, dry leaf weight, leaf area, and petiole length; one cluster comprised C. pubipetala, C. euphlebia Merr, C. impressinervis, C. enchengensis, and C. insularis, all of which had medium-level integrated scores for all of their variables; the last cluster was composed of the remaining species, all of which had low integrated scores for their variables. When the 20 species were grouped into more clusters, clustering based on PCA and OPLS-DA sample scores obtained completely consistent results, but there was a difference between these results and that obtained by using the hclust function in R. All of these three types of clustering revealed that C. longanensis and C. longruiensis were closely related, and C. changtsoansis, C. limonia and C. limonia f. obovata were closely related.

Fig. 2
figure 2

Results of OPLS-DA analysis. (A) VIP (Variable Importance in the Projection) values of 7 variables (plant height, canopy, wet leaf weight, dry leaf weight, leaf area, petiole length, and leaf length-width ratio); (B) OPLS-DA scores, the horizontal coordinate showing 240 samples from 20 golden camellia species and the vertical coordinate being the product of the weight and principal component 1 (t1); (C) Permutation Test result showing no overfitting by the OPLS-DA model.

Fig. 3
figure 3

Results of agglomerative clustering (A) on phenotypic traits; (B) on PCA sample scores; (C) on OPLS-DA sample scores for 20 golden camellia species.

Phylogenetic relationships between golden camellia species

Characteristics of transcriptomes and dataset analysis (Appendix 4), genetic statistics (Table S2), gene length distribution density (Fig. S2), gene family statistics (Fig. S3), and gene family Venn diagram (Fig. S4) of the 20 golden camellia species are presented in the Supplementary Materials. KOG, KEGG, and GO annotations were performed on coding regions, and the combination of Diamond and OrthoMCL was used to find the homologous Unigenes between species. Rhododendron griersonianum, Diospyros lotus, and Impatiens glandulifera were chosen as the outgroups. Comparative sequence analysis was performed between homologous genes of the outgroups and those of the 20 golden camellia species to construct phylogenetic trees. Figure 4A shows the phylogenetic tree constructed by ML method (ML_tree), the phylogenetic tree generated using NJ method (NJ_tree) is shown in Fig. 4B, and the phylogenetic tree generated using BI method (BI_tree) is displayed in Fig. 4C. ML_tree and BI_tree demonstrate that there was a distant evolutionary relationship between the 3 outgroups and the 20 golden camellia species, whereas NJ_tree shows that C. tunghinensis was evolutionarily closely related to the Diospyros lotus. With the outgroups as a reference, differences were found in the evolutionary sequence of all species among the three trees, but the genetic relationships among all species were similar on ML_tree and BI_tree. On the phylogenetic trees constructed using the three methods, the following species were evolutionarily closely related: C. achrysantha and C. changtsoansis; C. limonia and C. limonia f. obovata; C. insularis, C. pubipetala and C. impressinervis; C. chrysanthoides and C. flavida; and C. longgangensis and C. parvifolia.

Divergence time estimation

The divergence time tree constructed based on the ML topology (ML_Timetree) is shown in Fig. 5A, and that generated based on the BI topology (BI_Timetree) is presented in Fig. 5B. As searched for from http://www.timetree.org/, the divergence time between the section Chrysantha and the three outgroups is around 102.646 Mya (93.49-113.58 Mya). The divergence time on ML-Timetree was calibrated by that between C. pubipetala and C. longzhouensis (about 7.4 Mya), between C. flavida and C. impressinervis (about 9.7 Mya), between C. limonia and C. fusuiensis (about 2.5 Mya), and between C. euphlebia and C. impressinervis (about 4.3 Mya) as recorded in the literature. The divergence time on Bi_Timetree was calibrated by that between C. limonia and C. fusuiensis (about 2.5 Mya) and between C. euphlebia and C. impressinervis (about 4.3 Mya) as retrieved from http://www.timetree.org/. The divergence time of individual species varied much between ML_Timetree and Bi_Timetree due to differences in tree construction methods. On ML_Timetree, among the 20 golden camellia species, C. parvifolia was the first species to emerge (around 39.652 Mya (18.994–67.896 Mya)), followed by C. longruiensis, C. longgangensis, C. flavida, and C. chrysanthoides successively; C. achrysantha and C. fusuiensis, whose divergence time was about 1.186 Mya (0.852–1.53 Mya), were the last species to emerge; 14 species diverged after 3.452 Mya (2.622–4.077 Mya). On BI_Timetree, C. tunghinensis and C. euphlebia diverged 2.621 Mya (2.229–3.047 Mya), C. limonia, C. limonia f. obovata, C. changtsoansis, C. achrysantha, and C. fusuiensis diverged 2.504 Mya (2.115–2.872 Mya), and C. longruiensis and C. parvifolia were the last species to emerge (1.026 Mya [0.821–1.237 Mya]).

Fig. 4
figure 4

Phylogenetic trees of 20 golden camellia species inferred from single-copy homologous genes using (A) Maximum Likelihood method, (B) Neighbor-Joining method and (C) Bayesian Inference. ML bootstrap (A), NJ bootstrap (B), or Bayesian posterior probability (C) is indicated near each node the tree.

Fig. 5
figure 5figure 5

Chronograms of 20 golden camellia species derived from divergence time estimation based on maximum likelihood topology (A) and Bayesian Inference topology (B). The ages of stratigraphic boundaries, with a scale as millions of years ago (Mya), were obtained from the International Chronostratigraphic Chart16.

Discussion

Phylogeny and taxonomic basis of golden camellia species

The taxonomy of primitive species primarily depends on morphological discrepancies, such as differences in leaf shape and size, plant height, floral organ colors and size, etc. The primitiveness of some species in the genus Camellia is reflected by stalkless flowers and having no bracts and sepals; some species evolved and have elongated pedicels resulting in the emergence of bracts and sepals, bracts being isolated or persistent on pedicels or abscissive or decreased in number; in the case of absence of differentiated bracts and sepals, flowers and fruits may evolve, indicating that phylogeny and ontogeny are not consistent in the subtaxa of Camellia71. The collection of floral colors and other similar traits is an important basis on which to define subtaxa of Camellia. For instance, species in the section Chrysantha are characterized by yellow or golden flowers; generally all species in Camellia have axillary flowers but no true terminal flowers43. In our study, the 20 golden camellia species also showed differences in leaf blades and plant size (see Table 2); PCA and OPLS-DA results showed that some species could be distinguished from others based on phenotypic traits and phenotypic traits may serve as a tool for authenticating those species. However, many golden camellia species have similar phenotypic traits, and the use of phenotypic traits alone is not sufficient to identify and classify them. Comparative scrutiny of Figs. 3 and 4 revealed that there was no high correlation between clustering trees based on differences in phenotypic traits and phylogeny trees based on gene sequences, suggesting that species which are evolutionarily closely related may differ much in phenotypic traits as a result of species differentiation. This means that although morphological differences are to some extent useful to species authentication and taxonomy, they provide no references for the analysis of evolutionary relationships between species. Since species morphology is greatly affected by the environment, species taxonomy based on morphology alone has its limitations. For example, C. nitidissima var. phaeopubisperma is phenotypically diverse, so it is impossible to accurately identify this species according to morphological discrepancy alone40. Currently three taxonomic systems for the genus Camellia, established by Sealy, Zhang, and Min, respectively, are available, However, these systems focus on different phenotypic traits and remain controversial8. Therefore, it is of significance to correct the classification of the section Chrysantha using molecular biology methods.

In addition, maybe due to similar phenotypes, there are synonyms and homonyms present in golden camellia species, complicating the classification of the section Chrysantha. These include: 1) C. quinqueloculosa is synonymous with C. multipetala, C. longzhouensis is synonymous with C. chrysanthoides, and C. micrantha is synonymous with C. xiashiensis and C. parvipetala17; C. aurea Chang is synonymous with C. multipetala and C. longgangensis, C. chuongtsoensis is synonymous with C. perpatua; Zheng57 stated that C. huana is synonymous with C. tianeensis, but Zheng et al.58 argued that C. tianeensis is a variety of C. huana; C. aurea Chang and C. ptilosperma are names of the same species, and C. flavida, C. xiashiensis and C. micrantha should be merged into a single species18. In summary, both Xiao et al.18 and Wei et al.17 stated that C. micrantha is synonymous with C. xiashiensis, but it is debatable whether C. parvipetala and C. flavida should be merged into a single species. Also, further studies should be carried out to confirm that C. quinqueloculosa is synonymous with C. aurea Chang and C. longgangensis17,57. Wei et al.17 reversed a previous decision and proposed that C. nitidissima Chi. and C. petelotii refer to two different species. Lu et al.36 supported a previous decision that C. nitidissima var. microcarpa is a variety of C. nitidissima Chi., but Zheng57 held the opposite opinion. Tang et al.72 supported the viewpoint that C. nitidissima Chi. is very genetically closely related to C. nitidissima var. microcarpa. Zheng57 and Wei et al.17 proposed that C. terminalis is not a variety of C. pingguoensis. C. micrantha is a variety of C. chrysanthoides, and C. tunghinensis is not a variety of C. limonia57. C. ptilosperma is a natural hybrid of C. flavida and C. micrantha17. Our study showed that C. longzhouensis and C. chrysanthoides evolved in different periods of time and they were two different species, inconsistent with the findings of Wei et al.17; C. tunghinensis was temporally distant from C. limonia in terms of evolution, and the former was not a variety of the latter, in agreement with the result of Zheng57.

Evolutionary relationships between species in the section Chrysantha

Generally there is a close tie between phylogenetic relationships among golden camellia species and their geographical distribution, those having the same or similar geographical distribution are genetically closely related14,15,31,32,33,34. Gene flow between geographical regions has been limited by a long period of geographical isolation, and as a result, a high level of genetic variation exists between golden camellia species growing in different geographical regions35,36. Our study showed that most geographically short-distance species were evolutionarily closely related; for example, C. euphlebia and C. tunghinensis, both of which were distributed in Dongxing, Guangxi, were close in terms of genetic relationship and divergence time, and C. achrysantha and C. fusuiensis growing in Fusui, Guangxi were also evolutionarily closely related, basically consistent with the findings of previous studies.

The availability of information on interspecific genetic relationships will facilitate the discovery of new species, utilization of alternative herbal medicine, and taxonomic identification. For genetic relationships between golden camellia species, different researchers obtained different results, likely due to the difference in evidence they used, such as different molecular markers being used57,73; the differences may also result from the fact that the definitions of species are inconsistent, for example, C. perpatua is thought to be an alias of C. ptilosperma, but it is also believed to be the same species as C. chuongtsoensis57. There is uncertainty in the results of published papers because of synonyms and homonyms. Nevertheless, the following consensus can be reached by combining the results of various researchers: C. huana is very closely related to C. tianeensis genetically15,32; C. nitidissima Chi. is genetically closely related to C. nitidissima var. phaeopubisperma, C. euphlebia, and C. tunghinensis32,45,46,47,72,74; C. longruiensis is genetically closely related to C. flavida, C. wumingensis, C. longzhouensis, C. limonia, and C. xiashiensis47,57,74,75; C. murauchii is genetically closely related to C. longanensis and C. fusuiensis76; C. micrantha has close genetic relations with C. parvipetala, C. chrysanthoides, C. multipetala, C. xiashiensis, and C. longzhouensis32,73; C. mingii has close genetic relations with C. debaoensis and C. fascicularis72; C. tamdaoensis has close genetic relations with C. crassiphylla, C. ampletifolia, C. murauchii, C. hirsuta, and C. phanii33; C. insularis is genetically closely related to C. hirsuta and C. longanensis76,77; C. aurea is genetically closely related to C. impressinervis47,72,74,75,76,77; C. impressinervis has close genetic relations with C. wumingensis, C. limonia f. obovata, C. terminalis, C. chrysanthoides, C. aurea, and C. parvifolia; C. flavida is genetically closely related to C. aurea, C. longgangensis var. grandis, and C. longruiensis17,32,47,73, inconsistent with the findings of Zheng57. Taken together, geographically close species are genetically closely related. Three methods, i.e. NJ, ML, and BI were used in our study to generate evolutionary trees. There were differences in evolutionary relationships among individual species due to the discrepancy in tree construction methods. NJ is closely related to sequence homology and suitable for highly similar species, so it is usually used for intra-specific analyses. By contrast, ML and BI obtain better results in case of species with low sequence similarity, and they are generally used for inter-specific or inter-subspecific analyses. According to Hall78, BI_tree is more accurate than NJ_tree and ML-tree, so we mainly compared the BI_tree obtained in our study and BI trees in the literature. Our results also showed that C. fusuiensis and C. achrysantha from Fusui, Guangxi were very closely related to each other genetically; C. flavida and C. chrysanthoides from Longzhou, Guangxi had very close genetic relations; C. parvifolia and C. longruiensis from Longzhou and Ningming, Guangxi were genetically closely related; C. euphlebia and C. tunghinensis from Dongxing, Guangxi were very closely related to each other genetically (Fig. 4), consistent with the results of Yongqing et al.32. The above inconsistent results were caused by different designations and evaluation indicators used by different researchers.

Evolutionary time

As a general rule, an outgroup chosen to estimate the phylogenetic relationship is a species genetically closely related to the species under study, but when it comes to the study of evolutionary history of species, an outgroup should have an earlier evolutionary history than the species under study. Since some species in the section Chrysantha are at the early evolutionary stage of the family Theaceae, using other species in this family as outgroups will fail to accurately infer the evolutionary time of those golden camellia species. Therefore, Rhododendron griersonianum, Diospyros lotus, and Impatiens glandulifera were used as the outgroups for analyzing the divergence time between the 20 golden camellia species. The divergence time on ML_Timetree was calibrated by the divergence time between C. pubipetala and C. longzhouensis (about 7.4 Mya), between C. flavida and C. impressinervis (about 9.7 Mya), between C. limonia and C. fusuiensis (about 2.5 Mya), and between C. euphlebia and C. impressinervis (about 4.3 Mya) as retrieved from http://www.timetree.org/ and Cheng et al.1 and Zan et al.7. Since the topology we used to generate BI_Timetree was different from that we used to construct M_Timetree, both trees showed different evolutionary distances from the outgroups and species evolutionary sequences. The divergence time between C. pubipetala and C. longzhouensis (about 7.4 Mya) and between C. flavida and C. impressinervis (about 9.7 Mya) were unfit for BI_Timetree calibration, and only the divergence time obtained from http://www.timetree.org/ was used to calibrate BI_Timetree. Due to the difference in the basis on which the two kinds of trees were constructed, there was much variation in the divergence time between them: First, evolutionary time of individual species on BI_Timetree was later than that on ML_Timetree and the earliest divergence time on BI_Timetree was postponed from 39.652 Mya on ML_Timetree to 2.621 Mya because the divergence time of individual species recorded at http://www.timetree.org/ is later than the divergence time in Cheng et al.1 and Zan et al.7; second, differences existed in the evolutionary time sequences of individual species between the two divergence time trees. Nevertheless, the divergence time and sequence of 14 species were basically consistent on both divergence time trees despite inconsistencies concerning 6 other species (C. longzhouensis, C. chrysanthoides, C. flavida, C. longgangensis, C. longruiensis, and C. parvifolia). Most species in our study emerged after 2.621 Mya, later than C. flavida (12 Mya), C. impressinervis (9.7 Mya), C. pubipetala (8.7 Mya), and C. longzhouensis (8 Mya) in Zan et al.7 as well as C. flavida (11.9 Mya) and C. impressinervis (11 Mya) in Cheng et al.1, but almost at the same time as C. limonia, C. fusuiensis, and C. chrysanthoides (2.5 Mya for all) as predicted by http://www.timetree.org/.

Theaceae originated in the late Upper Cretaceous period, followed by the early radiation after the onset of the Early Eocene Climatic Optimum. Under favorable climate conditions in the early Miocene, the species differentiation rate of the common ancestor of Theaceae changed, and accelerated thereafter1. Theaceae emerged 79.8-102.5 Mya16,79,80. The crown age of the tribe Theeae was estimated as 22.1 Mya (19.2–25.2)16. Molecular dating using four fossil calibration points indicated that the tree age of Camellia was 39.5 Mya81; a radiative divergence of major clades occurred at 23 − 19 Mya and a species burst occurred at 10 − 5 Mya within this genus7. The section Chrysantha diverged earlier than the section Thea1; C. sinensis (Thea) diverged around 6 Mya, whereas C. flavida (Chrysantha) and C nitidissima (Chrysantha) diverged about 10 Mya1,16. C. sinensis and C. lanceoleosa diverged from a common ancestor 6–7 Mya3. The 20 species in our study diverged at a later time as compared with that recorded in the literature, inconsistent with the finding of Cheng et al.1 and Zan et al.7, likely due to the use of different methods and reference time for calibration.

Rationality of genetic diversity analysis theories

Presently genetic diversity analyses based on gene sequences and DNA molecular markers are performed on the basis of the molecular clock hypothesis. According to this hypothesis, the larger the divergence time interval between two species, the higher the variation in amino acid sequences between them. The precondition of the molecular clock hypothesis is that any given macromolecule (DNA or protein sequence) has an approximately constant evolutionary rate in all evolutionary lineages. However, a growing body of research has found that molecular clocks are not all the same in different species and that they can be influenced by “generation-time effect”, life history of species, and DNA methylation. The neutral theory proposes that any mutation in a group will be totally subject to the influence of various random factors if it dose not lead to any phenotypic difference. If, as the neutral theory expects, the vast majority of mutations are neutral and the mutation rate remains constant for a long time, then the population replacement rate also remains the same for a long time. This is also the principle of the molecular clock hypothesis. The neutral theory and the molecular clock hypothesis are circular arguments with almost the same meaning, and the proposition of the neutral theory is for the purpose of explaining the molecular clock hypothesis. The neutral theory is based on the assumption that most genetic variations are neutral, rather than on an empirical conclusion. However, tests of genetic diversity in real life has soundly rejected the neutral theory82, and molecular biological function research over the past 60 years has discovered that a large number of sequences considered as neutral are actually biologically functional. Another recent study83 has shown that the human mtDNA genome is largely functional rather than neutral, again directly falsifying the neutral theory. In addition, short tandem repeats have been assumed to be neutral but have recently been shown to bind transcription factors84, also disproving the neutral theory.

The maximum genetic diversity (MGD) theory has been proposed by Huang85,86 to supersede the neutral theory. The author believes that genetic diversity is influenced by natural selection, and that genetic variations will not undergo unlimited linear growth but will be impacted by species complexity. The MGD hypothesis takes in proven merits of the modern evolutionary theory and its assumption merely relates to a more limited range. According to a priori logic common sense reasoning, there must an upper limit to genetic distance or diversity. This upper limit determines that evolutionary variations are not linearly related to time. As such, both the molecular clock hypothesis and the neutral theory are false. Therefore, the divergence time estimated by them might not be accurate.

The divergence time is a hot topic in current evolutionary analyses. Generally the divergence time between species is estimated by the degree of divergence between gene sequences and molecular clocks. Unfortunately, since the molecular clock hypothesis is a mistaken interpretation of the genetic equidistance phenomenon87, the accuracy of inter-specific divergence time prediction based on the molecular clock model is questioned. The information on the distribution of an existing species alone is not sufficient to fully understand the evolutionary and distribution history of that species, and results in the literature are only prediction about divergence trends; a large amount of fossil information and other evidence are needed to make it possible to obtain more accurate divergence time48.

Conclusions

No consensus about phylogenetic relationships among species in the section Chrysantha has been reached until now. Different results are obtained in the literature due to the use of different molecular materials and methods. In the present study, we found that species with close geographic distances are evolutionarily close to each other, consistent with the results of previous studies. The divergence time of 14 species was close to that predicted at http://www.timetree.org/, but later than that in other literature. Our conclusions are tentative as our study is based on the failed molecular clock hypothesis as other genetic diversity analyses and can not cover all golden camellia species.