Comparative chloroplast genomes analysis of nine Primulina (Gesneriaceae) rare species, from karst region of southwest China

Gu, Jiangmiao; Li, Meijun; He, Songtao; Li, Zhi; Wen, Fang; Tan, Ke; Bai, Xinxiang; Hu, Guoxiong

doi:10.1038/s41598-024-81524-6

Download PDF

Article
Open access
Published: 04 December 2024

Comparative chloroplast genomes analysis of nine Primulina (Gesneriaceae) rare species, from karst region of southwest China

Jiangmiao Gu¹,
Meijun Li¹,
Songtao He¹,
Zhi Li^1,2,
Fang Wen^3,4,
Ke Tan^3,4,
Xinxiang Bai¹ &
…
Guoxiong Hu⁵

Scientific Reports volume 14, Article number: 30256 (2024) Cite this article

1344 Accesses
3 Citations
Metrics details

Subjects

Abstract

Guizhou Province is one of the most important karst regions of southwest China, with 22 Primulina species. These species are highly habitat-specialized and dependent on the soils of the karst region, and many inter-species classifications are unclear. Therefore, studying the chloroplast genomes and estimating the divergence times of there species can not only provide a better understanding of interspecific relationships but also help to know the species speciation and divergence in the karst environment. So, we sequenced and assembled the chloroplast genomes of nine Primulina species (including six endemic species of Guizhou) and conducted chloroplast genome comparison analysis and phylogenetic study. The chloroplast genome structures of the nine Primulina species were quadripartite with total lengths of 152,869–153,364 base pairs (bp) and GC content of 37.55–37.64%. There are 132 functional genes annotated, respectively. A total of 375 simple sequence repeats and 375 interspersed nuclear elements were identified. The 30 highly preferred codons identified were used at similar frequencies in different species, respectively. A phylogenetic tree constructed on the basis of the 38 chloroplast genomes showed that Primulina species form a monophyletic group. Eleven mutational hotspot regions that could serve as potential molecular markers were identified, of which two regions near the 3′ and 5′ ends of the ycf1 gene were of appropriate size and can serve as molecular markers for phylogenetic studies of Primulina. The results of molecular clock analyses indicate that the three major branches of Primulina begin to diverge in the Miocene, and the number of species proliferated in the Pliocene and Pleistocene. Most of the species of Primulina in Guizhou Province were formed in the Pleistocene and rapidly diverged within a short period of time. This research study enriches the genetic resource information of Primulina and deepens the understanding of the phylogenetic relationships of the genus.

The complete chloroplast genome sequences of five pinnate-leaved Primula species and phylogenetic analyses

Article Open access 27 November 2020

Chloroplast genome characterization of Uncaria guianensis and Uncaria tomentosa and evolutive dynamics of the Cinchonoideae subfamily

Article Open access 24 May 2023

The complete chloroplast genome of Gleditsia sinensis and Gleditsia japonica: genome organization, comparative analysis, and development of taxon specific DNA mini-barcodes

Article Open access 01 October 2020

Introduction

Karst landforms are specialized landforms formed by the dissolution of soluble rocks on the surface and underground¹. Due to the high surface relief, diverse topography, high soil calcium content, and lack of continuous soil coverage, a wide variety of microhabitats and "terrestrial island habitats" have been formed, with significant habitat segregation and heterogeneity, giving rise to many karst-exclusive species with a high degree of regional endemism^2,3,4,5.

Primulina Hance is one of the most diverse taxa in karst regions, and most species of the genus are dependent on limestone soils, creating taxa of great diversity and endemism in segregated and heterogeneous habitats in karst regions^6,7,8,9,10. To date, chloroplast gene fragments (atpB-rbcL, rpl32-trnL, trnL-trnF, rpl32-trnL, rps16 and matK, etc.) and nuclear gene fragments (ITS, 7FR, 13FR, 97FR, 117FR, 155FR, 165FR, 166FR, 248FR, 302FR and 383FR, etc.) have been used to construct phylogenetic trees to explore phylogenetic relationships in Primulina^{6,11,12,13,14,15,16,17}. However, the support for some branch nodes in the phylogenetic tree of the above cited studies was not high, and the phylogenetic positions of some species remain uncertain, such as in the case of the phylogenetic positions of P. lunglinensis, P. lunglinensis var. amblyosepala, P. liguliformis, P. subulatisepala, P. swinglei and P. nandanensis were inconsistent in different studies. Therefore, screening high-resolution molecular markers is one of the main directions for phylogenetic studies in this genus.

Guizhou Province is one of the most important karst regions in China, and its special geological landscape makes it one of the main areas in which Primulina (Fig. 1), with 22 species (the 10 species of which are endemic to Guizhou)^18,19,20. Species in the genus are highly habitat-specialized, and changes in the environment often lead to drastic reductions in the number of individuals in the population^8,9,10, which is problematic for the phylogenetic studies of Primulina. For example, the caves of the P. flexusa type locality (Duyun city, Guizhou Province, China) collapsed due to quarrying, and only a few individuals remained.

The chloroplast genome, due to its simple structure, moderate evolutionary rate, conserved sequence, and primarily matrilineal inheritance, is not only effective in improving the support and resolution of phylogenetic trees in phylogenetic studies^21,22 but also is effective for identifying natural hybridization, which is one of the very important data in phylogenetic analyses^23,24. However, the chloroplast genomic information is under-reported of the Primulina species distributed in Guizhou Province. To date, Feng et al.²⁵, Hsieh et al.²⁶, Hou et al.²⁷, Lu et al.²⁸, Peng et al.²⁹, and Wen et al.³⁰ sequenced and analyzed the chloroplast genomes of several Primulina species. Only 31 chloroplast genomes of 28 Primulina species have been uploaded to the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/), with most of these species being endemic to Guangxi (39.28%) or Guangdong (28.57%) provinces, China. There are fewer chloroplast genomes endemic to Guizhou Province that have been assembled.

To supplement and preserve molecular data on the endemic and understudied species, the chloroplast genomes of nine Primulina species (including six endemic species of Guizhou Province) (Fig. 2) were sequenced, assembled, and analyzed. Based on these data, (1) comparative analyses of the structure, characteristics, gene content, repetitive sequence content and distribution, and codon usage bias of nine chloroplast genomes were performed to increase knowledge of the Primulina chloroplast genomes; (2) comparative analysis of nucleotide differences in nine Primulina chloroplast genomes and screening of 11 nucleotide mutation hotspot regions that can be used as potential molecular markers were conducted; (3) a phylogenetic tree was constructed based on 36 Primulina chloroplast genomes and two Petrocodon chloroplast genomes, and the phylogenetic relationships of Primulina were briefly explored; and (4) estimated divergence times for these species are based on 36 Primulina chloroplast genomes and two Petrocodon chloroplast genomes.

Results

Chloroplast genome structure, features, and gene content

The chloroplast genomes of the nine Primulina species, ranging from 152,869 bp (P. obtusidentata) to 153,364 bp (P. silaniae), displayed a typical quadripartite structure: a large single-copy region (LSC) with a length of 84,333–84,709 bp, a small single-copy region (SSC) with a length of 17,648–17,885 bp, and two inverted repeat regions (IRs) with a length of 25,235–25,522 bp (Fig. 3; Table 1). The GC content of the nine chloroplast genomes were similar (37.55–37.64%), with the content of the LSC and SSC regions ranging from 35.50 to 35.60% and 31.18–31.39%, respectively, and those of the IRs ranging from 43.08 to 43.23%, which were significantly greater than those of the LSC and SSC regions (Table 1). The GC content of the coding sequences (CDSs) were 37.71–38.80%, with the first, second, and third positions of their codons (GC₁, GC₂, and GC₃) being 45.27–45.39%, 38.10–38.19%, and 29.76–29.84%, respectively, showing the regular pattern of GC₁ > GC₂ > GC₃. Notably, different base types were unevenly represented in the three positions of the codons.

Table 1 Characterization of the nine Primulina chloroplast genomes.

Full size table

The nine chloroplast genomes were each annotated with 132 functional genes, including 87 protein-coding genes (PCGs), 37 transfer ribonucleic acids (tRNAs), and 8 ribosomal ribonucleic acids (rRNAs) (Table 1). Based on the different functions of the genes, these genes were classified into protein synthesis and DNA-replication genes (74), photosynthesis genes (45), other genes coding for proteins (6), and genes of unknown function (7) (Supplementary Table S1). Eighteen unique genes had introns, including clpP1 and ycf3, which had two introns. Due to the inverted repeat nature of the IR regions, the IRb/SSC boundaries had a pseudogene, ycf1, that was partially duplicated and has lost the ability to encode a complete protein. In addition, the rps12 gene in the chloroplast genome consists of two divided parts, 5′-rps12 and 3′-rps12. 5′-rps12 is generally located in the LSC region; because of the expansion or contraction of the IR regions, 3′-rps12 will be located in the LSC region or IR regions ^31,32. In this study, 5′-rps12 was located in the LSC region, and 3′-rps12 was located in the IR regions with double copies (Fig. 3).

Identification of simple sequence repeats and interspersed nuclear elements

SSRs have strong polymorphism and good stability and are therefore ideal molecular markers that are widely used in species kinship identification, genetic diversity research, and molecular breeding ^33,34. In this study, a total of 375 SSRs were identified (Supplementary Table S2), with 35 (Primulina silaniae) to 50 (P. obtusidentata) SSRs identified for each species, including 15–25 mononucleotide (Mono-) repeats, 6–13 dinucleotide (Di-) repeats, 2–5 trinucleotide (Tri-) repeats, 6–12 tetranucleotide (Tetra-) repeats, 0–2 pentanucleotide (Penta-) repeats, and 0–2 hexanucleotide (Hexa-) repeats (Fig. 4A). The repeat units were mainly composed of adenine (A) and thymine (T), with a predominance of mononucleotide repeats A/T, followed by dinucleotide repeats AT/AT (Fig. 4B). Among the nine species, all six types of SSRs were detected only in the chloroplast genome of P. silaniae (Fig. 4A). The SSRs were mainly distributed in the LSC region (Fig. 4C). The percentages of the length of SSRs in the LSC, SSC, and IR regions were calculated separately, and it was found that there was no regular pattern in the percentage of the length of SSRs in these regions (Fig. 4D).

INEs were dispersed throughout the genome and were of four types: forward repeats, reverse repeats, palindromic repeats, and complement repeats ³⁵. In this study, a total of 375 INEs were identified in nine chloroplast genomes (Supplementary Table S2), with the number of INEs in each species ranging from 29 (Primulina liboensis) to 53 (P. flexusa), including 12–28 forward repeats, 16–25 palindromic repeats, 0–2 reverse repeats, and 0–2 complement repeats (Fig. 4E). Forward repeats and palindromic repeats were the most numerous; reverse repeats and complement repeats were absent in some species (Fig. 4E). The INEs ranged from 30 to 99 bp in length, with only one forward repeat of more than 90 bp present in P. malingheensis (Fig. 4F).

Codon usage bias analysis

Codon usage bias refers to the presence of a particular codon or codons in a species that is used more frequently than its synonymous codon and can directly affect the efficiency and accuracy of protein synthesis ^36,37. In this study, each species had 58 CDSs of no less than 300 bp in length, of which 64 codons were detected, with the total number ranging from 24,243 (Primulina malingheensis) to 24,714 (P. flexusa) (Supplementary Table S3). UGA (stop codon) and AUU (encoding isoleucine) were, respectively, the least and most used codons of the nine species; cysteine and leucine were, respectively, the least and most common amino acids of the nine species (Supplementary Table S3).

The nine chloroplast genomes were statistically analyzed and visualized based on relative synonymous codon usage (RSCU) (Fig. 5). Thirty of the 64 codons with RSCU > 1 were highly preferred codons, of which 29 ended in A/U, while 32 codons with RSCU < 1 were less preferred codons, of which 29 ended in G/C. Notably, methionine and tryptophan were encoded by the unique codons AUG and UGG only, with RSCU = 1, and without bias.

Comparative analysis of SC/IR boundaries

The nine chloroplast genomes displayed a typical quadripartite structure with four boundaries between the LSC, SSC, and IR regions (SC/IR boundaries) (Fig. 6). Comparative analyses revealed that the SC/IR boundaries were relatively conserved among the nine species. The LSC/IRb boundary (JLB line) is located in the rps19 gene, the IRb/SSC boundary (JSB line) is located in the overlap between the ndhF gene and the pseudogene ycf1, the SSC/IRa boundary (JSA line) is located in the ycf1 gene, and the IRa/LSC boundary (JLA line) is located in the intergenic sequences of the rpl2 gene and the trnH^GUG gene (Fig. 6).

At the LSC/IRb boundary, the rps19 gene spanned the JLB line, with very few base pairs extending into the IRb region. It can be divided into three categories: (1) 251 bp in the LSC region and 28 bp in the IRb region, with this category containing Primulina flexusa, P. malingheensis, P. secundiflora, P. tenuituba, and P. vestita; (2) 249 bp in the LSC region and 30 bp in the IRb region, with this category containing P. liboensis, P. fordii, and P. obtusidentata; and (3) 247 bp in the LSC region and 32 bp in the IRb region, with this category containing only P. silaniae.

At the IRb/SSC boundary, the ndhF gene spanned the JSB line, with very few base pairs extending into the IRb region. The ndhF genes of Primulina flexusa, P. liboensis, P. secundiflora, P. tenuituba, and P. vestita all had 107 bp on the left side of the JSB line; on the right side of the JSB line, they had 2185 bp for P. liboensis and 2179 bp for the remaining four species. The ndhF genes of P. fordii, P. malingheensis, P. obtusidentata, and P. silaniae all had 2195 bp on the right side of the JSB line; on the left side of the JSB line, it was 70 bp for P. fordii and P. obtusidentata and 91 bp for P. malingheensis and P. silaniae.

At the SSC/IRa boundary, the ycf1 gene spanned the JSA line, with very few base pairs extending into the IRa region. The ycf1 genes of Primulina flexusa, P. liboensis, P. secundiflora, P. tenuituba, and P. vestita all had 802 bp on the left side of the JSA line; on the right side of the JSA line, the length ranges from 4454 to 4664 bp. The ycf1 genes of P. fordii, P. malingheensis, P. obtusidentata, and P. silaniae ranged in length from 765 to 786 bp on the right side of the JSA line and from 4644 to 4704 bp on the left side of the JSA line.

At the IRa/LSC boundary, the rpl2 gene was located to the left of the JLA line, at 83 bp (Primulina flexusa, P. malingheensis, P. secundiflora, P. tenuituba, and P. vestita), 85 bp (P. liboensis, P. fordii, and P. obtusidentata), or 87 bp (P. silaniae); the trnH^GUG gene was located to the right of the JLA line, at 11 bp (P. liboensis), 8 bp (P. flexusa, P. malingheensis, P. secundiflora, P. tenuituba, and P. vestita), 4 bp (P. fordii and P. obtusidentata), or 2 bp (P. silaniae).

Comparative analysis of the chloroplast genomes

In this study, the chloroplast genomes of nine species were aligned and visualized using Primulina flexusa as the reference sequence (Fig. 7). The results showed that the nucleotide differences in the LSC and SSC regions were greater than those in the IR regions, and the noncoding regions were more differentiated than the coding regions. In the coding regions, the exons of all genes are relatively conserved except ycf1. Among the noncoding regions, the regions with large nucleotide differences were mainly the spacer region and introns of 5000–15,000 bp, 25,000–35,000 bp, 40,000–50,000 bp, 70,000–80,000 bp, and 110,000–120,000 bp.

To quantify the nucleotide polymorphisms in the nine chloroplast genomes in more detail and to identify the mutation hotspot regions among them, we calculated the nucleotide polymorphisms (Pi) in a window of 800 bp in length from the aligned matrices of the nine chloroplast genomes. The results showed that the Pi values ranged from 0 to 0.02267, with the lowest Pi value of 0.01583 in the window with the top 5% of the Pi values. A total of 11 mutational hotspot regions that could serve as potential molecular markers were identified (Pi > 0.01583): trnH^GUG-psbA, trnK^UUU-rps16, rps16-trnQ^UUG, petN-psbM, trnE^UUC-trnT^GGU-psbD, clpP1-psbB, ndhF-rpl32-trnL^UAG, ndhD-psaC-ndhE, and 3 regions in ycf1 (0–338 bp, 1900–3892 bp, and 4292–4955 bp) (Fig. 8). All 11 mutation hotspot regions were located in the LSC and SSC regions, and the Pi values of the IR regions were lower and more conserved.

Phylogenetic analysis

The topology of the phylogenetic tree constructed using maximum likelihood (ML) and Bayesian inference (BI) based on the 38 chloroplast genomes in this study is consistent, and the two are shown together (Fig. 9). Thirty-six Primulina species were clustered in the phylogenetic tree as a fully supported clade. In the study of Kong et al.¹³, Primulina was classified into Clades A, B, C, and D. The phylogenetic tree constructed in the present study contained three major clades that received full support, roughly corresponding to Clades B, C, and D as delineated by Kong et al. The Kong et al. study differs from the present study in that P. secundiflora was located in Clade D and clustered with P. vestita in a branch that received little support (BS = 78; PP < 0.50) in the present study, whereas in Kong et al., P. secundiflora was located in Clade B and formed a fully supported branch with a branch comprising P. maguanensis and P. grandibracteata.

Four of the nine Primulina species analysed in this study are located in Clade B, one is in Clade C, and four are in Clade D, and the genetic distances among them further support this result (Fig. 10). Among the three new taxa of Primulina reported in Guizhou Province, China, in recent years, P. silaniae and P. malingheensis are located in Clade B, and P. flexusa is located in Clabe D. In the available chloroplast genome data, P. silaniae is located at the base of Clade B. Two points are worth noting: first, P. liboensis (PP954905) in Clade C did not cluster with P. liboensis (MF177039) but instead formed a fully supported clade with P. sclerophylla; second, P. tenuituba (PP954910) in Clade D did not cluster with P. tenuituba (MW245830) but instead formed a highly supported (BS = 99; PP = 0.94) clade with P. flexusa.

Divergence time estimation

Estimated divergence times for these species are based on 36 Primulina chloroplast genomes and two Petrocodon chloroplast genomes. Primulina and Petrocodon diverged about 14.57 million years ago (Mya), with a 95% highest posterior density (HPD) of 11.06–18.76 Mya (Fig. 11). The three major branches of Primulina begin to diverge in the Miocene, with most of the species of clade B and clade C formed in the Miocene and Pliocene, and most of the species of clade D formed in the Pleistocene. Species of Primulina in Guizhou Province diverged from 6.54 to 0.52 Mya, with most of them formed in the Pleistocene and diverging rapidly in a short period of time (divergence time: 2.48 to 0.52 Mya).

Discussion

In this study, the chloroplast genomes of the nine Primulina species were relatively conserved in size, structure, gene type, and content, and all nine chloroplast genomes, ranging from 152,869 to 153,364 bp, displayed a quadripartite structure, which aligns with the general range of chloroplast genome sizes of 115–165 kp for photosynthesizing plants ³⁸. Each chloroplast genome was annotated with 87 PCGs, 37 tRNAs, and 8 rRNAs, of which 23 genes possessed introns, in agreement with the findings of Hsieh et al. ²⁶. Notably, Feng et al. annotated the chloroplast genomes of three Primulina species [P. eburnea (MF177038), P. huaijiensis (MF472012), and P. linearifolia (MF472013)] ²⁵. In contrast to the findings of the present study, Feng et al. found that the rps19 and trnG^UCC genes were absent, and only one exon was present in the rpl16, petB, and petD genes according to their annotation results for the three chloroplast genomes. Annotating the chloroplast gene sequences of these three species using the reference genomes of the present studyrevealed that they included the rps19 and trnG^UCC genes, and that the rpl16, petB and petD genes had two exons and one intron. Differences in the reference genome annotation information are likely responsible for the differences in the annotation results.

SSRs are the most commonly used and effective molecular markers³⁴. The SSRs identified in the nine chloroplast genomes in this study were dominated by mononucleotide, dinucleotide, trinucleotide, and tetranucleotide repeats. The repetitive units were mainly composed of either A or T, with A/T repeats and AT/AT repeats predominating, consistent with the major SSR repetitive unit types identified in the chloroplast genomes of other species in Gesneriaceae^26,39. In contrast to the chloroplast genomic SSRs identified by previous research on Primulina^26,39, SSRs of all six length types were detected in only P. silaniae. This may be related to the particular systematic position of P. silaniae, which is at the base of Clade B in the available Primulina chloroplast genomes (Fig. 9)¹³, and which may be a transitional taxon between Clades A and B. At present, there is a lack of data on the chloroplast genomes of the taxa in Clade A, as delineated by Kong et al. In the future, the chloroplast genomes of taxa in this clade could be added to study the similarities and differences of SSRs among different clades. The SSR loci identified in this study were mainly distributed in the LSC region, but there was no pattern in the percentage of the length in the LSC, SSC, and IR regions, and there was no preference for the LSC region. The distribution of SSRs in the LSC, SSC, and IR regions is likely to be random.

Codon usage bias analysis revealed that the frequencies of the 64 codons in the nine chloroplast genomes were essentially similar, with 30 highly preferred codons, 2 without biased codons, and 32 fewer preferred codons detected. Highly preferred codons ended in A/U, except for UUG (encoding leucine), while less preferred codons mostly ended in C/G, corroborating the lower GC₃ observation (Table 1). These results are similar to those of previous studies^36,39,40, supporting a high degree of consistency in chloroplast genome codon usage bias in Gesneriaceae, as suggested by Hsieh et al.²⁶, and further supporting the hypothesis that codons in higher plants tend to have A/T(U) endings^41,42,43.

Expansion, contraction, and loss of the IR regions cause changes in the size of the chloroplast genome, leading to gene duplication, gene deletion, and the generation of pseudogenes^44,45,46,47. The SC/IR boundaries of the nine chloroplast genomes in this study were relatively conserved, and no obvious expansion or contraction of the IR regions was detected, which is generally consistent with the results of SC/IR boundary analyses in other species of Gesneriaceae³⁹. Although the SC/IR boundaries of the nine chloroplast genomes were relatively conserved, they remain distinct in some aspects and are broadly consistent with the phylogenetic relationships. The SC/IR boundaries of Primulina flexusa, P. secundiflora, P. tenuituba, and P. vestita were nearly identical and distinguishable from those of the other five species, perhaps because of the close affinity of these four species.

Due to the simple structure, moderate evolutionary rate, and conserved sequence of the chloroplast genome, many genes, introns, and spacer regions have been exploited to construct phylogenetic relationships at different taxonomic levels^48,49,50. In studies of Primulina, trunk branches of phylogenetic trees constructed using common molecular markers did not receive high support¹³, and the development of additional molecular markers is therefore imperative. In this study, the alignment of the nine chloroplast genomes showed that the IR regions were relatively conserved, and the regions with large nucleotide differences were the ycf1 gene and the noncoding regions of the LSC and SSC regions. After further quantification of nucleotide polymorphisms, eleven mutational hotspot regions were identified: trnH^GUG-psbA, trnK^UUU-rps16, rps16-trnQ^UUG, petN-psbM, trnE^UUC-trnT^GGU-psbD, clpP1-psbB, ndhF-rpl32-trnL^UAG, ndhD-psaC-ndhE, and three regions of ycf1. The extent of the three highly mutated regions of the ycf1 gene was localized, of which the 0–338 bp and 4292–4955 bp regions were in general agreement with the findings of Hsieh et al.²⁶. Both regions were < 800 bp in size, which is convenient for amplification and sequencing and is conducive to the development of molecular markers for Primulina.

The phylogenetic relationships of the 36 Primulina species in this study are largely consistent with those of Kong et al.¹³, and the three main clades correspond roughly to Clades B, C, and D, as delineated by Kong et al.; however, there are still some differences. Primulina secundiflora was located in Clade D in this study, with the P. secundiflora material collected from its type locality, Kweichow, Ching-Che, Hwa-Chiao (now Huaqiao village, Baihuahu town, Guanshanhu District, Guizhou Province, China)⁵¹. Kong et al. reported that P. secundiflora was located in clade B and formed a fully supported branch with branches comprising P. maguanensis and P. grandibracteata. However, the small corolla and small linear bracts of P. secundiflora differ markedly from the larger corolla and large, suborbicular, or broadly ovate to narrowly ovate bracts of P. maguanensis and P. grandibracteata^52,53. During the compilation of Plants of Gesneriaceae in China and Gesneriaceae of South China, researchers identified a plant collected in Xingyi city, Guizhou Province, China, as P. secundiflora^54,55, which was published by our research team as a new taxon and named P. malingheensis²⁰ because of its larger corolla and larger and ovate bracts, which are clearly different from those of P. secundiflora. Primulina malingheensis is morphologically more similar to P. maguanensis and P. grandibracteata, and in this study, it is located in Clade B¹³. Kong et al. likely misidentified P. malingheensis as P. secundiflora because of the records of Plants of Gesneriaceae in China and Gesneriaceae of South China. There are many species in Primulina, and the classification is confusing. Li¹⁴ divided the species in the genus that are phylogenetically close and have similar morphological characteristics into several complexes, but the definition of species within the complexes is still highly problematic. According to Li¹⁴, P. secundiflora, P. tenuituba, P. vestita, and P. flexusa in this study should belong to the P. secundiflora complex, while P. liboensis and P. sclerophylla should belong to the P. sclerophylla complex. Within the P. secundiflora complex, P. tenuituba is widely distributed in Guizhou Province, Hunan Province, and Hubei Province, China, and there are some differences in the morphological characteristics of P. tenuituba from different localities, with those from Shennongjia Forestry District, Hubei Province being more similar to P. vestita and slightly different from P. tenuituba from Huayuan County, Hunan province. In this study, P. tenuituba was represented by only two populations from Guizhou Province (Guiyang city and Tongren city) and did not form a monophyletic lineage. The interspecific relationships of the P. secundiflora complex need to be investigated by adding additional samples of P. tenuituba from Hunan province and Hubei province. The P. sclerophylla complex is widely distributed in the Guangxi Zhuang Autonomous Region and southwestern Guizhou Province, China, and contains 20 species, the vast majority of which are distinguished from each other using continuously variable traits¹⁴. In this study, the P. sclerophylla complex involved only two P. liboensis populations from Pingtang County and southeastern Guizhou Province and one P. sclerophylla population from Hechi city, Guangxi Zhuang Autonomous Region, and the two P. liboensis populations did not form a monophyletic lineage. The interspecific relationships within the P. sclerophylla complex need to be investigated by adding additional samples from the type locality of the species within the complex.

In this study, the divergence time of Primulina was 14.57 Mya (with a 95% HPD of 11.06–18.76 Mya) (Fig. 11), which is much smaller than that of previous studies ^56,57. We speculate that the reason for this discrepancy may be that the studies of Woo et al.⁵⁶ and Perret et al.⁵⁷ did not include close relatives of Primulina in their analyses. In their study, the sister taxa of Primulina are species of Streptocarpus, which is more distantly related to Primulina, and much data on other genera is missing between the two genera⁸, leading to the speculation of an older divergence time of Primulina. In addition, we hypothesize that because the species of Clade A, which diverged earlier, was missing from the present study, a younger crown group age of Primulina (13.06 Mya, with a 95% HPD of 9.90–16.37 Mya) was obtained than in the previous study (14.14 Mya, with a 95% HPD of 10.79–19.09 Mya) ¹³. In subsequent studies, the chloroplast genome of the species of Clade A could be added to estimate the divergence time to verify whether the speculation is correct.

Conclusion

In this study, the complete chloroplast genomes of nine Primulina species were assembled, of which the chloroplast genomes of seven species were reported for the first time. A total of 375 SSRs and 375 INEs were identified in nine chloroplast genomes, providing additional molecular data for Primulina that can be used to analyze genetic diversity. The usage frequency of codons was broadly similar across the nine chloroplast genomes, with highly preferred codons of the same 30 types, which is similar to the results of a previous study. The SC/IR boundaries were relatively conserved, and no obvious expansion or contraction of the IR regions was detected. In addition, a total of 11 mutational hotspot regions that could serve as potential molecular markers were identified: trnH^GUG-psbA, trnK^UUU-rps16, rps16-trnQ^UUG, petN-psbM, trnE^UUC-trnT^GGU-psbD, clpP1-psbB, ndhF-rpl32-trnL^UAG, ndhD-psaC-ndhE, and three regions of ycf1 (0 bp–338 bp, 1,900 bp–3,892 bp, and 4,292 bp–4,955 bp). The two mutational hotspot regions near the 3′ and 5′ ends of the ycf1 gene were of appropriate size and are recommended as molecular markers for phylogenetic studies of Primulina. The phylogenetic tree based on complete chloroplast genomes strongly supports Primulina as a monophyletic taxon, but the interspecific relationships of the P. secundiflora complex and the P. sclerophylla complex still need to be investigated by adding more samples from more provenances. The results of molecular clock analyses indicate that the three major branches of Primulina begin to diverge in the Miocene, and the number of species proliferated in the Pliocene and Pleistocene. Most of the species of Primulina in Guizhou Province were formed in the Pleistocene and rapidly diverged within a short period of time.

Materials and methods

Plant materials and DNA extraction

The materials of the nine species were collected from Guizhou Province, China (Table 2). Mature leaves free of pests and diseases were collected in the field, washed, cleaned, sealed, and stored in a refrigerator at − 40 °C. The collection of plant materials was collected following international, national, and local legislation. Dr. Xinxiang Bai identified these plants, and the voucher specimens were stored in the Tree Herbarium of Guizhou University College of Forestry (GZAC). Total DNA was extracted from each of the nine samples using the GTAB method. DNA quality and purity were tested using 1% agarose gel electrophoresis and a fluorescence quantification instrument (Thermo Qubit 4.0).

Table 2 Sampling information of the nine Primulina species.

Full size table

Chloroplast genome sequencing, assembly and annotation

Sangon Biotech Co., Ltd. (Shanghai, China) used the Hieff NGS® MaxUp II DNA Library Prep Kit for Illumina® (Yeasen Biotechnology Co., Ltd., Shanghai, China) to prepare DNA libraries based on DNA samples of acceptable purity. The concentrations of the DNA libraries were determined using a fluorescence quantification instrument (Thermo Qubit 4.0). DNA libraries were sequenced using the Illumina MiSeq System (Illumina) to obtain the raw data. The raw data were cropped and filtered using FASTQ 0.36⁵⁸ software to obtain clean data. Comparison of clean data with published sequences of Primulina in the NCBI database. De novo genome assembly was performed from the clean data using SOAPdenovo 2⁵⁹ and NOVOPlasty⁶⁰. Primulina liboensis (GenBank accession number: NC_036101) was used as the reference genome, and the assembled chloroplast genome was annotated using the GeSeq online tool (https://chlorobox.mpimp-golm.mpg.de/geseq.html)⁶¹. CDS start and stop codons in the annotation file were manually checked and adjusted using Geneious 9.0.2⁶². The type, length, and anticodon name of the tRNA genes in the annotation file were manually checked and adjusted using tRNAscan SE (http://lowelab.ucsc.edu/tRNAscan-SE/)⁶³. Annotation information was visualized using OGDRAW v.1.3.163⁶⁴.

Repeat sequence analysis

SSRs were identified using Krait⁶⁵ with the following parameter settings: mononucleotide ≥ 10, dinucleotide ≥ 5, trinucleotide ≥ 4, tetranucleotide ≥ 3, pentanucleotide ≥ 3, and hexanucleotide ≥ 3. INEs were identified using the REPuter online tool (https://bibiserv.cebitec.uni-bielefeld.de/reputer)⁶⁶, with the minimum repeat size set to 30 bp and a sequence identity of 90% (Hamming distance of three).