Background & Summary

The Malvaceae family encompasses both economically vital crops and ornamental species. Economically, it includes globally significant crops such as cotton (Gossypium spp.) for natural fiber production and cacao (Theobroma cacao) for chocolate manufacturing1,2,3. Ornamentally, the genus Hibiscus (e.g., H. syriacus, H. mutabilis) has been extensively studied for its horticultural value. As a member of the Malvaceae, hollyhock (A. rosea) holds unique dual significance. Historically cultivated in China for over two millennia, it was documented in ancient pharmacopeias like Compendium of Materia Medica for its edible and medicinal properties. Modern studies have identified bioactive compounds in hollyhock, including flavonoids and polysaccharides with demonstrated anti-inflammatory and antimicrobial activities4,5,6,7. Beyond its traditional uses, hollyhock exhibits exceptional floral diversity, with petal colors ranging from white to black and intricate variegation patterns, making it an ideal system for studying pigmentation genetics (Fig. 1).

Fig. 1
figure 1

The diverse color and petal types of A. rosea.

Prior genomic efforts within the Malvaceae family have predominantly focused on economically pivotal species. The genus Gossypium (cotton) boasts multiple high-quality genomes, including the diploid G. raimondii and tetraploid G. hirsutum, which revolutionized fiber development studies8,9. Similarly, the cacao genome and its re-sequencing efforts have accelerated research on flavonoid biosynthesis10,11. For ornamental taxa, chromosome-level assemblies of H. syriacus12 and H. mutabilis13 revealed genetic bases for floral color transition and woody growth.

Notably, A. rosea genomic exploration remains limited. Previous studies have conducted transcriptome sequencing on single-petaled and double-petaled red-flowered A. rosea, identifying several genes potentially associated with double-petaled flower formation14,15. Additionally, Wang et al.16. performed genome sequencing of A. rosea (genome data yet to be published) and resequenced 32 germplasm resources to investigate its genetic diversity. This contrasts with related genera like Ceiba pentandra (kapok tree) and Bombax ceiba, whose genomes enabled comparative analyses of fiber evolution17. The lack of a hollyhock nuclear genome hinders exploration of its hallmark traits: petal pigmentation diversity and mucilage biosynthesis pathways.

Despite these advances in related species, A. rosea genomic exploration remains limited. To address this, we present the first chromosome-level genome assembly of A. rosea. Combining PacBio HiFi (52x) sequencing, Hi-C scaffolding, and RNA-seq annotation, we generated a 1.01 Gbp genome with scaffold N50 of 52.57 Mbp, anchoring 92.6% (1.01 Gbp) onto 21 chromosomes. We annotated 51,436 protein-coding genes, providing a critical resource for elucidating the genetic basis of hollyhock’s ornamental and medicinal traits.

Methods

Sampling

The plant seed material for genome sequencing originated from wild Hollyhock seeds preserved in The Germplasm Bank of Wild Species in Southwest China, collected from Songpan, Aba, Sichuan, with the collection number SCU-10-427. Seeds were sown in plug trays during March and maintained in the nursery of Chengdu Botanical Garden (30°46′N, 104°07′E) under controlled conditions: temperature 15–20 °C, natural photoperiod, and peat-based substrate (pH 5.5–6.0). At one month post-sowing, seedlings developed approximately 3-4 true leaves, from which leaf samples were collected for preliminary genomic analysis.

Seedlings were then transplanted into 14-cm pots under adjusted conditions: temperature 20–25 °C (other parameters unchanged). By four months post-sowing, plants entered a rapid vegetative growth phase (25–35 °C), and young leaves were harvested for genome sequencing.

At 14 months post-sowing (mature flowering stage, 20–30 °C), multiple tissues (leaves, stems, fruits, flowers, and roots) from the same individual were collected for transcriptome sequencing.

Chromosome fluorescence by in situ hybridization

Take the root of A. rosea, which has active meristematic tissue, and use nitrous oxide to induce cell mitosis. After obtaining mid-phase cells, prepare chromosome specimens and perform fluorescence in situ hybridization (FISH) using fluorescent probes based on the conserved repetitive sequences of telomeres, 5SrDNA and 18SrDNA. After DAPI staining, observe chromosomes under a fluorescence microscope to determine the karyotypic features of the species. Through chromosome squashing of root tips and FISH, it was revealed that A. rosea has a total of 42 chromosomes, with lengths ranging from 0.8 to 2.0 μm. The majority of these chromosomes were found to be metacentric or submetacentric. Using 5S rDNA and 18S rDNA repeated sequence probes for FISH analysis, we identified 8 chromosomes with strong 5S rDNA (red) hybridization signals and 6 chromosomes with strong 18S rDNA (green) hybridization signals, confirming the diploid nature of the sample with a karyotype of 2n = 2x = 42 (Fig. 2).

Fig. 2
figure 2

Chromosomal Fluorescence In Situ Hybridization (FISH) Images of A. rosea, AB: Telomere Fluorescence In Situ Hybridization (FISH) results, showing in situ hybridization of telomeric repeat sequences (green). CD: Chromosomal rDNA Fluorescence In Situ Hybridization (FISH) results, displaying 5S rDNA in red (indicated by arrows) and 18S rDNA in green (indicated by asterisks). Scale bar: 5 μm.

Genome size estimation

To estimate the genome size of A. rosea, 100X short read data (PE150; DNBSEQ) performed K-mer analysis, and SOAPnuke (v 1.5.6)18 was used to filter the original sequencing data, and the reads with low quality, adaptor contamination and PCR duplication were filtered out. The remaining clean reads were reused for subsequent analysis. The 17,21 kmer frequency is quickly counted using the Jellyfish (v2.1.4)19, then GenomeScope (v1.0)20 is used to fit the 17,21 kmer spectrum and estimate the genome’s characteristics. The genomic survey, employing a negative binomial distribution model, estimated the species’ genome size to be around 1.2 Gbp with a heterozygosity rate of 0.41% and 76.3% repeat content, indicative of a low heterozygous, high repeat genome (Fig. 3).

Fig. 3
figure 3

17,21-kmer Distribution Curves. Note: The x-axis represents the k-mer depth (coverage), while the y-axis denotes the frequency of k-mers at that particular depth.

Genome sequencing

DNA was extracted using the long - fragment magnetic bead method. The procedure was as follows: Take about 0.3–3 g of tissue sample, grind it into a flour - like consistency in liquid nitrogen, and transfer it into a 5 mL centrifuge tube. Add 3 mL of Tissue Lysis Buffer 1, mix by shaking, and incubate at 65 °C for 30 minutes. After incubation, centrifuge at low temperature for 10 minutes, transfer the supernatant to a new corresponding centrifuge tube, add 1 mL of Precipitation Buffer 2, incubate at low temperature for 5 minutes, centrifuge at low temperature for 10 minutes, and transfer the supernatant to a 2 mL centrifuge tube. Add 900 µL of Lysis Supernatant to each tube, then add 900 µL of DNA Binding Buffer and 50 µL of Magnetic Bead Buffer. Gently invert and mix several times, centrifuge briefly for 2 seconds, let stand at room temperature for 5 minutes to allow binding. Place the 2 mL centrifuge tubes on a magnetic rack until the magnetic beads are adsorbed and the liquid becomes clear. Discard the supernatant, wash the magnetic bead precipitate twice with 75% ethanol, air - dry at room temperature, and add 80–300 µL of pre - warmed Elution Buffer at 50 °C. Gently tap the centrifuge tube to resuspend the magnetic beads, let stand at room temperature for 5 minutes, place on the magnetic rack until the liquid is clear, transfer the supernatant to a new 1.5 mL centrifuge tube, and store it in a refrigerator at 4 °C for nucleic acid quality control. After being disrupted by Megaruptor (approximately 15–20 K), the DNA underwent damage repair and end repair before being ligated with known adaptors. Following enzymatic digestion, selection, and final transformation into a dumbbell-shaped HiFi library, it was subjected to quality checks. Post-qualification, the sample underwent PacBio Sequel II sequencing, subsequent data analysis, and interpretation of results.

Transcriptome sequencing

Transcriptome sequencing is used to assist in genome annotation. Total RNA was extracted from flowers, leaves, roots, stems, and fruits of the same hollyhock plant using the CTAB method21. mRNA was then enriched from the total RNA to construct strand-specific transcriptome libraries. These libraries were sequenced using the high-throughput DNBSEQ platform, followed by bioinformatics analysis.

Genome assembly and Hi-C analysis

The PacBio Sequel II platform generated 62.48 Gbp of HiFi long-read data (52 × coverage; 33.117 Gbp and 29.370 Gbp from two cells, respectively), which were assembled with Hifiasm (v0.19.6)22. The resulting contigs spanned 1.09 Gbp with an N50 of 36.61 Mbp. The genome assembly quality was assessed using BUSCO (v5.1.2)23, with the completeness evaluation library being eudicots_odb10. BUSCO assessment showed that the contig genome completeness was 99.5% (Table 1, Fig. 4).

Table 1 Statistics of the evaluation results of genome BUSCO.
Fig. 4
figure 4

The genomic features and Hi-C map of A.rosea. (a) genomic features of A. rosea (From outside to inside are genomic collinearity, GC content, gene number, repeat sequence features (transposon, DNA_class.stat, Copia, gypsy) and chromosomes), (b) Hi-C map of the A. rosea, Color intensity indicates the frequency of Hi-C interaction links from low (yellow) to high (red).

Hi-C libraries were prepared using DpnII restriction enzyme and sequenced on the DNBSEQ platform, producing 291.85 Gbp of 150 bp paired-end reads. To anchor contigs onto chromosomes, the Hi-C clean data were mapped to the assembled contigs using BWA (v0.7.12)24, and then erroneous mappings (MAPQ = 0) and duplicates were filtered by the Juicer25 pipeline (v1.5.6) to obtain the interaction matrix. Following, approximately 1.01 Gbp reads were used to anchor the contigs into chromosomes with 3D-DNA26 pipeline (v180,922). And 3D-DNA pipeline was used to remove select short contigs using default parameters. The Hi-C contact maps were then reviewed with Juicebox27 Assembly Tools (v2.15.07).

The chromosome-level genome assembly of A. rosea was generated by integrating PacBio HiFi long reads (62.48 Gbp, 52 × coverage) and Hi-C data (291.85 Gbp). The final assembly spans 1.01 Gbp across 21 scaffolds (scaffold N50 = 52.57 Mbp), with a contig N50 of 36.61 Mbp and a BUSCO completeness of 99.6% (eudicots_odb10) (Table 1).

Repeat sequence annotation

The methods for repeat sequence annotation are categorized into two types: homology-based alignment and de novo prediction. Homology-based alignment relies on the RepBase28, RepeatMasker (v4.1.2)29 is used to identify and classify sequences similar to known repeat sequences. For de novo prediction, a novel repeat sequence library is first constructed using software such as RepeatModeler (v2.0.3)30 and LTRharvest (v1.6.2)31, and then predictions are made using RepeatMasker. In addition, TRF (4.10.0)32 is utilized to locate tandem repeat sequences within the genome. A total of 565.84 Mbp (56% of the genome) of repetitive sequences were identified (Table 2), with transposable elements (TEs) predominating, among which LTR elements accounted for 48.44% of the genome (Table 3).

Table 2 Statistical results of repeat repeats.
Table 3 Classification of TEs.

Gene structure prediction

Gene structure prediction integrates three approaches: homology-based prediction, de novo prediction, and transcriptome-assisted prediction. Homology-based prediction utilizes annotation information from six closely related species: H. cannabinus, G. raimondii, G. hirsutum, Corchorus. capsularis, C. olitorius, and T. cacao. Firstly, GeMoMa (v1.9)33 is employed for homology prediction. Based on the results of homology prediction, genes with complete structures are selected for training the de novo prediction software, Augustus (v3.0.3)34 and SNAP (v11/29/2013)35. RNA-seq data is aligned using HISAT (v2.0.4)36 and assembled into transcripts with StringTie (v1.2.2)37. Finally, EvidenceModeler (v1.1.1)38 is used to consolidate all data, resulting in the final gene set. The completeness and proportion of fully predicted genes are assessed by comparing the gene set against the embryophyta_odb10 database using BUSCO. Ultimately, 51,436 genes were annotated. Among these predicted genes, the average gene length and coding sequence (CDS) length were 2739.92 bp and 1242.54 bp, respectively (Table 4). The average number of exons per gene was 5.49, with average exon and intron lengths of 226.23 bp and 333.31 bp, respectively. The BUSCO completeness assessment of the gene set revealed coverage of 1,588 conserved proteins, with a completeness of 98.4%, including 853 (52.9%) single-copy genes and 735 (45.5%) multi-copy genes (Table 1).

Table 4 Gene structure prediction statistics.

Functional annotation of genes

Functional annotation of genes is performed using InterProScan (v5.28-67.0)39 to search secondary structural ___domain databases and obtain gene function information. The comparison databases used include SwissProt40, TrEMBL, KEGG41, InterPro, NR, KOG, and GO42. The functional annotation statistics of the predicted genes in seven databases (Nr, Swissprot, KEGG, KOG, TrEMBL, Interpro, GO) are shown in the Table 5. The Venn diagram illustrating the number of gene annotations across five databases (Nr, Interpro, KEGG, Swissprot, and KOG) is shown in the Fig. 5b, with 33,086 genes annotated in all five databases. The statistical results of KOG functional annotation categorization indicated that the five most represented gene functions were ‘General function prediction only’, ‘Signal transduction mechanisms’, ‘Transcription’, ‘Function unknown’, and ‘Posttranslational modification, protein turnover, chaperones’ (Fig. S1). The KEGG pathway annotation results showed that the five most involved metabolic pathways were global and overview maps, carbohydrate metabolism, translation, folding, sorting, and degradation, and signal transduction - membrane transport (Fig. S2). The GO secondary node annotation classification statistics indicated that the most represented categories in terms of gene number were binding, cellular processes, catalytic activity, metabolic processes, and cellular component (Fig. S3).

Table 5 Genome function annotation result.
Fig. 5
figure 5

Gene structure prediction and gene function annotation. (a) Statistical plots of the prediction results of the gene structure, (b) Gene functional annotation results for number statistics in five databases: NR, InterPro, KEGG, SwissProt and KOG.

Data Records

The genome sequence data used for genome assembly and annotation have been deposited in the Genome Sequence Archive (GSA) of the National Genomics Data Center (NGDC) under accession number CRA02432643. Both the raw reads and chromosome assembly have been submitted to NCBI with the accession number of SRP59090344 and GCA_050717195.145. Annotation files can be accessed on Figshare46.

Technical Validation

We assessed the genome integrity using BUSCO, using the comparison library eudicots_odb10, the assembly integrity of contig genome was 99.5%, and the assembly integrity of scaffold genome was 99.6%, and the number and proportion of fully predicted genes were evaluated. The results showed that the gene set fully covered 1,588 conserved proteins (98.4% integrity), of which 863 (52.9%) were single copies and 735 (45.5%) were multiple copies. The three evaluation results showed that the genome assembly quality of hollyhock was high.