Abstract
Oxalis articulata is a widely distributed perennial herb known for its adaptability to diverse environmental conditions. Here, we report the first haplotype-resolved, chromosome-scale genome assembly for O. articulata (Oxalidaceae), generated through PacBio HiFi long reads and Hi-C technology. The resulting assembly comprises two haplotypes with sizes of 377.04 Mb (contig N50 = 31.13 Mb) and 342.70 Mb (contig N50 = 47.46 Mb), respectively. We predicted 36,063 protein-coding genes (94.33% functionally annotated) and 38,292 protein-coding genes (93.80% functionally annotated) in the two haplotypes, respectively. This high-quality genome assembly lays a solid groundwork for future initiatives in ecological and biochemical research for O. articulata.
Similar content being viewed by others
Background & Summary
The genus Oxalis (Oxalidaceae) comprises over 500 species of herbaceous plants1,2,3, many of which are renowned for their ornamental value, ecological adaptability, and medicinal properties4,5. Among these, Oxalis articulata Savigny (Fig. 1a–c), commonly known as pink-sorrel or pink wood sorrel, is a perennial species native to South America but now naturalized globally in tropical and subtropical regions6,7,8. This species exhibits prolific clonal propagation via hard, segmented woody rhizomes. This reproductive strategy enables rapid colonization and confers adaptive resilience to heterogeneous environments, which are key drivers of its ecological dominance in both native and invaded ranges9,10,11. These rhizomes facilitate vigorous spread and make the plant difficult to eradicate once established12. In addition to its horticultural and ecological impact, O. articulata has attracted interest in its bioactive compounds. Studies on this species have demonstrated significant antioxidant activity in its extracts, and it has also shown direct antimicrobial effects13. However, its aggressive invasiveness in regions such as East Asia and Australia poses significant challenges to local biodiversity9,14. A comprehensive understanding of its genomic architecture could elucidate the genetic basis of both its adaptive advantages and ecological impacts.
Despite its ecological and biochemical significance, genomic resources for O. articulata remain scarce. To date, only partial plastid genome sequences of the Oxalis genus are available15,16,17, limiting insights into its adaptive evolution, invasion mechanisms, and metabolite biosynthesis. High-quality chromosome-level genomes have revolutionized research in other non-model plants, enabling the identification of key genes involved in stress responses, invasion-related trait evolution, and species divergence18,19. Meanwhile, the absence of high-quality genome for O. articulata hinders comparative analyses within Oxalidaceae and obscures the genomic basis of its ecological success.
In this study, we present the first haplotype-resolved, chromosome-scale genome assembly of O. articulata (Fig. 2), constructed through an integrative approach combining PacBio HiFi sequencing and Hi-C technology. The initial assembly, generated with hifiasm20, produced highly contiguous allele sequences, which were anchored into 14 pseudochromosomes using YaHS21 (Fig. 3), aligning with its reported chromosome number22,23. Haplotype phasing was performed with SubPhaser24, leveraging repetitive k-mer signatures to resolve two distinct haplotypes (Fig. 4). As the first published genome within Oxalis genus, this haplotype-resolved, chromosome-level assembly provides a foundational resource for comparative evolutionary studies, ecological research, and functional genomics, advancing genomic investigations in this under-characterized plant lineage.
The landscape of genome assembly and annotation of O. articulata. Tracks from inside to outer correspond to (a) synteny information; (b) gene density; (c) GC content; (d) repeat density; (e) chromosomes of O. articulata. In the heatmap, blue represents a high percentage, while white indicates a low percentage.
Methods
Plant materials and genomic sequencing
All samples were collected from naturally growing O. articulata plants on the campus of Nanjing Forestry University (32°4′44″N, 118°48′31″E). Fresh young leaves were collected from selected individuals for genome sequencing. For transcriptome RNA extraction, samples of young leaves, current-year branches, and flowers were also gathered. To ensure preservation, all samples were rapidly flash-frozen in liquid nitrogen immediately after collection and stored at −80 °C.
For PacBio long-read sequencing, genomic DNA was extracted using a modified Cetyltrimethylammonium Bromide (CTAB) method25. The fragmented DNA was purified with AMPure PB magnetic beads, and a HiFi library was constructed using the SMRTbell prep kit 3.0 (PacBio, USA). Sequencing was carried out on the PacBio Revio platform (PacBio, USA), yielding 44.50 Gb of Circular Consensus Sequencing (CCS) reads, corresponding to approximately 119.30 × coverage (Table 1).
The Hi-C library preparation involved fixing fresh O. articulata young leaves in 2% formaldehyde26. The fixed tissues were homogenized and centrifuged to isolate nuclei. The cross-linked chromatin was digested with DpnII, labeled with biotin, and ligated with T4 DNA ligase. After de-crosslinking, the DNA was purified and fragmented into 300–500 bp segments. Fragments containing interaction sites were enriched using streptavidin magnetic beads to construct the Hi-C library. High-throughput sequencing on the DNBSEQ-T7 platform (BGI, China) produced 16.36 Gb of Hi-C reads, equating to approximately 43.87 × coverage (Table 1).
For transcriptome sequencing, RNA was extracted from plant tissues using the DP411 and DP762-T1C kits (TIANGEN, China). mRNA was subsequently purified using the Dynabeads mRNA Purification Kit (Invitrogen, USA). RNA libraries were prepared and sequenced on the DNBSEQ-T7 platform (BGI, China), resulting in 19.38 Gb of RNAseq reads, which were utilized for the genome annotation (Table 1).
Genome size estimation
The genome size was estimated by counting k-mer of HiFi data with word size 25 using KMC v3.2.427. A k-mer frequency histogram was also generated with KMCtools. Genome size and heterozygosity were estimated using GenomeScope 2.028. The estimated genome size is 374 Mb with heterozygosity rate of 3.2% (Fig. 1d).
Genome assembly
PacBio HiFi reads and Hi-C short reads were combined in Hifiasm v0.24.020, employing the Hi-C Integrated Assembly mode to generate contigs for two distinct haplotypes. Subsequently, the Hi-C reads were simultaneously compared to the merged contigs for scaffolding using YaHS v1.2a.121. Manual corrections were applied using Juicebox v1.11.0829 to rectify misinsertions and optimize contig orientations, thereby ensuring the overall chromosome structure. The final assembly is 802.17 Mb comprising 1,610 contigs, with an N50 of 44.47 Mb (Table 2). Among them, 29 contigs, representing approximately 89.72% of the total genome size, were anchored to 14 chromosomes, which were further phased into two haplotypes (Table 2). The remaining 1,581 unanchored contigs account for a total length of approximately 82.43 Mb. As part of the scaffolding process, 200 ‘N’ bases were inserted between adjacent contigs to represent unknown regions without estimating actual gap sizes.
The corrected chromosomes were then typed using Subphaser v1.2.6 with parameters set to -k 150 -q 2 -f 2.0, which were chosen based on recommendations from the SubPhaser documentation and benchmarking results24. These settings resulted in the final identification of Haplotype A and Haplotype B (Fig. 4). Each individual haploid genome comprises a total of seven chromosomes. Haplotype A contained 20 contigs with a total size of 377.04 Mb and Haplotype B has 9 contigs with a total size of 342.70 Mb. The N50 lengths for Haplotype A and Haplotype B were 31.13 Mb and 47.46 Mb, respectively (Table 2) Additionally, the assembled genomes were characterized for telomeres using quarTeT v1.2.530. In Haplotype B, telomeres were detected on both ends of all seven chromosomes, whereas in Haplotype A, six chromosomes had telomeres on both ends, while one chromosome had a telomere detected on only one end (Fig. 5).
Structural variation between the two haploid genomes of O. articulata. Syntenic relationships and structural variations between Haplotype A (top) and Haplotype B (bottom) chromosomes are shown. All annotations, including structural variation types and chromosome features, are indicated in the accompanying legend.
Repeat annotation
The de novo identification of transposable elements (TEs) was performed using the EDTA (Extensive de novo TE Annotator) pipeline v2.1.031, with the parameters–sensitive 1–anno 1. A total of 241.87 Mb (64.15%) of assembled sequences were annotated as TE in Haplotype A, including Long Terminal Repeat (LTR, 55.56%), Terminal Inverted Repeat (TIR, 3.06%) and Helitron (2.96%). In contrast, a total of 206.08 Mb (60.13%) of the assembled sequences in Haplotype B were annotated as TE, including LTR (53.15%), TIR (3.79%) and Helitron (3.51%) (Table 3).
Gene structure prediction and functional annotations
Gene structure prediction was carried out using a combination of ab initio and transcriptome-based approaches. RNA-Seq data were processed to obtain transcripts using the Hisat2-StringTie pipeline32,33, with transcript structure validation conducted via the PASA pipeline v2.5.234. Additionally, Helixer35 was employed for de novo gene annotation, and the predicted gene structures were further validated through PASA to ensure accuracy. Combining the above steps, 36,063 and 38,292 protein-coding genes were finally annotated in Haplotype A and Haplotype B, respectively (Table 4).
The functional annotation of protein-coding genes was accomplished through a comprehensive three-step approach. Initially, gene sequences were aligned to the eggNOG v5.0.2 database using eggNOG-mapper v2.1.1236, which successfully annotated 94.0% of the genes. Then, we used KEGG Automatic Annotation Server (https://www.genome.jp/tools/kaas/) to associate 60.02% of genes with KEGG pathways. In the second step, BLAST 2.14.1+37 was utilized to compare gene sequences against three major protein databases: Swiss-Prot (69.02%), TrEMBL (93.08%), and Nr (92.70%). Finally, Gene Ontology (GO) terms (http://geneontology.org/) were annotated through Blast2GO v2.538 and 50.02% of the genes were annotated. Totally, 94.06% of the genes were annotated to at least one database in our analysis (Table 5).
Annotation of non-coding RNAs
Non-coding RNAs were identified using Infernal v1.1.539 by querying against the Rfam v14.10 database with default parameters. In Haplotype A, the analysis identified 8,279 rRNAs, 621 tRNAs, and 2,245 other ncRNAs. In Haplotype B, 6,212 rRNAs, 600 tRNAs, and 2,342 other ncRNAs were identified (Table 6).
Comparison between haplotype assemblies
SyRI (Synteny and Rearrangement Identifier) v1.6.340 was employed to detect synteny and structural variations (≥50 bp) and single nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels) (<50 bp) between the two haplotypes using default parameters. Plotsr v1.1.141 was used for generating synteny plot (Fig. 5). This analysis yielded a total of 1,371,602 single-nucleotide polymorphism (SNP) differences and 160 insertion-deletions (InDels), including 80 insertions and 80 deletions. SyRI detected 2,019 duplications (DUPs) with a total length of 9.5 Mb, 69 inversions (INVs) with a total length of 35.97 Mb, and 5,454 translocations (TRANSs) with a total length of 20.43 Mb.
Data Records
The raw sequencing data and genome assembly generated in this study have been deposited in the National Genomics Data Center (NGDC), China National Center for Bioinformation, under BioProject accession number PRJCA03623542. The raw sequencing data are available in the Genome Sequence Archive (GSA) under accession number CRA02370043. The assembled genome files are available in both the Genome Warehouse (GWH; accession numbers GWHFOPA00000000.244 and GWHFOPB00000000.245) and NCBI GenBank (accession numbers JBLZXC00000000046 and JBLZXD00000000047). The assembly and annotation files also can be downloaded at figshare48.
Technical Validation
The quality of the genome assembly was evaluated using several metrics, including Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.7 with the embryophyta_odb10 dataset, the LTR Assembly Index (LAI), and consensus quality value (QV)49. The results indicated that 99.2% of the complete core genes, encompassing both single-copy and duplicated genes, were present in the assembled genome (Table 7). For the two haplotypes, the proportions of complete core genes were 98.9% and 99.2%, respectively. The LAI values were 24.15 for Haplotype A, 32.51 for Haplotype B, and 26.92 for the assembled genome. The QV values of the haplotypes and the assembled genome were 78.86, 78.25, and 78.56, respectively. Additionally, the k-mer completeness of the assembled genome reached 98.68% (Table 7). To evaluate the completeness of the genome annotation, BUSCO analysis and OMArk50 with LUCA.h5 database were performed on the annotated gene set. The results showed that the proportions of complete core genes were 99.2%, 99.1%, and 99.2% for Haplotype A, Haplotype B, and the assembled genome, respectively (Table 8). In addition, OMArk evaluation reported high completeness (>98.1%) and no detected contaminants in all gene sets, further supporting the reliability of the annotation (Table 8).
Then, we evaluated the haplotype-phased assemblies by plotting the collinearity between two haplotype-phased assemblies using JCVI v1.4.1651. The results showed that both haplotypes exhibited overall collinearity, with some structural variations (Fig. 6a). Notably, a large structural variation region was identified on Chr5 (~15 Mb in size), containing two distinct inversions and one translocation. These rearrangements were also identified in SyRI analysis (Fig. 5). To further confirm these structural variations, we examined the Hi-C contact matrices of these two chromosomes (Fig. 6b). The strong interaction signals along the diagonal indicated well-preserved genomic organization, confirming the accuracy of the assembly.
Code availability
All software utilized in this study was executed in accordance with the official documentation. The specific versions and parameters of the software, along with any custom codes employed, are detailed in the Methods section. Any procedures not explicitly outlined in the Methods were conducted using default parameters.
References
Lourteig, A. & Oxalis, L. subgéneros Thamnoxys (Endl.) Reiche emend. Lourt. Bradea 7, 1–199 (1994).
Lourteig, A. & Oxalis, L. subgéneros Monoxalis (small) Lourt., Oxalis y Trifidus Lourt. Bradea 7, 201–629 (2000).
Christenhusz, M. J. M., Fay, M. F. & Chase, M. W. Plants of the World: An Illustrated Encyclopedia of Vascular Plants. (University of Chicago Press, 2017).
Veldkamp, J. F. Oxalidaceae. Flora Malesiana - Series 1, Spermatophyta 7, 151–178 (1972).
Vaio, M., Gardner, A., Emshwiller, E. & Guerra, M. Molecular phylogeny and chromosome evolution among the creeping herbaceous Oxalis species of sections Corniculatae and Ripariae (Oxalidaceae). Mol Phylogenet Evol 68, 199–211 (2013).
Ghahremaninejad, F. & Gholamian, F. A new record (Oxalis articulata) from Iran. Iran J Bot 12, 55–56 (2006).
Fiaschi, P. & Pirani, J. R. Review of plant biogeographic studies in Brazil. J of Sytematics Evolution 47, 477–496 (2009).
Nesom, G. L. Taxonomic notes on acaulescent Oxalis (Oxalidaceae) in the United States. Phytologia 91, 501–526 (2009).
Byalt, V. V. & Korshunov, M. V. New alien species of flowering plants to the flora of the Arabian Peninsula. Novitates Syst Pl Vasc 118–124 (2020).
Wang, Y.-C., Su, P., Zhang, P.-H. & Zhang, Y.-J. First report of bacterial wilt disease caused by Pantoea agglomerans on the ornamental perennial Oxalis articulata in China. Plant Dis 107, 1216 (2023).
Rosenfeldt, S. & Galati, B. G. The structure of the stigma and the style of Oxalis spp. (Oxalidaceae). J Torrey Bot Soc 136, 33–45 (2009).
Oberlander, K. C., Emshwiller, E., Bellstedt, D. U. & Dreyer, L. L. A model of bulb evolution in the eudicot genus Oxalis (Oxalidaceae). Mol Phylogenet Evol 51, 54–63 (2009).
Draz, A., Kawashty, S., Shamso, E., Hosni̇, H. & HusseiN, S. Chemical profiling of Oxalis species growing wild in Egypt using HRLC/MS Spectrometry. International Journal of Secondary Metabolite 9, 426–439 (2022).
Yan, X. et al. The categorization and analysis on the geographic distribution patterns of Chinese alien invasive plants. Biodiv Sci 22, 667 (2014).
Chen, L. et al. The complete chloroplast genome sequence of Oxalis corniculata (L.). Mitochondrial DNA B Resour 6, 1346–1347 (2021).
Zhou, M. et al. Characterization of the complete chloroplast genome of Oxalis corymbosa DC. (Oxalidaceae), a medicinal plant from Zhejiang Province. Mitochondrial DNA B Resour 6, 1138–1140 (2021).
Lubna, Asaf, S., Jan, R., Khan, A. L. & Lee, I.-J. Complete chloroplast genome characterization of Oxalis corniculata and Its comparison with related species from family Oxalidaceae. Plants 9, 928 (2020).
Wan, F. et al. A chromosome-level genome assembly of Cydia pomonella provides insights into chemical ecology and insecticide resistance. Nat Commun 10, 4237 (2019).
Cavalet-Giorsa, E. et al. Origin and evolution of the bread wheat D genome. Nature 633, 848–855 (2024).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021).
Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39, btac808 (2023).
Moura, A. I. et al. Karyotype inconsistencies in the taxonomy of the genus Oxalis (Oxalidaceae). Iheringia Ser Bot 75, e2020003–e2020003 (2020).
Naranjo, P. C. A., Mola, L. M. & Poggio, L. Estúdios citotaxonomicos y evolutivos en especies herbaceas sudamericanas de Oxalis (Oxalidaceae). I. Boletín de la Sociedad Argentina de Botánica 20, 183–200 (1982).
Jia, K.-H. et al. SubPhaser: a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers. New Phytologist 235, 801–809 (2022).
Doyle, J. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem Bull 19, 11–15 (1987).
van Berkum, N. L. et al. Hi-C: A method to study the three-dimensional architecture of genomes. J Vis Exp 1–7 (2010).
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k -mer statistics. Bioinformatics 33, 2759–2761 (2017).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 11, 1432 (2020).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 3, 99–101 (2016).
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic Res uhad127 (2023).
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019).
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput Biol 18, e1009730 (2022).
Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666 (2003).
Holst, F. et al. Helixer–de novo prediction of primary eukaryotic gene models combining deep learning and a hidden markov model. Preprint at https://doi.org/10.1101/2023.02.06.527280 (2023).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and ___domain prediction at the metagenomic scale. Mol Biol Evol 38, 5825–5829 (2021).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Conesa, A. et al. Blast2GO: A universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21, 3674–3676 (2005).
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol 20, 277 (2019).
Goel, M. & Schneeberger, K. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 38, 2922–2926 (2022).
NGDC BioProject. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA036235 (2025).
NGDC Genome Sequence Archive. https://ngdc.cncb.ac.cn/gsa/browse/CRA023700 (2025).
NGDC Genome Warehouse. https://ngdc.cncb.ac.cn/gwh/Assembly/92289/show (2025).
NGDC Genome Warehouse. https://ngdc.cncb.ac.cn/gwh/Assembly/92290/show (2025).
Yang, W. Oxalis articulata, whole genome shotgun sequencing project. Haplotype A. NCBI GenBank. https://identifiers.org/ncbi/insdc:JBLZXC000000000.1 (2025).
Yang, W. Oxalis articulata, whole genome shotgun sequencing project. Haplotype B. NCBI GenBank. https://identifiers.org/ncbi/insdc:JBLZXD000000000.1 (2025).
Yang, W. Haplotype-resolved chromosomal-level genome assembly and annotation of Oxalis articulata. figshare. https://doi.org/10.6084/m9.figshare.28467281.v4 (2025).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020).
Nevers, Y. et al. Quality assessment of gene repertoire annotations with OMArk. Nat Biotechnol 43, 124–133 (2025).
Tang, H. et al. JCVI: A versatile toolkit for comparative genomics analysis. iMeta 3, e211 (2024).
Acknowledgements
This work was supported equally by grants from the National Natural Science Foundation of China (NSFC) (32422053 and 32301411) and the Natural Science Foundation of Jiangsu Province, China (BK20230394).
Author information
Authors and Affiliations
Contributions
Z.W. designed and led the project. C.F. and Z.Z. collected the samples. W.Y., C.J. and C.B. performed data analyses. W.Y. wrote the manuscript. Z.W. and Q.H. revised the manuscript. F.L. and Z.H. provided technical support. W.Y., C.J. and C.B. contributed equally. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yang, W., Jiang, C., Bi, C. et al. A haplotype-resolved chromosomal-level genome assembly of Oxalis articulata. Sci Data 12, 856 (2025). https://doi.org/10.1038/s41597-025-05237-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-05237-9