Background & Summary

The genus Oxalis (Oxalidaceae) comprises over 500 species of herbaceous plants1,2,3, many of which are renowned for their ornamental value, ecological adaptability, and medicinal properties4,5. Among these, Oxalis articulata Savigny (Fig. 1a–c), commonly known as pink-sorrel or pink wood sorrel, is a perennial species native to South America but now naturalized globally in tropical and subtropical regions6,7,8. This species exhibits prolific clonal propagation via hard, segmented woody rhizomes. This reproductive strategy enables rapid colonization and confers adaptive resilience to heterogeneous environments, which are key drivers of its ecological dominance in both native and invaded ranges9,10,11. These rhizomes facilitate vigorous spread and make the plant difficult to eradicate once established12. In addition to its horticultural and ecological impact, O. articulata has attracted interest in its bioactive compounds. Studies on this species have demonstrated significant antioxidant activity in its extracts, and it has also shown direct antimicrobial effects13. However, its aggressive invasiveness in regions such as East Asia and Australia poses significant challenges to local biodiversity9,14. A comprehensive understanding of its genomic architecture could elucidate the genetic basis of both its adaptive advantages and ecological impacts.

Fig. 1
figure 1

Plant morphology and genome size estimation of O. articulata. (a) flower; (b) leaf; (c) rhizome; (d) 25-kmer distribution estimation.

Despite its ecological and biochemical significance, genomic resources for O. articulata remain scarce. To date, only partial plastid genome sequences of the Oxalis genus are available15,16,17, limiting insights into its adaptive evolution, invasion mechanisms, and metabolite biosynthesis. High-quality chromosome-level genomes have revolutionized research in other non-model plants, enabling the identification of key genes involved in stress responses, invasion-related trait evolution, and species divergence18,19. Meanwhile, the absence of high-quality genome for O. articulata hinders comparative analyses within Oxalidaceae and obscures the genomic basis of its ecological success.

In this study, we present the first haplotype-resolved, chromosome-scale genome assembly of O. articulata (Fig. 2), constructed through an integrative approach combining PacBio HiFi sequencing and Hi-C technology. The initial assembly, generated with hifiasm20, produced highly contiguous allele sequences, which were anchored into 14 pseudochromosomes using YaHS21 (Fig. 3), aligning with its reported chromosome number22,23. Haplotype phasing was performed with SubPhaser24, leveraging repetitive k-mer signatures to resolve two distinct haplotypes (Fig. 4). As the first published genome within Oxalis genus, this haplotype-resolved, chromosome-level assembly provides a foundational resource for comparative evolutionary studies, ecological research, and functional genomics, advancing genomic investigations in this under-characterized plant lineage.

Fig. 2
figure 2

The landscape of genome assembly and annotation of O. articulata. Tracks from inside to outer correspond to (a) synteny information; (b) gene density; (c) GC content; (d) repeat density; (e) chromosomes of O. articulata. In the heatmap, blue represents a high percentage, while white indicates a low percentage.

Fig. 3
figure 3

Inter-chromosomal Hi-C heatmap of O. articulata genome. The heatmap displays chromatin interaction frequencies across all chromosomes, derived from Hi-C sequencing data and normalized using the Knight–Ruiz (KR) algorithm. The color scale indicates the intensity of chromatin contacts.

Fig. 4
figure 4

The result of subsequent typing of the chromosomes using Subphaser. The clustered chromosomes were divided into two haplotypes, Haplotype A and Haplotype B.

Methods

Plant materials and genomic sequencing

All samples were collected from naturally growing O. articulata plants on the campus of Nanjing Forestry University (32°4′44″N, 118°48′31″E). Fresh young leaves were collected from selected individuals for genome sequencing. For transcriptome RNA extraction, samples of young leaves, current-year branches, and flowers were also gathered. To ensure preservation, all samples were rapidly flash-frozen in liquid nitrogen immediately after collection and stored at −80 °C.

For PacBio long-read sequencing, genomic DNA was extracted using a modified Cetyltrimethylammonium Bromide (CTAB) method25. The fragmented DNA was purified with AMPure PB magnetic beads, and a HiFi library was constructed using the SMRTbell prep kit 3.0 (PacBio, USA). Sequencing was carried out on the PacBio Revio platform (PacBio, USA), yielding 44.50 Gb of Circular Consensus Sequencing (CCS) reads, corresponding to approximately 119.30 × coverage (Table 1).

Table 1 Statistics of genomic sequencing data.

The Hi-C library preparation involved fixing fresh O. articulata young leaves in 2% formaldehyde26. The fixed tissues were homogenized and centrifuged to isolate nuclei. The cross-linked chromatin was digested with DpnII, labeled with biotin, and ligated with T4 DNA ligase. After de-crosslinking, the DNA was purified and fragmented into 300–500 bp segments. Fragments containing interaction sites were enriched using streptavidin magnetic beads to construct the Hi-C library. High-throughput sequencing on the DNBSEQ-T7 platform (BGI, China) produced 16.36 Gb of Hi-C reads, equating to approximately 43.87 × coverage (Table 1).

For transcriptome sequencing, RNA was extracted from plant tissues using the DP411 and DP762-T1C kits (TIANGEN, China). mRNA was subsequently purified using the Dynabeads mRNA Purification Kit (Invitrogen, USA). RNA libraries were prepared and sequenced on the DNBSEQ-T7 platform (BGI, China), resulting in 19.38 Gb of RNAseq reads, which were utilized for the genome annotation (Table 1).

Genome size estimation

The genome size was estimated by counting k-mer of HiFi data with word size 25 using KMC v3.2.427. A k-mer frequency histogram was also generated with KMCtools. Genome size and heterozygosity were estimated using GenomeScope 2.028. The estimated genome size is 374 Mb with heterozygosity rate of 3.2% (Fig. 1d).

Genome assembly

PacBio HiFi reads and Hi-C short reads were combined in Hifiasm v0.24.020, employing the Hi-C Integrated Assembly mode to generate contigs for two distinct haplotypes. Subsequently, the Hi-C reads were simultaneously compared to the merged contigs for scaffolding using YaHS v1.2a.121. Manual corrections were applied using Juicebox v1.11.0829 to rectify misinsertions and optimize contig orientations, thereby ensuring the overall chromosome structure. The final assembly is 802.17 Mb comprising 1,610 contigs, with an N50 of 44.47 Mb (Table 2). Among them, 29 contigs, representing approximately 89.72% of the total genome size, were anchored to 14 chromosomes, which were further phased into two haplotypes (Table 2). The remaining 1,581 unanchored contigs account for a total length of approximately 82.43 Mb. As part of the scaffolding process, 200 ‘N’ bases were inserted between adjacent contigs to represent unknown regions without estimating actual gap sizes.

Table 2 Summary of the O. articulata genome assembly data.

The corrected chromosomes were then typed using Subphaser v1.2.6 with parameters set to -k 150 -q 2 -f 2.0, which were chosen based on recommendations from the SubPhaser documentation and benchmarking results24. These settings resulted in the final identification of Haplotype A and Haplotype B (Fig. 4). Each individual haploid genome comprises a total of seven chromosomes. Haplotype A contained 20 contigs with a total size of 377.04 Mb and Haplotype B has 9 contigs with a total size of 342.70 Mb. The N50 lengths for Haplotype A and Haplotype B were 31.13 Mb and 47.46 Mb, respectively (Table 2) Additionally, the assembled genomes were characterized for telomeres using quarTeT v1.2.530. In Haplotype B, telomeres were detected on both ends of all seven chromosomes, whereas in Haplotype A, six chromosomes had telomeres on both ends, while one chromosome had a telomere detected on only one end (Fig. 5).

Fig. 5
figure 5

Structural variation between the two haploid genomes of O. articulata. Syntenic relationships and structural variations between Haplotype A (top) and Haplotype B (bottom) chromosomes are shown. All annotations, including structural variation types and chromosome features, are indicated in the accompanying legend.

Repeat annotation

The de novo identification of transposable elements (TEs) was performed using the EDTA (Extensive de novo TE Annotator) pipeline v2.1.031, with the parameters–sensitive 1–anno 1. A total of 241.87 Mb (64.15%) of assembled sequences were annotated as TE in Haplotype A, including Long Terminal Repeat (LTR, 55.56%), Terminal Inverted Repeat (TIR, 3.06%) and Helitron (2.96%). In contrast, a total of 206.08 Mb (60.13%) of the assembled sequences in Haplotype B were annotated as TE, including LTR (53.15%), TIR (3.79%) and Helitron (3.51%) (Table 3).

Table 3 Summary of the O. articulata genome repeat annotation.

Gene structure prediction and functional annotations

Gene structure prediction was carried out using a combination of ab initio and transcriptome-based approaches. RNA-Seq data were processed to obtain transcripts using the Hisat2-StringTie pipeline32,33, with transcript structure validation conducted via the PASA pipeline v2.5.234. Additionally, Helixer35 was employed for de novo gene annotation, and the predicted gene structures were further validated through PASA to ensure accuracy. Combining the above steps, 36,063 and 38,292 protein-coding genes were finally annotated in Haplotype A and Haplotype B, respectively (Table 4).

Table 4 Summary of the O. articulata genome gene structure annotation.

The functional annotation of protein-coding genes was accomplished through a comprehensive three-step approach. Initially, gene sequences were aligned to the eggNOG v5.0.2 database using eggNOG-mapper v2.1.1236, which successfully annotated 94.0% of the genes. Then, we used KEGG Automatic Annotation Server (https://www.genome.jp/tools/kaas/) to associate 60.02% of genes with KEGG pathways. In the second step, BLAST 2.14.1+37 was utilized to compare gene sequences against three major protein databases: Swiss-Prot (69.02%), TrEMBL (93.08%), and Nr (92.70%). Finally, Gene Ontology (GO) terms (http://geneontology.org/) were annotated through Blast2GO v2.538 and 50.02% of the genes were annotated. Totally, 94.06% of the genes were annotated to at least one database in our analysis (Table 5).

Table 5 Summary of the O. articulata genome gene functional annotation.

Annotation of non-coding RNAs

Non-coding RNAs were identified using Infernal v1.1.539 by querying against the Rfam v14.10 database with default parameters. In Haplotype A, the analysis identified 8,279 rRNAs, 621 tRNAs, and 2,245 other ncRNAs. In Haplotype B, 6,212 rRNAs, 600 tRNAs, and 2,342 other ncRNAs were identified (Table 6).

Table 6 Summary of the O. articulata genome non-coding RNA annotation.

Comparison between haplotype assemblies

SyRI (Synteny and Rearrangement Identifier) v1.6.340 was employed to detect synteny and structural variations (≥50 bp) and single nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels) (<50 bp) between the two haplotypes using default parameters. Plotsr v1.1.141 was used for generating synteny plot (Fig. 5). This analysis yielded a total of 1,371,602 single-nucleotide polymorphism (SNP) differences and 160 insertion-deletions (InDels), including 80 insertions and 80 deletions. SyRI detected 2,019 duplications (DUPs) with a total length of 9.5 Mb, 69 inversions (INVs) with a total length of 35.97 Mb, and 5,454 translocations (TRANSs) with a total length of 20.43 Mb.

Data Records

The raw sequencing data and genome assembly generated in this study have been deposited in the National Genomics Data Center (NGDC), China National Center for Bioinformation, under BioProject accession number PRJCA03623542. The raw sequencing data are available in the Genome Sequence Archive (GSA) under accession number CRA02370043. The assembled genome files are available in both the Genome Warehouse (GWH; accession numbers GWHFOPA00000000.244 and GWHFOPB00000000.245) and NCBI GenBank (accession numbers JBLZXC00000000046 and JBLZXD00000000047). The assembly and annotation files also can be downloaded at figshare48.

Technical Validation

The quality of the genome assembly was evaluated using several metrics, including Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.7 with the embryophyta_odb10 dataset, the LTR Assembly Index (LAI), and consensus quality value (QV)49. The results indicated that 99.2% of the complete core genes, encompassing both single-copy and duplicated genes, were present in the assembled genome (Table 7). For the two haplotypes, the proportions of complete core genes were 98.9% and 99.2%, respectively. The LAI values were 24.15 for Haplotype A, 32.51 for Haplotype B, and 26.92 for the assembled genome. The QV values of the haplotypes and the assembled genome were 78.86, 78.25, and 78.56, respectively. Additionally, the k-mer completeness of the assembled genome reached 98.68% (Table 7). To evaluate the completeness of the genome annotation, BUSCO analysis and OMArk50 with LUCA.h5 database were performed on the annotated gene set. The results showed that the proportions of complete core genes were 99.2%, 99.1%, and 99.2% for Haplotype A, Haplotype B, and the assembled genome, respectively (Table 8). In addition, OMArk evaluation reported high completeness (>98.1%) and no detected contaminants in all gene sets, further supporting the reliability of the annotation (Table 8).

Table 7 Evaluation of O. articulata genome assembly.
Table 8 BUSCO assessment of predicted genes in the O. articulata genome.

Then, we evaluated the haplotype-phased assemblies by plotting the collinearity between two haplotype-phased assemblies using JCVI v1.4.1651. The results showed that both haplotypes exhibited overall collinearity, with some structural variations (Fig. 6a). Notably, a large structural variation region was identified on Chr5 (~15 Mb in size), containing two distinct inversions and one translocation. These rearrangements were also identified in SyRI analysis (Fig. 5). To further confirm these structural variations, we examined the Hi-C contact matrices of these two chromosomes (Fig. 6b). The strong interaction signals along the diagonal indicated well-preserved genomic organization, confirming the accuracy of the assembly.

Fig. 6
figure 6

Assembly validation. (a) Collinearity between Haplotype A and Haplotype B (b) Hi-C heatmap of chromosome 5 for both haplotypes. The black dashed boxes indicate regions with structural variations.