Background & Summary

Eleocharis dulcis (Cyperaceae), commonly known as Chinese water chestnut, is an aquatic perennial herbaceous plant that grows in shallow waters and is mainly distributed in China, Southeast Asia, the Americas, Europe, and Oceania1,2,3. Eleocharis dulcis is commonly consumed as a vegetable, as its underground corms are rich in starch and have unique tastes. It is also a critical ingredient in traditional Chinese medicine4,5. The cultivated history of Chinese water chestnut in China and India dates back to about two thousand years ago3,6. Now, it is widely cultivated in various counties and regions, including Japan, Australia, the Philippines, Malaysia, Hawaii, and other Pacific islands1.

Similar to other aquatic vegetable (rhizome, tuber, and bulbs), the corm of Chinse water chestnut is an underground stem that serves as a storage organ. The accumulating evidence suggests that the formation of storage organ is regulated by both genetic and environmental factors7,8,9. For example, Cheng et al.10 have identified some up-regulated transcription factors and genes related to organ formation of Chinese water chestnut. However, it remains unclear which genes are likely involved in domestication, and how they regulate corm development. In addition, the lack of an assembled genome has hitherto hindered the studies on the genome structure and evolution of E. dulcis which has a large number of chromosomes (more than 210)11,12. Thus, it is necessary to obtain the whole genome sequences of Chinese water chestnut to deepen our understanding of the mechanisms of the corm development and genome evolution.

In this study, we reported a telomere-to-telomere (T2T) gap-free genome assembly for E. dulcis. De novo assembly using Oxford Nanopore Technology (ONT) ultra-long reads and PacBio high-fidelity (HiFi) long reads generated 521.12 Mb contigs with a contig N50 length of 4.67 Mb. After scaffolding by Hi-C reads, approximately 94.66% (493.24 Mb) of the assembled sequence were anchored onto pseudochromosomes, resulting in 111 gapless chromosomes (Fig. 1A,B; Tables 2, S1). A total of 33,493 protein-coding genes were predicted, of which 32,646 (97.47%) could be functionally annotated, with an average length of 4,211 bp (Table 3). In addition, we also identified 4,032 non-coding genes, including 225 miRNAs, 590 tRNAs, 2,735 rRNAs and 482 snRNAs (Table 3). Phylogenomic and comparative analyses (Figs. 1C, 2) suggested that chromosome breakage and genome duplication in E. dulcis likely took place multiple times during the evolution of its genome, after its divergence from a common ancestor with Rhynchospora breviuscula at ca. 35.6 million years ago (Mya). To understand the process of corm development, we conducted transcriptome sequencing (RNA-seq) at four distinct stages of corm development (Stage1–Stage4, Stage1: initial swelling stage, Stage2: middle swelling stage, Stage3: late swelling stage and Stage4: maturity stage) in both cultivated and wild accessions (Fig. 3A). Among these stages, Stage 2 exhibited the highest number of differentially expressed genes (DEGs, 15,870) (Fig. 3B,C). The GO and KEGG enrichment analyses showed that 45 of these upregulated 2,840 in Cluster 2 of reaction norms of those DEGs were significantly enriched for starch metabolic process (Fig. 3D, Tables S2, S3). The complete genomic and transcriptomic resources represent a major step forward in disentangling the complexities of corm development and genome evolution of E. dulcis; and will also facilitate functional genomics and molecular breeding studies in Chinese water chestnut.

Fig. 1: High-quality telomere-to-telomere (T2T) genome assembly of Eleocharis dulcis and comparative genomic analysis.
figure 1

(A) The Hi-C heatmap of chromosome interactions in E. dulcis genome. (B) The characteristics of the assembled E. dulcis genome. The tracks from outer to inner represent the gene density, repeat sequence density and GC content according to 50kbp windows. (C) A phylogenetic tree was constructed based on 634 single copy orthologues retrieved from 11 species. Arabidopsis thaliana and Vitis vinifera were chosen as outgroups. The red and green numbers on the branches represent the contracted and expanded gene families, respectively. Numbers at each node represent the inferred divergence times between species (millions of years). Red dots represent calibration ages between Rhynchospora breviuscula and Juncus effusus; R. breviuscula and Oryza sativa; R. breviuscula and Asparagus officinalis; A. officinalis and Vitis vinifera; V. vinifera and Arabidopsis thaliana. Calibration ages were obtained from the Timetree database (http://www.timetree.org/).

Fig. 2: Whole genome duplication and collinearity analyses.
figure 2

(A) Distributions of synonymous substitutions (Ks) across paralogs and orthologs for each species. (B) Gene collinearity between Rhynchospora breviuscula and Eleocharis dulcis. The x-axis and y-axis correspond to the pseudochromosomes of R. breviuscula and E. dulcis, respectively.

Fig. 3: Transcriptomic analyses for corms of Eleocharis dulcis.
figure 3

(A) Four developmental stages of corms of cultivated (‘Xuanci’)/wild accessions (S1: initial swelling stage; S2: middle swelling stage; S3: late swelling stage and S4: maturity stage) used for RNA sequencing. (B) Principal components analysis (PCA) of the transcriptomes of all samples. (C) Numbers of differentially expressed genes (DEGs) in cultivated and wild corms. (D) Changes in gene expression in terms of four reaction norms for cultivar (‘Xuanci’) at the four stages (S1 to S4).

Methods

Plant materials and genome sequencing

We obtained the wild sample of E. dulcis from field in Baojiakan Village, Haishu District, Ningbo City, China (121.37° E, 29.76° N) and extracted the genomic DNA from mature leaves using the DNAsecure Plant Kit (Tiangen Biotech, Beijing, China). A hybrid approach, which combined ONT ultra-long sequencing, PacBio HiFi sequencing and a Hi-C chromatin interaction map, was employed to generate the genome assembly. For ONT ultra-long sequencing, as described by ONT Community’s protocol, genomic DNAs with long fragment were selected using Short Read Eliminator XL (Circulomics, MD, USA). After preparing the library using the SQK-LSK109 kit (Oxford Nanopore Technologies; Oxford, UK), Oxford Nanopore sequencing was performed on a PromethION flow cell (ONT, Oxford, UK). For PacBio HiFi sequencing, the genomic DNA was fragmented into 15 kb segments to construct a standard SMRTbell library, according to the manufacturer’s instructions. Then, this library was sequenced on a PacBio Sequel IIe platform (Pacifc Biosciences, Menlo Park, USA). Hi-C library was constructed using NEBNext Ultra II DNA library preparation kit and DpnII enzyme (Ipswich, MA, USA). The Hi-C and Illumina short-insert libraries were sequenced on a DNBSEQ-T7 platform.

Estimation of genomic characteristics

llumina short reads were used to estimate genomic characteristics using a k-mer-based statistical analysis implemented in JELLYFISH v2.2.113 with the following parameters: ‘count -m 19 -C -c 7 -s 1 G -F 2’. Genome heterozygosity and genome size were estimated using GenomeScope v2.0 with default parameters14. Based on 19-mer depth analysis, genome size and heterozygosity were estimated to be 517.78 Mb and 0.77%, respectively (Fig. S1).

Quality assessment, genome assembly and gap filling

For ONT ultra-long data, FILTLONG v0.2.4 (https://github.com/rrwick/Filtlong) and PORECHOP v0.2.4 (https://github.com/rrwick/Porechop) were employed to remove low quality reads and trim primer sequences. The PacBio HiFi original sequencing data (subreads) were filtered by CCS v6.0 (–min-passes 3 –min-snr 2.5 –top-passes 60, https://github.com/PacificBiosciences/ccs). For Hi-C data, FASTP v0.2115 was used to remove the joint sequences and low-quality reads from raw data. Then, clean data were aligned to the reference genome with HICUP v0.816. After quality controlling, 36.98 Gb of ONT ultra-long data, 29.19 Gb of PacBio HiFi data and 58 Gb of Hi-C data (Table 1) were used for de novo genome assembly. The initial assembly of the ONT Ultra-long sequencing data was performed using NextDenovo v2.5 (read_cutoff = 1k, blocksize = 1 g, nextgraph_options = -a 1, https://github.com/Nextomics/NextDenovo). Additionally, we employed HIFIASM v0.16.117 to perform genome assembly with the HiFi sequencing data. To obtain a more comprehensive and accurate genome assembly, we also conducted a hybrid assembly combining HiFi and ONT data using HIFIASM v0.16.1. The mitochondrial and chloroplast sequences were removed from the alignments using MINIMAP2 v2.1718. The complete chloroplast sequence of E. dulcis (NC_047447) and complete mitochondrial sequence of Liriodendron tulipifera (NC_012152) were used as references. A BLAST RefSeq library was used to remove bacterial contamination and contigs with low read counts.

Table 1 Whole genome sequencing data of Eleocharis dulcis.

To achieve a telomere-to-telomere genome assembly, the contig sequences were clustered into various chromosome groups using a bottom-up hierarchical clustering algorithm with ALLHIC v0.9.819. The resulting assembly was further refined through adjustment of chromosome construction using the JUICEBOX tool v2.1220 (Fig. 1A). To fill the gaps in the assembled genome, WINNOWMAP v1.1121 was used for sequence alignment, including the assemblies derived from nextDenovo and Hifiasm, HiFi reads and ONT ultra-long reads. The longest and the most consensus sequences were applied to replace the corresponding gap-tied sequence. Subsequently, error correction was conducted using RACON v1.622, resulting in the attainment of a high-quality gap-free genome assembly. In total, 493.27 Mb reads were successfully anchored to 111 gapless chromosomes, with a contig N50 of 4.7 Mb (Table 2). Telomeres and centromeres were predicted employing quarTeT23 with recommended parameters (Table S1).

Table 2 Summary of genome assembly and evaluation.

Genome annotation

Annotations of transposable elements (TEs) were obtained from homology-based searching in known repeat databases and de novo predictions. As part of homology-based detection, RepeatMasker v4.0.9 (http://www.repeatmasker.org) was used to identify TEs against the Repbase database at the DNA level. De novo TEs were detected by RepeatMasker v4.0.9 based on a de novo repeat library constructed by RepeatModeler v1.0.4. (http://www.repeatmasker.org/RepeatModeler.html), LTR_FINDER v1.0.524, and REPEATSCOUT v1.0.525. The tandem repeat sequences in the genome were identified using TANDEM REPEATS FINDER v4.0.926. A total of 238.28 Mb (48.31%) repetitive sequences were annotated in E. dulcis (Table 3), with long terminal repeat (LTR) accompanied for the largest proportion (24.32%).

Table 3 Genome annotation of repetitive sequences and protein-coding genes.

To predict the gene structure of E. dulcis, three approaches were used: homology-based prediction, ab initio gene prediction and transcriptome-assisted annotation. For the homology-based gene prediction, the protein-coding sequences from Rhynchospora pubera27, R. breviuscula27, R. tenuis27, Brachypodium distachyon28, Carex cristatella29, and Oryza sativa30 were mapped to the assembled genome using TBLASTN v2.2.26 (E-value ≤ 1e−5)31 to obtain high-quality protein structures. The ab initio gene prediction was performed using AUGUSTUS v3.2.332, GENSCAN v1.033, GLIMMERHMM v3.0.434, GENEID v1.4.435 and SNAP v2013.11.2936. Besides, the RNA-seq reads separately derived from different tissues (i.e., root, stem, leaf) were aligned to the genome sequence using TOPHAT v2.0.1137 in transcriptome-assisted prediction. Subsequently, the mapped reads were assembled into longer transcripts using CUFFINKS v2.2.138. The transcripts from all tissues were merged and subjected to TransDecoder in PASA v2.4.139 for protein-coding sequence prediction and quality filtering. Only complete transcripts were retained for further analysis. All genes predicted by the above methods were integrated into a non-redundant gene set using EvidenceModeler (EVM) v1.1.140. The EVM-predicted genes were further corrected with PASA v2.4.139 to predict the untranslated regions and alternative splicings. The resulting protein models were functionally annotated according to the best matches with proteins deposited in GeneOntology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-Prot, TREMBL and NCBI non-redundant protein database (Nr) using BLASTP (E-value = 1e−5). We identified tRNA genes, rRNA genes and ncRNA genes using TRNASCAN-SE v1.2341, INFERNAL v1.1.242 and RFAM database (https://rfam.org/), respectively. Finally, we employed MCSCANX43 and ‘CIRCLIZE’ R package44 to detect and visualize collinear blocks (Fig. 1B).

Gene family evolution analyses

We employed ORTHOFINDER v2.3.1245 to identify orthologous groups for 11 species, including Arabidopsis thaliana46, Vitis vinifera47, Eleocharis dulcis, Ananas comosus48, Rhynchospora breviuscula27, Phoenix dactylifera49, Juncus effusus27, Asparagus officinalis50, Musa acuminata51, Sorghum bicolor52 and Oryza sativa30. A total of 634 single copy orthologues were obtained and subsequently aligned using MUSCLE v3.853. All these genes were finally concatenated into one sequence as a ‘supergene’. Based on the ‘supergene’, we used RAXML v8.2.154 to construct a maximum likelihood (ML) tree, with the PROTGAMMAWAG model. Additionally, MCMCtree program implemented in PAML v4.955 was employed to estimate divergence times among species. According to the time-calibrated phylogeny, the genus Eleocharis was estimated to diverge from Rhynchospora ca. 20 million years ago (Fig. 1C). Subsequently, according to the species phylogenetic tree and gene family clustering results (Fig. 1C), gene family expansions and contractions analysis was performed using CAFE v3.156. GO enrichment analysis of gene families (P-value < 0.05) was performed using AGRIGO v2.057 with default parameters. The results revealed that 5,463 and 620 gene families had undergone significant expansions and contractions in E. dulcis (Fig. 1C), respectively. These expanded families were mostly enriched for essential functions, e.g. DNA binding, flavonoid biosynthetic process, nucleic acid binding (Fig. 1D).

The Ks method was employed to identify whole-genome duplication (WGD) events. The yn00 module implemented in PAML v4.955 was used to calculate the synonymous mutation frequency (Ks), nonsynonymous mutation frequency (Ka), and the ratio of nonsynonymous to synonymous mutation rates (Ka/Ks) for collinear gene pairs. The analysis of the Ks distribution showed that, except for a common whole-genome duplication (WGD) event shared by all monocots, E. dulcis underwent a recent WGD event which occurred after its divergence from R. breviuscula (Fig. 2A). Additionally, the synteny blocks generated by JCVI v0.9.1358 showed collinearity between a single chromosome of R. breviuscula and multiple chromosomes of E. dulcis and high self-collinearity among chromosomes of E. dulcis. Both suggest multiple occurrences of chromosome breakage and duplication during genomic evolution (Fig. 2B).

Transcriptome analysis

Wild individuals and a traditional cultivar ‘Xuanci’ of Chinese water chestnut were chosen for differential expression analysis of corm development. Our sampling strategy encompassed four distinct stages of corm development, i.e. Stage 1 (S1, initial swelling stage), Stage 2 (S2, middle swelling stage), Stage 3 (S3, late swelling stage), and Stage 4 (S4, maturity stage) (Fig. 3A). The samples of each stage included three biological replicates. Raw reads were generated by the DNBSEQ-T7 platform. Adapters and low-quality reads were filtered using FASTP v0.2115. Clean reads of each tissue were mapped to the reference genome using HISAT2 v2.0.459. The read counting of the annotated genes was performed using STRINGTIE v2.1.660. R package ‘DESeq2’61 was used to analyze differential gene expression. PCA showed that transcriptomes of three biological replicates within each time point group clustered together (Fig. 3B). The DEGs between various tissues were defined according to the following criteria:|log2FC| ≥ 1 and the false discovery rate (FDR) corrected P-value < 0.05.

RNAseq generated a total of 1,345,118,694 reads, with an average of 56,046,612 reads per sample

Among these stages, Stage 4 (maturity stage) exhibited the lowest number of DEGs, with 5,949 down-regulated genes and 5,965 up-regulated genes between cultivated and wild accessions (Fig. 3C). In contrast, Stage 2 (the middle swelling stage) had the highest number of DEGs, with 9,121 down-regulated genes and 6,749 up-regulated genes (Fig. 3C). Fuzzy c-means clustering implemented in the ‘MFUZZ’ R package62 revealed four major expression reaction norms in cultivated accessions (Fig. 3D). Genes in Cluster 1 were down-regulated after S1, while genes in Cluster 2 and Cluster 3 were up-regulated, reaching expression peak at S2 and S3, respectively. In Cluster 4, genes showed bimodal expression patterns (Fig. 3D). GO (Table S2) and KEGG enrichment (Table S3) analyses further demonstrated that the up-regulated genes in Cluster 2 were significantly (P < 0.05) enriched for starch metabolic process.

Data Records

The whole genome sequencing datasets have been stored in the National Genomics Data Center (NGDC). The raw data can be located using the GSA number CRA01314763, the genome sequence and annotation data were under accession number GWHEQVU0000000064, which corresponds to the BioProject accession number PRJCA01861365. Additionally, the RNA-seq data for both cultivated and wild accessions at four different stages can be found under the BioProject accession number PRJCA02068266, with the corresponding GSA number being CRA01313467. The genome sequence data were also deposited in the SRA (Sequence Read Archive) database with accession number SRX22848329SRX2284833368,69,70,71,72 under the BioProject accession number PRJNA1033537. The assembled genome was deposited in the GeneBank in NCBI under accession number GCA_040741935.173. The RNA-seq data were submitted under PRJNA1033533. All expression data were uploaded to the the Gene Expression Omnibus (GEO) public database under Accession No. GSE26413274.

Technical Validation

The evaluation of the contig and chromosome levels of the assemblies was conducted through four methods: N50 for assessing continuity (contig N50 = 4.7 Mb), QV scores calculated by MERQURY75 v1.3 for measuring accuracy (QV = 49.2) and the paired-end (Illumina) reads mapping rate for ensuring consistency with the raw data (Mapping rate = 98.9%, read coverage (20X) = 99.0%). The verification of protein-coding genes was performed using BUSCO76. The BUSCO analysis showed 94.4% of complete BUSCO genes at the “genome mode” and 94.0% at the “protein mode”. The latter comprised 1,036 (64.8%) single-copy BUSCOs and 472 (29.2%) duplicated BUSCOs.