Telomere-to-telomere genome assembly of Eleocharis dulcis and expression profiles during corm development

Chen, Yang; Zhang, Xinyi; Wang, Lingyun; Fang, Mingya; Lu, Ruisen; Ma, Yazhen; Huang, Yan; Chen, Xiaoyang; Sheng, Wei; Shi, Lin; Zheng, Zhaisheng; Qiu, Yingxiong

doi:10.1038/s41597-024-03717-y

Download PDF

Data Descriptor
Open access
Published: 10 August 2024

Telomere-to-telomere genome assembly of Eleocharis dulcis and expression profiles during corm development

Yang Chen^1,2^na1,
Xinyi Zhang²^na1,
Lingyun Wang¹^na1,
Mingya Fang¹,
Ruisen Lu³,
Yazhen Ma²,
Yan Huang²,
Xiaoyang Chen⁴,
Wei Sheng¹,
Lin Shi¹,
Zhaisheng Zheng¹ &
…
Yingxiong Qiu²

Scientific Data volume 11, Article number: 869 (2024) Cite this article

1902 Accesses
Metrics details

Subjects

Abstract

Eleocharis dulcis (Burm. f.) Trin. ex Hensch., commonly known as Chinese water chestnut, is a traditional aquatic vegetable in China, and now is widely cultivated throughout the world because of its high nutritional value and unique tastes. Here, we report the assembly of a 493.24 Mb telomere-to-telomere (T2T) genome of E. dulcis accomplished by integrating ONT ultra-long reads, PacBio long reads and Hi-C data. The reference genome was anchored onto 111 gap-free chromosomes, containing 48.31% repeat elements and 33,493 predicted protein-coding genes. Whole genome duplication (WGD) and inter-genomic synteny analyses indicated that chromosome breakage and genome duplication in E. dulcis possibly occurred multiple times during genome evolution after its divergence from a common ancestor with Rhynchospora breviuscula at ca. 35.6 Mya. A comparative time-course transcriptome analysis of corm development revealed different patterns of gene expression between cultivated and wild accessions with the highest number of differentially expressed genes (DEGs, 15,870) at the middle swelling stage and some of the DEGs were significantly enriched for starch metabolic process.

The Welwitschia genome reveals a unique biology underpinning extreme longevity in deserts

Article Open access 12 July 2021

Transposase-assisted target-site integration for efficient plant genome engineering

Article Open access 26 June 2024

Chromosome-level genome assembly of two cultivated Jujubes

Article Open access 17 October 2024

Background & Summary

Eleocharis dulcis (Cyperaceae), commonly known as Chinese water chestnut, is an aquatic perennial herbaceous plant that grows in shallow waters and is mainly distributed in China, Southeast Asia, the Americas, Europe, and Oceania^1,2,3. Eleocharis dulcis is commonly consumed as a vegetable, as its underground corms are rich in starch and have unique tastes. It is also a critical ingredient in traditional Chinese medicine^4,5. The cultivated history of Chinese water chestnut in China and India dates back to about two thousand years ago^3,6. Now, it is widely cultivated in various counties and regions, including Japan, Australia, the Philippines, Malaysia, Hawaii, and other Pacific islands¹.

Similar to other aquatic vegetable (rhizome, tuber, and bulbs), the corm of Chinse water chestnut is an underground stem that serves as a storage organ. The accumulating evidence suggests that the formation of storage organ is regulated by both genetic and environmental factors^7,8,9. For example, Cheng et al.¹⁰ have identified some up-regulated transcription factors and genes related to organ formation of Chinese water chestnut. However, it remains unclear which genes are likely involved in domestication, and how they regulate corm development. In addition, the lack of an assembled genome has hitherto hindered the studies on the genome structure and evolution of E. dulcis which has a large number of chromosomes (more than 210)^11,12. Thus, it is necessary to obtain the whole genome sequences of Chinese water chestnut to deepen our understanding of the mechanisms of the corm development and genome evolution.

In this study, we reported a telomere-to-telomere (T2T) gap-free genome assembly for E. dulcis. De novo assembly using Oxford Nanopore Technology (ONT) ultra-long reads and PacBio high-fidelity (HiFi) long reads generated 521.12 Mb contigs with a contig N50 length of 4.67 Mb. After scaffolding by Hi-C reads, approximately 94.66% (493.24 Mb) of the assembled sequence were anchored onto pseudochromosomes, resulting in 111 gapless chromosomes (Fig. 1A,B; Tables 2, S1). A total of 33,493 protein-coding genes were predicted, of which 32,646 (97.47%) could be functionally annotated, with an average length of 4,211 bp (Table 3). In addition, we also identified 4,032 non-coding genes, including 225 miRNAs, 590 tRNAs, 2,735 rRNAs and 482 snRNAs (Table 3). Phylogenomic and comparative analyses (Figs. 1C, 2) suggested that chromosome breakage and genome duplication in E. dulcis likely took place multiple times during the evolution of its genome, after its divergence from a common ancestor with Rhynchospora breviuscula at ca. 35.6 million years ago (Mya). To understand the process of corm development, we conducted transcriptome sequencing (RNA-seq) at four distinct stages of corm development (Stage1–Stage4, Stage1: initial swelling stage, Stage2: middle swelling stage, Stage3: late swelling stage and Stage4: maturity stage) in both cultivated and wild accessions (Fig. 3A). Among these stages, Stage 2 exhibited the highest number of differentially expressed genes (DEGs, 15,870) (Fig. 3B,C). The GO and KEGG enrichment analyses showed that 45 of these upregulated 2,840 in Cluster 2 of reaction norms of those DEGs were significantly enriched for starch metabolic process (Fig. 3D, Tables S2, S3). The complete genomic and transcriptomic resources represent a major step forward in disentangling the complexities of corm development and genome evolution of E. dulcis; and will also facilitate functional genomics and molecular breeding studies in Chinese water chestnut.

**Fig. 1: High-quality telomere-to-telomere (T2T) genome assembly of *Eleocharis dulcis* and comparative genomic analysis.**

**Fig. 2: Whole genome duplication and collinearity analyses.**

Fig. 3: Transcriptomic analyses for corms of *Eleocharis dulcis.*

Methods

Plant materials and genome sequencing

We obtained the wild sample of E. dulcis from field in Baojiakan Village, Haishu District, Ningbo City, China (121.37° E, 29.76° N) and extracted the genomic DNA from mature leaves using the DNAsecure Plant Kit (Tiangen Biotech, Beijing, China). A hybrid approach, which combined ONT ultra-long sequencing, PacBio HiFi sequencing and a Hi-C chromatin interaction map, was employed to generate the genome assembly. For ONT ultra-long sequencing, as described by ONT Community’s protocol, genomic DNAs with long fragment were selected using Short Read Eliminator XL (Circulomics, MD, USA). After preparing the library using the SQK-LSK109 kit (Oxford Nanopore Technologies; Oxford, UK), Oxford Nanopore sequencing was performed on a PromethION flow cell (ONT, Oxford, UK). For PacBio HiFi sequencing, the genomic DNA was fragmented into 15 kb segments to construct a standard SMRTbell library, according to the manufacturer’s instructions. Then, this library was sequenced on a PacBio Sequel IIe platform (Pacifc Biosciences, Menlo Park, USA). Hi-C library was constructed using NEBNext Ultra II DNA library preparation kit and DpnII enzyme (Ipswich, MA, USA). The Hi-C and Illumina short-insert libraries were sequenced on a DNBSEQ-T7 platform.

Estimation of genomic characteristics

llumina short reads were used to estimate genomic characteristics using a k-mer-based statistical analysis implemented in JELLYFISH v2.2.1¹³ with the following parameters: ‘count -m 19 -C -c 7 -s 1 G -F 2’. Genome heterozygosity and genome size were estimated using GenomeScope v2.0 with default parameters¹⁴. Based on 19-mer depth analysis, genome size and heterozygosity were estimated to be 517.78 Mb and 0.77%, respectively (Fig. S1).

Quality assessment, genome assembly and gap filling

For ONT ultra-long data, FILTLONG v0.2.4 (https://github.com/rrwick/Filtlong) and PORECHOP v0.2.4 (https://github.com/rrwick/Porechop) were employed to remove low quality reads and trim primer sequences. The PacBio HiFi original sequencing data (subreads) were filtered by CCS v6.0 (–min-passes 3 –min-snr 2.5 –top-passes 60, https://github.com/PacificBiosciences/ccs). For Hi-C data, FASTP v0.21¹⁵ was used to remove the joint sequences and low-quality reads from raw data. Then, clean data were aligned to the reference genome with HICUP v0.8¹⁶. After quality controlling, 36.98 Gb of ONT ultra-long data, 29.19 Gb of PacBio HiFi data and 58 Gb of Hi-C data (Table 1) were used for de novo genome assembly. The initial assembly of the ONT Ultra-long sequencing data was performed using NextDenovo v2.5 (read_cutoff = 1k, blocksize = 1 g, nextgraph_options = -a 1, https://github.com/Nextomics/NextDenovo). Additionally, we employed HIFIASM v0.16.1¹⁷ to perform genome assembly with the HiFi sequencing data. To obtain a more comprehensive and accurate genome assembly, we also conducted a hybrid assembly combining HiFi and ONT data using HIFIASM v0.16.1. The mitochondrial and chloroplast sequences were removed from the alignments using MINIMAP2 v2.17¹⁸. The complete chloroplast sequence of E. dulcis (NC_047447) and complete mitochondrial sequence of Liriodendron tulipifera (NC_012152) were used as references. A BLAST RefSeq library was used to remove bacterial contamination and contigs with low read counts.

Table 1 Whole genome sequencing data of Eleocharis dulcis.

Full size table

To achieve a telomere-to-telomere genome assembly, the contig sequences were clustered into various chromosome groups using a bottom-up hierarchical clustering algorithm with ALLHIC v0.9.8¹⁹. The resulting assembly was further refined through adjustment of chromosome construction using the JUICEBOX tool v2.12²⁰ (Fig. 1A). To fill the gaps in the assembled genome, WINNOWMAP v1.11²¹ was used for sequence alignment, including the assemblies derived from nextDenovo and Hifiasm, HiFi reads and ONT ultra-long reads. The longest and the most consensus sequences were applied to replace the corresponding gap-tied sequence. Subsequently, error correction was conducted using RACON v1.6²², resulting in the attainment of a high-quality gap-free genome assembly. In total, 493.27 Mb reads were successfully anchored to 111 gapless chromosomes, with a contig N50 of 4.7 Mb (Table 2). Telomeres and centromeres were predicted employing quarTeT²³ with recommended parameters (Table S1).

Table 2 Summary of genome assembly and evaluation.

Full size table

Genome annotation

Annotations of transposable elements (TEs) were obtained from homology-based searching in known repeat databases and de novo predictions. As part of homology-based detection, RepeatMasker v4.0.9 (http://www.repeatmasker.org) was used to identify TEs against the Repbase database at the DNA level. De novo TEs were detected by RepeatMasker v4.0.9 based on a de novo repeat library constructed by RepeatModeler v1.0.4. (http://www.repeatmasker.org/RepeatModeler.html), LTR_FINDER v1.0.5²⁴, and REPEATSCOUT v1.0.5²⁵. The tandem repeat sequences in the genome were identified using TANDEM REPEATS FINDER v4.0.9²⁶. A total of 238.28 Mb (48.31%) repetitive sequences were annotated in E. dulcis (Table 3), with long terminal repeat (LTR) accompanied for the largest proportion (24.32%).

Table 3 Genome annotation of repetitive sequences and protein-coding genes.

Full size table

To predict the gene structure of E. dulcis, three approaches were used: homology-based prediction, ab initio gene prediction and transcriptome-assisted annotation. For the homology-based gene prediction, the protein-coding sequences from Rhynchospora pubera²⁷, R. breviuscula²⁷, R. tenuis²⁷, Brachypodium distachyon²⁸, Carex cristatella²⁹, and Oryza sativa³⁰ were mapped to the assembled genome using TBLASTN v2.2.26 (E-value ≤ 1e⁻⁵)³¹ to obtain high-quality protein structures. The ab initio gene prediction was performed using AUGUSTUS v3.2.3³², GENSCAN v1.0³³, GLIMMERHMM v3.0.4³⁴, GENEID v1.4.4³⁵ and SNAP v2013.11.29³⁶. Besides, the RNA-seq reads separately derived from different tissues (i.e., root, stem, leaf) were aligned to the genome sequence using TOPHAT v2.0.11³⁷ in transcriptome-assisted prediction. Subsequently, the mapped reads were assembled into longer transcripts using CUFFINKS v2.2.1³⁸. The transcripts from all tissues were merged and subjected to TransDecoder in PASA v2.4.1³⁹ for protein-coding sequence prediction and quality filtering. Only complete transcripts were retained for further analysis. All genes predicted by the above methods were integrated into a non-redundant gene set using EvidenceModeler (EVM) v1.1.1⁴⁰. The EVM-predicted genes were further corrected with PASA v2.4.1³⁹ to predict the untranslated regions and alternative splicings. The resulting protein models were functionally annotated according to the best matches with proteins deposited in GeneOntology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-Prot, TREMBL and NCBI non-redundant protein database (Nr) using BLASTP (E-value = 1e⁻⁵). We identified tRNA genes, rRNA genes and ncRNA genes using TRNASCAN-SE v1.23⁴¹, INFERNAL v1.1.2⁴² and RFAM database (https://rfam.org/), respectively. Finally, we employed MCSCANX⁴³ and ‘CIRCLIZE’ R package⁴⁴ to detect and visualize collinear blocks (Fig. 1B).

Gene family evolution analyses

We employed ORTHOFINDER v2.3.12⁴⁵ to identify orthologous groups for 11 species, including Arabidopsis thaliana⁴⁶, Vitis vinifera⁴⁷, Eleocharis dulcis, Ananas comosus⁴⁸, Rhynchospora breviuscula²⁷, Phoenix dactylifera⁴⁹, Juncus effusus²⁷, Asparagus officinalis⁵⁰, Musa acuminata⁵¹, Sorghum bicolor⁵² and Oryza sativa³⁰. A total of 634 single copy orthologues were obtained and subsequently aligned using MUSCLE v3.8⁵³. All these genes were finally concatenated into one sequence as a ‘supergene’. Based on the ‘supergene’, we used RAXML v8.2.1⁵⁴ to construct a maximum likelihood (ML) tree, with the PROTGAMMAWAG model. Additionally, MCMCtree program implemented in PAML v4.9⁵⁵ was employed to estimate divergence times among species. According to the time-calibrated phylogeny, the genus Eleocharis was estimated to diverge from Rhynchospora ca. 20 million years ago (Fig. 1C). Subsequently, according to the species phylogenetic tree and gene family clustering results (Fig. 1C), gene family expansions and contractions analysis was performed using CAFE v3.1⁵⁶. GO enrichment analysis of gene families (P-value < 0.05) was performed using AGRIGO v2.0⁵⁷ with default parameters. The results revealed that 5,463 and 620 gene families had undergone significant expansions and contractions in E. dulcis (Fig. 1C), respectively. These expanded families were mostly enriched for essential functions, e.g. DNA binding, flavonoid biosynthetic process, nucleic acid binding (Fig. 1D).

The Ks method was employed to identify whole-genome duplication (WGD) events. The yn00 module implemented in PAML v4.9⁵⁵ was used to calculate the synonymous mutation frequency (Ks), nonsynonymous mutation frequency (Ka), and the ratio of nonsynonymous to synonymous mutation rates (Ka/Ks) for collinear gene pairs. The analysis of the Ks distribution showed that, except for a common whole-genome duplication (WGD) event shared by all monocots, E. dulcis underwent a recent WGD event which occurred after its divergence from R. breviuscula (Fig. 2A). Additionally, the synteny blocks generated by JCVI v0.9.13⁵⁸ showed collinearity between a single chromosome of R. breviuscula and multiple chromosomes of E. dulcis and high self-collinearity among chromosomes of E. dulcis. Both suggest multiple occurrences of chromosome breakage and duplication during genomic evolution (Fig. 2B).

Transcriptome analysis

Wild individuals and a traditional cultivar ‘Xuanci’ of Chinese water chestnut were chosen for differential expression analysis of corm development. Our sampling strategy encompassed four distinct stages of corm development, i.e. Stage 1 (S1, initial swelling stage), Stage 2 (S2, middle swelling stage), Stage 3 (S3, late swelling stage), and Stage 4 (S4, maturity stage) (Fig. 3A). The samples of each stage included three biological replicates. Raw reads were generated by the DNBSEQ-T7 platform. Adapters and low-quality reads were filtered using FASTP v0.21¹⁵. Clean reads of each tissue were mapped to the reference genome using HISAT2 v2.0.4⁵⁹. The read counting of the annotated genes was performed using STRINGTIE v2.1.6⁶⁰. R package ‘DESeq2’⁶¹ was used to analyze differential gene expression. PCA showed that transcriptomes of three biological replicates within each time point group clustered together (Fig. 3B). The DEGs between various tissues were defined according to the following criteria:|log₂FC| ≥ 1 and the false discovery rate (FDR) corrected P-value < 0.05.

RNAseq generated a total of 1,345,118,694 reads, with an average of 56,046,612 reads per sample

Among these stages, Stage 4 (maturity stage) exhibited the lowest number of DEGs, with 5,949 down-regulated genes and 5,965 up-regulated genes between cultivated and wild accessions (Fig. 3C). In contrast, Stage 2 (the middle swelling stage) had the highest number of DEGs, with 9,121 down-regulated genes and 6,749 up-regulated genes (Fig. 3C). Fuzzy c-means clustering implemented in the ‘MFUZZ’ R package⁶² revealed four major expression reaction norms in cultivated accessions (Fig. 3D). Genes in Cluster 1 were down-regulated after S1, while genes in Cluster 2 and Cluster 3 were up-regulated, reaching expression peak at S2 and S3, respectively. In Cluster 4, genes showed bimodal expression patterns (Fig. 3D). GO (Table S2) and KEGG enrichment (Table S3) analyses further demonstrated that the up-regulated genes in Cluster 2 were significantly (P < 0.05) enriched for starch metabolic process.

Data Records

The whole genome sequencing datasets have been stored in the National Genomics Data Center (NGDC). The raw data can be located using the GSA number CRA013147⁶³, the genome sequence and annotation data were under accession number GWHEQVU00000000⁶⁴, which corresponds to the BioProject accession number PRJCA018613⁶⁵. Additionally, the RNA-seq data for both cultivated and wild accessions at four different stages can be found under the BioProject accession number PRJCA020682⁶⁶, with the corresponding GSA number being CRA013134⁶⁷. The genome sequence data were also deposited in the SRA (Sequence Read Archive) database with accession number SRX22848329–SRX22848333^{68,69,70,71,72} under the BioProject accession number PRJNA1033537. The assembled genome was deposited in the GeneBank in NCBI under accession number GCA_040741935.1⁷³. The RNA-seq data were submitted under PRJNA1033533. All expression data were uploaded to the the Gene Expression Omnibus (GEO) public database under Accession No. GSE264132⁷⁴.

Technical Validation

The evaluation of the contig and chromosome levels of the assemblies was conducted through four methods: N50 for assessing continuity (contig N50 = 4.7 Mb), QV scores calculated by MERQURY⁷⁵ v1.3 for measuring accuracy (QV = 49.2) and the paired-end (Illumina) reads mapping rate for ensuring consistency with the raw data (Mapping rate = 98.9%, read coverage (20X) = 99.0%). The verification of protein-coding genes was performed using BUSCO⁷⁶. The BUSCO analysis showed 94.4% of complete BUSCO genes at the “genome mode” and 94.0% at the “protein mode”. The latter comprised 1,036 (64.8%) single-copy BUSCOs and 472 (29.2%) duplicated BUSCOs.

Code availability

No custom code was developed in this work. Software and pipelines were implemented according to the manuals and protocols of published bioinformatics tools. Software version and parameters were described in Methods.

References

Morton, J. F., Sanchez, C. A. & Snyder, G. H. Chinese water chestnuts in Florida: past, present, and future. Proc. Fla. State. Hort. Soc. 101, 139–144 (1988).
Google Scholar
Wu, Z. Y., P. H. Raven & D. Y. Hong. Cyperaceae. In Flora of China. Vol. 23 188–200. (Science Press, Beijing, and Missouri Botanical Garden Press, St. Louis., 2010).
Li, M. & Midmore, D. J. Estimating the genetic relationships of Chinese water chestnut (Eleocharis dulcis (Burm. f.) Hensch) cultivated in Australia, using random amplified polymorphic DNAs (RAPDs). J. Horticult. Sci. Biotechnol. 74, 224–231 (2015).
Article Google Scholar
Zhan, G. et al. Study on antibacterial properties and major bioactive constituents of Chinese water chestnut (Eleocharis dulcis) peels extracts/fractions. Eur. Food Res. Technol. 238, 789–796 (2014).
Article CAS Google Scholar
Zhang, Y. et al. Eleocharis dulcis corm: phytochemicals, health benefits, processing and food products. J. Sci. Food Agric. 102, 19–40 (2022).
Article PubMed CAS Google Scholar
Chen, Q. Y. Chinese water chestnuts. (Chinese Agricultural Publishing House, Beijing, China; 1995).
Jing, S. et al. Long-distance control of potato storage organ formation by SELF PRUNING 3D and FLOWERING LOCUS T-like 1. Plant Commun. 4, 100547 (2023).
Article PubMed PubMed Central CAS Google Scholar
Navarro, C. et al. Control of flowering and storage organ formation in potato by FLOWERING LOCUS T. Nature. 478, 119–122 (2011).
Article ADS PubMed CAS Google Scholar
Zhang, X. et al. Pangenome of water caltrop reveals structural variations and asymmetric subgenome divergence after allopolyploidization. Horticulture Research. 10, uhad203 (2023).
Article PubMed PubMed Central CAS Google Scholar
Cheng, L. et al. Transcriptome analysis of gene expression during Chinese water chestnut storage organ formation. PLoS One 11, e164223 (2016).
Article Google Scholar
Uchiyama, H., Matoba, H., Aizawa, T., Sumida, H. & Do, M. N. Chromosome counts of some wetland cyperaceous species from the Mekong Delta, Vietnam. Cytologia. 75, 335–339 (2010).
Article Google Scholar
Roalson, E. H. A synopsis of chromosome number variation in the Cyperaceae. Bot. Rev. 74, 209–393 (2008).
Article Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204 (2017).
Article PubMed PubMed Central CAS Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, 884–890 (2018).
Article Google Scholar
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Res 4, 1310 (2015).
Article PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 18, 170–175 (2021).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article PubMed PubMed Central CAS Google Scholar
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants. 5, 833–845 (2019).
Article PubMed CAS Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 3, 99–101 (2016).
Article PubMed PubMed Central CAS Google Scholar
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, 111–118 (2020).
Article Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017).
Article PubMed PubMed Central CAS Google Scholar
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research 10, uhad127 (2023).
Article PubMed PubMed Central Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, 265–268 (2007).
Article Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, 351–358 (2005).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article PubMed PubMed Central CAS Google Scholar
Hofstatter, P. G. et al. Repeat-based holocentromeres influence genome architecture and karyotype evolution. Cell 185, 3153–3168 (2022).
Article PubMed CAS Google Scholar
The International Brachypodium Initiative. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463, 763–768 (2010).
Planta, J. et al. Chromosome-scale genome assemblies and annotations for Poales species Carex cristatella, Carex scoparia, Juncus effusus, and Juncus inflexus. G3: Genes, Genomes, Genet. 12, jkac211 (2022).
Article CAS Google Scholar
Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, 883–887 (2007).
Article Google Scholar
Gertz, E. M., Yu, Y., Agarwala, R., Schäffer, A. A. & Altschul, S. F. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol. 4, 1–14 (2006).
Article Google Scholar
Stanke, M., Schoffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinf. 7, 62 (2006).
Article Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article PubMed CAS Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Article PubMed CAS Google Scholar
Blanco, E., Parra, G. & Guigo, R. Using geneid to identify genes. Curr Protoc Bioinformatics. Chapter 4, 3–4 (2007).
Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinf. 5, 59 (2004).
Article Google Scholar
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).
Article PubMed PubMed Central CAS Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
Article PubMed PubMed Central CAS Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article PubMed PubMed Central CAS Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).
Article PubMed PubMed Central Google Scholar
Schattner, P., Brooks, A. N. & Lowe, T. M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 33, 686–689 (2005).
Article Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Article PubMed PubMed Central CAS Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, 49 (2012).
Article ADS Google Scholar
Gu, Z., Gu, L., Eils, R., Schlesner, M. & Brors, B. circlize implements and enhances circular visualization in R. Bioinformatics. 19, 2811 (2014).
Article Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015).
Article PubMed PubMed Central Google Scholar
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Article ADS Google Scholar
Massonnet, M. et al. The genetic basis of sex determination in grapes. Nat. Commun. 11, 2902 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Ming, R. et al. The pineapple genome and the evolution of CAM photosynthesis. Nature Genet. 47, 1435–1442 (2015).
Article PubMed CAS Google Scholar
Al-Dous, E. K. et al. De novo genome sequencing and comparative genomics of date palm (Phoenix dactylifera). Nat. Biotechnol. 29, 521–527 (2011).
Article PubMed CAS Google Scholar
Harkess, A. et al. The asparagus genome sheds light on the origin and evolution of a young Y chromosome. Nat. Commun. 8, 1279 (2017).
Article ADS PubMed PubMed Central Google Scholar
D Hont, A. et al. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488, 213–217 (2012).
Article ADS PubMed CAS Google Scholar
Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009).
Article ADS PubMed CAS Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article PubMed PubMed Central CAS Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article PubMed PubMed Central CAS Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Article PubMed CAS Google Scholar
Han, M. V., Thomas, G. W., Lugo-Martinez, J. & Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol. 30, 1987–1997 (2013).
Article PubMed CAS Google Scholar
Tian, T. et al. agriGO v2.0: a GO analysis toolkit for the agricultural community, 2017 update. Nucleic Acids Res. 122–129 (2017).
Tang, H. et al. Synteny and collinearity in plant genomes. Science. 320, 486–488 (2008).
Article ADS PubMed CAS Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods. 12, 357–360 (2015).
Article PubMed PubMed Central CAS Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article PubMed PubMed Central CAS Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq 2. Genome Biol. 15 (2014).
Kumar, L. & E, F. M. Mfuzz: a software package for soft clustering of microarray data. Bioinformation. 2, 5–7 (2007).
Article PubMed PubMed Central Google Scholar
NGDC Genome Sequence Archive. https://ngdc.cncb.ac.cn/gsa/browse/CRA013147 (2024).
NGDC Genome Warehouse. https://ngdc.cncb.ac.cn/gwh/Assembly/83530/show (2024).
NGDC BioProject. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA018613 (2024).
NGDC BioProject. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA020682 (2024).
NGDC Genome Sequence Archive. https://ngdc.cncb.ac.cn/gsa/browse/CRA013134 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRX22848329 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRX22848330 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRX22848331 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRX22848332 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRX22848333 (2024).
NCBI GeneBank. https://identifiers.org/ncbi/insdc.gca:GCA_040741935.1 (2024).
NCBI Gene Expression Omnibus. https://identifiers.org/geo/GSE264132 (2024).
Rhie A. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology, 21 (2020).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 31, 3210–3212 (2015).
Article PubMed Google Scholar

Download references

Acknowledgements

Special thanks to the colleagues in BENAGEN company for their assistance with data analyses. This work was supported by the Program on the Collection, Conservation, and Utilization of Germplasm Resources of Aquatic Vegetables in Zhejiang Province.

Author information

These authors contributed equally: Yang Chen, Xinyi Zhang, Lingyun Wang.

Authors and Affiliations

Provincial Key Laboratory of Characteristic Aquatic Vegetable Breeding and Cultivation, Jinhua Academy of Agricultural Sciences (Zhejiang Institute of Agricultural Machinery), Jinhua, Zhejiang, 321000, China
Yang Chen, Lingyun Wang, Mingya Fang, Wei Sheng, Lin Shi & Zhaisheng Zheng
Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, Hubei, 430074, China
Yang Chen, Xinyi Zhang, Yazhen Ma, Yan Huang & Yingxiong Qiu
Institute of Botany, Jiangsu Province and Chinese Academy of Sciences, Nanjing, Jiangsu, 210014, China
Ruisen Lu
Seed Management Station of Zhejiang Province, Hangzhou, Zhejiang, 310020, China
Xiaoyang Chen

Authors

Yang Chen
View author publications
Search author on:PubMed Google Scholar
Xinyi Zhang
View author publications
Search author on:PubMed Google Scholar
Lingyun Wang
View author publications
Search author on:PubMed Google Scholar
Mingya Fang
View author publications
Search author on:PubMed Google Scholar
Ruisen Lu
View author publications
Search author on:PubMed Google Scholar
Yazhen Ma
View author publications
Search author on:PubMed Google Scholar
Yan Huang
View author publications
Search author on:PubMed Google Scholar
Xiaoyang Chen
View author publications
Search author on:PubMed Google Scholar
Wei Sheng
View author publications
Search author on:PubMed Google Scholar
Lin Shi
View author publications
Search author on:PubMed Google Scholar
Zhaisheng Zheng
View author publications
Search author on:PubMed Google Scholar
Yingxiong Qiu
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Q. and Z.Z. conceived the project. Y.C., X.Z. and L.W. performed experiments and coordinated research activities. Y.C., X.Z., L.W., Y. H., M.F., X. C., W. S., and M.F. collected the samples. Y.C. and X.Z. analyzed data. Y.C. drafted the manuscript. Y.Q., Y.M., X.Z. and R. L. revised and finalized the manuscript. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Zhaisheng Zheng or Yingxiong Qiu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, Y., Zhang, X., Wang, L. et al. Telomere-to-telomere genome assembly of Eleocharis dulcis and expression profiles during corm development. Sci Data 11, 869 (2024). https://doi.org/10.1038/s41597-024-03717-y

Download citation

Received: 08 January 2024
Accepted: 31 July 2024
Published: 10 August 2024
DOI: https://doi.org/10.1038/s41597-024-03717-y