Background & Summary

Bryophytes, including hornworts, liverworts, and mosses, form the sister group to vascular plants, and exhibit a haploid-dominated lifecycle with unbranched sporophytes remain attaching to and reliant on gametophytes1. Bryophytes excel in enduring some of the most severe environmental conditions, ranging from low light intensity and extreme temperatures to nutrient scarcity and desiccation. As a result, they often act as pioneering species within various ecosystems. Notably, bryophytes are the predominant flora in Antarctica’s terrestrial ecosystems2 and can also thrive in arid deserts3. These adaptabilities may stem from an evolutionarily refined genetic toolkit geared towards stress tolerance4.

Aulacomnium turgidum (Aulacomniaceae, Rhizogoniales), a moss species widespread across the Arctic Circle, including Svalbard, Greenland, and Alaska, is notable for its ability to regenerate after being entombed in ice for 400 years, demonstrating exceptional freeze stress tolerance5. The A. turgidum (Fig. 1a) was collected from Svalbard, a Norwegian archipelago, epitomizes a typical Arctic environment. The annual mean surface air temperatures vary from 2.5 to 5.8 °C in summer and to −11.4 to −9.4 °C in winter6. Despite these extreme conditions, resilient plants, including bryophytes, flourish due to their high tolerance for abiotic stresses7,8.

Fig. 1
figure 1

(a) Aulacomnium turgidum and (b) geographical map showing its sampling site in the Arctic. (c) Polytrichastrum alpinum and (d) geographical map showing its sampling site in the Antarctic.

Polytrichastrum alpinum (Polytrichaceae, Polytrichales) is a common Antarctic moss species, and thrives in moist, rocky environments near glacier moraine peaks, as well as in dry areas2. The P. alpinum sample (Fig. 1c) was collected from King George Island, located at the northern tip of the Antarctic Peninsula, an area characterized by its semi-desert landscape9. The average annual temperature of this area is −2.3 °C, with summer months experiencing temperatures slightly above freezing. The region is subject to frequent strong winds10. The island is home to 64 documented species of moss, which are primarily found in humid, sheltered areas with relatively stable and partially organic soils11,12.

Over the recent decades, research has utilized mosses including A. turgidum and P. alpinum from the Arctic and Antarctic regions to study the impacts of global warming on Earth13,14, the interactions between polar microbial communities and plants15,16, and the molecular mechanisms behind unique survival strategies in extreme conditions17,18. Despite the ecological significance of polar mosses, our understanding of their genomes remains limited19. These mosses, which thrive under the harsh conditions of the Arctic and Antarctic, are critical for understanding resilience and adaptability to extreme environments. However, the scarcity of complete genomic data hinders our ability to fully comprehend the molecular mechanisms that underpin their unique survival strategies. This gap in knowledge underscores the need for enhanced genomic research to better exploit the potential of polar mosses in studying environmental adaptation and climate change resilience.

In this study, we assembled high-quality chromosome-level genomes of A. turgidum and P. alpinum using a combination of Illumina short reads, Nanopore long reads, and high-throughput chromosome conformation capture (Hi-C) data. The genome of A. turgidum was assembled as 277.84 Mb with a contig N50 of 11.92 Mb, while the genome of P. alpinum was 498.33 Mb with a contig N50 of 4.24 Mb (Table 1). A total of 275.60 Mb and 488.51 Mb of the assemblies were anchored to 11 and 8 chromosomes for A. turgidum and P. alpinum, respectively (Table 2). Both species possess a sex chromosome with lower gene density and higher repetitive sequence density than the autosomes. A. turgidum and P. alpinum respectively encodes 25,999 and 28,070 protein-coding genes (Table 3). These two high-quality genomes offer valuable new genomic resources for future research into the genetic foundations and adaptive evolution of plants in Arctic and Antarctic environments.

Table 1 Characteristics of genome assemblies and genome size estimates for Aulacomnium turgidum and Polytrichastrum alpinum.
Table 2 Summary of the assembled chromosomes of A. turgidum and P. alpinum.
Table 3 Genome annotation of A. turgidum and P. alpinum.

Methods

Plant materials and sequencing

Wild gametophytes of A. turgidum were collected from Spitsbergen Island, Svalbard (78°54′41″ N, 11°58′35″ E) (Fig. 1b) on September 18th, 2018, and P. alpinum from King George Island, Antarctica (62°12.041” S, 58°59.698” W) (Fig. 1d) on December 23rd, 2018. Voucher specimens were deposited in the Herbarium of Shenzhen Fairy Lake Botanical Garden in Shenzhen, China (SZG). The entire moss plant, including its leaves, stems, and rhizoids, was used for sequencing. Approximately 5 grams of material were gathered from these samples for genomic and transcriptomic sequencing, respectively. Three distinct types of genome sequencing techniques were performed: the Nanopore long-read sequencing, the Hi-C sequencing, and the Illumina short-read sequencing. Additionally, plant tissues were utilized for transcriptomic sequencing. Transcriptome libraries were constructed using a TruSeq RNA Library Prep Kit v2 (Illumina, CA, USA), with an insert size ranging from 200 to 400 bp, following polyA selection. Sequencing was performed on an Illumina NovaSeq 6000 platform (Illumina, CA, USA), generating 150-bp paired-end reads.

Genome assembly

Illumina short-read sequencing and genome survey

The Illumina short reads were processed using Trimmomatic (v0.39)20 to filter out duplicates, low-quality reads, and adapters. Filtered reads were used for k-mer analyses to estimate genome sizes. Jellyfish (v2.0.0)21 was employed to count the k-mer frequencies with the ‘-m 21’ parameter. GenomeScope (v2.0)22 was used to estimate genome sizes with the settings “-k 21 -p 1” along with other default parameters. These analyses predicted that the genome sizes for A. turgidum and P. alpinum are 272.00 Mb and 449.50 Mb, respectively, with the proportion of repetitive sequences at 29.52% and 40.00%, respectively (Fig. 2).

Fig. 2
figure 2

Genome survey based on k-mer (k = 21) analysis of A. turgidum and P. alpinum. The X-axis represents the k-mer depth, while the Y-axis indicates the k-mer frequency for a given depth.

ONT long-read sequencing and genome assembly

Nanopore libraries were prepared using SQK-LSK108 and sequenced on a Nanopore PromethION sequencer. The long reads obtained were assembled using NextDenovo (v2.5.0)23. The initial assembly was then polished for three iteratives with NextPolish (v1.3.1)24, using both Nanopore long reads and filtered Illumina reads.

Hi-C scaffolding and chromosome-level genome assembly

The Hi-C library construction involved several steps: cross-linking, restriction enzyme digestion (using MboI), end repair, DNA cyclization, and purification25. Paired-end 150-bp reads were generated on an Illumina NovaSeq 6000 platform (Illumina, CA, USA). The raw Hi-C reads were initially processed with Trimmomatic (v0.39)20 using default parameters. Subsequently, Juicer (v1.6)26 was employed to extract valid data. Misassembled contigs were corrected, anchored, ordered, and oriented using the 3D-DNA pipeline (v 180922)27. Juicebox (v1.11.08)28 was then utilized for manual adjustments of the anchored results. For scaffolds not anchored to chromosomes, we performed the procedure outlined by Zhou et al.29 to remove those containing contamination or organelle fragments, and to link consecutive contigs to generate a high-quality genome assembly (Fig. 3). This process ultimately enabled the identification of 11 chromosomes for A. turgidum and 8 chromosomes for P. alpinum, resulting in final chromosome-scale genome assemblies of 277.84 Mb for A. turgidum and 498.33 Mb for P. alpinum (Table 1).

Fig. 3
figure 3

Hi-C interaction heatmap for A. turgidum and P. alpinum. The heatmap illustrates the density of Hi-C interactions among distinct chromosomes, showing scaffolded and independently assembled chromosomes at high resolution. The intensity of red color reflects to the number of Hi-C interactions.

Repeat annotation

For repeat annotation, a customized de novo repeat library was created using a homology-based approach. Programs Piler (v0.4.1)30, LTR_FINDER (v1.0.5)31, RepeatScout (v1.0.5)32, and RepeatModeler (v2.0)33 were used to generate the libraries. The resulting repetitive sequence libraries were combined and used as input for RepeatMasker (v.4.1.1)34. Additionally, Repbase (v21.01)35 served as the database for known repetitive elements, searched using RepeatMasker (v.4.1.1)34. Tandem Repeats Finder (TRF v4.07)36 was employed to predict tandem repeat sequences across the genome. The results indicated that the genomes of A. turgidum and P. alpinum contained 42.65% and 53.37% repetitive elements, respectively. The percentages of long terminal repeat (LTR) retrotransposons were 15.45% for A. turgidum and 25.16% for P. alpinum (Table 3 and Fig. 4).

Fig. 4
figure 4

Circos plots illustrating the chromosomal-level genome assemblies for A. turgidum (left) and P. alpinum (right). The tracks, from outermost to innermost, are: chromosomes, GC content, gene density, TE density, LTR/Gypsy density, LTR/Copia density and color ribbons representing genome-wide syntenic blocks of inter-chromosome and intra-chromosome. The color bar indicates the proportion of GC content.

Gene and functional annotation

To predict protein-coding genes (PCGs) for A. turgidum and P. alpinum, we integrated two types of evidence. We sourced proteome sequences from Physcomitrium patens37, Marchantia polymorpha (https://marchantia.info/), Arabidopsis thaliana (https://www.arabidopsis.org/), and the bryophyte proteins from Swiss-Prot (https://www.uniprot.org/) to provide homology-based protein evidence. Transcriptome evidence involved mapping clean illumina reads to the assembled genomes using HISAT (v2.2.0)38 to produce BAM files. Gene models were then predicted using BRAKER3 pipeline (v3.0.6)39, in conjunction with AUGUSTUS40 and GeneMark41 de novo predictions, based on the soft-masked genome, protein evidence, and transcriptome data. Additionally, transcripts were generated using StringTie (v2.2.1)42 and TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder). These pieces of evidences were integrated using EVidenceModeler (v2.1.0)43. Subsequently, Trinity (v2.8.4)44 was used for de novo transcript assembly, updating gene models to include untranslated regions (UTRs) and alternative splicing variants via the PASA pipeline (v2.5.3)45. A total of 25,999 and 28,070 protein-coding genes were predicted in A. turgidum and P. alpinum, respectively (Table 3). The completeness of the gene space was evaluated using the BUSCO v.3.1.0, based on the Viridiplantae_odb10 set46. For functional annotation, gene models were aligned against the UniProt (Swiss-Prot and TrEMBL), KEGG, and TAIR databases. Protein domains and gene ontologies were annotated using InterProScan (v5.51–85.0)47. Additionally, the iTAK online tool (v1.6)48 was used to identify transcription factors, predicting 727 transcription factors in A. turgidum and 698 in P. alpinum, (Table 4, Tables S1, S2).

Table 4 Overview of gene functional annotations in Aulacomnium turgidum and Polytrichastrum alpinum.

Non-coding RNA annotation

Transfer RNAs (tRNAs) were identified using tRNAscan-SE (v1.3.1)49 with default parameters. Given the high conservation of ribosomal RNAs (rRNAs), rRNA sequences from closely related species were downloaded from the Ensembl database and used as references in BLAST (v2.2.26)50 search with an e-value threshold of 1e-5. Other non-coding RNAs (ncRNAs), including micro RNAs (miRNAs) and small nuclear RNAs (snRNAs), were identified by searching against the Rfam database (v12.0)51 using Infernal (v1.1.1)52 with default parameters. In total, 1,032 and 3,223 ncRNAs were identified in A. turgidum and P. alpinum, respectively (Table 5).

Table 5 Non-coding RNA classification and genome proportion in Aulacomnium turgidum and Polytrichastrum alpinum.

Genome synteny analysis and Circos diagram construction

The python library jcvi (v1.1.8)53 was employed to identify intra-genomic syntenic blocks and detect inter- and intra-chromosomal synteny (Table S3). The syntenic regions were first identified using the jcvi.compara.catalog ortholog function with a default C-score threshold of 0.7 to filter out low-quality hits. Syntenic depths were then calculated using the jcvi.compara.synteny depth function. To further refine the syntenic blocks, we applied the jcvi.compara.synteny screen function with the parameters–minspan = 30–simple. Visualization of syntenic relationships and genomic feature distributions was carried out using Circos54. The Circos plot (Fig. 4) includes synteny blocks, GC content, gene density, transposable element (TE) density, LTR/Gypsy density, and LTR/Copia density, with all features calculated in 1 Mb windows.

Phylogenetic reconstruction

A total of 14 bryophyte species, including the newly sequenced A. turgidum and P. alpinum, were sampled to identify orthologs using OrthoFinder (v2.3.11)55. Alignment for protein sequences of 61 single-copy orthologs was performed using MAFFT (v7.453)56. The resulting alignments were then concatenated and used as the input data for IQ-TREE2 (v2.0.6)57 to construct a maximum likelihood phylogenetic tree, with the JTT model and 1,000 ultrafast bootstrap replicates. The resulting phylogenetic tree was rooted with Anthoceros angustus, and visualized using the interactive Tree of Life (iTOL)58 (Fig. 5).

Fig. 5
figure 5

A maximum likelihood phylogenetic tree of bryophytes based on 61 single-copy ortholog protein sequences from 14 bryophyte species. The newly sequenced two mosses are highlighted in bold.

Data Records

The raw data from Nanopore, Hi-C, Illumina short-read sequencing used for genome assembly and annotation have been deposited in the Genome Sequence Archive (GSA) of the National Genomics Data Center (NGDC) with the accession number CRA01759659 under the BioProject accession number PRJCA02776060. All the genomic sequencing raw data were also deposited in the China National GeneBank Database (CNGB) Nucleotide Sequence Archive (CNSA) under accession numbers CNP000289561. The final contigs and chromosome assemblies are submitted to NCBI under the accession number GCA_048933245.162, GCA_048933195.163 of A. turgidum and P. alpinum, respectively. The contigs and chromosome-scale genome assemblies have also been made available in the GSA at the NGDC. The specific accession numbers for sequences of A. turgidum and P. alpinum are GWHEUUP00000000.164 and GWHEUUQ00000000.165, respectively. The annotation files are available in figshare66.

Technical Validation

The completeness and accuracy of the genome assemblies for A. turgidum and P. alpinum were assessed using multiple approaches. First, Benchmarking Universal Single-Copy Orthologs (BUSCO)46 analyses with the Viridiplantae_odb10 indicated completeness scores of 98.20% for A. turgidum and 98.00% for P. alpinum. Additionally, QUAST67 was employed to evaluate assembly contiguity metrics, revealing final assembly size of 277.84 Mb for A. turgidum with a L50 value of 5, and 498.33 Mb for P. alpinum with a L50 value of 3. To further assess assembly accuracy, Merqury68 was used, yielding quality values (QV) of 30.14 for A. turgidum and 29.59 for P. alpinum, with assembly completeness rates of 99.42% and 93.06%, respectively. The interaction contact pattern, centered along the principal diagonal in the Hi-C heatmap (Fig. 3), further supports the accuracy of the chromosome -level assemblies.