Background

Pyrosomes represent a category of marine holoplankton, characterized by a structure termed “single body,” which is not an individual organism but rather an aggregation of several to even millions of minuscule entities known as “zooids” (Fig. 1). These cylindrical colonies exhibit a length that can vary significantly, ranging from a few millimeters to several meters, contingent on the specific species and stages of development.

Fig. 1
figure 1

Morphology of pyrosomes. (a) Photomicrograph of colony of Pyrosoma atlanticum. (b) Details of single zooids in colony. (c) Morphological drawing of single zooid of P. atlanticum (after Neumann). Scale bar = 1 cm.

Pyrosomes are classified under the order Pyrosomatida Jones, 18481, class Thaliacea, Subphylum Tunicata Balfour, 18812, Phylum Chordata Haeckel, 18743. Generally, bilaterians divide into two primary categories, protostomes and deuterostomes4. Chordates, encompassing tunicates, cephalochordates and vertebrates, constitute a clade distinguished unambiguously within deuterostomes, with cephalochordates being the earliest to diverge5,6,7. Tunicates comprise over 3,000 recognized species and form sister groups with vertebrates (Fig. 2).

Fig. 2
figure 2

Cladogram of phylogenetic relationships between P. atlanticum and animal groups of interest. X-axis: divergence time, Mya: million years ago, C1: Cambrian, O: Ordovician, S: Silurian, D: Devonian, C2: Carboniferous, P1: Permian, T: Triassic, J: Jurassic, C3: Cretaceous, P2: Paleogene, N: Neogene.

The etymology of pyrosome is from the Ancient Greek pyro (fire) and soma (body), which means the body can emit light (fire), since it has the characteristic of bioluminescence. There were light organs on both sides of the anterior end of the branchial basket at the inner edge of the oral siphon of the zooids (Fig. 2c), and zooids can respond to external stimuli and other light sources, leading to bioluminescence.

Although pyrosomes inhabit various tropical to temperate waters, knowledge of their distribution remains limited due to the uncertainty of their probability in samples of plankton. To date, only a few species have been confirmed, and the biological records of this group are sparse8. There are only 8 pyrosomes species have been confirmed within 1 family (Pyrosomatidae), 2 subfamilies (Pyrosomatinae Lahille, 18889 and Pyrostremmatinae van Soest, 197910) and 3 genera (Pyrosoma Péron, 180411, Pyrosomella van Soest, 197910 and Pyrostremma Garstang, 192912). Research on this group remains limited, with taxonomy primarily based on brief morphological details. Moreover, the taxonomy of pyrosomes continues to be a topic of debate. Pyrosomes serve as important consumers of phytoplankton, exhibiting the highest feeding rate of zooplankton to phytoplankton, and the colonies utilize food intake to promote high biomass turnover and material circulation13,14. Due to the high consumption rate, these organisms generate considerable fecal particles, which contribute notably to the sedimentation of organic matter debris, often referred to as Marine Snow. Furthermore, they provide a vital food source for higher trophic levels within the marine food web. Consequently, pyrosomes are essential for the transport and cycling of organic carbon in marine ecosystems15.

Pyrosoma atlanticum Péron, 180411 is the most extensively researched pyrosome species, which widely distributed in the oceanic regions between latitudes 50° N and 50° S. In this study, we performed an integrated analysis utilizing PacBio Iso-Seq and short-read RNA-seq to generate high-quality and full-length transcript data from P. atlanticum. Based on these full-length transcripts, systematic gene functional annotation was then performed. The full-length transcriptome and gene expression profiles obtained in this study is the first dataset of transcriptome in class Thaliacea, which can be used as a reference transcriptome for further genetic analysis. Based on these full-length transcripts, systematic gene functional annotation was then performed. The resulting full-length transcriptome and gene expression profiles constitute the inaugural transcriptomic dataset for the class Thaliacea, serving as a reference for subsequent genetic analyses. Furthermore, this study offers valuable transcriptomic insights for ongoing research into chordate evolution.

Methods

Sample collection & nucleic acid extraction

The pyrosome colony, P. atlanticum, used in this study, originated from surface water in the Philippines Basin, located in the Western Pacific Ocean, during September 2018 (Fig. 3). Following several sterile seawater rinses to eliminate contamination, the sample underwent dehydration with sterile filter paper, was pre-frozen in liquid nitrogen, and then stored at -80 °C until RNA extraction. Given the small size and high water content of single zooid of pyrosome, a mixed sample of one entire zooid separated from the colony was utilized for total RNA isolation using an RNA extraction kit (Takara). The RNA 6000 nano kit and the Agilent 2100 Bioanalyzer system (Agilent Technologies) were employed to evaluate RNA quality.

Fig. 3
figure 3

Sampling locality of pyrosomes. Black diamond represents sampling locality.

PacBio library preparation and sequencing

Full-length cDNA was synthesized from the purified total RNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View), and large-scale PCR was performed to generate double-stranded cDNA templates. Then, the PacBio Template Prep Kit (PacBio) was used to generate the SMRTbell libraries. Subsequently, the Pacific Sequel platform was used for SMRT (Single Molecule Real-Time) sequencing.

Illumina library preparation & sequencing

The poly (A) mRNA was enriched with magnetic beads containing oligo (dT) to process RNA. The first strand of cDNA was synthesized using the mRNA fragment interrupted by interrupting agent as a template, the second strand synthesis reaction system is then performed to synthesize the second strand cDNA. The second strand of cDNA was purified with Ampure XP beads, and the sticky ends were repaired. The “A” base was added to the 3′ end of the cDNA, and PCR amplification was carried out with manufacturer’s instructions. Then the ABI StepOnePlus real-time PCR system was used to inspect the constructed library. The libraries were sequenced on an Illumina HiSeq X 10 platform. The experiments were carried out using three replicates.

Raw data processing

The bioinformatic pipeline from raw data to full-length transcriptome was shown in Fig. 4. Briefly, the raw subreads were analyzed following the Iso-Seq. 3 pipeline (https://github.com/PacificBiosciences/IsoSeq). This pipeline included three key steps: generation of polished Circular Consensus Sequence (CCS) reads using CCS v6.2.0 with a minimum quality score of 0.9 (-min-rq 0.9), classification of full-length (FL) reads based on the presence of both 5’ and 3’ cDNA primers and a poly(A) tail, and clustering of FL reads. Lima v2.1.0 and isoseq. 3 refine were used for primer removal and poly(A) tail excision, respectively. The Iterative Clustering and Error correction (ICE) algorithm was applied to obtain high-quality FL consensus sequences, classified by post-correction accuracy above 99%. A total of 20.61 G data in 1,050,025,012 bp of 433,244 CCSs were generated from the PacBio library by using SMRT cell (Table 1, Fig. 5a, b). Through the detection, 380,278 full-length non-chemiric (FLNC) transcripts were identified (Table 2). The FLNC reads with high similarity were clustered to form a consensus sequence by ICE algorithm, and 21,278 consensus isoforms with a mean length 2,552 bp in combination with non-full-length sequences were generated. The resulting sequences were corrected with the Quiver program, then 21,024 high-quality (HQ, accuracy ratio >99%) isoforms and 250 low-quality (LQ) isoforms were generated (Table 3, Fig. 5c). The LQ isoforms were corrected by Illumina sequencing data with Proovread software16. The reads combined of HQ and corrected LQ isoforms were clustered by CD-HIT17,18. Finally, a total of 12,996 full-length transcripts were obtained by eliminating redundancy (Fig. 5d). The Benchmarking Universal Single-copy Orthologs (BUSCO) with OrthoDB dataset was used as a reference to assess the integrity and accuracy of the transcriptome19. The results indicated that the integrity and accuracy of the transcriptome were reliable (Fig. 6).

Fig. 4
figure 4

Data processing and quantity control of full-length transcriptome. Final full-length sequences were obtained by jointly processing third-generation and second-generation data.

Table 1 Statistics of CCSs.
Fig. 5
figure 5

Results of PacBio SMRT sequencing. (a) Distribution of CCS read lengths. X-axis: read lengths, left y-axis: frequency distribution indicated by blue bars, right y-axis: cumulative frequency distribution curve represented by black line. (b) Distribution of number of CCSs full pass. X-axis: number of passes, y-axis: counts of sequences per respective number of full passes. (c) Distribution of FLNC read lengths. X-axis: read lengths, left y-axis: frequency distribution (blue bars), right y-axis: cumulative frequency distribution curve (black line). (d) Distribution of consensus isoform lengths. X-axis: read lengths, left y-axis: frequency distribution (blue bars), right y-axis: cumulative frequency distribution curve (black line).

Table 2 Statistics of full-length sequences.
Table 3 Cluster of consensus isoforms.
Fig. 6
figure 6

BUSCO integrity assessment of transcriptome. It shows 63.5% complete BUSCOs (blue), subdivided into 49.6% single-copy (lighter blue) and 13.9% duplicated (darker blue). Fragmented BUSCOs accounted for 3.7% (yellow), and 32.8% were missing (red). n: number of single-copy orthologs of homologous species.

Predictions of simple sequence repeat (SSR), coding sequence (CDS), transcription factor (TF) & long non-coding RNA (lncRNA)

The SSRs of the transcriptome were identified with MISA (http://pgrc.ipk-gatersleben.de/misa/). TransDecoder (https://github.com/TransDecoder/TransDecoder/releases) in ‘longOrfs’ mode identified candidate coding regions within transcripts by locating minimum length open reading frames (ORFs), computing log-likelihood scores, and verifying these scores were highest in the first reading frame. Additionally, it allowed verification of putative peptides matching a Protein Family (Pfam)20 ___domain above the noise cutoff score. Furthermore, we utilized AnimalTFDB21 (https://guolab.wchscu.cn/AnimalTFDB4_Document) to predict DNA-binding domains in the protein sequences. The database provides a curated collection of Hidden Markov Models (HMMs) for various transcription factors. We extracted relevant HMM profiles from AnimalTFDB for known DNA-binding domains. These HMMs were used to scan the protein sequences using the HMMER 3.1b2 package. The hmmsearch tool was used to compare each protein sequence against the HMM profiles, with hits considered significant if their E-values were below 1e-4. The identified domains were further validated by cross-referencing with existing annotations in public databases such as UniProt and Pfam. Four computational tools including Coding Potential Calculator (CPC)22, Coding-Non-Coding Index (CNCI)23, Coding Potential Assessment Tool (CPAT)24, and Pfam20 were employed to differentiate non-protein-coding RNA candidates from potential protein-coding RNAs. Transcripts were filtered out based on a minimum length of 50 nucleotides and a default exon count threshold25. Long non-coding RNA (lncRNA) candidates were defined as those exceeding 200 nucleotides in length and containing more than two exons. These candidates were further analyzed using CPC, CNCI, CPAT, and Pfam, which effectively distinguished between protein-coding and non-coding genes.

A total of 6,820 SSR candidates were obtained from the 12,788 transcripts (å 500 bp) (Table 4, Fig. 7a, b). By TransDecoder, a total of 11,561 CDSs containing the start and stop codons which were defined as complete open reading frames (ORFs), were obtained (Fig. 7c). Based on the animalTFDB database, a total of 545 TFs were predicted belonging to 47 TF families (Fig. 7d). Using the CPC, CNCI, CPAT and Pfam databases, a total of 549 lncRNAs were predicted (Fig. 8a).

Table 4 Statistics of SSRs.
Fig. 7
figure 7

Predictions of SSR, CDS, TF & lncRNA. (a) SSRs coverage in whole transcriptome of P. atlanticum. X-axis: SSR types, y-axis: cumulative number of bases occupied by specific SSR types in corresponding transcript. (b) Scattergram of distribution of SSR types. X-axis: SSR types, y-axis: number of SSR repeats, z-axis: sequential number of SSRs located on same transcript. c: compound SSRs (containing at least two perfect SSRs with a distance less than 100 bp), c*: interrupted SSRs (with shared nucleotide(s) between two SSRs), p1: perfect mononucleotide repeats, p2: perfect dinucleotide repeats, p3: perfect trinucleotide repeats, p4: perfect tetranucleotide repeats, p5: perfect pentanucleotide repeats, p6: perfect hexanucleotide repeats. (c) Distribution of predicted CDS lengths. X-axis: lengths of CDSs, y-axis: counts of each CDSs of different lengths. (d) Distribution of top 20 TF types. X-axis: transcription factor families, y-axis: counts of each transcription factor family.

Fig. 8
figure 8

Venn diagrams of LncRNA prediction & gene annotation. (a) LncRNAs predicted by CNCI, CPC, Pfam and CPAT. (b) Gene annotations from Pfam, COG, GO and KEGG. Each section distinguished by different colors, represents number of transcripts identified by each tool and their intersections.

Gene functional annotation

Gene function was annotated based on the following databases: NCBI Non-Redundant Protein Sequence Database (NR)26, Pfam23, Clusters of euKaryotic Ortholog Groups (KOG)27, Clusters of Orthologous Groups (COG)28, evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG)29, A manually annotated and reviewed protein sequence database (Swiss-Prot)30, Kyoto Encyclopedia of Genes and Genomes (KEGG)31 and Gene Ontology (GO)32. The BLAST program was used with an E-value <10−333. As a result, a total of 10,874 transcripts were annotated (Table 5). Moreover, a total of 6,728 transcripts were simultaneously annotated by the Pfam, COG, GO and KEGG databases (Fig. 8b).

Table 5 Statistics of annotated transcripts.

There were 10,544 matched transcripts after aligned all transcripts in the NR database (Fig. 9a). The top six species with the most annotated homologous sequences in the NR database were Ciona intestinalis Linnaeus, 176734 (5,350, 50.69%), Phallusia mammillata Cuvier, 181535 (4,001, 37.91%), Branchiostoma belcheri Gray, 184736 (84, 0.80%), Dendronephthya gigantea Verrill, 186437 (45, 0.43%), Lingula anatina Lamarck, 180138 (30, 0.28%), and Saccoglossus kowalevskii Agassiz, 187339 (30, 0.28%).

Fig. 9
figure 9

Functional annotation of transcripts in pyrosome. (a) Homologous species distribution in NR annotation. Pie chart shows species composition, with each segment representing proportion of sequences derived from different species. (b) COG annotation of transcripts. X-axis: COG functional categories labeled A to X corresponding to description of legends on right, y-axis: counts of transcripts assigned to each functional category. (c) GO annotation of top 12 groups of transcripts for each category. X-axis: GO functional categories, grouped into Biological Process (ochre), Cellular Component (purple), and Molecular Function (green), y-axis: counts of transcripts associated with each GO category. (d) Results of top 11 annotated KEGG pathways. X-axis: KEGG pathways, y-axis: counts of transcripts associated with each KEGG pathway.

A total of 10,176 transcripts were annotated with 10,960 annotations by the COG database, and divided into 24 categories according to COG function class (Fig. 9b). There were 2,386 and 1,633 transcripts belonging to General function prediction only and Function unknown, accounting for 21.77% and 14.90%, respectively. In addition, there were transcripts belonging to other categories, including Posttranslational modification, protein turnover, chaperones (1,171, 10.68%), Intracellular trafficking, secretion, and vesicular transport (601, 5.48%), Transcription (596, 5.44%) and RNA processing and modification (582, 5.31%).

There were 8,063 transcripts were annotated with 1,299,420 GO terms by the GO database, covering categories related to Biological Process, Cellular Component, and Molecular Function (Fig. 9c). In the Biological Process category, the highly represented groups included Cellular process (7,084), and Biological regulation (5,741), with further enrichment in subcategories such as Metabolic process (5,287), Regulation of biological process (5,352), and Organic substance metabolic process (5,115). In the Cellular Component category, the top groups were Intracellular anatomical structure (7,191), and Organelle (6,629), with specific enrichment in subcategories like Cytoplasm (6,296), Intracellular organelle (6,531), and Membrane-bonded organelle (6,089). For Molecular Function, the main groups included Binding (5,447), and Protein binding (4,165), with key subcategories like Catalytic activity (3,269), Organic cyclic compound binding (2,134), and Heterocyclic compound binding (2,097).

In total, there were 7,698 transcripts matched in the KEGG database with 10,893 KO terms, and 5,019 transcripts matched in 22,914 pathways. These transcripts were participated in 6 categories, including Cellular Processes, Organismal System, Metabolism, Genetic Information Processing, Environmental Information Processing and Human Diseases. Overall, the top ten pathways of the matched transcripts were Metabolic pathways (1,174), Biosynthesis of secondary metabolites (388), Pathways in cancer (377), Protein processing in endoplasmic reticulum (269), Phagosome (265), Human papillomavirus infection (237), Microbial metabolism in diverse environments (230), Spliceosome (222), PI3K-Akt signaling pathway (214) and Thermogenesis (209) (Fig. 9d).

Data Records

All raw reads of full-length Iso-seq were deposited in Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) under accession number SRR2642702440. All raw reads of RNA-seq were deposited in the SRA of NCBI under accession numbers SRR2642702841, SRR2642702942 and SRR2642703043. Moreover, the sequences files produced in bioinformatic pipeline from raw data to full-length transcriptome (CCSs, FLNC reads, consensus isoforms, HQ isoforms, LQ isoforms, transcripts, SSRs, lncRNAs, CDSs, CDSs (mRNA) and TFs), the prediction results of SSRs, CDSs, TFs, lncRNAs, the annotations of genes by eggNOG, COG, GO, KEGG, Pfam, KOG, NR and Swissprot, and the FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values of three replicates generated by Illumine sequencing in the pyrosome were deposited in Figshare44.

Technical Validation

Three single zooids separated from the colony for three replicates were sequenced on an Illumina HiSeq X 10, respectively. The quality assessments of the Illumina clean reads were evaluated with FastQC, and the Q20, Q30 and GC content metrics were obtained, respectively (Table 6). Then, the Pearson correlation analysis of the FPKM between three replicates was carried out (Fig. 10), and the reproducibility was determined.

Table 6 Statistics of evaluation of sequencing data.
Fig. 10
figure 10

Pearson correlation analysis between samples according to FPKM values from RNA-seq. Numbers and color codes in grid indicates the Pearson coefficient of determination (R2) between samples. Scales ranging from 0.97 (pinkish-purple) to 1 (wathet).