Abstract
Pyrosomes represent a group of marine holozooplankton. They do not exist as individual entities but as intricate colonies comprising numerous zooids, well-known for their bioluminescent properties. Pyrosomes inhabit most oceanic waters spanning tropical to temperate regions. They serve as significant consumers of phytoplankton, playing a crucial role in transport of organic carbon and sedimentation of organic matter debris. Nonetheless, the knowledge about this group remains limited, particularly concerning genomic and gene aspects. To further investigate the pyrosome at the molecular level, we used Pyrosoma atlanticum, which is the best known of pyrosome, for investigation in this study. We performed PacBio Iso-Seq and Illumina RNA-seq to generate high-quality and full-length transcript data from P. atlanticum. The systematic gene functional annotation was performed by integrated data. The full-length transcriptome produced in this study represents the inaugural dataset of transcriptome within the class Thaliacea, serving as a reference for future investigations. Additionally, as a stem group for chordates, the information of pyrosomes can provide a valuable research foundation for the evolution of vertebrates.
Similar content being viewed by others
Background
Pyrosomes represent a category of marine holoplankton, characterized by a structure termed “single body,” which is not an individual organism but rather an aggregation of several to even millions of minuscule entities known as “zooids” (Fig. 1). These cylindrical colonies exhibit a length that can vary significantly, ranging from a few millimeters to several meters, contingent on the specific species and stages of development.
Pyrosomes are classified under the order Pyrosomatida Jones, 18481, class Thaliacea, Subphylum Tunicata Balfour, 18812, Phylum Chordata Haeckel, 18743. Generally, bilaterians divide into two primary categories, protostomes and deuterostomes4. Chordates, encompassing tunicates, cephalochordates and vertebrates, constitute a clade distinguished unambiguously within deuterostomes, with cephalochordates being the earliest to diverge5,6,7. Tunicates comprise over 3,000 recognized species and form sister groups with vertebrates (Fig. 2).
The etymology of pyrosome is from the Ancient Greek pyro (fire) and soma (body), which means the body can emit light (fire), since it has the characteristic of bioluminescence. There were light organs on both sides of the anterior end of the branchial basket at the inner edge of the oral siphon of the zooids (Fig. 2c), and zooids can respond to external stimuli and other light sources, leading to bioluminescence.
Although pyrosomes inhabit various tropical to temperate waters, knowledge of their distribution remains limited due to the uncertainty of their probability in samples of plankton. To date, only a few species have been confirmed, and the biological records of this group are sparse8. There are only 8 pyrosomes species have been confirmed within 1 family (Pyrosomatidae), 2 subfamilies (Pyrosomatinae Lahille, 18889 and Pyrostremmatinae van Soest, 197910) and 3 genera (Pyrosoma Péron, 180411, Pyrosomella van Soest, 197910 and Pyrostremma Garstang, 192912). Research on this group remains limited, with taxonomy primarily based on brief morphological details. Moreover, the taxonomy of pyrosomes continues to be a topic of debate. Pyrosomes serve as important consumers of phytoplankton, exhibiting the highest feeding rate of zooplankton to phytoplankton, and the colonies utilize food intake to promote high biomass turnover and material circulation13,14. Due to the high consumption rate, these organisms generate considerable fecal particles, which contribute notably to the sedimentation of organic matter debris, often referred to as Marine Snow. Furthermore, they provide a vital food source for higher trophic levels within the marine food web. Consequently, pyrosomes are essential for the transport and cycling of organic carbon in marine ecosystems15.
Pyrosoma atlanticum Péron, 180411 is the most extensively researched pyrosome species, which widely distributed in the oceanic regions between latitudes 50° N and 50° S. In this study, we performed an integrated analysis utilizing PacBio Iso-Seq and short-read RNA-seq to generate high-quality and full-length transcript data from P. atlanticum. Based on these full-length transcripts, systematic gene functional annotation was then performed. The full-length transcriptome and gene expression profiles obtained in this study is the first dataset of transcriptome in class Thaliacea, which can be used as a reference transcriptome for further genetic analysis. Based on these full-length transcripts, systematic gene functional annotation was then performed. The resulting full-length transcriptome and gene expression profiles constitute the inaugural transcriptomic dataset for the class Thaliacea, serving as a reference for subsequent genetic analyses. Furthermore, this study offers valuable transcriptomic insights for ongoing research into chordate evolution.
Methods
Sample collection & nucleic acid extraction
The pyrosome colony, P. atlanticum, used in this study, originated from surface water in the Philippines Basin, located in the Western Pacific Ocean, during September 2018 (Fig. 3). Following several sterile seawater rinses to eliminate contamination, the sample underwent dehydration with sterile filter paper, was pre-frozen in liquid nitrogen, and then stored at -80 °C until RNA extraction. Given the small size and high water content of single zooid of pyrosome, a mixed sample of one entire zooid separated from the colony was utilized for total RNA isolation using an RNA extraction kit (Takara). The RNA 6000 nano kit and the Agilent 2100 Bioanalyzer system (Agilent Technologies) were employed to evaluate RNA quality.
PacBio library preparation and sequencing
Full-length cDNA was synthesized from the purified total RNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View), and large-scale PCR was performed to generate double-stranded cDNA templates. Then, the PacBio Template Prep Kit (PacBio) was used to generate the SMRTbell libraries. Subsequently, the Pacific Sequel platform was used for SMRT (Single Molecule Real-Time) sequencing.
Illumina library preparation & sequencing
The poly (A) mRNA was enriched with magnetic beads containing oligo (dT) to process RNA. The first strand of cDNA was synthesized using the mRNA fragment interrupted by interrupting agent as a template, the second strand synthesis reaction system is then performed to synthesize the second strand cDNA. The second strand of cDNA was purified with Ampure XP beads, and the sticky ends were repaired. The “A” base was added to the 3′ end of the cDNA, and PCR amplification was carried out with manufacturer’s instructions. Then the ABI StepOnePlus real-time PCR system was used to inspect the constructed library. The libraries were sequenced on an Illumina HiSeq X 10 platform. The experiments were carried out using three replicates.
Raw data processing
The bioinformatic pipeline from raw data to full-length transcriptome was shown in Fig. 4. Briefly, the raw subreads were analyzed following the Iso-Seq. 3 pipeline (https://github.com/PacificBiosciences/IsoSeq). This pipeline included three key steps: generation of polished Circular Consensus Sequence (CCS) reads using CCS v6.2.0 with a minimum quality score of 0.9 (-min-rq 0.9), classification of full-length (FL) reads based on the presence of both 5’ and 3’ cDNA primers and a poly(A) tail, and clustering of FL reads. Lima v2.1.0 and isoseq. 3 refine were used for primer removal and poly(A) tail excision, respectively. The Iterative Clustering and Error correction (ICE) algorithm was applied to obtain high-quality FL consensus sequences, classified by post-correction accuracy above 99%. A total of 20.61 G data in 1,050,025,012 bp of 433,244 CCSs were generated from the PacBio library by using SMRT cell (Table 1, Fig. 5a, b). Through the detection, 380,278 full-length non-chemiric (FLNC) transcripts were identified (Table 2). The FLNC reads with high similarity were clustered to form a consensus sequence by ICE algorithm, and 21,278 consensus isoforms with a mean length 2,552 bp in combination with non-full-length sequences were generated. The resulting sequences were corrected with the Quiver program, then 21,024 high-quality (HQ, accuracy ratio >99%) isoforms and 250 low-quality (LQ) isoforms were generated (Table 3, Fig. 5c). The LQ isoforms were corrected by Illumina sequencing data with Proovread software16. The reads combined of HQ and corrected LQ isoforms were clustered by CD-HIT17,18. Finally, a total of 12,996 full-length transcripts were obtained by eliminating redundancy (Fig. 5d). The Benchmarking Universal Single-copy Orthologs (BUSCO) with OrthoDB dataset was used as a reference to assess the integrity and accuracy of the transcriptome19. The results indicated that the integrity and accuracy of the transcriptome were reliable (Fig. 6).
Results of PacBio SMRT sequencing. (a) Distribution of CCS read lengths. X-axis: read lengths, left y-axis: frequency distribution indicated by blue bars, right y-axis: cumulative frequency distribution curve represented by black line. (b) Distribution of number of CCSs full pass. X-axis: number of passes, y-axis: counts of sequences per respective number of full passes. (c) Distribution of FLNC read lengths. X-axis: read lengths, left y-axis: frequency distribution (blue bars), right y-axis: cumulative frequency distribution curve (black line). (d) Distribution of consensus isoform lengths. X-axis: read lengths, left y-axis: frequency distribution (blue bars), right y-axis: cumulative frequency distribution curve (black line).
BUSCO integrity assessment of transcriptome. It shows 63.5% complete BUSCOs (blue), subdivided into 49.6% single-copy (lighter blue) and 13.9% duplicated (darker blue). Fragmented BUSCOs accounted for 3.7% (yellow), and 32.8% were missing (red). n: number of single-copy orthologs of homologous species.
Predictions of simple sequence repeat (SSR), coding sequence (CDS), transcription factor (TF) & long non-coding RNA (lncRNA)
The SSRs of the transcriptome were identified with MISA (http://pgrc.ipk-gatersleben.de/misa/). TransDecoder (https://github.com/TransDecoder/TransDecoder/releases) in ‘longOrfs’ mode identified candidate coding regions within transcripts by locating minimum length open reading frames (ORFs), computing log-likelihood scores, and verifying these scores were highest in the first reading frame. Additionally, it allowed verification of putative peptides matching a Protein Family (Pfam)20 ___domain above the noise cutoff score. Furthermore, we utilized AnimalTFDB21 (https://guolab.wchscu.cn/AnimalTFDB4_Document) to predict DNA-binding domains in the protein sequences. The database provides a curated collection of Hidden Markov Models (HMMs) for various transcription factors. We extracted relevant HMM profiles from AnimalTFDB for known DNA-binding domains. These HMMs were used to scan the protein sequences using the HMMER 3.1b2 package. The hmmsearch tool was used to compare each protein sequence against the HMM profiles, with hits considered significant if their E-values were below 1e-4. The identified domains were further validated by cross-referencing with existing annotations in public databases such as UniProt and Pfam. Four computational tools including Coding Potential Calculator (CPC)22, Coding-Non-Coding Index (CNCI)23, Coding Potential Assessment Tool (CPAT)24, and Pfam20 were employed to differentiate non-protein-coding RNA candidates from potential protein-coding RNAs. Transcripts were filtered out based on a minimum length of 50 nucleotides and a default exon count threshold25. Long non-coding RNA (lncRNA) candidates were defined as those exceeding 200 nucleotides in length and containing more than two exons. These candidates were further analyzed using CPC, CNCI, CPAT, and Pfam, which effectively distinguished between protein-coding and non-coding genes.
A total of 6,820 SSR candidates were obtained from the 12,788 transcripts (å 500 bp) (Table 4, Fig. 7a, b). By TransDecoder, a total of 11,561 CDSs containing the start and stop codons which were defined as complete open reading frames (ORFs), were obtained (Fig. 7c). Based on the animalTFDB database, a total of 545 TFs were predicted belonging to 47 TF families (Fig. 7d). Using the CPC, CNCI, CPAT and Pfam databases, a total of 549 lncRNAs were predicted (Fig. 8a).
Predictions of SSR, CDS, TF & lncRNA. (a) SSRs coverage in whole transcriptome of P. atlanticum. X-axis: SSR types, y-axis: cumulative number of bases occupied by specific SSR types in corresponding transcript. (b) Scattergram of distribution of SSR types. X-axis: SSR types, y-axis: number of SSR repeats, z-axis: sequential number of SSRs located on same transcript. c: compound SSRs (containing at least two perfect SSRs with a distance less than 100 bp), c*: interrupted SSRs (with shared nucleotide(s) between two SSRs), p1: perfect mononucleotide repeats, p2: perfect dinucleotide repeats, p3: perfect trinucleotide repeats, p4: perfect tetranucleotide repeats, p5: perfect pentanucleotide repeats, p6: perfect hexanucleotide repeats. (c) Distribution of predicted CDS lengths. X-axis: lengths of CDSs, y-axis: counts of each CDSs of different lengths. (d) Distribution of top 20 TF types. X-axis: transcription factor families, y-axis: counts of each transcription factor family.
Gene functional annotation
Gene function was annotated based on the following databases: NCBI Non-Redundant Protein Sequence Database (NR)26, Pfam23, Clusters of euKaryotic Ortholog Groups (KOG)27, Clusters of Orthologous Groups (COG)28, evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG)29, A manually annotated and reviewed protein sequence database (Swiss-Prot)30, Kyoto Encyclopedia of Genes and Genomes (KEGG)31 and Gene Ontology (GO)32. The BLAST program was used with an E-value <10−333. As a result, a total of 10,874 transcripts were annotated (Table 5). Moreover, a total of 6,728 transcripts were simultaneously annotated by the Pfam, COG, GO and KEGG databases (Fig. 8b).
There were 10,544 matched transcripts after aligned all transcripts in the NR database (Fig. 9a). The top six species with the most annotated homologous sequences in the NR database were Ciona intestinalis Linnaeus, 176734 (5,350, 50.69%), Phallusia mammillata Cuvier, 181535 (4,001, 37.91%), Branchiostoma belcheri Gray, 184736 (84, 0.80%), Dendronephthya gigantea Verrill, 186437 (45, 0.43%), Lingula anatina Lamarck, 180138 (30, 0.28%), and Saccoglossus kowalevskii Agassiz, 187339 (30, 0.28%).
Functional annotation of transcripts in pyrosome. (a) Homologous species distribution in NR annotation. Pie chart shows species composition, with each segment representing proportion of sequences derived from different species. (b) COG annotation of transcripts. X-axis: COG functional categories labeled A to X corresponding to description of legends on right, y-axis: counts of transcripts assigned to each functional category. (c) GO annotation of top 12 groups of transcripts for each category. X-axis: GO functional categories, grouped into Biological Process (ochre), Cellular Component (purple), and Molecular Function (green), y-axis: counts of transcripts associated with each GO category. (d) Results of top 11 annotated KEGG pathways. X-axis: KEGG pathways, y-axis: counts of transcripts associated with each KEGG pathway.
A total of 10,176 transcripts were annotated with 10,960 annotations by the COG database, and divided into 24 categories according to COG function class (Fig. 9b). There were 2,386 and 1,633 transcripts belonging to General function prediction only and Function unknown, accounting for 21.77% and 14.90%, respectively. In addition, there were transcripts belonging to other categories, including Posttranslational modification, protein turnover, chaperones (1,171, 10.68%), Intracellular trafficking, secretion, and vesicular transport (601, 5.48%), Transcription (596, 5.44%) and RNA processing and modification (582, 5.31%).
There were 8,063 transcripts were annotated with 1,299,420 GO terms by the GO database, covering categories related to Biological Process, Cellular Component, and Molecular Function (Fig. 9c). In the Biological Process category, the highly represented groups included Cellular process (7,084), and Biological regulation (5,741), with further enrichment in subcategories such as Metabolic process (5,287), Regulation of biological process (5,352), and Organic substance metabolic process (5,115). In the Cellular Component category, the top groups were Intracellular anatomical structure (7,191), and Organelle (6,629), with specific enrichment in subcategories like Cytoplasm (6,296), Intracellular organelle (6,531), and Membrane-bonded organelle (6,089). For Molecular Function, the main groups included Binding (5,447), and Protein binding (4,165), with key subcategories like Catalytic activity (3,269), Organic cyclic compound binding (2,134), and Heterocyclic compound binding (2,097).
In total, there were 7,698 transcripts matched in the KEGG database with 10,893 KO terms, and 5,019 transcripts matched in 22,914 pathways. These transcripts were participated in 6 categories, including Cellular Processes, Organismal System, Metabolism, Genetic Information Processing, Environmental Information Processing and Human Diseases. Overall, the top ten pathways of the matched transcripts were Metabolic pathways (1,174), Biosynthesis of secondary metabolites (388), Pathways in cancer (377), Protein processing in endoplasmic reticulum (269), Phagosome (265), Human papillomavirus infection (237), Microbial metabolism in diverse environments (230), Spliceosome (222), PI3K-Akt signaling pathway (214) and Thermogenesis (209) (Fig. 9d).
Data Records
All raw reads of full-length Iso-seq were deposited in Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) under accession number SRR2642702440. All raw reads of RNA-seq were deposited in the SRA of NCBI under accession numbers SRR2642702841, SRR2642702942 and SRR2642703043. Moreover, the sequences files produced in bioinformatic pipeline from raw data to full-length transcriptome (CCSs, FLNC reads, consensus isoforms, HQ isoforms, LQ isoforms, transcripts, SSRs, lncRNAs, CDSs, CDSs (mRNA) and TFs), the prediction results of SSRs, CDSs, TFs, lncRNAs, the annotations of genes by eggNOG, COG, GO, KEGG, Pfam, KOG, NR and Swissprot, and the FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values of three replicates generated by Illumine sequencing in the pyrosome were deposited in Figshare44.
Technical Validation
Three single zooids separated from the colony for three replicates were sequenced on an Illumina HiSeq X 10, respectively. The quality assessments of the Illumina clean reads were evaluated with FastQC, and the Q20, Q30 and GC content metrics were obtained, respectively (Table 6). Then, the Pearson correlation analysis of the FPKM between three replicates was carried out (Fig. 10), and the reproducibility was determined.
Code availability
Parameters to software tools involved are described below:
CCS: version 6.2.0, parameters: minimum quality score of 0.9 (-min-rq 0.9), classification of full-length (FL) reads based on the presence of both 5’ and 3’ cDNA primers and a poly(A) tail, and clustering of FL reads.
Lima: version 2.1.0.
isoseq. 3 refine: versions 3.4.0, parameters: poly(A) tail excision (-require-polya).
CD-HIT: version 4.8.1, parameters: setting the sequence identity threshold to 99% (-c 0.99).
Busco: version 5.7.1. parameters: setting metazoan as lineage dataset.
BLAST: version 2.2.26, parameters: limits the number of high-scoring pairs (hsps) to two per query-subject pair (-max_hsps 2), only considers alignments with at least 95% sequence similarity (-perc_identity 95), requires each hsp to completely cover 100% of the query sequence (-qcov_hsp_perc 100).
MISA: https://webblast.ipk-gatersleben.de/misa/, parameters: filter transcripts longer than 500 bp.
TransDecoder: https://github.com/TransDecoder/TransDecoder/releases, parameters: identifying candidate coding regions within transcripts by locating minimum length open reading frames (ORFs), computing log-likelihood scores, and verifying these scores were highest in the first reading frame. Additionally, it allowed verification of putative peptides matching a Pfam ___domain above the noise cutoff score.
AnimalTFDB: https://guolab.wchscu.cn/AnimalTFDB4_Document.
HMMER: version 3.1b2, parameters: setting the E-value threshold for reporting ___domain hits (-domE 1e-4).
FastQC: version 0.11.9.
Trimmomatic: version 0.39.
STAR: version 2.7.3a.
Samtools: version 0.1.18.
RSEM: version 1.3.3, parameters: using FPKM as the metric.
Databases for LncRNA: CPC (http://cpc2.gao-lab.org/), CNCI (https://github.com/www-bioinfo-org/CNCI), CPAT (http://rna-cpat.sourceforge.net/), Pfam (http://pfam.xfam.org/).
Databases for Gene Functional Annotation: NR (https://www.ncbi.nlm.nih.gov/), Pfam (http://pfam.xfam.org/), Clusters of Orthologous Groups of proteins (KOG/COG/eggnog, http://eggnogdb.embl.de/), Swiss-Prot (https://www.uniprot.org/), KEGG (https://www.genome.jp/kegg/), and GO (http://geneontology.org/).
If no explanation was provided for the software, there was no custom code used to generate or process the data in the manuscript.
References
Jones, T. R. Tunicata. In: Todd’s Cyclopaedia of Anatomy and Physiology. Sherwood, Gilbert, and Piper 4, 1185-1243 (1848).
Balfour, F. M. A treatise on comparative embryology. London, MacMillan 2, 1-655 i-xxi (1881).
Haeckel, E. Die gastraea theorie, die phylogenetische classification des thierreichs und die homologie der keimblatter. Jenaische Zeitschrift fur Naturwissenschaft 8, 1–55 (1874).
Grobben, K. Die systematische Einteilung des Tierreiches. Ver. Zool. Bot. Ges. Wien 58, 491–511 (1908).
Wada, H. & Satoh, N. Details of the evolutionary history from invertebrates to vertebrates, as deduced from the sequences of 18S rDNA. Proc. Natl Acad. Sci. USA 91, 1801–1804 (1994).
Perseke, M., Golombek, A., Schlegel, M. & Struck, T. H. The impact of mitochondrial genome analyses on the understanding of deuterostome phylogeny. Mol. Phylogenet. Evol. 66, 898–905 (2013).
Jefferies, R. P., Brown, N. A. & Daley, P. E. The early phylogeny of chordates and echinoderms and the origin of chordate left–right asymmetry and bilateral symmetry. Acta Zoologica 77, 101–122 (1996).
Van Soest, R. W. M. A monograph of the order Pyrosomatida (Tunicata, Thaliacea). J. Plankton Res. 3, 603–631 (1981).
Lahille, M. F. Étude systématique des tuniciers. Compt Rendus Assoc. France Adv. Sci. 16, 667–677 (1888).
Van Soest, R. W. M. Revised classification of the order Pyrosomatida (Tunicata, Thaliacea), with the description of a new genus. Steenstrupia 5, 197–217 (1979).
Péron, F. Mémoire sur le nouveau genre Pyrosoma. Ann. Mus. Hist. Natur. Paris 4, 437–446 (1804).
Garstang, W. Tunicates. Encyclopedia Britannica 14, 549–555 (1929).
Perissinotto, R. et al. Grazing by Pyrosoma atlanticum (Tunicata Thaliacea) in the south Indian Ocean. Mar. Ecol. Progr. Ser. 330, 1–11 (2007).
Andersen, V., Sardou, J. & Nival, P. The diel migrations and vertical distributions of zooplankton and micronekton in the Northwestern Mediterranean Sea. 2. Siphonophores, hydromedusae and pyrosomids. J. Plankton Res. 14, 1155–1169 (1992).
Esnal, G. Pyrosomatida. In: D. Boltovskoy (ed.). South Atlantic Zooplankton. Leiden, Backhuys Publishers 1423-1444 (1999).
Hackl, T., Hedrich, R., Schultz, J. & Förster, F. Proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004–3011 (2014).
Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Mosè, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution 38, 4647–4654 (2021).
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Research 42, D222–230 (2014).
Shen, W. et al. AnimalTFDB 4.0: a comprehensive animal transcription factor database updated with variation and expression annotations. Nucleic Acids Research 51(Database issue), D39–45 (2023).
Kong, L. et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic acids research 35(suppl 2), W345–W349 (2007).
Sun, L. et al. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic acids research 41, e166 (2013).
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic acids research 41, e74–e74 (2013).
Yang, H., Xu, D., Zhuo, Z., Hu, J. & Lu, B. SMRT Sequencing of the Full-Length Transcriptome of the Rhynchophorus ferrugineus (Coleoptera: Curculionidae). PeerJ 8, 9133 (2020).
Deng, Y. et al. Integrated NR Database in Protein Annotation System and Its Localization. Computer Engineering 32, 71–74 (2006).
Koonin, E. V. et al. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome biology 5, R7 (2004).
Tatusov, R. L., Galperin, M. Y. & Natale, D. A. The COG database: a tool for genome scale analysis of protein functions and evolution. Nucleic Acids Research 28, 33–36 (2000).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research 47, D309–D314 (2019).
Apweiler, R. et al. UniProt: The Universal Protein knowledgebase. Nucleic Acids Research 1, D115–9 (2004).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389–3402 (1997).
Linnaeus, C. Systema naturae per regna tria naturae: secundum classes, ordines, genera, species, cum characteribus, differentiis, synonymis, locis. Ed. 12. 1., Regnum Animale. 1 & 2. Holmiae (Stockholm), Laurentii Salvii. apud JB Delamolliere 1-1327 (1767).
Cuvier, G. Memoire sur les ascidies et sur leur anatomique. Mem Mus Hist Na. Paris 2, 10–39 (1815).
Gray, J. E. A list of the genera of recent Mollusca, their synonyma and types. R. and JE Taylor (1847).
Verrill, A. E. List of the polyps and corals sent by the Museum of Comparative Zoology to other institutions in exchange, with annotations. Bulletin of the Museum of Comparative Zoology 1, 29–60 (1864).
Lamarck, J. B. Système des animaux sans vertèbres, ou tableau général des classes, des ordres et des genres de ces animaux; Présentant leurs caractères essentiels et leur distribution, d’apres la considération de leurs rapports naturels et de leur organisation, et suivant l’arrangement établi dans les galeries du Muséum d’Histoire Naturelle, parmi leurs dépouilles conservées; Précédé du discours d’ouverture du Cours de Zoologie, donné dans le Muséum National d’Histoire Naturelle l’an 8 de la République. Published by the author and Deterville, Paris viii + 432 pp (1801).
Agassiz, A. The history of Balanoglossus and Tornaria. Memoirs of the American Academy of Arts and Sciences 9, 421–436 (1873).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427024 (2023).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427028 (2023).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427029 (2023).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427030 (2023).
Xiang, P. et al. Pyrosome Transcriptome Datasets. Figshare https://doi.org/10.6084/m9.figshare.27242205 (2024).
Acknowledgements
This work was supported by the National Key Research and Development Program of China (Grant Nos. 2022YFC3102401 & 2023YFC3604500), National Natural Science Foundation of China (Grant No. 32271165), Natural Science Foundation of Xiamen, China (3502Z20227162), Key Laboratory of Tropical Marine Ecosystem and Bioresource, Ministry of Natural Resources (2021ZD01) and the Interdisciplinary Project in Ocean Research of Tongji University (Grant No. 2022-2-ZD-02). We thank the Biomarker Technologies Co., Ltd for assisting in sequencing. Figure 2 was created using Ocean Data View.
Author information
Authors and Affiliations
Contributions
P.X., M.L. and X.B. conceived and designed the study. P.X. collected the samples. P.X., B.X. and X.B. performed the experiments, B.X., X.B., M.L.,C.Z. and J.L. analysed the data. All authors discussed results. P.X., M.L., X.B. and J.L. wrote the draft manuscript. All authors reviewed the manuscript. C.Z. and M.L. approved the final version to be published. P.X. and X.B. contributed equally to this work and should be considered as co-first authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Xiang, P., Bai, X., Xing, B. et al. Full-length transcriptome annotation of a pyrosome, Pyrosoma atlanticum (Chordata, Thaliacea). Sci Data 11, 1433 (2024). https://doi.org/10.1038/s41597-024-04251-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-04251-7