Full-length transcriptome annotation of a pyrosome, Pyrosoma atlanticum (Chordata, Thaliacea)

Xiang, Peng; Bai, Xuanxuan; Xing, Bingpeng; Li, Jiangtao; Zhang, Chao; Li, Mingyu

doi:10.1038/s41597-024-04251-7

Download PDF

Data Descriptor
Open access
Published: 24 December 2024

Full-length transcriptome annotation of a pyrosome, Pyrosoma atlanticum (Chordata, Thaliacea)

Peng Xiang ORCID: orcid.org/0000-0002-3193-0775^1,2^na1,
Xuanxuan Bai^1,3^na1,
Bingpeng Xing²,
Jiangtao Li⁴,
Chao Zhang¹ &
…
Mingyu Li^3,5

Scientific Data volume 11, Article number: 1433 (2024) Cite this article

1628 Accesses
Metrics details

Subjects

Abstract

Pyrosomes represent a group of marine holozooplankton. They do not exist as individual entities but as intricate colonies comprising numerous zooids, well-known for their bioluminescent properties. Pyrosomes inhabit most oceanic waters spanning tropical to temperate regions. They serve as significant consumers of phytoplankton, playing a crucial role in transport of organic carbon and sedimentation of organic matter debris. Nonetheless, the knowledge about this group remains limited, particularly concerning genomic and gene aspects. To further investigate the pyrosome at the molecular level, we used Pyrosoma atlanticum, which is the best known of pyrosome, for investigation in this study. We performed PacBio Iso-Seq and Illumina RNA-seq to generate high-quality and full-length transcript data from P. atlanticum. The systematic gene functional annotation was performed by integrated data. The full-length transcriptome produced in this study represents the inaugural dataset of transcriptome within the class Thaliacea, serving as a reference for future investigations. Additionally, as a stem group for chordates, the information of pyrosomes can provide a valuable research foundation for the evolution of vertebrates.

Distribution, associations and role in the biological carbon pump of Pyrosoma atlanticum (Tunicata, Thaliacea) off Cabo Verde, NE Atlantic

Article Open access 29 April 2021

Host-specific symbioses and the microbial prey of a pelagic tunicate (Pyrosoma atlanticum)

Article Open access 14 April 2021

Discovery of a novel brown algal genus and species Setoutiphycus delamareoides (Phaeophyceae, Ectocarpales) from the Seto Inland Sea, Japan

Article Open access 06 July 2021

Background

Pyrosomes represent a category of marine holoplankton, characterized by a structure termed “single body,” which is not an individual organism but rather an aggregation of several to even millions of minuscule entities known as “zooids” (Fig. 1). These cylindrical colonies exhibit a length that can vary significantly, ranging from a few millimeters to several meters, contingent on the specific species and stages of development.

Pyrosomes are classified under the order Pyrosomatida Jones, 1848¹, class Thaliacea, Subphylum Tunicata Balfour, 1881², Phylum Chordata Haeckel, 1874³. Generally, bilaterians divide into two primary categories, protostomes and deuterostomes⁴. Chordates, encompassing tunicates, cephalochordates and vertebrates, constitute a clade distinguished unambiguously within deuterostomes, with cephalochordates being the earliest to diverge^5,6,7. Tunicates comprise over 3,000 recognized species and form sister groups with vertebrates (Fig. 2).

The etymology of pyrosome is from the Ancient Greek pyro (fire) and soma (body), which means the body can emit light (fire), since it has the characteristic of bioluminescence. There were light organs on both sides of the anterior end of the branchial basket at the inner edge of the oral siphon of the zooids (Fig. 2c), and zooids can respond to external stimuli and other light sources, leading to bioluminescence.

Although pyrosomes inhabit various tropical to temperate waters, knowledge of their distribution remains limited due to the uncertainty of their probability in samples of plankton. To date, only a few species have been confirmed, and the biological records of this group are sparse⁸. There are only 8 pyrosomes species have been confirmed within 1 family (Pyrosomatidae), 2 subfamilies (Pyrosomatinae Lahille, 1888⁹ and Pyrostremmatinae van Soest, 1979¹⁰) and 3 genera (Pyrosoma Péron, 1804¹¹, Pyrosomella van Soest, 1979¹⁰ and Pyrostremma Garstang, 1929¹²). Research on this group remains limited, with taxonomy primarily based on brief morphological details. Moreover, the taxonomy of pyrosomes continues to be a topic of debate. Pyrosomes serve as important consumers of phytoplankton, exhibiting the highest feeding rate of zooplankton to phytoplankton, and the colonies utilize food intake to promote high biomass turnover and material circulation^13,14. Due to the high consumption rate, these organisms generate considerable fecal particles, which contribute notably to the sedimentation of organic matter debris, often referred to as Marine Snow. Furthermore, they provide a vital food source for higher trophic levels within the marine food web. Consequently, pyrosomes are essential for the transport and cycling of organic carbon in marine ecosystems¹⁵.

Pyrosoma atlanticum Péron, 1804¹¹ is the most extensively researched pyrosome species, which widely distributed in the oceanic regions between latitudes 50° N and 50° S. In this study, we performed an integrated analysis utilizing PacBio Iso-Seq and short-read RNA-seq to generate high-quality and full-length transcript data from P. atlanticum. Based on these full-length transcripts, systematic gene functional annotation was then performed. The full-length transcriptome and gene expression profiles obtained in this study is the first dataset of transcriptome in class Thaliacea, which can be used as a reference transcriptome for further genetic analysis. Based on these full-length transcripts, systematic gene functional annotation was then performed. The resulting full-length transcriptome and gene expression profiles constitute the inaugural transcriptomic dataset for the class Thaliacea, serving as a reference for subsequent genetic analyses. Furthermore, this study offers valuable transcriptomic insights for ongoing research into chordate evolution.

Methods

Sample collection & nucleic acid extraction

The pyrosome colony, P. atlanticum, used in this study, originated from surface water in the Philippines Basin, located in the Western Pacific Ocean, during September 2018 (Fig. 3). Following several sterile seawater rinses to eliminate contamination, the sample underwent dehydration with sterile filter paper, was pre-frozen in liquid nitrogen, and then stored at -80 °C until RNA extraction. Given the small size and high water content of single zooid of pyrosome, a mixed sample of one entire zooid separated from the colony was utilized for total RNA isolation using an RNA extraction kit (Takara). The RNA 6000 nano kit and the Agilent 2100 Bioanalyzer system (Agilent Technologies) were employed to evaluate RNA quality.

PacBio library preparation and sequencing

Full-length cDNA was synthesized from the purified total RNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View), and large-scale PCR was performed to generate double-stranded cDNA templates. Then, the PacBio Template Prep Kit (PacBio) was used to generate the SMRTbell libraries. Subsequently, the Pacific Sequel platform was used for SMRT (Single Molecule Real-Time) sequencing.

Illumina library preparation & sequencing

The poly (A) mRNA was enriched with magnetic beads containing oligo (dT) to process RNA. The first strand of cDNA was synthesized using the mRNA fragment interrupted by interrupting agent as a template, the second strand synthesis reaction system is then performed to synthesize the second strand cDNA. The second strand of cDNA was purified with Ampure XP beads, and the sticky ends were repaired. The “A” base was added to the 3′ end of the cDNA, and PCR amplification was carried out with manufacturer’s instructions. Then the ABI StepOnePlus real-time PCR system was used to inspect the constructed library. The libraries were sequenced on an Illumina HiSeq X 10 platform. The experiments were carried out using three replicates.

Raw data processing

The bioinformatic pipeline from raw data to full-length transcriptome was shown in Fig. 4. Briefly, the raw subreads were analyzed following the Iso-Seq. 3 pipeline (https://github.com/PacificBiosciences/IsoSeq). This pipeline included three key steps: generation of polished Circular Consensus Sequence (CCS) reads using CCS v6.2.0 with a minimum quality score of 0.9 (-min-rq 0.9), classification of full-length (FL) reads based on the presence of both 5’ and 3’ cDNA primers and a poly(A) tail, and clustering of FL reads. Lima v2.1.0 and isoseq. 3 refine were used for primer removal and poly(A) tail excision, respectively. The Iterative Clustering and Error correction (ICE) algorithm was applied to obtain high-quality FL consensus sequences, classified by post-correction accuracy above 99%. A total of 20.61 G data in 1,050,025,012 bp of 433,244 CCSs were generated from the PacBio library by using SMRT cell (Table 1, Fig. 5a, b). Through the detection, 380,278 full-length non-chemiric (FLNC) transcripts were identified (Table 2). The FLNC reads with high similarity were clustered to form a consensus sequence by ICE algorithm, and 21,278 consensus isoforms with a mean length 2,552 bp in combination with non-full-length sequences were generated. The resulting sequences were corrected with the Quiver program, then 21,024 high-quality (HQ, accuracy ratio >99%) isoforms and 250 low-quality (LQ) isoforms were generated (Table 3, Fig. 5c). The LQ isoforms were corrected by Illumina sequencing data with Proovread software¹⁶. The reads combined of HQ and corrected LQ isoforms were clustered by CD-HIT^17,18. Finally, a total of 12,996 full-length transcripts were obtained by eliminating redundancy (Fig. 5d). The Benchmarking Universal Single-copy Orthologs (BUSCO) with OrthoDB dataset was used as a reference to assess the integrity and accuracy of the transcriptome¹⁹. The results indicated that the integrity and accuracy of the transcriptome were reliable (Fig. 6).

Table 1 Statistics of CCSs.

Full size table

Table 2 Statistics of full-length sequences.

Full size table

Table 3 Cluster of consensus isoforms.

Full size table

Predictions of simple sequence repeat (SSR), coding sequence (CDS), transcription factor (TF) & long non-coding RNA (lncRNA)

The SSRs of the transcriptome were identified with MISA (http://pgrc.ipk-gatersleben.de/misa/). TransDecoder (https://github.com/TransDecoder/TransDecoder/releases) in ‘longOrfs’ mode identified candidate coding regions within transcripts by locating minimum length open reading frames (ORFs), computing log-likelihood scores, and verifying these scores were highest in the first reading frame. Additionally, it allowed verification of putative peptides matching a Protein Family (Pfam)²⁰ ___domain above the noise cutoff score. Furthermore, we utilized AnimalTFDB²¹ (https://guolab.wchscu.cn/AnimalTFDB4_Document) to predict DNA-binding domains in the protein sequences. The database provides a curated collection of Hidden Markov Models (HMMs) for various transcription factors. We extracted relevant HMM profiles from AnimalTFDB for known DNA-binding domains. These HMMs were used to scan the protein sequences using the HMMER 3.1b2 package. The hmmsearch tool was used to compare each protein sequence against the HMM profiles, with hits considered significant if their E-values were below 1e-4. The identified domains were further validated by cross-referencing with existing annotations in public databases such as UniProt and Pfam. Four computational tools including Coding Potential Calculator (CPC)²², Coding-Non-Coding Index (CNCI)²³, Coding Potential Assessment Tool (CPAT)²⁴, and Pfam²⁰ were employed to differentiate non-protein-coding RNA candidates from potential protein-coding RNAs. Transcripts were filtered out based on a minimum length of 50 nucleotides and a default exon count threshold²⁵. Long non-coding RNA (lncRNA) candidates were defined as those exceeding 200 nucleotides in length and containing more than two exons. These candidates were further analyzed using CPC, CNCI, CPAT, and Pfam, which effectively distinguished between protein-coding and non-coding genes.

A total of 6,820 SSR candidates were obtained from the 12,788 transcripts (å 500 bp) (Table 4, Fig. 7a, b). By TransDecoder, a total of 11,561 CDSs containing the start and stop codons which were defined as complete open reading frames (ORFs), were obtained (Fig. 7c). Based on the animalTFDB database, a total of 545 TFs were predicted belonging to 47 TF families (Fig. 7d). Using the CPC, CNCI, CPAT and Pfam databases, a total of 549 lncRNAs were predicted (Fig. 8a).

Table 4 Statistics of SSRs.

Full size table

Gene functional annotation

Gene function was annotated based on the following databases: NCBI Non-Redundant Protein Sequence Database (NR)²⁶, Pfam²³, Clusters of euKaryotic Ortholog Groups (KOG)²⁷, Clusters of Orthologous Groups (COG)²⁸, evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG)²⁹, A manually annotated and reviewed protein sequence database (Swiss-Prot)³⁰, Kyoto Encyclopedia of Genes and Genomes (KEGG)³¹ and Gene Ontology (GO)³². The BLAST program was used with an E-value <10⁻³³³. As a result, a total of 10,874 transcripts were annotated (Table 5). Moreover, a total of 6,728 transcripts were simultaneously annotated by the Pfam, COG, GO and KEGG databases (Fig. 8b).

Table 5 Statistics of annotated transcripts.

Full size table

There were 10,544 matched transcripts after aligned all transcripts in the NR database (Fig. 9a). The top six species with the most annotated homologous sequences in the NR database were Ciona intestinalis Linnaeus, 1767³⁴ (5,350, 50.69%), Phallusia mammillata Cuvier, 1815³⁵ (4,001, 37.91%), Branchiostoma belcheri Gray, 1847³⁶ (84, 0.80%), Dendronephthya gigantea Verrill, 1864³⁷ (45, 0.43%), Lingula anatina Lamarck, 1801³⁸ (30, 0.28%), and Saccoglossus kowalevskii Agassiz, 1873³⁹ (30, 0.28%).

A total of 10,176 transcripts were annotated with 10,960 annotations by the COG database, and divided into 24 categories according to COG function class (Fig. 9b). There were 2,386 and 1,633 transcripts belonging to General function prediction only and Function unknown, accounting for 21.77% and 14.90%, respectively. In addition, there were transcripts belonging to other categories, including Posttranslational modification, protein turnover, chaperones (1,171, 10.68%), Intracellular trafficking, secretion, and vesicular transport (601, 5.48%), Transcription (596, 5.44%) and RNA processing and modification (582, 5.31%).

There were 8,063 transcripts were annotated with 1,299,420 GO terms by the GO database, covering categories related to Biological Process, Cellular Component, and Molecular Function (Fig. 9c). In the Biological Process category, the highly represented groups included Cellular process (7,084), and Biological regulation (5,741), with further enrichment in subcategories such as Metabolic process (5,287), Regulation of biological process (5,352), and Organic substance metabolic process (5,115). In the Cellular Component category, the top groups were Intracellular anatomical structure (7,191), and Organelle (6,629), with specific enrichment in subcategories like Cytoplasm (6,296), Intracellular organelle (6,531), and Membrane-bonded organelle (6,089). For Molecular Function, the main groups included Binding (5,447), and Protein binding (4,165), with key subcategories like Catalytic activity (3,269), Organic cyclic compound binding (2,134), and Heterocyclic compound binding (2,097).

In total, there were 7,698 transcripts matched in the KEGG database with 10,893 KO terms, and 5,019 transcripts matched in 22,914 pathways. These transcripts were participated in 6 categories, including Cellular Processes, Organismal System, Metabolism, Genetic Information Processing, Environmental Information Processing and Human Diseases. Overall, the top ten pathways of the matched transcripts were Metabolic pathways (1,174), Biosynthesis of secondary metabolites (388), Pathways in cancer (377), Protein processing in endoplasmic reticulum (269), Phagosome (265), Human papillomavirus infection (237), Microbial metabolism in diverse environments (230), Spliceosome (222), PI3K-Akt signaling pathway (214) and Thermogenesis (209) (Fig. 9d).

Data Records

All raw reads of full-length Iso-seq were deposited in Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) under accession number SRR26427024⁴⁰. All raw reads of RNA-seq were deposited in the SRA of NCBI under accession numbers SRR26427028⁴¹, SRR26427029⁴² and SRR26427030⁴³. Moreover, the sequences files produced in bioinformatic pipeline from raw data to full-length transcriptome (CCSs, FLNC reads, consensus isoforms, HQ isoforms, LQ isoforms, transcripts, SSRs, lncRNAs, CDSs, CDSs (mRNA) and TFs), the prediction results of SSRs, CDSs, TFs, lncRNAs, the annotations of genes by eggNOG, COG, GO, KEGG, Pfam, KOG, NR and Swissprot, and the FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values of three replicates generated by Illumine sequencing in the pyrosome were deposited in Figshare⁴⁴.

Technical Validation

Three single zooids separated from the colony for three replicates were sequenced on an Illumina HiSeq X 10, respectively. The quality assessments of the Illumina clean reads were evaluated with FastQC, and the Q20, Q30 and GC content metrics were obtained, respectively (Table 6). Then, the Pearson correlation analysis of the FPKM between three replicates was carried out (Fig. 10), and the reproducibility was determined.

Table 6 Statistics of evaluation of sequencing data.

Full size table

Code availability

Parameters to software tools involved are described below:

CCS: version 6.2.0, parameters: minimum quality score of 0.9 (-min-rq 0.9), classification of full-length (FL) reads based on the presence of both 5’ and 3’ cDNA primers and a poly(A) tail, and clustering of FL reads.

Lima: version 2.1.0.

isoseq. 3 refine: versions 3.4.0, parameters: poly(A) tail excision (-require-polya).

CD-HIT: version 4.8.1, parameters: setting the sequence identity threshold to 99% (-c 0.99).

Busco: version 5.7.1. parameters: setting metazoan as lineage dataset.

BLAST: version 2.2.26, parameters: limits the number of high-scoring pairs (hsps) to two per query-subject pair (-max_hsps 2), only considers alignments with at least 95% sequence similarity (-perc_identity 95), requires each hsp to completely cover 100% of the query sequence (-qcov_hsp_perc 100).

MISA: https://webblast.ipk-gatersleben.de/misa/, parameters: filter transcripts longer than 500 bp.

TransDecoder: https://github.com/TransDecoder/TransDecoder/releases, parameters: identifying candidate coding regions within transcripts by locating minimum length open reading frames (ORFs), computing log-likelihood scores, and verifying these scores were highest in the first reading frame. Additionally, it allowed verification of putative peptides matching a Pfam ___domain above the noise cutoff score.

AnimalTFDB: https://guolab.wchscu.cn/AnimalTFDB4_Document.

HMMER: version 3.1b2, parameters: setting the E-value threshold for reporting ___domain hits (-domE 1e-4).

FastQC: version 0.11.9.

Trimmomatic: version 0.39.

STAR: version 2.7.3a.

Samtools: version 0.1.18.

RSEM: version 1.3.3, parameters: using FPKM as the metric.

Databases for LncRNA: CPC (http://cpc2.gao-lab.org/), CNCI (https://github.com/www-bioinfo-org/CNCI), CPAT (http://rna-cpat.sourceforge.net/), Pfam (http://pfam.xfam.org/).

Databases for Gene Functional Annotation: NR (https://www.ncbi.nlm.nih.gov/), Pfam (http://pfam.xfam.org/), Clusters of Orthologous Groups of proteins (KOG/COG/eggnog, http://eggnogdb.embl.de/), Swiss-Prot (https://www.uniprot.org/), KEGG (https://www.genome.jp/kegg/), and GO (http://geneontology.org/).

If no explanation was provided for the software, there was no custom code used to generate or process the data in the manuscript.

References

Jones, T. R. Tunicata. In: Todd’s Cyclopaedia of Anatomy and Physiology. Sherwood, Gilbert, and Piper 4, 1185-1243 (1848).
Balfour, F. M. A treatise on comparative embryology. London, MacMillan 2, 1-655 i-xxi (1881).
Haeckel, E. Die gastraea theorie, die phylogenetische classification des thierreichs und die homologie der keimblatter. Jenaische Zeitschrift fur Naturwissenschaft 8, 1–55 (1874).
Google Scholar
Grobben, K. Die systematische Einteilung des Tierreiches. Ver. Zool. Bot. Ges. Wien 58, 491–511 (1908).
Google Scholar
Wada, H. & Satoh, N. Details of the evolutionary history from invertebrates to vertebrates, as deduced from the sequences of 18S rDNA. Proc. Natl Acad. Sci. USA 91, 1801–1804 (1994).
Article ADS CAS PubMed PubMed Central Google Scholar
Perseke, M., Golombek, A., Schlegel, M. & Struck, T. H. The impact of mitochondrial genome analyses on the understanding of deuterostome phylogeny. Mol. Phylogenet. Evol. 66, 898–905 (2013).
Article PubMed Google Scholar
Jefferies, R. P., Brown, N. A. & Daley, P. E. The early phylogeny of chordates and echinoderms and the origin of chordate left–right asymmetry and bilateral symmetry. Acta Zoologica 77, 101–122 (1996).
Article Google Scholar
Van Soest, R. W. M. A monograph of the order Pyrosomatida (Tunicata, Thaliacea). J. Plankton Res. 3, 603–631 (1981).
Article Google Scholar
Lahille, M. F. Étude systématique des tuniciers. Compt Rendus Assoc. France Adv. Sci. 16, 667–677 (1888).
Google Scholar
Van Soest, R. W. M. Revised classification of the order Pyrosomatida (Tunicata, Thaliacea), with the description of a new genus. Steenstrupia 5, 197–217 (1979).
Google Scholar
Péron, F. Mémoire sur le nouveau genre Pyrosoma. Ann. Mus. Hist. Natur. Paris 4, 437–446 (1804).
Google Scholar
Garstang, W. Tunicates. Encyclopedia Britannica 14, 549–555 (1929).
Google Scholar
Perissinotto, R. et al. Grazing by Pyrosoma atlanticum (Tunicata Thaliacea) in the south Indian Ocean. Mar. Ecol. Progr. Ser. 330, 1–11 (2007).
Article ADS CAS Google Scholar
Andersen, V., Sardou, J. & Nival, P. The diel migrations and vertical distributions of zooplankton and micronekton in the Northwestern Mediterranean Sea. 2. Siphonophores, hydromedusae and pyrosomids. J. Plankton Res. 14, 1155–1169 (1992).
Article Google Scholar
Esnal, G. Pyrosomatida. In: D. Boltovskoy (ed.). South Atlantic Zooplankton. Leiden, Backhuys Publishers 1423-1444 (1999).
Hackl, T., Hedrich, R., Schultz, J. & Förster, F. Proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004–3011 (2014).
Article CAS PubMed PubMed Central Google Scholar
Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article CAS PubMed Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Mosè, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution 38, 4647–4654 (2021).
Article Google Scholar
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Research 42, D222–230 (2014).
Article CAS PubMed Google Scholar
Shen, W. et al. AnimalTFDB 4.0: a comprehensive animal transcription factor database updated with variation and expression annotations. Nucleic Acids Research 51(Database issue), D39–45 (2023).
Article CAS PubMed Google Scholar
Kong, L. et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic acids research 35(suppl 2), W345–W349 (2007).
Article PubMed PubMed Central Google Scholar
Sun, L. et al. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic acids research 41, e166 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic acids research 41, e74–e74 (2013).
Article CAS PubMed PubMed Central Google Scholar
Yang, H., Xu, D., Zhuo, Z., Hu, J. & Lu, B. SMRT Sequencing of the Full-Length Transcriptome of the Rhynchophorus ferrugineus (Coleoptera: Curculionidae). PeerJ 8, 9133 (2020).
Article Google Scholar
Deng, Y. et al. Integrated NR Database in Protein Annotation System and Its Localization. Computer Engineering 32, 71–74 (2006).
Google Scholar
Koonin, E. V. et al. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome biology 5, R7 (2004).
Article PubMed PubMed Central Google Scholar
Tatusov, R. L., Galperin, M. Y. & Natale, D. A. The COG database: a tool for genome scale analysis of protein functions and evolution. Nucleic Acids Research 28, 33–36 (2000).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Apweiler, R. et al. UniProt: The Universal Protein knowledgebase. Nucleic Acids Research 1, D115–9 (2004).
Article Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Linnaeus, C. Systema naturae per regna tria naturae: secundum classes, ordines, genera, species, cum characteribus, differentiis, synonymis, locis. Ed. 12. 1., Regnum Animale. 1 & 2. Holmiae (Stockholm), Laurentii Salvii. apud JB Delamolliere 1-1327 (1767).
Cuvier, G. Memoire sur les ascidies et sur leur anatomique. Mem Mus Hist Na. Paris 2, 10–39 (1815).
Google Scholar
Gray, J. E. A list of the genera of recent Mollusca, their synonyma and types. R. and JE Taylor (1847).
Verrill, A. E. List of the polyps and corals sent by the Museum of Comparative Zoology to other institutions in exchange, with annotations. Bulletin of the Museum of Comparative Zoology 1, 29–60 (1864).
Google Scholar
Lamarck, J. B. Système des animaux sans vertèbres, ou tableau général des classes, des ordres et des genres de ces animaux; Présentant leurs caractères essentiels et leur distribution, d’apres la considération de leurs rapports naturels et de leur organisation, et suivant l’arrangement établi dans les galeries du Muséum d’Histoire Naturelle, parmi leurs dépouilles conservées; Précédé du discours d’ouverture du Cours de Zoologie, donné dans le Muséum National d’Histoire Naturelle l’an 8 de la République. Published by the author and Deterville, Paris viii + 432 pp (1801).
Agassiz, A. The history of Balanoglossus and Tornaria. Memoirs of the American Academy of Arts and Sciences 9, 421–436 (1873).
Article ADS Google Scholar
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427024 (2023).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427028 (2023).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427029 (2023).
Sequence Read Archive (SRA) database of National Center for Biotechnology Information https://identifiers.org/ncbi/insdc.sra:SRR26427030 (2023).
Xiang, P. et al. Pyrosome Transcriptome Datasets. Figshare https://doi.org/10.6084/m9.figshare.27242205 (2024).

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (Grant Nos. 2022YFC3102401 & 2023YFC3604500), National Natural Science Foundation of China (Grant No. 32271165), Natural Science Foundation of Xiamen, China (3502Z20227162), Key Laboratory of Tropical Marine Ecosystem and Bioresource, Ministry of Natural Resources (2021ZD01) and the Interdisciplinary Project in Ocean Research of Tongji University (Grant No. 2022-2-ZD-02). We thank the Biomarker Technologies Co., Ltd for assisting in sequencing. Figure 2 was created using Ocean Data View.

Author information

These authors contributed equally: Peng Xiang, Xuanxuan Bai.

Authors and Affiliations

Fundamental Research Center, Shanghai Yangzhi Rehabilitation Hospital (Shanghai Sunshine Rehabilitation Center), School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai, 200092, China
Peng Xiang, Xuanxuan Bai & Chao Zhang
Laboratory of Marine Biodiversity, Third Institute of Oceanography, Ministry of Natural Resources, 178 Daxue road, 361005, Xiamen, China
Peng Xiang & Bingpeng Xing
School of Pharmaceutical Sciences, Xiamen University, 4221 Xiangannan road, Xiamen, 361104, China
Xuanxuan Bai & Mingyu Li
State Key Laboratory of Marine Geology, School of Ocean and Earth Science, Tongji University, 1239 Siping Road, Shanghai, 200092, China
Jiangtao Li
Key Laboratory of Tropical Marine Ecosystem and Bioresource, Fourth Institute of Oceanography, Ministry of Natural Resources, Beihai, 536015, China
Mingyu Li

Authors

Peng Xiang
View author publications
Search author on:PubMed Google Scholar
Xuanxuan Bai
View author publications
Search author on:PubMed Google Scholar
Bingpeng Xing
View author publications
Search author on:PubMed Google Scholar
Jiangtao Li
View author publications
Search author on:PubMed Google Scholar
Chao Zhang
View author publications
Search author on:PubMed Google Scholar
Mingyu Li
View author publications
Search author on:PubMed Google Scholar

Contributions

P.X., M.L. and X.B. conceived and designed the study. P.X. collected the samples. P.X., B.X. and X.B. performed the experiments, B.X., X.B., M.L.,C.Z. and J.L. analysed the data. All authors discussed results. P.X., M.L., X.B. and J.L. wrote the draft manuscript. All authors reviewed the manuscript. C.Z. and M.L. approved the final version to be published. P.X. and X.B. contributed equally to this work and should be considered as co-first authors.

Corresponding authors

Correspondence to Chao Zhang or Mingyu Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xiang, P., Bai, X., Xing, B. et al. Full-length transcriptome annotation of a pyrosome, Pyrosoma atlanticum (Chordata, Thaliacea). Sci Data 11, 1433 (2024). https://doi.org/10.1038/s41597-024-04251-7

Download citation

Received: 12 November 2023
Accepted: 06 December 2024
Published: 24 December 2024
DOI: https://doi.org/10.1038/s41597-024-04251-7