LucaPCycle: Illuminating microbial phosphorus cycling in deep-sea cold seep sediments using protein language models

Zhang, Chuwen; He, Yong; Wang, Jieni; Chen, Tengkai; Baltar, Federico; Hu, Minjie; Liao, Jing; Xiao, Xi; Li, Zhao-Rong; Dong, Xiyang

doi:10.1038/s41467-025-60142-4

Download PDF

Article
Open access
Published: 26 May 2025

LucaPCycle: Illuminating microbial phosphorus cycling in deep-sea cold seep sediments using protein language models

Nature Communications volume 16, Article number: 4862 (2025) Cite this article

3045 Accesses
13 Altmetric
Metrics details

Subjects

Abstract

Phosphorus is essential for life and critically influences marine productivity. Despite geochemical evidence of active phosphorus cycling in deep-sea cold seeps, the microbial processes involved remain poorly understood. Traditional sequence-based searches often fail to detect proteins with remote homology. To address this, we developed a deep learning model, LucaPCycle, integrating raw sequences and contextual embeddings based on the protein language model ESM2-3B. LucaPCycle identified 5241 phosphorus-cycling protein families from global cold seep gene and genome catalogs, substantially enhancing our understanding of their diversity, ecology, and function. Among previously unannotated sequences, we discovered three alkaline phosphatase families that feature unique ___domain organizations and preserved enzymatic capabilities. These results highlight previously overlooked ecological importance of phosphorus cycling within cold seeps, corroborated by data from porewater geochemistry, metatranscriptomics, and metabolomics. We revealed a previously unrecognized diversity of archaea, including Asgardarchaeota, anaerobic methanotrophic archaea and Thermoproteota, which contribute to organic phosphorus mineralization and inorganic phosphorus solubilization through various mechanisms. Additionally, auxiliary metabolic genes of cold seep viruses primarily encode the PhoR-PhoB regulatory system and PhnCDE transporter, potentially enhancing their hosts’ phosphorus utilization. Overall, LucaPCycle are capable of accessing previously ‘hidden’ sequence spaces for microbial phosphorus cycling and can be applied to various ecosystems.

Global and seasonal variation of marine phosphonate metabolism

Article Open access 23 June 2022

Unexpected genetic and microbial diversity for arsenic cycling in deep sea cold seep sediments

Article Open access 29 March 2023

Drivers of the global phosphorus cycle over geological time

Article 19 November 2024

Introduction

Cold seeps, often situated along continental margins, are areas where hydrocarbon-rich fluids escape from subsurface reservoirs to the seafloor^1,2. These unique environments are distinguished by their chemosynthetic microorganisms that utilize methane and other hydrocarbons—such as non-methane alkanes and aromatic hydrocarbons—as sources of carbon and energy to sustain life^3,4,5. A key process in these habitats is the anaerobic oxidation of methane (AOM), which occurs in conjunction with sulfate reduction, mediated by a consortium of anaerobic methanotrophic archaea (ANME) and sulfate-reducing bacteria (SRB)^6,7. In addition to carbon and sulfur, these microbial communities also require phosphorus for essential biological functions such as building cell membranes, synthesizing nucleic acids, and producing energy carriers and various phosphorylated metabolic intermediates^8,9. Despite its importance, studies on phosphorus cycling in deep-sea cold seep sediments are less developed than those on carbon and sulfur^3,10.

Chen et al. (2023) examined a piston core from the Haima cold seeps in the Qiongdongnan Basin, northern South China Sea, revealing high phosphate concentrations in porewater and low ratios of organic phosphorus to total organic carbon (TOC)¹¹. This suggests a phosphorus-rich environment within these deep-sea cold seeps. The study, along with others, provided geochemical evidence that AOM is a key driver of phosphorus cycling within these environments^11,12,13. This process, associated with the generation of hydrogen sulfide, leads to the reductive dissolution of iron oxides, releasing iron-bound phosphorus into the surrounding porewater. Another study indicates that in deep-sea carbonate deposits from cold seeps in the South and East China Seas, calcium-bound phosphorus predominates, co-precipitating with calcium carbonate during AOM, with iron also playing a crucial role in the release of phosphorus from iron oxides¹⁴. These findings underscore the complex interactions between microbial activity and chemical reactions in deep-sea phosphorus cycling, through a combination of microbial and chemical processes. However, the specific contributions of microbially-driven processes to the production of dissolved inorganic phosphorus in these environments remain unclear.

Microbes have evolved various enzymatic mechanisms to convert recalcitrant phosphorus forms into bioavailable phosphorus¹⁵. These phosphorus-cycling microorganisms (PCMs) possess genes related to four main processes: organic phosphorus mineralization, inorganic phosphorus solubilization, phosphorus uptake and transport, and phosphorus-starvation response regulation^16,17,18,19. Organic phosphorus mineralization primarily involves release of hydrolytic enzymes to recycle various organic phosphorus compounds. Acid (e.g., PhoN, PhoC) and alkaline phosphatases (e.g., PhoD, PhoA) cleave phosphate from complex phosphoesters^20,21,22, while phytases (e.g., AppA) hydrolyze phosphate from phytate or myo-inositol hexakis dihydrogen phosphate²³. Phosphonatase (e.g., PhnW, PhnX) and C-P lyase (PhnGHIJKLM) participate in C-P cleavage of phosphonates^24,25. For inorganic phosphorus mobilization, polyphosphate kinase (PPK) catalyzes the formation of polyphosphate (poly-P), and exopolyphosphatase (PPX) subsequently releases phosphate from poly-P chains^26,27. PCMs also secrete organic acids, such as gluconic acid produced by the oxidation of glucose via membrane-bound quinoprotein glucose dehydrogenase (Gcd), to solubilize inorganic phosphorus minerals²⁸. Additionally, PCMs harbor high-affinity (PstABCS) and low-affinity (Pit) transporters for phosphate uptake under varying conditions²⁹, and can adjust phosphorus-sensing and regulatory mechanisms (e.g., PhoB-PhoR system) to optimize phosphorus use under scarcity^19,30.

The annotation of phosphorus-cycling genes has traditionally relied on sequence homology methods such as Blastp or profile hidden Markov models (pHMMs)^30,31. However, these methods are limited to detecting sequences with relatively high similarity, potentially missing divergent sequences with distant evolutionary relationships. To overcome this limitation, researchers are increasingly using protein structure features, which are more conserved over evolutionary timescales and crucial for maintaining functional integrity³². Recent advancements in deep learning for protein bioinformatics, exemplified by tools like AlphaFold, have facilitated insights into these previously obscure ‘functionally dark’ proteins^33,34,35. For example, this approach led to the discovery of a superfamily of translation-targeting toxin-antitoxin systems, TumE–TumA³⁴. Additionally, the field of functional annotation predictions has benefited significantly from the application of natural language processing (NLP) techniques. In NLP, protein language models (PLMs) are developed using extensive labeled and unlabeled datasets to create vectorized embeddings that represent both protein sequences and structures. This innovative method has been applied in various life science research areas, including the identification of viruses, the discovery of antibiotics, and the detection of remote proteins^32,36,37,38.

In this study, we aim to improve the annotation of phosphorus-cycling proteins, focusing on previously unexplored sequence spaces, to deepen our understanding of microbial contribution in phosphorus cycling across global cold seeps. To achieve this, we developed a deep learning predictor named LucaPCycle, specifically designed to target phosphorus-cycling proteins. This model was trained using both the raw protein sequence and the embeddings derived from the PLM ESM2-3B. Our findings indicate that LucaPCycle effectively identifies remote homologies, thus providing a valuable complement to traditional alignment-based methods. This enhancement allows for a more comprehensive understanding of the roles and mechanisms of phosphorus cycling, not limited to the cold seep ecosystems investigated here, but also in other environments.

Results and discussion

PLM-based LucaPCycle outperforms homology-based approaches

LucaPCycle is a dual-channel deep learning model (Fig. 1a). The first channel utilizes the protein language model ESM2 to extract features from sequences at the residue level, leveraging its self-supervised learning capabilities to understand the sequence context³⁹. The second channel involves a Transformer-Encoder⁴⁰ designed to capture the features of the raw sequence. We applied LucaPCycle to build two models, one for binary-classification and the other for 31-classification. The binary-classification model predicts whether a protein sequence has a phosphorus-cycling function. If it does, the 31-classification model then categorizes it into one of 31 specific types of phosphorus-cycling genes (detailed in Supplementary Data 1). LucaPCycle is established based on 214,193 positive samples of 31 phosphorus-cycling proteins and 853,615 negative samples covering a range of functions, including intracellular phosphorus metabolism, nitrogen cycling, sulfur cycling, and other unrelated functions (Supplementary Fig. 1a–b). Detailed information on model building is provided in the Methods section.

**Fig. 1: Overview of LucaPCycle for phosphorus-cycling protein annotation.**

Binary- and 31-classification models in LucaPCycle both perform well on metrics of accuracy, precision, recall, F1-score (The harmonic mean of the precision and recall), AUC (Area under the ROC curve), and PR-AUC (Area under the precision-recall curve), achieving scores above 0.96 when evaluated on validation and test datasets (Fig. 1b). To further demonstrate the sensitivity and specificity of LucaPCycle, we benchmarked it to two homology-based functional annotation tools, utilizing a dataset of 1,521,958 true positive sequences (see Methods). The homologous protein search methods include: (1) Diamond Blastp based on sequence similarity with curated phosphorus-cycling genes in PCycDB database³⁰; and (2) KofamScan³¹, a gene function annotation tool based on KEGG Orthology and hidden Markov model (HMM). Notably, LucaPCycle outperforms the other two methods with the highest precision rate (86.13%, a measure of the accuracy of the positive predictions made by the model) and a comparable recall rate (85.63%, a measure of how well the model can identify all the positive samples; Fig. 1c). In comparisons, KofamScan exhibited a high recall rate (98.96%) but a low precision rate (58.24%), while Diamond Blastp exhibited both the lowest precision rate (51.98%) and recall rate (35.01%).

Remote homology detection: three alkaline phosphatases at cold seeps

Using the non-redundant gene (n = 147,289,169) and genome (n = 1878) catalogs from global cold seeps, phosphorus-cycling genes were functionally annotated using LucaPCycle and two sequence homology-based methods, Diamond Blastp and KofamScan (Supplementary Fig. 2). The results from these three search strategies were cross-checked by protein ___domain analysis, a deep learning model DeepFRI⁴¹ and a machine learning model CLEAN⁴². Subsequently, all validated phosphorus-cycling proteins were merged and clustered using sequence-based clustering algorithm to generate representative sequences for phylogenetic analyses. Singletons, likely indicative of sequencing errors⁴³, were discarded, leading to the identification of 333,493 non-singleton phosphorus-cycling protein families (Supplementary Fig. 2). The majority of protein families consisted of sequences organized in small groups (2 members) and exhibited diverse lengths with the most common range being 100–300 amino acids (Supplementary Fig. 3). Among 333,493 non-singleton phosphorus-cycling protein families, 5241 families (i.e., 29,418 proteins) were exclusively identified by LucaPCycle (Supplementary Fig. 4a). The 5241 protein families cover 31 distinct types of phosphorus-cycling genes, with phnC (n = 1384), ppx (n = 972), phnK (n = 723), and pstB (n = 720) being the ones having the highest number of families (Supplementary Fig. 4b). Diamond Blastp and KofamScan identified more exclusive phosphorus-cycling gene families than LucaPCycle. However, many of these exclusive families are likely false positives, as indicated by their relatively low precision rates (KofamScan: 58.24%; Diamond Blastp: 51.98%; Supplementary Note 1 and Supplementary Data 2). Despite passing ___domain-based validation, the accuracy of these predictions remains uncertain since one protein ___domain may appear in a variety of unrelated proteins, leading to potential false positives. To validate the 5241 phosphorus-cycling protein families identified by LucaPCycle, we queried these families against the full KofamScan HMM database. In total, 1966 protein families matched sequences in the HMM database, corresponding to 25 distinct types of phosphorus-cycling genes (Supplementary Data 3). However, these matches had high E-values (> 1E-5) and low sequence similarity scores (< 25), excluding them from standard KofamScan annotations. These results indicate that the LucaPCycle-specific protein families represent remote homologs, highlighting their distant evolutionary relationships with known sequences. LucaPCycle has an advantage in identifying remote homologous proteins due to ESM2’s embeddings generated by a large model trained on extensive sequence data. This approach effectively captures sequence context and produces deep embedding matrices, allowing distantly related sequences with low similarity to exhibit high similarity in their embeddings. Consequently, protein families identified exclusively by Diamond Blastp and KofamScan were classified as homologous phosphorus-cycling protein families, while those unique to LucaPCycle were categorized as remote phosphorus-cycling protein families. These remote proteins could significantly contribute to the biogeochemical cycling of phosphorus in cold seeps, but have been previously neglected due to limitations of homology-based methods. Notably, the ppx gene with remote homology constituted substantial proportions of the total ppx gene (73.02%) and transcript (76.84%) abundances (Supplementary Fig. 4c–d).

We further explored the remote homology information of alkaline phosphatases (ALPs) recovered by LucaPCycle. This group was chosen due to its prevalence in global ocean water and its clinical significance^20,44,45. ALPs are typically separated into three classical families (PhoA, PhoD, and PhoX)^20,22 and one divergent family (PafA)⁴⁵. Our sequence and structure tree topologies showed that remote ALPs formed three distinct clusters (Fig. 2a, b). We designated these clusters as ALP1, ALP2, and ALP3 in this study. Specifically, sequence and structural phylogenies placed ALP1 between PhoX and divergent PafA, while ALP2 and ALP3 were positioned closer to PafA and PhoA, respectively. Despite sharing a low sequence identity averaging at 14.3%, a large proportion of ALP1 demonstrated structural homology to classic PhoD (TM-score > 0.5; Supplementary Fig. 5a). ALP2 were more resemblance to divergent PafA rather than to other three classic ALPs. All ALP2 displayed a high structural similarity with divergent PafA, with the majority exceeding a TM-score of 0.75 but sequence similarity averaged at 33.6% (Supplementary Fig. 5b). ALP3 shared a sequence identity of 17.6% with PhoA from Escherichia coli (PDB: 1AJA) but exhibited a TM-score of 0.82 in terms of structure (Supplementary Fig. 5c). Thus, ALP1, ALP2, and ALP3 could be remote relatives of PhoD, PafA and PhoA, respectively.

**Fig. 2: Three alkaline phosphatases (ALPs) with remote homology identified by LucaPCycle.**

Protein structures are composed of domains, independently folding units that can be found in multiple structural contexts and functional roles⁴⁶. Extracting these domains enables a more fine-grained description of the features among different proteins. Subsequently, we compared the ___domain variations between ALP1 and the classic PhoD, given their relatively low TM-score (Supplementary Fig. 5a). Both ALP1 and PhoD were partitioned into two domains by sequence- and structure-based ___domain assignments (Fig. 2c). However, their ___domain organizations were completely different. Typically, PhoD features a core architecture consisting of a sandwich of long β-sheets flanked by α-helices of varying lengths⁴⁷ and an upstream N-terminal β-sheet ___domain. In contrast, ALP1 lacked the N-terminal ___domain and was organized with a core architecture followed by an unannotated C-terminal β-sheet ___domain (Fig. 2c). ALP1 also exhibited fewer α-helices in its core architecture compared to PhoD. However, the core architectures of ALP1 and PhoD maintained a similar orientation and were conserved within the ECOD hierarchy at various levels (Fig. 2d and Supplementary Data 4): A-group (same architecture), X-group (possible homology), H-group (homology), T-group (topology), and F-group (sequence similarity). Nevertheless, the C-terminal ___domain of ALP1 and the N-terminal ___domain of PhoD were structural divergence and assigned different ECOD T-group and F-group labels (Fig. 2d and Supplementary Data 4). The emergence of a C-terminal β-sheet ___domain may represent an adaptive strategy for phosphorus-cycling microbes to enhance fitness in extreme deep-sea environments.

Despite evolutionary variations in amino acid sequences and protein structures, the remote ALP1 retained the function of phosphoester hydrolysis. ALP1 preserved metal binding sites in their core architectures, with active center residues exhibiting similar spatial distribution to classic PhoD (Fig. 2e). The molecular docking analysis demonstrated robust and stable interactions between ALP1 and 89 various phosphoester compounds, as indicated by low binding energies (−11.02 ~ −2.39 kcal/mol; Fig. 2f and Supplementary Data 5). This conservation of essential domains (i.e., core architectures) for function maintenance in remote ALPs suggests convergent evolution.

Phosphorus-cycling genes are diverse, active and depth-stratified at cold seeps

The microbes inhabiting cold seeps were found to harbor multiple phosphorus acquisition strategies. In particular, genes encoding the regulatory system (phoR-phoB) were among the most predominant (av. 3780.38 genes per million, GPM; Fig. 3a). These microbes also exhibited efficient phosphate transport systems to acquire available phosphorus with minimal energy investment. This was evidenced by the significantly higher levels of genes encoding for high-affinity phosphate transporter (pstABCS; av. 634.03 GPM) compared to those encoding the low-affinity phosphate transporter (pit; av. 168.63 GPM). The prevalence of phoR-phoB and pstABCS aligned with observations in soil microbes in response to limited bioavailable phosphorus^48,49. Within inorganic phosphorus solubilization, genes encoding glucose dehydrogenase (gcd), polyphosphate kinase (ppk) and exopolyphosphatase (ppx) demonstrated comparable abundance patterns.

**Fig. 3: The biogeographic pattern of phosphorus-cycling genes across global cold seeps.**

Organic phosphorus constitutes about 25% of the total phosphorus buried in marine sediments⁵⁰, where phosphorus atoms are covalently bonded to carbon either directly (P–C, phosphonates) or through a phosphoester linkage (P–O–C, phosphomonoesters, phosphodiesters, phosphotriesters and phytates)⁵¹. Depending on the substrate, microorganisms rely on distinct enzymes to mineralize organic phosphorus and release inorganic phosphate. For phosphoester hydrolysis, genes for nonspecific ALPs (phoA, phoD, or phoX, av. 67.72 GPM) were more prevalent than genes for nonspecific acid phosphatases (APs; aphA, olpA, phoN or phoC, av. 16.23 GPM) and phytases (av. 4.65 GPM; Fig. 3a). The lower abundance of genes encoding phytase can be attributed to the limited availability of its substrate, phytates, sourced primarily from terrestrial runoff ⁵². Additionally, the approximately neutral pH at cold seeps may limit the activity of APs, which has an optimal pH range of 4–6⁵³ (Supplementary Data 6). Regarding phosphonate degradation, the genes encoding substrate-specific phosphonatase (phnW-phnX) and broad-specificity C–P lyase (phnGHIJK) showed comparable abundances (Fig. 3a). Moreover, untargeted metabolomics detected over 104 organic phosphorus compounds in 55 cold seep sediments (Fig. 3b and Supplementary Data 7). These metabolites included a wide diversity of phosphomonoesters (e.g., acetylphosphate), phytates (e.g., 1D-Myo-inositol 1,4,5,6-tetrakisphosphate) and phosphonates (e.g., Hydroxymethylphosphonate). Additionally, our results identified intermediate metabolites of organic phosphorus mineralization, such as alpha-D-ribose 1,2-cyclic phosphate 5-phosphate, generated during phosphonate degradation. This provides potential biological evidence for the recycling of organic phosphorus molecules at cold seeps. However, it cannot conclusively rule out the possibility that some of these molecules originate from external sources.

To determine the biogeographic patterns of phosphorus-cycling proteins, each metagenome was categorized by its sediment depth (i.e., surface: < 1 mbsf; shallow: 1–10 mbsf; deep: > 10 mbsf). The distribution of phosphorus-cycling proteins was stratified by sediment depth. The functional richness (Chao1) decreased with depth and peaked at surface layers (P < 0.001; Fig. 3c). The compositions of phosphorus-cycling proteins also displayed significant variations among surface, shallow, and deep layers (F = 13.5, P = 0.001, R² = 0.15, 999-permutations test; Fig. 3d). Likewise, we observed depth-related pattern in the expression of phosphorus-cycling genes, with a higher transcript abundance detected in surface layers (averaging from 1.38 to 1251.73 TPM; Supplementary Fig. 6). This depth-stratified pattern may be driven by phosphate concentrations, which increase with sediment depth as shown by measurements from 12 sediment cores in the Shenhu area and Qiongdongnan basin (Supplementary Fig. 7 and Supplementary Data 8). Phosphate concentration exerted a negative feedback regulation on the abundance of phosphorus-cycling genes (Fig. 3e), as observed in soil bacteria⁵⁴. Cold seep types (i.e., gas hydrate, methane seep, oil and gas seep, mud/asphalt volcano) also contribute to the patterns of phosphorus-cycling gene abundance and expression (Supplementary Fig. 6 and 8). Significant differences in functional richness (P < 0.001; Supplementary Fig. 8a) and compositions of phosphorus-cycling proteins among cold seep types were observed (F = 5.71, P = 0.002, R² = 0.10, 999-permutations test; Supplementary Fig. 8b). Temporal dynamics of phosphorus-cycling genes were evident across both short-term (over several years) and geological timescales. Variations in phosphorus-cycling gene abundance and composition were detected between samples collected in 2021 and 2023 from the QDN-W07 site, as well as between active and extinct areas of the Haima cold seep (Supplementary Fig. 9). These findings reflect the impact of changing seep activity on phosphorus cycling within cold seep ecosystems.

Subsequently, we assessed the importance of phosphorus cycling in cold seep ecosystems based on the abundance of corresponding genes and transcripts. We chose mcrA and dsrA genes as references due to their key roles in anaerobic methane oxidation coupled with sulfate reduction, a major biological process in cold seeps^55,56. Except for genes encoding APs and phytases, the abundances of other phosphorus-cycling genes associated with 11 distinct metabolic processes (averaging from 54.57 to 3780.38 GPM) exceeded those of mcrA (av. 46.15 GPM) or dsrA genes (av. 21.32 GPM; Fig. 3a). Likewise, the expression levels of phosphorus-cycling genes (averaging from 6.84 to 425.08 TPM) demonstrated a similar pattern when compared to mcrA (av. 251.53 TPM) or dsrA genes (av. 5.66 TPM; Supplementary Fig. 10). To date, microbial transformation of phosphorus is well-documented in ocean surface water where phosphate is often limited^24,44. Here, our results demonstrated that phosphorus cycling driven by microorganisms is clearly important within deep-sea cold seeps in addition to the extensively studied carbon and sulfur metabolisms^55,56.

Archaea are key players for organic phosphorus mineralization and inorganic phosphorus solubilization

The taxonomic analysis was focused on the cold seep genome catalog to ensure a fine and accurate taxonomic assignment of phosphorus-cycling proteins. The results revealed that genes encoding the regulatory system, the high-affinity PstABCS and the low-affinity Pit were widely distributed throughout all major archaeal and bacterial phyla (Supplementary Fig. 11). This broad distribution pattern highlights the foundational role of phosphorus uptake and regulation in supporting cell growth⁹. Aside from the aforementioned basic genetic equipment, our findings substantially expanded the taxonomic distribution of genes related to the phosphate scavenging from inorganic and organic phosphorus source among Archaea ___domain (Fig. 4a). These archaea exhibited high relative abundances in the microbial communities, e.g., with up to 44.1% of the microbial community in sediment samples from the Haakon Mosby mud volcano (Supplementary Fig. 12). To date, genes encoding inorganic polyphosphate (poly-P) metabolism, specifically ppk or ppx, have been identified in the archaeal phyla Halobacteriota and Thermoproteota⁵⁷. However, our PLM-based LucaPCycle alternatively identified ppx in Iainarchaeota (formerly Diapherotrites) from the DPANN superphylum, representing one of the 5241 annotations exclusively detected by LucaPCycle. DPANN are characterized by small genome sizes and minimal metabolic function⁵⁸. The selective maintenance of ppx genes in DPANN archaea may be associated with changing environments, as poly-P confers resistance to environmental stresses⁵⁹. However, mineral phosphorus solubilization in archaea has received less attention with the majority of studies focusing on bacteria^17,18,28. Beyond Halobacteriota²⁸, four additional archaeal phyla—Hydrothermarchaeota, Aenigmatarchaeota, Thermoplasmatota, and Thermoproteota—also harbored the gcd gene, which encodes glucose dehydrogenase (Fig. 4a). This enzyme catalyzes the oxidation of glucose to gluconic acid, facilitating the solubilization of mineral-sorbed inorganic phosphate¹⁸.

**Fig. 4: The overlooked role of archaea in deep-sea phosphorus-cycling.**

Contrary to a recent study reporting that most archaea cannot metabolize phosphonates⁶⁰, this study found that a variety of archaeal phyla possess the genetic capability to do so. Specifically, genes responsible for phosphonate substrate-specific catabolism (phnWX), which target 2-AEP, demonstrated a wide taxonomic distribution across archaea ___domain (Fig. 4a). These genes co-occurred in 12 different archaeal phyla, predominantly Halobacteriota, Thermoproteota, Asgardarchaeota, and Thermoplasmatota, whereas in global seawater, they were mainly found in the bacterial phylum Proteobacteria²⁴. Additionally, the broad-specificity catabolism (C-P lyase) genes (phnGHIJKLM) that hydrolyze a broad range of phosphonates were identified in genomes affiliated with archaea Halobacteriota. Notably, metagenome-assembled genomes (MAGs; FR_S4_sbin5 and HM1_cobin.76) belonging to ecologically important methanotrophs ANME-2c and ANME-3 possessed near complete C-P lyase operon (Fig. 4b, c). This marks a significant discovery, overturning the previous notion that the phnJ gene was exclusive to bacteria in marine ecosystems^24,61,62. Organization of C-P lyase operons exhibited a consecutive string of C-P lyase subunits and alongside the synteny with other phosphorus-cycling genes, such as the phosphonate transporter (phnCDE) and partial phosphate transporter (pstABCS). The neighborhood occurrence of these genes in C-P lyase operons reflects an evolutionary adaptation to enhance the efficiency of phosphorus-related mechanisms, consistent with what has been observed in bacteria²⁴. The subunits of the C-P lyase core complex found in ANME archaea interacted closely (pTM: 0.9; ipTM: 0.9) to form conformational states necessary for phosphonate breakdown (Fig. 4d).

ANME-2c FR_S4_sbin5 also harbor genes for complete MCR complex (mcrABG), along with an archaeal Wood–Ljungdahl pathway for oxidizing methane to CO₂ and HdrABC to recycle the CoM and CoB heterodisulfides (Fig. 4e and Supplementary Fig. 13). This denotes a conventional anaerobic oxidation of methane via MCR mechanism^63,64. The presence of C-P lyase in methanotrophs revealed the potential of methane production from utilizing methylphosphonate. Indeed, the metabolic product of PhnJ (alpha-D-ribose 1,2-cyclic phosphate 5-phosphate) was detected in our metabolome data (Supplementary Data 7), indicating the actual occurrence of methylphosphonate-driven methane formation at cold seeps. All phnJ genes, encoding for the catalytic subunit⁶⁵, from gene- and genome-level were used for phylogenic analyses to reveal the evolution of C-P layse. PhnJ protein sequences associated with Halobacteriota formed a monophyletic group, indicating a shared evolutionary ancestry (Supplementary Fig. 14).

Phosphorus-cycling AMGs in cold seep viruses are dominated by the PhoR-PhoB system and PhnCDE transporter

Apart from bacteria and archaea, we identified 328 phosphorus-cycling auxiliary metabolic genes (AMGs) encoded by 182 viral contigs (vContigs) from 165 metagenomic assemblies (Supplementary Data 9-10). This represents a substantial number of phosphorus-cycling AMGs when compared to the study by Liang et al. (2024)⁶⁶, which recovered only 105 phosphorus-cycling AMGs from 333 soil metagenomes. The recovered vContigs were validated as viruses by containing viral hallmark genes, viral-like genes or having a high virus score (> 0.7) determined by genomad⁶⁷. Genomic arrangements also supported their affiliations to viruses, with phosphorus-cycling AMGs flanked by viral hallmarks or viral-like genes on both sides (Fig. 5a and Supplementary Fig. 15). Using geNomad⁶⁷, up to 94.5% (n = 172) of the 182 vContigs were classified to the Caudoviricetes class, with a few assigned to classes of Megaviricetes (n = 1) and Tokiviricetes (n = 1, Fig. 5b). Due to the limited number of vContigs that could be classified to a specific viral family (Supplementary Data 11), the family-level classifications were performed alternatively using PhaGCN⁶⁸. A total of 58 vContigs could be further assigned to 15 families, with Straboviridae being the most prevalent (n = 19, Supplementary Fig. 16).

**Fig. 5: Phosphorus-cycling viruses at cold seeps.**

The 328 phosphorus-cycling AMGs spanned 20 gene types (Fig. 5c), categorized into four phosphorus metabolic processes: organic phosphorus mineralization (phoA, phoN, phoX, phnW, phnX, phnM, phnL and phnK), phosphorus transportation (phnC, phnD, phnE, pit and pstABCS), inorganic phosphorus solubilization (ppk and ppx) and phosphorus regulation (phoR and phoB). The identification of previously unrecognized AMGs⁶⁶, i.e., phnX, phoX, ppk, phnM, phnL and phnK, expanded the diversity of AMGs involved in phosphorus cycling. Remarkably, phosphorus regulation (phoR, n = 57; phoB, n = 102) was the most commonly encoded function among the cold seep viral AMGs (Fig. 5c). Moreover, one viral genome harbored a complete PhoB-PhoR regulatory system (Fig. 5a), a significant finding since previously only individual components of this system, either phoR or phoB gene, had been identified as AMGs^66,69. These finding allow us to propose that cold seep phages could regulate the transcription of phosphorus-cycling genes completely using their own regulators, rather than relying on the host’s regulatory system⁷⁰. Phosphorus-cycling AMGs were also enriched in ATP-binding subunit (phnC, n = 75; Fig. 5c) of the ABC-type phosphonate transport system (phnCDE)⁷¹. One viral genome was observed to encode other components of this transporter (phnDE, Fig. 5a). To date, only one provirus genome encoding PhnCDE was detected⁷², and the periplasmic high-affinity phosphate-binding gene pstS was the most commonly reported AMG related to phosphorus transport^73,74. The viral phoR, phoB and phnC were expressed at varying levels, with the highest reaching up to 4.80 TPM at Jiaolong cold seep, while the transcripts of other types of AMGs were not detected (Supplementary Data 12). However, the absence of detected expression does not imply that these AMGs are inactive in situ, as it is challenging to recover sufficient RNA from deep-sea samples, and RNA may be lost during deep-sea sampling or extraction. This enrichment of AMGs in phosphorus regulation and transport were consistent with the pattern observed in their microbial hosts, which had the highest abundance of these genes across global cold seeps (Fig. 3a). This uniformity suggests that cold seep viruses and their microbial hosts are likely under the same selective pressure (e.g., phosphate bioavailability) to retain these genes for phosphorus utilization, as seen with other nutrient cycling processes such as nitrogen and sulfur^75,76. Multiple lines of evidence suggest that, in phosphorus-limited oceanic water and various terrestrial ecosystems^66,72,77, viruses augment their hosts’ phosphorus acquisition by encoding AMGs related to phosphatases like phoD and phoN. However, in phosphorus-rich cold seep environments, the dominance of AMGs for the PhoR-PhoB system and PhnCDE transporter implies that phosphorus regulation and transport are more crucial than metabolic processes for phosphate release.

To explore the potential viral impacts on phosphorus cycling at cold seeps, we investigated virus-host linkages, via iPHoP that aggregates six different methods for host predictions along with CRISPR spacer matching⁷⁸. Approximate 17.58% (32/182) of the phosphorus-cycling vContigs identified could be associated with a total of 81 hosts (Supplementary Data 13). These hosts encompassed a wide range of diversity, spanning 20 bacterial classes and 6 archaeal classes (order level for ANME-1; Fig. 5d). While it is established that known hosts for phosphorus-cycling viruses are predominantly bacteria⁶⁶, our results revealed a broader spectrum of hosts in archaeal lineages. Specifically, we identified five distinct Caudoviricetes viruses infected ecologically important archaea ANME-1 (Fig. 5d), functioning as methanotrophs in many methane-rich ecosystems⁷. These viruses potentially modulated the phosphorus metabolism of ANME-1 by containing phoB, phnC, and phnK (Supplementary Data 10 and 13). Of particular significance is the involvement of PhnK, a component of C-P lyase, in mediating methylphosphonate-driven methane formation⁶¹.

Implications of PLM-based approaches for phosphorus cycling in cold seep

Traditional homology-based annotations can only capture sequence space conserved to some degree, leaving a high number of genes unannotated and therefore ‘hidden’ from our sight. Here we show the utility of alignment-free PLMs to improve the functional prediction of phosphorus-cycling proteins within large-scale cold seep metagenomic datasets. The LucaPCycle model enabled the characterization of previously unrecognized alkaline phosphatase families featured by a streamlined core architecture ___domain and a unique C-terminal β-sheet ___domain. The discovery of these enzyme families highlights the hidden diversity of phosphorus-cycling microbial communities and their overlooked ecological functions. The presence of diverse phosphoesters and phosphonates, along with their intermediate metabolites, and the high abundance and expression of phosphorus-cycling genes, all suggest that phosphorus cycling plays a significant yet often underappreciated role in the biogeochemical cycling within cold seep environments. Our study also expanded the taxonomic distribution of phosphorus-cycling genes among the Archaea ___domain. We highlight the unexpected roles of archaea in polyphosphate metabolism, mineral phosphorus solubilization, and both phosphonate substrate-specific and broad-specificity catabolism. Notably, ANME methanotrophs possess the genetic equipment for methylphosphonate-driven methane production via complete C-P lyase operons. Additionally, cold seep viruses possess the genetic capacity to regulate microbial phosphorus-cycling processes primarily through the PhoR-PhoB regulatory system and the PhnCDE transporter. This study underscores the critical need for PLM-based methodologies to uncover the hidden biological functions of these genes, enhancing our understanding of microbially-mediated phosphorus-cycling and its ecological significance in diverse ecosystems. Furthermore, the deep-sea sedimentary ecosystems, particularly beyond ocean water, represents a significant but often overlooked component in global phosphorus cycling.

Methods

Architecture of LucaPCycle

LucaPCycle is a transformer-based dual-channel model (Fig. 1a). The first stage (binary-classification) predicts if a protein sequence has a phosphorus-cycling function. In the second stage (31-classification), it classifies the positive predictions from the first stage into one of 31 specific types. These 31 phosphorus-cycling proteins participate in various phosphorus metabolisms, including phosphoester hydrolysis, phytate hydrolysis, broad-specificity and substrate-specific phosphonate catabolism, polyphosphate metabolism, mineral phosphorus solubilization, phosphonate transport, low- and high-affinity phosphate transport, and two-component systems (Supplementary Data 1). LucaPCycle consists of five modules: Input, Tokenizer, Encoder, Pooler, and Output. Below are descriptions of each module:

(1)
Input: This module receives the amino acid sequence as input.
(2)
Tokenizer: This module performs tokenization of the input amino acid sequence at both character and word levels. Character-level tokenization slices protein sequences into amino acid characters. Word-level tokenization employs the Byte Pair Encoding (BPE) algorithm⁷⁹ from the Subword-NMT tool (https://github.com/rsennrich/subword-nmt) to segment the input sequence into “words”, generating a word-level list.
(3)
Encoder: This module extracts features from the tokenized lists and includes two submodules. One submodule uses the ESM2-3B model³⁹ to extract features from the character-level list, resulting in the representation matrix M1. This submodule captures universal features and internal structural and contextual information from the sequence. The second submodule based on the Transformer-Encoder⁴⁰ to derive features from the word-level list, producing the representation matrix M2.
(4)
Pooler: This module selects essential features and transforms the feature matrix into a vector through down-sampling. Since the Encoder extracts mixed features, especially general ones from ESM2, we use Value-Level Attention Pooling⁸⁰ to assign learnable weights to each feature (Eqs. (1) and (2)).
$${e}_{{ik}}=\sum\limits_{j=1}^{{l}_{i}}{\alpha }_{{jk}}{r}_{{ijk}},\,\sum\limits_{j=1}^{{l}_{i}}{\alpha }_{{jk}}=1$$
(1)
$${e}_{i}=\left[{e}_{i1},{e}_{i2},\ldots,{e}_{{id}}\right]$$
(2)
where, $i$ is the sample index, ${l}_{i}$ is the length of the $i-{th}$ sequence, $d$ is the dimension of the embedding, ${e}_{i}$ is the pooled vector, and $k\in \left\{{{\mathrm{1,2}}},\ldots,d\right\}$, ${r}_{{ijk}}$ is the value of row $j$ and column $k$ of the representation matrix of the $i-{th}$ sequence.
(5)
Output: This module concatenates two pooled vectors into a classification vector. It uses the sigmoid-function for binary-classification to generate the positive probability and the Softmax-function for 31-classification to output probabilities for each class.

Dataset for LucaPCycle

We initially constructed a dataset with 214,193 positive samples of 31 phosphorus-cycling proteins and 3,832,374 negative samples. The positive samples are sourced from the PCycDB that represents a comprehensive and accurate database for analyzing phosphorus-cycling genes and microorganisms³⁰. The number of sequences for each phosphorus-cycling protein detailed in Supplementary Fig. 1a. To maintain a reasonable ratio of positive to negative samples, we used CD-HIT (v4.8.1)⁸¹ to reduce the redundancy of the negative samples with 50% amino acid identity. The rationale for retaining redundancy in the positive sequences is provided in Supplementary Note 2, Supplementary Table 1, Supplementary Figs. 17–18. The clustering process resulted in 853,615 non-redundant negative samples and an approximate positive/negative sample ratio of 1:4 for model building. The negative samples encompass protein sequences involved in intracellular phosphorus metabolism (n = 39,048), nitrogen cycling (n = 9248) and sulfur cycling (n = 62,937). To ensure a diverse set of negative samples, we also incorporated other unrelated functional proteins from KEGG (n = 68,498) and UniProt (n = 673,884), such as protein sequences from viral capsid, DNA replication, RNA transcription, carbon metabolism, signal transduction, and secondary metabolism. This diversity in negative samples aims to enhance the model’s annotation performance and generalization ability. The detailed numbers for each category of negative samples are shown in Supplementary Fig. 1b.

Model building of LucaPCycle

For binary-classification model, all positive samples (n = 214,193) and non-redundant negative samples (n = 853,615, 50% identity) were randomly divided into training, validation, and testing datasets with a ratio of 8:1:1 (Supplementary Fig. 1c). Regarding 31-classification model, the dataset consists of all positive samples without negatives. The sequences of each phosphorus-cycling protein were randomly divided into training, validation, and testing datasets with a ratio of 8:1:1.

The first channel of LucaPCycle applied ESM2-3B (https://github.com/facebookresearch/esm) for contextual representation. The second channel used a 4-layers with 8-heads Transformer-Encoder to characterize the raw sequence. The model training lasted for a maximum of 50 epochs based on the best F1-score in validation dataset for model finalization. We used the AdamW optimizer with betas (0.9, 0.98) and a peak learning rate of 2E-4, incorporating a linear warm-up schedule throughout the learning rate updates. We employed a batch size of 8 with a gradient accumulation step of 8 for the binary-classification model and a batch size of 16 with a gradient accumulation step of 4 for the 31-classification model. The model was trained on 4 Nvidia A100 GPUs.

Model evaluation and benchmarking

We evaluated the binary-classification model and the 31-class classification model using six metrics based on validation and testing datasets, i.e., accuracy, precision (macro), recall (macro), F1-Score (macro), AUC (macro, one-vs-rest), and PR-AUC (macro) (Fig. 1b). To assess the impact of positive sequence redundancy, modeling channels and dataset partition ratios on the performance of the LucaPCycle model, we conducted a comprehensive ablation study. Specifically, the ablation study evaluated different modeling channels and two dataset partition ratios (8:1:1 and 6:2:2) in both binary-classification and 31-class classification. Furthermore, we compared the effects of six levels of positive sequence redundancy clustered using CD-HIT (v4.8.1)⁸¹ (50%, 60%, 70%, 80%, 90%, and 100%) on binary-classification. The detailed results are provided in Supplementary Information (Supplementary Note 3, Supplementary Figs. 19–20). Considering the presence of less frequent classes in the 31-class classification, we conducted a detailed error analysis to thoroughly evaluate the model’s performance on the 31 phosphorus-cycling categories (Supplementary Note 4 and Supplementary Tables 2–3).

To further benchmark LucaPCycle, we initially compared it against Diamond Blastp, KofamScan and PLMSearch³². However, due to significant overlap between the sequences predicted by PLMSearch and KofamScan, PLMSearch was ultimately excluded from the comparison. Detailed results are provided in Supplementary Note 5. The benchmarking dataset was predicted from the non-redundant gene (n = 147,289,169) and genome (n = 1878) catalogs from global cold seeps, comprising a total of 151,187,265 sequences (Supplementary Figs. 21–22). To annotate these unknown sequences across 31 categories of phosphorus-cycling genes, we employed Diamond Blastp, KofamScan, and LucaPCycle. The predictions from all three methods were subsequently subjected to the ___domain-based validation step (DeepFRI, CLEAN, ECOD) described in detail below. Despite passing validation, the accuracy of these predicted results remains uncertain, as the same ___domain may occur in a variety of unrelated proteins. To enhance the reliability of the benchmark and reduce false positives, we defined the true positives as sequences predicted by at least two of the three methods (Diamond Blastp, KofamScan, and LucaPCycle). This process resulted in a total of 1,521,958 true positive sequences used for benchmarking (Supplementary Fig. 23). Since the sequences in the benchmark dataset can also be captured by homology-based searches, they are considered homologous. The sequence similarity between positive sequences predicted by LucaPCycle and the training dataset was evaluated, with detailed results provided in Supplementary Note 6 and Supplementary Fig. 24. Finally, the verified accuracy, precision, and recall were subsequently calculated using Eqs. (3)−(7).

$${True\; Positives}= ({{Pred\; Positives}}_{{LucaPCycle}}\cap {{Pred\; Positives}}_{{Blastp}}) \\ \cup ({{Pred\; Positives}}_{{LucaPCycle}}\cap {{Pred\; Positives}}_{{KofamScan}}) \\ \cup ({{Pred\; Positives}}_{{Blastp}}\cap {{Pred\; Positives}}_{{KofamScan}})$$

(3)

$${Verified\; Accuracy}=\frac{{Verified\; Positives}}{{Predicted\; Positives}}$$

(4)

$${Verified}\,{{\&}} \, {Hit\; Positives}=\left({Predicted\; Positives}\right)\cap \left({True\; Positives}\right)$$

(5)

$${Precision}=\frac{{Verified}\,{{\&}}\,{Hit\; Positives}}{{Predicted\; Positives}}$$

(6)

$${Recall}=\frac{{Verified}\,{{\&}}\,{Hit\; Positives}}{{True\; Positives}}$$

(7)

Metagenomic and metatranscriptomic datasets and processing

The 165 metagenomes and 33 metatranscriptomes analyzed in this study were sourced from sediment samples (sediment depths: 0–68.55 mbsf; water depths 860–3005 m) collected at 16 globally distributed cold seep sites (Supplementary Data 14). These samples encompass five types of cold seeps, namely gas hydrates (n = 39), mud volcanoes (n = 7), asphalt volcanoes (n = 7), oil and gas seeps (n = 15) and methane seeps (n = 96). Metagenomic datasets from QDN-W07 derived from sediment cores sampled in 2021 and 2023, along with those from active and extinct Haima cold seeps, were used to investigate the temporal dynamics of phosphorus-cycling genes. All metagenomic and metatranscriptomic datasets used in this study were sourced from our previous publications described in detail elsewhere^82,83,84.

Non-redundant gene and genome catalogs were constructed as described in our previous publication⁸². In brief, metagenomic raw reads were quality-controlled and assembled into contigs. The protein-coding sequences were predicted from contigs using Prodigal v2.6.3 (parameter: -meta)⁸⁵, and then clustered at 95% amino acid identity using CD-HIT (v4.8.1)⁸¹. This process yielded a non-redundant gene catalog (n = 147,289,169). MAGs were recovered from contigs with the length longer than 1 kb using six binning tools, including MetaBAT2⁸⁶, MaxBin2⁸⁷, CONCOCT⁸⁸, SemiBin⁸⁹, VAMB⁹⁰ and Rosella (https://github.com/rhysnewell/rosella). All produced MAGs were dereplicated at 95% average nucleotide identity using dRep v3.4.0 (parameters: -comp 50 -con 10)⁹¹, resulting in 1878 representative MAGs. Completeness and contamination of MAGs were evaluated using CheckM2 (v1.0.1)⁹². Functional annotation of MAGs was carried out by searching against KEGG, Pfam, MEROPS and dbCAN databases using DRAM (v1.3.5)⁹³.

With regard to 33 metatranscriptomes, raw reads were quality filtered using the Read_qc module within metaWRAP (v1.3.2)⁹⁴. The removal of ribosomal RNAs was conducted using sortmeRNA v2.1 with default parameters⁹⁵.

Identification of phosphorus-cycling proteins based on LucaPCycle and two homology-based methods

We used the trained LucaPCycle to predict 31 phosphorus-cycling proteins from the protein sequences of non-redundant gene catalog and MAGs. Detailed predictions for each protein are shown in Supplementary Fig. 1d. Additionally, the same datasets were also functionally annotated using two conventional methods, blast- and HMM-based approaches. For the blast-based approach, we used Diamond Blastp program (v2.0.15)⁹⁶ with the filtering criteria of identity ≥ 30.0% and hit length ≥ 25 amino acids, as suggested by Zeng et al. (2022)³⁰. For the HMM-based approach, we used pre-computed HMM profiles of KEGG ortholog via KofamScan (v1.3.0)³¹. The significant hits related to 31 phosphorus-cycling genes were selected according to KofamScan’s adaptive score threshold and an E-value threshold of 1E-5.

To ensure more precise and reliable results, all protein sequences identified by LucaPCycle, Diamond Blastp, and KofamScan were merged, followed by validation through three distinct methods: ___domain analysis, DeepFRI v1.0.0 (Deep Functional Residue Identification)⁴¹, and CLEAN v1.0.1 (Contrastive Learning-Enabled Enzyme Annotation)⁴². With regard to ___domain analysis, protein sequences were queried against full ECOD ___domain database (v.20231128)⁹⁷ using Diamond Blastp program (v2.0.15)⁹⁶ with an E-value threshold of 1E-2. ECOD is a hierarchical ___domain database that describes the evolutionary relationships between pairs of protein domains. A sequence was considered as bona fide phosphorus-cycling protein if supported by at least one of these validation processes. The cross-checked results were used for benchmark test.

To reveal phylogenetic diversity, clustering was conducted on the validated protein sequences using CD-HIT v4.8.1 (parameters: 70% sequence identity, 80% alignment coverage)⁸¹ to select representative ones. At this step, all singleton clusters were removed, resulting in 321,789 non-singleton protein families associated with phosphorus-cycling proteins.

Geochemical and metabolomics analyses

The porewater samples from Qiongdongnan cold seep were used for the following geochemical analyses. Total alkalinity (TA) was analyzed onboard by direct titration with an HCl standard solution. The pH of the porewater was monitored using a calibrated pH meter. Phosphate (PO₄³⁻) concentrations were determined photometrically using a UV-Vis spectrophotometer (Hitachi U5100, Hitachi Limited, Japan). Other cation-anion concentrations (SO₄^2-, K⁺, Na⁺, Mg²⁺ and Ca²⁺) were measured using an ion chromatography system (ICS-1100, Thermo, United States) with conductance detector. The total organic carbon (TOC) and total nitrogen (TN) of bulk sediment samples were quantified on an elemental analyzer-isotope ratio mass spectrometer (Vario ISOPOTE Cube-Isoprime, Elementar, Germany).

The untargeted metabolomics analyses were performed on 55 sediment samples collected from the Qiongdongnan and Shenhu cold seeps⁹⁸. For metabolite extraction, approximately 50 mg of each sediment sample was mixed with a solution of methanol, acetonitrile, and water in a 2:2:1 volume ratio. The samples were vortexed for 30 s, ground with steel beads for 10 min at 45 Hz, and ultrasonicated for 10 min in an ice bath. High-resolution LC-MS/MS analysis was performed on a Waters Acquity I-Class PLUS ultra-high performance liquid tandem with a Waters Xevo G2-XS QTOF high-resolution mass spectrometer. Data collection was performed in MSe mode and processed using Progenesis QI software, utilizing the METLIN database and Biomark’s proprietary library for metabolite identification.

Taxonomy assignment of bacteria and archaea

To obtain the taxonomic profiles for microbial communities driving the process of phosphorus cycling, all sequences from non-singleton phosphorus-cycling protein families were taxonomically assigned using the MMseqs2 taxonomy tool v13.45111 (parameter: --tax-lineage 1)⁹⁹. This involved performing six-frame translation searches against the GTDB R207 database and assigning each protein sequence to the lowest common ancestor of the best hits for each frame.

The taxonomic classification of MAGs encoding phosphorus-cycling proteins was conducted using GTDB-Tk toolkit (v2.1.1)¹⁰⁰ with default parameters against the GTDB R207 database. The assignment of MAG FR_S4_sbin harboring both McrA and PhnJ were further confirmed by the visual inspection of taxonomic trees. Briefly, the concatenated multiple sequence alignment of the MAG FR_S4_sbin and reference ANME genomes accessed from NCBI GenBank were produced based on 43 conserved single-copy genes extracted by CheckM (v1.0.12)¹⁰¹. The maximum likelihood phylogenomic tree was generated using RAxML (v8.2.12)¹⁰² with PROTCATLG model and 1000 bootstrap replicates.

Abundance calculations

The gene abundances of non-singleton phosphorus-cycling protein families across 165 metagenomes were quantified using Salmon v.1.9.0 in mapping-based mode (parameters: -validateMappings -meta)¹⁰³. GPM (genes per million) values were used as a proxy for gene abundance. Similarly, transcript abundances of non-singleton phosphorus-cycling protein families were determined by mapping clean reads from 33 metatranscriptomes to the non-redundant gene catalog using Salmon v.1.9.0 with the aforementioned parameters¹⁰³. Transcript abundances were calculated as TPM (transcripts per million). It is worth noting that all sequences from non-singleton phosphorus-cycling protein families were considered in these calculations.

At the genome level, the relative abundance of each MAG was profiled by mapping quality-trimmed reads from the cold seep metagenomes against the MAGs using CoverM in genome mode (https://github.com/wwood/CoverM) (v0.6.1; parameters: -min-read-percent-identity 0.95 -min-read-aligned-percent 0.75 -trim-min 0.10 -trim-max 0.90 -m relative_abundance).

Protein structure alignments, ___domain classification and molecular docking

Three-dimensional structures of identified phosphorus-cycling proteins were predicted using the deep-learning artificial intelligence system, via AlphaFold3 web server (May 2024 to June 2024)¹⁰⁴. The confidence in the accuracy of the predicted structures was measured by the predicted template modeling (pTM) score. All remote ALP structures exhibited pTM score above 0.5 and 76% of the structures achieving very high confidence scores (pTM > 0.9; Supplementary Data 15). A pTM score above 0.5 means the overall predicted fold for the protein might be similar to the true structure. For PhnGHIJ complex, the interface predicted template modeling (ipTM) score was used to measure the accuracy of the predicted relative positions of the subunits within this complex.

The paired structural alignments were carried out using Foldseek (v8.ef4e960)³³ in easy-search mode to align the structures of remote alkaline phosphatases against a library of reference structures drawn from PDB database and manually curated AlphaFold models (Supplementary Data 16). The domains of remote ALPs were analyzed with FoldSeek (v8.ef4e960)³³ by comparing to domains defined in the ECOD ___domain databases (v.20231128)⁹⁷.

Ledock (v1.0; https://www.lephar.com/) was used to predict the binding poses of 89 phosphoester compounds on remote ALP (RMSD: 1.0, number of binding conformations: 20). The binding box of each predicted protein was performed via the DockingPie1 plugin within PyMOL (docking program: Autodock Vina)¹⁰⁵. All of the structures were visualized and exported as images using PyMOL.

Phylogenetic analysis

To facilitate a clear and intuitive phylogenetic tree, we used remote ALPs representative sequences from non-singleton phosphorus-cycling protein families. For sequence-based tree, the sequences of remote ALPs were aligned with corresponding reference sequences using MUSCLE v3.8.1551 (default settings)¹⁰⁶. The alignments were subjected to phylogenetic tree reconstruction using FastTree v2.1.11 with JC + CAT models¹⁰⁷. Structures of remote ALPs derived from AlphaFold 3 modeling, along with reference structures of well-characterized ALPs were subjected to structural tree building using Foldtree (https://github.com/DessimozLab/fold_tree), based on a local structural alphabet. All trees were visualized using iTOL (v6)¹⁰⁸.

Identification of viral genomes-containing phosphorus-cycling genes

The assembled contigs with length longer than 10 kb were screened by geNomad pipeline v1.3.3 (parameter: genomad end-to-end)⁶⁷ for viral prediction. The completeness and contamination of identified viral genomes were determined by CheckV v1.01 (database v1.5)¹⁰⁹. Only viral genome-containing phosphorus-cycling genes were included for downstream analyzes. The taxonomy each of phosphorus-cycling viral genome was assigned by geNomad (v1.3.3)⁶⁷. However, only a few viral genomes could be assigned to specific viral families or genera. Thus, the finer classifications of phosphorus-cycling viral genomes were alternatively performed by PhaGCN v2.0 (cut-off score > 0.5)⁶⁸. Phage genes and hallmark genes were identified by DRAM-v (v1.3.5)⁹³ and geNomad (v1.3.3)⁶⁷. The putative virus-host linkages were predicted using iPHoP (v1.3.3)⁷⁸ and CRISPR spacer matching. For the iPHoP, both default database and custom MAGs from our 165 metagenomes were used to maximize host prediction. For CRISPR spacer matching, the CRISPR arrays were identified using CRISPRCasFinder (v4.2.21)¹¹⁰ from our custom MAGs. Local alignments of extracted spacers with lengths greater than 25 bp against viral genomes were searched using BLASTn. Only BLAST matches with 100% alignment coverage and at most two mismatches were considered as high-confidence matches.

Statistical analysis

All statistical analyses were performed in R (v4.1.0). The normality and variance homogeneity of data were evaluated using Shapiro-Wilk test and Levene’s test, respectively. The two-sided Kruskal-Wallis rank-sum test was employed to compare gene abundance among different phosphorus metabolic processes. Correlation network analyses of environmental factors and phosphorus-cycling gene abundance were based on Spearman’s correlation. Pearson’s correlation and linear regression were performed to assess the relationship between PO₄³⁻ concentrations and sediment depth. The alpha diversity (Chao1 index) of phosphorus-cycling microbial communities was calculated using the ‘vegan’ package. Variations in alpha diversity were evaluated using the two-sided Wilcoxon rank sum test for different depths and timescales, and the two-sided Kruskal-Wallis rank sum test for different types of cold seep. The bata diversity of phosphorus-cycling microbial communities was estimated using the partial least squares discrimination analysis (PLS-DA) with the ‘mixOmics’ package. PERMANOVA (Permutational Multivariate Analysis of Variance; permutations = 999) tests were used to calculate the statistical differences across different depths, timescales and types of cold seeps. Both alpha and beta diversity indices were calculated based on functional profiles.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw data and the processed dataset used for model building as well as the results from three functional annotation tools (i.e., LucaPCycle, Blastp and KofamScan) can be accessed at Zenodo (https://doi.org/10.5281/zenodo.15334764). The non-redundant gene catalog derived from 165 cold seep metagenomes can be found in Figshare (https://doi.org/10.6084/m9.figshare.22568107). The non-redundant MAGs catalog has been deposited in the NCBI database under BioProject ID PRJNA950938. The metabolomics raw data have been deposited in the OMIX, China National Center for Bioinformation under accession number OMIX010049. Additional data produced in this study including representatives from non-singleton phosphorus-cycling protein families, phylogenetic tree files based on amino acid sequences and protein structures of ALPs along with spectral data are available at Zenodo (https://doi.org/10.5281/zenodo.15334764). All additional data supporting the findings of this study are provided within the article and its Supplementary Information Files. Source data are provided in this paper. Source data are provided with this paper.

Code availability

The original source code for LucaPCycle is available at LucaPCycle GitHub (https://github.com/LucaOne/LucaPCycle) or LucaPCycle Zenodo (https://doi.org/10.5281/zenodo.15334764)¹¹¹. See README on the LucaPCycle Zenodo (https://doi.org/10.5281/zenodo.15334764) for a detailed description of the code.

References

Joye, S. B. The geology and biogeochemistry of hydrocarbon seeps. Annu. Rev. Earth Planet. Sci. 48, 205–231 (2020).
Article ADS CAS Google Scholar
Feng, D. et al. Cold seep systems in the South China Sea: an overview. J. Asian Earth Sci. 168, 3–16 (2018).
Article ADS Google Scholar
Chen, S.-C. et al. Anaerobic oxidation of ethane by archaea from a marine hydrocarbon seep. Nature 568, 108–111 (2019).
Article ADS CAS PubMed Google Scholar
Dong, X. et al. Thermogenic hydrocarbon biodegradation by diverse depth-stratified microbial populations at a Scotian Basin cold seep. Nat. Commun. 11, 5825–5825 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Laso-Pérez, R. et al. Anaerobic degradation of non-methane alkanes by “Candidatus Methanoliparia” in hydrocarbon seeps of the Gulf of Mexico. mBio 10, e01814–e01819 (2019).
Article PubMed PubMed Central Google Scholar
Boetius, A. et al. A marine microbial consortium apparently mediating anaerobic oxidation of methane. Nature 407, 623–626 (2000).
Article ADS CAS PubMed Google Scholar
Chadwick, G. L. et al. Comparative genomics reveals electron transfer and syntrophic mechanisms differentiating methanotrophic and methanogenic archaea. PLoS Biol 20, e3001508 (2022).
Article PubMed PubMed Central Google Scholar
Duhamel, S. et al. Phosphorus as an integral component of global marine biogeochemistry. Nat. Geosci. 14, 359–368 (2021).
Article ADS CAS Google Scholar
Walton, C. R. et al. Phosphorus availability on the early Earth and the impacts of life. Nat. Geosci. 16, 399–409 (2023).
Article ADS CAS Google Scholar
Dong, X. et al. Phylogenetically and catabolically diverse diazotrophs reside in deep-sea cold seep sediments. Nat. Commun. 13, 4885 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, Q. et al. The formation of authigenic phosphorus minerals in cold-seep sediments from the South China Sea: Implications for carbon cycling below the sulfate-methane transition. Mar. Pet. Geol. 155, 106425 (2023).
Article CAS Google Scholar
Liu, J. et al. Vivianite formation in methane-rich deep-sea sediments from the South China Sea. Biogeosciences 15, 6329–6348 (2018).
Article ADS CAS Google Scholar
Egger, M. et al. Anaerobic oxidation of methane alters sediment records of sulfur, iron and phosphorus in the Black Sea. Biogeosciences 13, 5333–5355 (2016).
Article ADS CAS Google Scholar
Zhou, J. et al. Phosphorus species in deep-sea carbonate deposits: implications for phosphorus cycling in cold seepPhosphorus species in deep-sea carbonate deposits: implications for phosphorus cycling in cold seep environments. Minerals 10, 645 (2020).
Article ADS CAS Google Scholar
Li, H.-P. et al. Roles of phosphate-solubilizing bacteria in mediating soil legacy phosphorus availability. Microbiol. Res. 272, 127375 (2023).
Article CAS PubMed Google Scholar
Dai, Z. et al. Long-term nutrient inputs shift soil microbial functional profiles of phosphorus cycling in diverse agroecosystems. ISME J 14, 757–770 (2020).
Article CAS PubMed Google Scholar
Liang, J.-L. et al. Novel phosphate-solubilizing bacteria enhance soil phosphorus cycling following ecological restoration of land degraded by mining. ISME J 14, 1600–1613 (2020).
Article CAS PubMed PubMed Central Google Scholar
Liu, S. et al. Antimony efflux underpins phosphorus cycling and resistance of phosphate-solubilizing bacteria in mining soils. ISME J. 17, 1278–1289 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lamarche, M. G. et al. The phosphate regulon and bacterial virulence: a regulatory network connecting phosphate homeostasis and pathogenesis. FEMS Microbiol. Rev. 32, 461–473 (2008).
Article CAS PubMed Google Scholar
Srivastava, A. et al. Enzyme promiscuity in natural environments: alkaline phosphatase in the ocean. ISME J. 15, 3375–3383 (2021).
Article CAS PubMed PubMed Central Google Scholar
Meeds, J. A. et al. Phosphorus deficiencies invoke optimal allocation of exoenzymes by ectomycorrhizas. ISME J. 15, 1478–1489 (2021).
Article CAS PubMed PubMed Central Google Scholar
Luo, H. et al. Subcellular localization of marine bacterial alkaline phosphatases. Proc. Natl. Acad. Sci. USA. 106, 21219–21223 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Gessler, N. N. et al. Phytases and the prospects for their application (Review). Appl. Biochem. Microbiol. 54, 352–360 (2018).
Article CAS Google Scholar
Lockwood, S. et al. Global and seasonal variation of marine phosphonate metabolism. ISME J. 16, 2198–2212 (2022).
Article CAS PubMed PubMed Central Google Scholar
Murphy, A. R. J. et al. Transporter characterisation reveals aminoethylphosphonate mineralisation as a key step in the marine phosphorus redox cycle. Nat. Commun. 12, 4554 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Brown, M. R. W. & Kornberg, A. The long and short of it - polyphosphate, PPK and bacterial survival. Trends Biochem. Sci. 33, 284–290 (2008).
Article CAS PubMed Google Scholar
Chugh, S. et al. Polyphosphate kinase-1 regulates bacterial and host metabolic pathways involved in pathogenesis of Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. USA. 121, e2309664121 (2024).
Article PubMed PubMed Central Google Scholar
Siles, J. A. et al. Distribution of phosphorus cycling genes across land uses and microbial taxonomic groups based on metagenome and genome mining. Soil Biol. Biochem. 174, 108826 (2022).
Article CAS Google Scholar
Botero, L. M., Al-Niemi, T. S. & McDermott, T. R. Characterization of two inducible phosphate transport systems in Rhizobium tropici. Appl. Environ. Microbiol. 66, 15–22 (2000).
Article ADS CAS PubMed PubMed Central Google Scholar
Zeng, J. et al. PCycDB: a comprehensive and accurate database for fast analysis of phosphorus cycling genes. Microbiome 10, 101 (2022).
Article CAS PubMed PubMed Central Google Scholar
Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020).
Article CAS PubMed Google Scholar
Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Pavlopoulos, G. A. et al. Unraveling the functional dark matter through global metagenomics. Nature 622, 594–602 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Hou, X. et al. Using artificial intelligence to document the hidden RNA virosphere. Cell 187, 6929–6942.e6916 (2024).
Article CAS PubMed Google Scholar
Flamholz, Z. N., Biller, S. J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat. Microbiol. 9, 537–549 (2024).
Article PubMed PubMed Central Google Scholar
Ma, Y. et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 40, 921–931 (2022).
Article CAS PubMed Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article ADS MathSciNet CAS PubMed Google Scholar
Vaswani, A. et al. Attention is all you need. arXiv https://doi.org/10.48550/arXiv.1706.03762 (2017).
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
Article ADS PubMed PubMed Central Google Scholar
Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).
Article ADS CAS PubMed Google Scholar
Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat. Methods 10, 996–998 (2013).
Article CAS PubMed Google Scholar
Torcello-Requena, A. et al. A distinct, high-affinity, alkaline phosphatase facilitates occupation of P-depleted environments by marine picocyanobacteria. Proc. Natl. Acad. Sci. USA 121, e2312892121 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lidbury, I. D. E. A. et al. A widely distributed phosphate-insensitive phosphatase presents a route for rapid organophosphorus remineralization in the biosphere. Proc. Natl. Acad. Sci. USA 119, e2118122119 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lau, A.M. et al. Exploring structural diversity across the protein universe with the encyclopedia of domains. bioRxiv https://doi.org/10.1101/2024.03.18.585509 (2024).
Rodriguez, F. et al. Crystal structure of the Bacillus subtilis phosphodiesterase PhoD reveals an iron and calcium-containing active site. J. Biol. Chem. 289, 30889–30899 (2014).
Article CAS PubMed PubMed Central Google Scholar
Wu, X. et al. Genome-resolved metagenomics reveals distinct phosphorus acquisition strategies between soil microbiomes. mSystems 7, e01107–e01121 (2022).
Article CAS PubMed PubMed Central Google Scholar
Oliverio, A. M. et al. The role of phosphorus limitation in shaping soil bacterial communities and their metabolic capabilities. mBio 11, e01718-20 (2020).
Ruttenberg, K. C. Treatise on Geochemistry 2nd edn, Vol. 9144 (Elsevier, Oxford; 2014).
Karl, D. M. & Björkman, K. M. In Biogeochemistry Of Marine Dissolved Organic Matter (eds. D.A. Hansell & C.A. Carlson) 233−334 (Academic Press, 2015).
Li, J. et al. Phytate as a phosphorus nutrient with impacts on iron stress-related gene expression for phytoplankton: insights from the diatom phaeodactylum tricornutum. Appl. Environ. Microbiol. 88, e0209721 (2022).
Article PubMed Google Scholar
Shukla, S. K., Hariharan, S. & Rao, T. S. Uranium bioremediation by acid phosphatase activity of Staphylococcus aureus biofilms: can a foe turn a friend? J. Hazard. Mater. 384, 121316 (2020).
Article CAS PubMed Google Scholar
Lidbury, I. D. E. A. et al. Comparative genomic, proteomic and exoproteomic analyses of three Pseudomonas strains reveals novel insights into the phosphorus scavenging capabilities of soil bacteria. Environ. Microbiol. 18, 3535–3549 (2016).
Article CAS PubMed PubMed Central Google Scholar
Müller, A. L. et al. Phylogenetic and environmental diversity of DsrAB-type dissimilatory (bi)sulfite reductases. ISME J. 9, 1152–1165 (2015).
Article PubMed Google Scholar
Knittel, K. & Boetius, A. Anaerobic oxidation of methane: progress with an unknown process. Annu. Rev. Microbiol. 63, 311–334 (2009).
Article CAS PubMed Google Scholar
Paula, F. S. et al. The potential for polyphosphate metabolism in Archaea and anaerobic polyphosphate formation in Methanosarcina mazei. Sci. Rep. 9, 17101 (2019).
Article ADS PubMed PubMed Central Google Scholar
Castelle, C. J. et al. Biosynthetic capacity, metabolic variety and unusual biology in the CPR and DPANN radiations. Nat. Rev. Microbiol. 16, 629–645 (2018).
Article CAS PubMed Google Scholar
Zhang, H. et al. Polyphosphate kinase 1, a conserved bacterial enzyme, in a eukaryote, Dictyostelium discoideum, with a role in cytokinesis. Proc. Natl. Acad. Sci. USA. 104, 16486–16491 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Boden, J. S. et al. Timing the evolution of phosphorus-cycling enzymes through geological time using phylogenomics. Nat. Commun. 15, 3703 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
von Arx, J. N. et al. Methylphosphonate-driven methane formation and its link to primary production in the oligotrophic North Atlantic. Nat. Commun. 14, 6529 (2023).
Article ADS Google Scholar
Sosa, O. A. et al. Phosphate-limited ocean regions select for bacterial populations enriched in the carbon-phosphorus lyase pathway for phosphonate degradation. Environ. Microbiol. 21, 2402–2414 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Expanding anaerobic alkane metabolism in the ___domain of Archaea. Nat. Microbiol. 4, 595–602 (2019).
Article CAS PubMed Google Scholar
Dong, X. et al. Metabolic potential of uncultured bacteria and archaea associated with petroleum seepage in deep-sea sediments. Nat. Commun. 10, 1816 (2019).
Article ADS PubMed PubMed Central Google Scholar
Kamat, S. S., Williams, H. J. & Raushel, F. M. Intermediates in the transformation of phosphonates to phosphate by bacteria. Nature 480, 570–573 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Liang, J.-L. et al. Hidden diversity and potential ecological function of phosphorus acquisition genes in widespread terrestrial bacteriophages. Nat. Commun. 15, 2827 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).
Article CAS PubMed Google Scholar
Jiang, J.-Z. et al. Virus classification for viral genomic fragments using PhaGCN2. Briefings Bioinf 24, bbac505 (2023).
Article Google Scholar
Huang, X. et al. Soil nutrient conditions alter viral lifestyle strategy and potential function in phosphorous and nitrogen metabolisms. Soil Biol. Biochem. 189, 109279 (2024).
Article CAS Google Scholar
Zeng, Q. & Chisholm, SallieW. Marine viruses exploit their host’s two-component regulatory system in response to resource limitation. Curr. Biol. 22, 124–128 (2012).
Article CAS PubMed Google Scholar
Luo, X.-Q. et al. Viral community-wide auxiliary metabolic genes differ by lifestyles, habitats, and hosts. Microbiome 10, 190 (2022).
Article PubMed PubMed Central Google Scholar
Gao, S. et al. Patterns and ecological drivers of viral communities in acid mine drainage sediments across Southern China. Nat. Commun. 13, 2389 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Kelly, L. et al. Genetic diversity in cultured and wild marine cyanomyoviruses reveals phosphorus stress as a strong selective agent. ISME J. 7, 1827–1841 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhao, F. et al. Biochemical and structural characterization of the cyanophage-encoded phosphate-binding protein: implications for enhanced phosphate uptake of infected cyanobacteria. Environ. Microbiol. 24, 3037–3050 (2022).
Article CAS PubMed Google Scholar
Kieft, K. et al. Ecology of inorganic sulfur auxiliary metabolism in widespread bacteriophages. bioRxiv https://doi.org/10.1101/2020.08.24.253096 (2020).
Gazitúa, M. C. et al. Potential virus-mediated nitrogen cycling in oxygen-depleted oceanic waters. ISME J. 15, 981–998 (2020).
Article PubMed PubMed Central Google Scholar
Lunga, P.M. et al. Virus-host interactions reveal potential roles towards phosphorus cycling in South Atlantic Ocean Euphotic Zones. Res. Square https://doi.org/10.21203/rs.3.rs-1021375/v1 (2021).
Roux, S. et al. iPHoP: an integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 21, e3002083 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gage, P. A new algorithm for data compression. C. Users J. 12, 23–38 (1994).
Google Scholar
He, Y. et al. KG-MTT-BERT: knowledge graph enhanced BERT for multi-type medical text classification. arXiv https://doi.org/10.48550/arXiv.2210.03970 (2022).
Fu, L. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Han, Y. et al. A comprehensive genomic catalog from global cold seeps. Sci. Data 10, 596 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, J. et al. Deep sea cold seeps are a sink for mercury and source for methylmercury. Commun. Earth Environ. 5, 324 (2024).
Article Google Scholar
Liang, Q. et al. Microbial succession, community assembly and adaptation over five years in a newly discovered deep-sea cold seep. bioRxiv https://doi.org/10.1101/2024.10.31.619006 (2024).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 11, 119 (2010).
Article Google Scholar
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Article PubMed PubMed Central Google Scholar
Wu, Y. W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).
Article CAS PubMed Google Scholar
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
Article CAS PubMed Google Scholar
Pan, S. et al. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat. Commun. 13, 2326 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).
Article CAS PubMed Google Scholar
Olm, M. R. et al. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chklovski, A. et al. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat. Methods 20, 1203–1212 (2023).
Article CAS PubMed Google Scholar
Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 48, 8883–8900 (2020).
Article CAS PubMed PubMed Central Google Scholar
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—A flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Article PubMed PubMed Central Google Scholar
Kopylova, E., Noé, L. & Touzet, H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics 28, 3211–3217 (2012).
Article CAS PubMed Google Scholar
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10, e1003926 (2014).
Article PubMed PubMed Central Google Scholar
Han, Y. et al. Phylogenetically and structurally diverse reductive dehalogenases link biogeochemical cycles in deep-sea cold seeps. bioRxiv https://doi.org/10.1101/2024.01.23.576788 (2024).
Mirdita, M. et al. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chaumeil, P.-A. et al. GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics 36, 1925–1927 (2019).
Article PubMed PubMed Central Google Scholar
Parks, D. H. et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Patro, R. et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Article CAS PubMed PubMed Central Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Rosignoli, S. & Paiardini, A. DockingPie: a consensus docking plugin for PyMOL. Bioinformatics 38, 4233–4234 (2022).
Article CAS PubMed Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive tree of life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res. 52, W78–W82 (2024).
Article PubMed PubMed Central Google Scholar
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021).
Article CAS PubMed Google Scholar
Couvin, D. et al. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 46, W246–W251 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zhang, C. et al. LucaPCycle: Illuminating microbial phosphorus cycling in deep-sea cold seep sediments using protein language models. Zenodo https://doi.org/10.5281/zenodo.15334764 (2025).
Article Google Scholar

Download references

Acknowledgements

The work was supported by the National Science Foundation of China (No. 32300100 and No. 92351304), the Scientific Research Foundation of Third Institute of Oceanography, MNR (No. 2025011, No. 2022025 and No. 2023022), the Natural Science Foundation of Fujian Province (No. 2023J06042), State Key Laboratory of Marine Geology, Tongji University (No. MGK202303), and China Postdoctoral Science Foundation (No. 2022M723709). We thank Jiaxue Peng and Yingchun Han for their assistance in metabolomics and metagenomic analyses, respectively, and Fabai Wu, Yu Hu, Muhe Diao, Maxim Rubin-Blum and Lingwei Ruan for helpful discussions.

Author information

These authors contributed equally: Chuwen Zhang, Yong He.

Authors and Affiliations

Key Laboratory of Marine Genetic Resources, Third Institute of Oceanography, Ministry of Natural Resources, Xiamen, China
Chuwen Zhang, Jieni Wang, Tengkai Chen, Jing Liao & Xiyang Dong
Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China
Yong He & Zhao-Rong Li
Fungal and Biogeochemical Oceanography Group, College of Oceanography and Ecological Science, Shanghai Ocean University, Shanghai, China
Federico Baltar
Fungal and Biogeochemical Oceanography Group, Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria
Federico Baltar
Key Laboratory of Humid Sub-tropical Eco-geographical Process of Ministry of Education, Fujian Normal University, Fuzhou, China
Minjie Hu
School of Geographical Sciences, Fujian Normal University, Fuzhou, China
Minjie Hu
Key Laboratory of Marine Mineral Resources, Ministry of Natural Resources, Guangzhou Marine Geological Survey, China Geological Survey, Guangzhou, China
Xi Xiao
Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, China
Xiyang Dong

Authors

Chuwen Zhang
View author publications
Search author on:PubMed Google Scholar
Yong He
View author publications
Search author on:PubMed Google Scholar
Jieni Wang
View author publications
Search author on:PubMed Google Scholar
Tengkai Chen
View author publications
Search author on:PubMed Google Scholar
Federico Baltar
View author publications
Search author on:PubMed Google Scholar
Minjie Hu
View author publications
Search author on:PubMed Google Scholar
Jing Liao
View author publications
Search author on:PubMed Google Scholar
Xi Xiao
View author publications
Search author on:PubMed Google Scholar
Zhao-Rong Li
View author publications
Search author on:PubMed Google Scholar
Xiyang Dong
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, X.D., and C.Z.; Methodology, C.Z., Y.H., Z.-R.L., J.W., T.C., J.L., and X.X.; Investigation, C.Z., X.D., Y.H., and Z.-R.L.; Computational Resources, Y.H., and Z.-R.L.; Writing—Original Draft, C.Z., Y.H., X.D., and Z.-R.L.; Writing—Review and Editing, X.D., C.Z., Y.H., Z.-R.L., M.H., and F.B.; Funding Acquisition, X.D., and C.Z.; Supervision, X.D., and Z.-R.L.

Corresponding authors

Correspondence to Zhao-Rong Li or Xiyang Dong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Wei Zheng, the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.”

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1-17

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, C., He, Y., Wang, J. et al. LucaPCycle: Illuminating microbial phosphorus cycling in deep-sea cold seep sediments using protein language models. Nat Commun 16, 4862 (2025). https://doi.org/10.1038/s41467-025-60142-4

Download citation

Received: 03 August 2024
Accepted: 16 May 2025
Published: 26 May 2025
DOI: https://doi.org/10.1038/s41467-025-60142-4