Introduction

Cold seeps, often situated along continental margins, are areas where hydrocarbon-rich fluids escape from subsurface reservoirs to the seafloor1,2. These unique environments are distinguished by their chemosynthetic microorganisms that utilize methane and other hydrocarbons—such as non-methane alkanes and aromatic hydrocarbons—as sources of carbon and energy to sustain life3,4,5. A key process in these habitats is the anaerobic oxidation of methane (AOM), which occurs in conjunction with sulfate reduction, mediated by a consortium of anaerobic methanotrophic archaea (ANME) and sulfate-reducing bacteria (SRB)6,7. In addition to carbon and sulfur, these microbial communities also require phosphorus for essential biological functions such as building cell membranes, synthesizing nucleic acids, and producing energy carriers and various phosphorylated metabolic intermediates8,9. Despite its importance, studies on phosphorus cycling in deep-sea cold seep sediments are less developed than those on carbon and sulfur3,10.

Chen et al. (2023) examined a piston core from the Haima cold seeps in the Qiongdongnan Basin, northern South China Sea, revealing high phosphate concentrations in porewater and low ratios of organic phosphorus to total organic carbon (TOC)11. This suggests a phosphorus-rich environment within these deep-sea cold seeps. The study, along with others, provided geochemical evidence that AOM is a key driver of phosphorus cycling within these environments11,12,13. This process, associated with the generation of hydrogen sulfide, leads to the reductive dissolution of iron oxides, releasing iron-bound phosphorus into the surrounding porewater. Another study indicates that in deep-sea carbonate deposits from cold seeps in the South and East China Seas, calcium-bound phosphorus predominates, co-precipitating with calcium carbonate during AOM, with iron also playing a crucial role in the release of phosphorus from iron oxides14. These findings underscore the complex interactions between microbial activity and chemical reactions in deep-sea phosphorus cycling, through a combination of microbial and chemical processes. However, the specific contributions of microbially-driven processes to the production of dissolved inorganic phosphorus in these environments remain unclear.

Microbes have evolved various enzymatic mechanisms to convert recalcitrant phosphorus forms into bioavailable phosphorus15. These phosphorus-cycling microorganisms (PCMs) possess genes related to four main processes: organic phosphorus mineralization, inorganic phosphorus solubilization, phosphorus uptake and transport, and phosphorus-starvation response regulation16,17,18,19. Organic phosphorus mineralization primarily involves release of hydrolytic enzymes to recycle various organic phosphorus compounds. Acid (e.g., PhoN, PhoC) and alkaline phosphatases (e.g., PhoD, PhoA) cleave phosphate from complex phosphoesters20,21,22, while phytases (e.g., AppA) hydrolyze phosphate from phytate or myo-inositol hexakis dihydrogen phosphate23. Phosphonatase (e.g., PhnW, PhnX) and C-P lyase (PhnGHIJKLM) participate in C-P cleavage of phosphonates24,25. For inorganic phosphorus mobilization, polyphosphate kinase (PPK) catalyzes the formation of polyphosphate (poly-P), and exopolyphosphatase (PPX) subsequently releases phosphate from poly-P chains26,27. PCMs also secrete organic acids, such as gluconic acid produced by the oxidation of glucose via membrane-bound quinoprotein glucose dehydrogenase (Gcd), to solubilize inorganic phosphorus minerals28. Additionally, PCMs harbor high-affinity (PstABCS) and low-affinity (Pit) transporters for phosphate uptake under varying conditions29, and can adjust phosphorus-sensing and regulatory mechanisms (e.g., PhoB-PhoR system) to optimize phosphorus use under scarcity19,30.

The annotation of phosphorus-cycling genes has traditionally relied on sequence homology methods such as Blastp or profile hidden Markov models (pHMMs)30,31. However, these methods are limited to detecting sequences with relatively high similarity, potentially missing divergent sequences with distant evolutionary relationships. To overcome this limitation, researchers are increasingly using protein structure features, which are more conserved over evolutionary timescales and crucial for maintaining functional integrity32. Recent advancements in deep learning for protein bioinformatics, exemplified by tools like AlphaFold, have facilitated insights into these previously obscure ‘functionally dark’ proteins33,34,35. For example, this approach led to the discovery of a superfamily of translation-targeting toxin-antitoxin systems, TumE–TumA34. Additionally, the field of functional annotation predictions has benefited significantly from the application of natural language processing (NLP) techniques. In NLP, protein language models (PLMs) are developed using extensive labeled and unlabeled datasets to create vectorized embeddings that represent both protein sequences and structures. This innovative method has been applied in various life science research areas, including the identification of viruses, the discovery of antibiotics, and the detection of remote proteins32,36,37,38.

In this study, we aim to improve the annotation of phosphorus-cycling proteins, focusing on previously unexplored sequence spaces, to deepen our understanding of microbial contribution in phosphorus cycling across global cold seeps. To achieve this, we developed a deep learning predictor named LucaPCycle, specifically designed to target phosphorus-cycling proteins. This model was trained using both the raw protein sequence and the embeddings derived from the PLM ESM2-3B. Our findings indicate that LucaPCycle effectively identifies remote homologies, thus providing a valuable complement to traditional alignment-based methods. This enhancement allows for a more comprehensive understanding of the roles and mechanisms of phosphorus cycling, not limited to the cold seep ecosystems investigated here, but also in other environments.

Results and discussion

PLM-based LucaPCycle outperforms homology-based approaches

LucaPCycle is a dual-channel deep learning model (Fig. 1a). The first channel utilizes the protein language model ESM2 to extract features from sequences at the residue level, leveraging its self-supervised learning capabilities to understand the sequence context39. The second channel involves a Transformer-Encoder40 designed to capture the features of the raw sequence. We applied LucaPCycle to build two models, one for binary-classification and the other for 31-classification. The binary-classification model predicts whether a protein sequence has a phosphorus-cycling function. If it does, the 31-classification model then categorizes it into one of 31 specific types of phosphorus-cycling genes (detailed in Supplementary Data 1). LucaPCycle is established based on 214,193 positive samples of 31 phosphorus-cycling proteins and 853,615 negative samples covering a range of functions, including intracellular phosphorus metabolism, nitrogen cycling, sulfur cycling, and other unrelated functions (Supplementary Fig. 1a–b). Detailed information on model building is provided in the Methods section.

Fig. 1: Overview of LucaPCycle for phosphorus-cycling protein annotation.
figure 1

a Architecture of LucaPCycle. LucaPCycle consists of five modules: Input, Tokenizer, Encoder, Pooler, and Output. The Tokenizer module performs character-level and word-level tokenization on input sequences. The Encoder automatically extracts two types of features: the representation matrix of the protein language model and the representation matrix of the raw sequence. The Pooler selects essential features and transforms the feature matrix into a vector. The Output layer concatenates these two pooled vectors for classification. b Performance of binary-classification and 31-classification models within LucaPCycle. Six metrics, including accuracy, precision (macro), recall (macro), F1-score (macro), AUC (macro, one-vs-rest), and PR-AUC (macro) were evaluated based on the validation and testing datasets. c Benchmarking of LucaPCycle with KofamScan and Diamond Blastp. Source data are provided as a Source Data file.

Binary- and 31-classification models in LucaPCycle both perform well on metrics of accuracy, precision, recall, F1-score (The harmonic mean of the precision and recall), AUC (Area under the ROC curve), and PR-AUC (Area under the precision-recall curve), achieving scores above 0.96 when evaluated on validation and test datasets (Fig. 1b). To further demonstrate the sensitivity and specificity of LucaPCycle, we benchmarked it to two homology-based functional annotation tools, utilizing a dataset of 1,521,958 true positive sequences (see Methods). The homologous protein search methods include: (1) Diamond Blastp based on sequence similarity with curated phosphorus-cycling genes in PCycDB database30; and (2) KofamScan31, a gene function annotation tool based on KEGG Orthology and hidden Markov model (HMM). Notably, LucaPCycle outperforms the other two methods with the highest precision rate (86.13%, a measure of the accuracy of the positive predictions made by the model) and a comparable recall rate (85.63%, a measure of how well the model can identify all the positive samples; Fig. 1c). In comparisons, KofamScan exhibited a high recall rate (98.96%) but a low precision rate (58.24%), while Diamond Blastp exhibited both the lowest precision rate (51.98%) and recall rate (35.01%).

Remote homology detection: three alkaline phosphatases at cold seeps

Using the non-redundant gene (n = 147,289,169) and genome (n = 1878) catalogs from global cold seeps, phosphorus-cycling genes were functionally annotated using LucaPCycle and two sequence homology-based methods, Diamond Blastp and KofamScan (Supplementary Fig. 2). The results from these three search strategies were cross-checked by protein ___domain analysis, a deep learning model DeepFRI41 and a machine learning model CLEAN42. Subsequently, all validated phosphorus-cycling proteins were merged and clustered using sequence-based clustering algorithm to generate representative sequences for phylogenetic analyses. Singletons, likely indicative of sequencing errors43, were discarded, leading to the identification of 333,493 non-singleton phosphorus-cycling protein families (Supplementary Fig. 2). The majority of protein families consisted of sequences organized in small groups (2 members) and exhibited diverse lengths with the most common range being 100–300 amino acids (Supplementary Fig. 3). Among 333,493 non-singleton phosphorus-cycling protein families, 5241 families (i.e., 29,418 proteins) were exclusively identified by LucaPCycle (Supplementary Fig. 4a). The 5241 protein families cover 31 distinct types of phosphorus-cycling genes, with phnC (n = 1384), ppx (n = 972), phnK (n = 723), and pstB (n = 720) being the ones having the highest number of families (Supplementary Fig. 4b). Diamond Blastp and KofamScan identified more exclusive phosphorus-cycling gene families than LucaPCycle. However, many of these exclusive families are likely false positives, as indicated by their relatively low precision rates (KofamScan: 58.24%; Diamond Blastp: 51.98%; Supplementary Note 1 and Supplementary Data 2). Despite passing ___domain-based validation, the accuracy of these predictions remains uncertain since one protein ___domain may appear in a variety of unrelated proteins, leading to potential false positives. To validate the 5241 phosphorus-cycling protein families identified by LucaPCycle, we queried these families against the full KofamScan HMM database. In total, 1966 protein families matched sequences in the HMM database, corresponding to 25 distinct types of phosphorus-cycling genes (Supplementary Data 3). However, these matches had high E-values (> 1E-5) and low sequence similarity scores (< 25), excluding them from standard KofamScan annotations. These results indicate that the LucaPCycle-specific protein families represent remote homologs, highlighting their distant evolutionary relationships with known sequences. LucaPCycle has an advantage in identifying remote homologous proteins due to ESM2’s embeddings generated by a large model trained on extensive sequence data. This approach effectively captures sequence context and produces deep embedding matrices, allowing distantly related sequences with low similarity to exhibit high similarity in their embeddings. Consequently, protein families identified exclusively by Diamond Blastp and KofamScan were classified as homologous phosphorus-cycling protein families, while those unique to LucaPCycle were categorized as remote phosphorus-cycling protein families. These remote proteins could significantly contribute to the biogeochemical cycling of phosphorus in cold seeps, but have been previously neglected due to limitations of homology-based methods. Notably, the ppx gene with remote homology constituted substantial proportions of the total ppx gene (73.02%) and transcript (76.84%) abundances (Supplementary Fig. 4c–d).

We further explored the remote homology information of alkaline phosphatases (ALPs) recovered by LucaPCycle. This group was chosen due to its prevalence in global ocean water and its clinical significance20,44,45. ALPs are typically separated into three classical families (PhoA, PhoD, and PhoX)20,22 and one divergent family (PafA)45. Our sequence and structure tree topologies showed that remote ALPs formed three distinct clusters (Fig. 2a, b). We designated these clusters as ALP1, ALP2, and ALP3 in this study. Specifically, sequence and structural phylogenies placed ALP1 between PhoX and divergent PafA, while ALP2 and ALP3 were positioned closer to PafA and PhoA, respectively. Despite sharing a low sequence identity averaging at 14.3%, a large proportion of ALP1 demonstrated structural homology to classic PhoD (TM-score > 0.5; Supplementary Fig. 5a). ALP2 were more resemblance to divergent PafA rather than to other three classic ALPs. All ALP2 displayed a high structural similarity with divergent PafA, with the majority exceeding a TM-score of 0.75 but sequence similarity averaged at 33.6% (Supplementary Fig. 5b). ALP3 shared a sequence identity of 17.6% with PhoA from Escherichia coli (PDB: 1AJA) but exhibited a TM-score of 0.82 in terms of structure (Supplementary Fig. 5c). Thus, ALP1, ALP2, and ALP3 could be remote relatives of PhoD, PafA and PhoA, respectively.

Fig. 2: Three alkaline phosphatases (ALPs) with remote homology identified by LucaPCycle.
figure 2

a Phylogenetic tree of ALPs based on protein sequences. Scale bar indicates the mean number of substitutions per site. Bootstrap values over 70% were shown. The n values denote the count of ALPs sequences within each family. b Structure-based phylogeny of ALPs. All protein 3D structures are predicted by AlphaFold3. c Comparison of domains between ALP1 and classic PhoD identified by sequence (InterproScan) versus structure-based methods (ECOD). d Structural comparison between three predicted remote ALP1 and the experimentally verified PhoD (2YEQ, in gray). e Close-up view of the putative active sites for metal binging. The active site residues of PhoD are shown in gray sticks. The catalytic site metal ions are presented as spheres, with Fe3+ shown in red and Ca2+ shown in cyan. f Molecular docking of ALP1 and phosphomonoesters. Substrates histidinol phosphate is shown in yellow sticks. The pTM scores for the protein structures of ALPs are provided in Supplementary Data 15.

Protein structures are composed of domains, independently folding units that can be found in multiple structural contexts and functional roles46. Extracting these domains enables a more fine-grained description of the features among different proteins. Subsequently, we compared the ___domain variations between ALP1 and the classic PhoD, given their relatively low TM-score (Supplementary Fig. 5a). Both ALP1 and PhoD were partitioned into two domains by sequence- and structure-based ___domain assignments (Fig. 2c). However, their ___domain organizations were completely different. Typically, PhoD features a core architecture consisting of a sandwich of long β-sheets flanked by α-helices of varying lengths47 and an upstream N-terminal β-sheet ___domain. In contrast, ALP1 lacked the N-terminal ___domain and was organized with a core architecture followed by an unannotated C-terminal β-sheet ___domain (Fig. 2c). ALP1 also exhibited fewer α-helices in its core architecture compared to PhoD. However, the core architectures of ALP1 and PhoD maintained a similar orientation and were conserved within the ECOD hierarchy at various levels (Fig. 2d and Supplementary Data 4): A-group (same architecture), X-group (possible homology), H-group (homology), T-group (topology), and F-group (sequence similarity). Nevertheless, the C-terminal ___domain of ALP1 and the N-terminal ___domain of PhoD were structural divergence and assigned different ECOD T-group and F-group labels (Fig. 2d and Supplementary Data 4). The emergence of a C-terminal β-sheet ___domain may represent an adaptive strategy for phosphorus-cycling microbes to enhance fitness in extreme deep-sea environments.

Despite evolutionary variations in amino acid sequences and protein structures, the remote ALP1 retained the function of phosphoester hydrolysis. ALP1 preserved metal binding sites in their core architectures, with active center residues exhibiting similar spatial distribution to classic PhoD (Fig. 2e). The molecular docking analysis demonstrated robust and stable interactions between ALP1 and 89 various phosphoester compounds, as indicated by low binding energies (−11.02 ~ −2.39 kcal/mol; Fig. 2f and Supplementary Data 5). This conservation of essential domains (i.e., core architectures) for function maintenance in remote ALPs suggests convergent evolution.

Phosphorus-cycling genes are diverse, active and depth-stratified at cold seeps

The microbes inhabiting cold seeps were found to harbor multiple phosphorus acquisition strategies. In particular, genes encoding the regulatory system (phoR-phoB) were among the most predominant (av. 3780.38 genes per million, GPM; Fig. 3a). These microbes also exhibited efficient phosphate transport systems to acquire available phosphorus with minimal energy investment. This was evidenced by the significantly higher levels of genes encoding for high-affinity phosphate transporter (pstABCS; av. 634.03 GPM) compared to those encoding the low-affinity phosphate transporter (pit; av. 168.63 GPM). The prevalence of phoR-phoB and pstABCS aligned with observations in soil microbes in response to limited bioavailable phosphorus48,49. Within inorganic phosphorus solubilization, genes encoding glucose dehydrogenase (gcd), polyphosphate kinase (ppk) and exopolyphosphatase (ppx) demonstrated comparable abundance patterns.

Fig. 3: The biogeographic pattern of phosphorus-cycling genes across global cold seeps.
figure 3

a Relative abundances of phosphorus-cycling genes from four distinct metabolic processes, as compared to mcrA and dsrA in sediment metagenomes (n = 163). OP: Organic phosphorus mineralization; IP: Inorganic phosphorus solubilization; TP: Transporters; RG: Regulatory genes; AOM-SR: The anaerobic oxidation of methane (AOM) coupled to sulfate reduction (SR). Genes of alkaline phosphatase (ALP) including phoA, phoD and phoX. Genes of acid phosphatase (AP) including aphA, olpA, phoN and phoC. Gene abundances are indicated as GPM (genes per million). Data are presented as mean ± SD. P-values of differences among different genes were calculated using two-sided Kruskal-Wallis rank-sum tests. Boxplot: center line, median; box limits, upper and lower quartiles; whiskers, 1.5 × interquartile range; points, outliers. b The chemical structures of phosphomonoester, phytate, phosphonates and metabolic intermediates identified in sediments from Qiongdongnan and Shenhu cold seeps. c Comparison of functional richness of phosphorus-cycling genes across diverse cold seep sediment depths. Surface, < 1 mbsf (n = 116); Shallow, 1 to 10 mbsf (n = 33); Deep, > 10 mbsf (n = 13). Asterisks indicate statistically significant differences among groups of surface, shallow, and deep sediments (determined by two-sided Wilcoxon rank sum tests; **: P < 0.01;  ***: P < 0.001). Data are presented as mean ± SD. Boxplot: center line, median; box limits, upper and lower quartiles; whiskers, 1.5 × interquartile range; points, outliers. d The partial least squares discrimination analysis (PLS-DA) plots depict the differences in phosphorus-cycling microbial communities among different sediment depths. Analysis of similarity were examined using a 999-permutation PERMANOVA test. e Spearman correlation between the abundances of phosphorus-cycling genes and various physicochemical variables. The red and blue colors indicate positive and negative correlations, respectively. TA: total alkalinity; TOC: total organic carbon; TN: total nitrogen. Source data are provided as a Source Data file.

Organic phosphorus constitutes about 25% of the total phosphorus buried in marine sediments50, where phosphorus atoms are covalently bonded to carbon either directly (P–C, phosphonates) or through a phosphoester linkage (P–O–C, phosphomonoesters, phosphodiesters, phosphotriesters and phytates)51. Depending on the substrate, microorganisms rely on distinct enzymes to mineralize organic phosphorus and release inorganic phosphate. For phosphoester hydrolysis, genes for nonspecific ALPs (phoA, phoD, or phoX, av. 67.72 GPM) were more prevalent than genes for nonspecific acid phosphatases (APs; aphA, olpA, phoN or phoC, av. 16.23 GPM) and phytases (av. 4.65 GPM; Fig. 3a). The lower abundance of genes encoding phytase can be attributed to the limited availability of its substrate, phytates, sourced primarily from terrestrial runoff 52. Additionally, the approximately neutral pH at cold seeps may limit the activity of APs, which has an optimal pH range of 4–653 (Supplementary Data 6). Regarding phosphonate degradation, the genes encoding substrate-specific phosphonatase (phnW-phnX) and broad-specificity C–P lyase (phnGHIJK) showed comparable abundances (Fig. 3a). Moreover, untargeted metabolomics detected over 104 organic phosphorus compounds in 55 cold seep sediments (Fig. 3b and Supplementary Data 7). These metabolites included a wide diversity of phosphomonoesters (e.g., acetylphosphate), phytates (e.g., 1D-Myo-inositol 1,4,5,6-tetrakisphosphate) and phosphonates (e.g., Hydroxymethylphosphonate). Additionally, our results identified intermediate metabolites of organic phosphorus mineralization, such as alpha-D-ribose 1,2-cyclic phosphate 5-phosphate, generated during phosphonate degradation. This provides potential biological evidence for the recycling of organic phosphorus molecules at cold seeps. However, it cannot conclusively rule out the possibility that some of these molecules originate from external sources.

To determine the biogeographic patterns of phosphorus-cycling proteins, each metagenome was categorized by its sediment depth (i.e., surface: < 1 mbsf; shallow: 1–10 mbsf; deep: > 10 mbsf). The distribution of phosphorus-cycling proteins was stratified by sediment depth. The functional richness (Chao1) decreased with depth and peaked at surface layers (P < 0.001; Fig. 3c). The compositions of phosphorus-cycling proteins also displayed significant variations among surface, shallow, and deep layers (F = 13.5, P = 0.001, R2 = 0.15, 999-permutations test; Fig. 3d). Likewise, we observed depth-related pattern in the expression of phosphorus-cycling genes, with a higher transcript abundance detected in surface layers (averaging from 1.38 to 1251.73 TPM; Supplementary Fig. 6). This depth-stratified pattern may be driven by phosphate concentrations, which increase with sediment depth as shown by measurements from 12 sediment cores in the Shenhu area and Qiongdongnan basin (Supplementary Fig. 7 and Supplementary Data 8). Phosphate concentration exerted a negative feedback regulation on the abundance of phosphorus-cycling genes (Fig. 3e), as observed in soil bacteria54. Cold seep types (i.e., gas hydrate, methane seep, oil and gas seep, mud/asphalt volcano) also contribute to the patterns of phosphorus-cycling gene abundance and expression (Supplementary Fig. 6 and 8). Significant differences in functional richness (P < 0.001; Supplementary Fig. 8a) and compositions of phosphorus-cycling proteins among cold seep types were observed (F = 5.71, P = 0.002, R2 = 0.10, 999-permutations test; Supplementary Fig. 8b). Temporal dynamics of phosphorus-cycling genes were evident across both short-term (over several years) and geological timescales. Variations in phosphorus-cycling gene abundance and composition were detected between samples collected in 2021 and 2023 from the QDN-W07 site, as well as between active and extinct areas of the Haima cold seep (Supplementary Fig. 9). These findings reflect the impact of changing seep activity on phosphorus cycling within cold seep ecosystems.

Subsequently, we assessed the importance of phosphorus cycling in cold seep ecosystems based on the abundance of corresponding genes and transcripts. We chose mcrA and dsrA genes as references due to their key roles in anaerobic methane oxidation coupled with sulfate reduction, a major biological process in cold seeps55,56. Except for genes encoding APs and phytases, the abundances of other phosphorus-cycling genes associated with 11 distinct metabolic processes (averaging from 54.57 to 3780.38 GPM) exceeded those of mcrA (av. 46.15 GPM) or dsrA genes (av. 21.32 GPM; Fig. 3a). Likewise, the expression levels of phosphorus-cycling genes (averaging from 6.84 to 425.08 TPM) demonstrated a similar pattern when compared to mcrA (av. 251.53 TPM) or dsrA genes (av. 5.66 TPM; Supplementary Fig. 10). To date, microbial transformation of phosphorus is well-documented in ocean surface water where phosphate is often limited24,44. Here, our results demonstrated that phosphorus cycling driven by microorganisms is clearly important within deep-sea cold seeps in addition to the extensively studied carbon and sulfur metabolisms55,56.

Archaea are key players for organic phosphorus mineralization and inorganic phosphorus solubilization

The taxonomic analysis was focused on the cold seep genome catalog to ensure a fine and accurate taxonomic assignment of phosphorus-cycling proteins. The results revealed that genes encoding the regulatory system, the high-affinity PstABCS and the low-affinity Pit were widely distributed throughout all major archaeal and bacterial phyla (Supplementary Fig. 11). This broad distribution pattern highlights the foundational role of phosphorus uptake and regulation in supporting cell growth9. Aside from the aforementioned basic genetic equipment, our findings substantially expanded the taxonomic distribution of genes related to the phosphate scavenging from inorganic and organic phosphorus source among Archaea ___domain (Fig. 4a). These archaea exhibited high relative abundances in the microbial communities, e.g., with up to 44.1% of the microbial community in sediment samples from the Haakon Mosby mud volcano (Supplementary Fig. 12). To date, genes encoding inorganic polyphosphate (poly-P) metabolism, specifically ppk or ppx, have been identified in the archaeal phyla Halobacteriota and Thermoproteota57. However, our PLM-based LucaPCycle alternatively identified ppx in Iainarchaeota (formerly Diapherotrites) from the DPANN superphylum, representing one of the 5241 annotations exclusively detected by LucaPCycle. DPANN are characterized by small genome sizes and minimal metabolic function58. The selective maintenance of ppx genes in DPANN archaea may be associated with changing environments, as poly-P confers resistance to environmental stresses59. However, mineral phosphorus solubilization in archaea has received less attention with the majority of studies focusing on bacteria17,18,28. Beyond Halobacteriota28, four additional archaeal phyla—Hydrothermarchaeota, Aenigmatarchaeota, Thermoplasmatota, and Thermoproteota—also harbored the gcd gene, which encodes glucose dehydrogenase (Fig. 4a). This enzyme catalyzes the oxidation of glucose to gluconic acid, facilitating the solubilization of mineral-sorbed inorganic phosphate18.

Fig. 4: The overlooked role of archaea in deep-sea phosphorus-cycling.
figure 4

a Phylogenetic distribution of phosphorus-cycling genes within Archaea ___domain. b Taxonomic placement of PhnJ-containing archaeal genomes. Phylogenomic tree constructed based on alignments of 43 conserved protein sequences from MAG FR_S4_sbin5 and HM1_cobin.76 (indicated by red stars). Bootstrap value ≥ 70 are shown. Scale bar indicates amino acid substitutions per site. c Organization of C-P lyase operons in ANME-2c and ANME-3. d The structure of Phn(GHIJK)2 complex shown as a surface representation. The C-P lyase core complex and the interaction among subunits were predicted using AlphaFold3. e Anaerobic oxidation of methane and methanogenesis in ANME-2c. Gray color indicates the absence of the enzyme. The percentages between brackets indicate the estimated completeness of the corresponding MAGs. MPn: methylphosphonate. The methane metabolism pathway diagram was created using the vector graphics software CorelDRAW 2019. A complete list of metabolic information can be found in Supplementary Data 17.

Contrary to a recent study reporting that most archaea cannot metabolize phosphonates60, this study found that a variety of archaeal phyla possess the genetic capability to do so. Specifically, genes responsible for phosphonate substrate-specific catabolism (phnWX), which target 2-AEP, demonstrated a wide taxonomic distribution across archaea ___domain (Fig. 4a). These genes co-occurred in 12 different archaeal phyla, predominantly Halobacteriota, Thermoproteota, Asgardarchaeota, and Thermoplasmatota, whereas in global seawater, they were mainly found in the bacterial phylum Proteobacteria24. Additionally, the broad-specificity catabolism (C-P lyase) genes (phnGHIJKLM) that hydrolyze a broad range of phosphonates were identified in genomes affiliated with archaea Halobacteriota. Notably, metagenome-assembled genomes (MAGs; FR_S4_sbin5 and HM1_cobin.76) belonging to ecologically important methanotrophs ANME-2c and ANME-3 possessed near complete C-P lyase operon (Fig. 4b, c). This marks a significant discovery, overturning the previous notion that the phnJ gene was exclusive to bacteria in marine ecosystems24,61,62. Organization of C-P lyase operons exhibited a consecutive string of C-P lyase subunits and alongside the synteny with other phosphorus-cycling genes, such as the phosphonate transporter (phnCDE) and partial phosphate transporter (pstABCS). The neighborhood occurrence of these genes in C-P lyase operons reflects an evolutionary adaptation to enhance the efficiency of phosphorus-related mechanisms, consistent with what has been observed in bacteria24. The subunits of the C-P lyase core complex found in ANME archaea interacted closely (pTM: 0.9; ipTM: 0.9) to form conformational states necessary for phosphonate breakdown (Fig. 4d).

ANME-2c FR_S4_sbin5 also harbor genes for complete MCR complex (mcrABG), along with an archaeal Wood–Ljungdahl pathway for oxidizing methane to CO2 and HdrABC to recycle the CoM and CoB heterodisulfides (Fig. 4e and Supplementary Fig. 13). This denotes a conventional anaerobic oxidation of methane via MCR mechanism63,64. The presence of C-P lyase in methanotrophs revealed the potential of methane production from utilizing methylphosphonate. Indeed, the metabolic product of PhnJ (alpha-D-ribose 1,2-cyclic phosphate 5-phosphate) was detected in our metabolome data (Supplementary Data 7), indicating the actual occurrence of methylphosphonate-driven methane formation at cold seeps. All phnJ genes, encoding for the catalytic subunit65, from gene- and genome-level were used for phylogenic analyses to reveal the evolution of C-P layse. PhnJ protein sequences associated with Halobacteriota formed a monophyletic group, indicating a shared evolutionary ancestry (Supplementary Fig. 14).

Phosphorus-cycling AMGs in cold seep viruses are dominated by the PhoR-PhoB system and PhnCDE transporter

Apart from bacteria and archaea, we identified 328 phosphorus-cycling auxiliary metabolic genes (AMGs) encoded by 182 viral contigs (vContigs) from 165 metagenomic assemblies (Supplementary Data 9-10). This represents a substantial number of phosphorus-cycling AMGs when compared to the study by Liang et al. (2024)66, which recovered only 105 phosphorus-cycling AMGs from 333 soil metagenomes. The recovered vContigs were validated as viruses by containing viral hallmark genes, viral-like genes or having a high virus score (> 0.7) determined by genomad67. Genomic arrangements also supported their affiliations to viruses, with phosphorus-cycling AMGs flanked by viral hallmarks or viral-like genes on both sides (Fig. 5a and Supplementary Fig. 15). Using geNomad67, up to 94.5% (n = 172) of the 182 vContigs were classified to the Caudoviricetes class, with a few assigned to classes of Megaviricetes (n = 1) and Tokiviricetes (n = 1, Fig. 5b). Due to the limited number of vContigs that could be classified to a specific viral family (Supplementary Data 11), the family-level classifications were performed alternatively using PhaGCN68. A total of 58 vContigs could be further assigned to 15 families, with Straboviridae being the most prevalent (n = 19, Supplementary Fig. 16).

Fig. 5: Phosphorus-cycling viruses at cold seeps.
figure 5

a Genome organization diagrams of six representative vContigs identified in this study. Predicted open reading frames are colored according to the results of our phosphorus-cycling proteins annotation pipeline, DRAM-v and geNomad. b The class-level classifications of the phosphorus-cycling vContigs according to geNomad. c Numbers of the auxiliary metabolism genes (AMGs) involved in phosphorus cycling. d Interaction networks of phosphorus-cycling vContigs and cold seep microorganisms. The circles and hexagons represent bacterial and archaeal host, respectively. Hosts are shown at the bottom of the panel and colored according to their taxonomy at class level (order level for ANME-1). Source data are provided as a Source Data file.

The 328 phosphorus-cycling AMGs spanned 20 gene types (Fig. 5c), categorized into four phosphorus metabolic processes: organic phosphorus mineralization (phoA, phoN, phoX, phnW, phnX, phnM, phnL and phnK), phosphorus transportation (phnC, phnD, phnE, pit and pstABCS), inorganic phosphorus solubilization (ppk and ppx) and phosphorus regulation (phoR and phoB). The identification of previously unrecognized AMGs66, i.e., phnX, phoX, ppk, phnM, phnL and phnK, expanded the diversity of AMGs involved in phosphorus cycling. Remarkably, phosphorus regulation (phoR, n = 57; phoB, n = 102) was the most commonly encoded function among the cold seep viral AMGs (Fig. 5c). Moreover, one viral genome harbored a complete PhoB-PhoR regulatory system (Fig. 5a), a significant finding since previously only individual components of this system, either phoR or phoB gene, had been identified as AMGs66,69. These finding allow us to propose that cold seep phages could regulate the transcription of phosphorus-cycling genes completely using their own regulators, rather than relying on the host’s regulatory system70. Phosphorus-cycling AMGs were also enriched in ATP-binding subunit (phnC, n = 75; Fig. 5c) of the ABC-type phosphonate transport system (phnCDE)71. One viral genome was observed to encode other components of this transporter (phnDE, Fig. 5a). To date, only one provirus genome encoding PhnCDE was detected72, and the periplasmic high-affinity phosphate-binding gene pstS was the most commonly reported AMG related to phosphorus transport73,74. The viral phoR, phoB and phnC were expressed at varying levels, with the highest reaching up to 4.80 TPM at Jiaolong cold seep, while the transcripts of other types of AMGs were not detected (Supplementary Data 12). However, the absence of detected expression does not imply that these AMGs are inactive in situ, as it is challenging to recover sufficient RNA from deep-sea samples, and RNA may be lost during deep-sea sampling or extraction. This enrichment of AMGs in phosphorus regulation and transport were consistent with the pattern observed in their microbial hosts, which had the highest abundance of these genes across global cold seeps (Fig. 3a). This uniformity suggests that cold seep viruses and their microbial hosts are likely under the same selective pressure (e.g., phosphate bioavailability) to retain these genes for phosphorus utilization, as seen with other nutrient cycling processes such as nitrogen and sulfur75,76. Multiple lines of evidence suggest that, in phosphorus-limited oceanic water and various terrestrial ecosystems66,72,77, viruses augment their hosts’ phosphorus acquisition by encoding AMGs related to phosphatases like phoD and phoN. However, in phosphorus-rich cold seep environments, the dominance of AMGs for the PhoR-PhoB system and PhnCDE transporter implies that phosphorus regulation and transport are more crucial than metabolic processes for phosphate release.

To explore the potential viral impacts on phosphorus cycling at cold seeps, we investigated virus-host linkages, via iPHoP that aggregates six different methods for host predictions along with CRISPR spacer matching78. Approximate 17.58% (32/182) of the phosphorus-cycling vContigs identified could be associated with a total of 81 hosts (Supplementary Data 13). These hosts encompassed a wide range of diversity, spanning 20 bacterial classes and 6 archaeal classes (order level for ANME-1; Fig. 5d). While it is established that known hosts for phosphorus-cycling viruses are predominantly bacteria66, our results revealed a broader spectrum of hosts in archaeal lineages. Specifically, we identified five distinct Caudoviricetes viruses infected ecologically important archaea ANME-1 (Fig. 5d), functioning as methanotrophs in many methane-rich ecosystems7. These viruses potentially modulated the phosphorus metabolism of ANME-1 by containing phoB, phnC, and phnK (Supplementary Data 10 and 13). Of particular significance is the involvement of PhnK, a component of C-P lyase, in mediating methylphosphonate-driven methane formation61.

Implications of PLM-based approaches for phosphorus cycling in cold seep

Traditional homology-based annotations can only capture sequence space conserved to some degree, leaving a high number of genes unannotated and therefore ‘hidden’ from our sight. Here we show the utility of alignment-free PLMs to improve the functional prediction of phosphorus-cycling proteins within large-scale cold seep metagenomic datasets. The LucaPCycle model enabled the characterization of previously unrecognized alkaline phosphatase families featured by a streamlined core architecture ___domain and a unique C-terminal β-sheet ___domain. The discovery of these enzyme families highlights the hidden diversity of phosphorus-cycling microbial communities and their overlooked ecological functions. The presence of diverse phosphoesters and phosphonates, along with their intermediate metabolites, and the high abundance and expression of phosphorus-cycling genes, all suggest that phosphorus cycling plays a significant yet often underappreciated role in the biogeochemical cycling within cold seep environments. Our study also expanded the taxonomic distribution of phosphorus-cycling genes among the Archaea ___domain. We highlight the unexpected roles of archaea in polyphosphate metabolism, mineral phosphorus solubilization, and both phosphonate substrate-specific and broad-specificity catabolism. Notably, ANME methanotrophs possess the genetic equipment for methylphosphonate-driven methane production via complete C-P lyase operons. Additionally, cold seep viruses possess the genetic capacity to regulate microbial phosphorus-cycling processes primarily through the PhoR-PhoB regulatory system and the PhnCDE transporter. This study underscores the critical need for PLM-based methodologies to uncover the hidden biological functions of these genes, enhancing our understanding of microbially-mediated phosphorus-cycling and its ecological significance in diverse ecosystems. Furthermore, the deep-sea sedimentary ecosystems, particularly beyond ocean water, represents a significant but often overlooked component in global phosphorus cycling.

Methods

Architecture of LucaPCycle

LucaPCycle is a transformer-based dual-channel model (Fig. 1a). The first stage (binary-classification) predicts if a protein sequence has a phosphorus-cycling function. In the second stage (31-classification), it classifies the positive predictions from the first stage into one of 31 specific types. These 31 phosphorus-cycling proteins participate in various phosphorus metabolisms, including phosphoester hydrolysis, phytate hydrolysis, broad-specificity and substrate-specific phosphonate catabolism, polyphosphate metabolism, mineral phosphorus solubilization, phosphonate transport, low- and high-affinity phosphate transport, and two-component systems (Supplementary Data 1). LucaPCycle consists of five modules: Input, Tokenizer, Encoder, Pooler, and Output. Below are descriptions of each module:

  1. (1)

    Input: This module receives the amino acid sequence as input.

  2. (2)

    Tokenizer: This module performs tokenization of the input amino acid sequence at both character and word levels. Character-level tokenization slices protein sequences into amino acid characters. Word-level tokenization employs the Byte Pair Encoding (BPE) algorithm79 from the Subword-NMT tool (https://github.com/rsennrich/subword-nmt) to segment the input sequence into “words”, generating a word-level list.

  3. (3)

    Encoder: This module extracts features from the tokenized lists and includes two submodules. One submodule uses the ESM2-3B model39 to extract features from the character-level list, resulting in the representation matrix M1. This submodule captures universal features and internal structural and contextual information from the sequence. The second submodule based on the Transformer-Encoder40 to derive features from the word-level list, producing the representation matrix M2.

  4. (4)

    Pooler: This module selects essential features and transforms the feature matrix into a vector through down-sampling. Since the Encoder extracts mixed features, especially general ones from ESM2, we use Value-Level Attention Pooling80 to assign learnable weights to each feature (Eqs. (1) and (2)).

    $${e}_{{ik}}=\sum\limits_{j=1}^{{l}_{i}}{\alpha }_{{jk}}{r}_{{ijk}},\,\sum\limits_{j=1}^{{l}_{i}}{\alpha }_{{jk}}=1$$
    (1)
    $${e}_{i}=\left[{e}_{i1},{e}_{i2},\ldots,{e}_{{id}}\right]$$
    (2)

    where, \(i\) is the sample index, \({l}_{i}\) is the length of the \(i-{th}\) sequence, \(d\) is the dimension of the embedding, \({e}_{i}\) is the pooled vector, and \(k\in \left\{{{\mathrm{1,2}}},\ldots,d\right\}\), \({r}_{{ijk}}\) is the value of row \(j\) and column \(k\) of the representation matrix of the \(i-{th}\) sequence.

  5. (5)

    Output: This module concatenates two pooled vectors into a classification vector. It uses the sigmoid-function for binary-classification to generate the positive probability and the Softmax-function for 31-classification to output probabilities for each class.

Dataset for LucaPCycle

We initially constructed a dataset with 214,193 positive samples of 31 phosphorus-cycling proteins and 3,832,374 negative samples. The positive samples are sourced from the PCycDB that represents a comprehensive and accurate database for analyzing phosphorus-cycling genes and microorganisms30. The number of sequences for each phosphorus-cycling protein detailed in Supplementary Fig. 1a. To maintain a reasonable ratio of positive to negative samples, we used CD-HIT (v4.8.1)81 to reduce the redundancy of the negative samples with 50% amino acid identity. The rationale for retaining redundancy in the positive sequences is provided in Supplementary Note 2, Supplementary Table 1, Supplementary Figs. 1718. The clustering process resulted in 853,615 non-redundant negative samples and an approximate positive/negative sample ratio of 1:4 for model building. The negative samples encompass protein sequences involved in intracellular phosphorus metabolism (n = 39,048), nitrogen cycling (n = 9248) and sulfur cycling (n = 62,937). To ensure a diverse set of negative samples, we also incorporated other unrelated functional proteins from KEGG (n = 68,498) and UniProt (n = 673,884), such as protein sequences from viral capsid, DNA replication, RNA transcription, carbon metabolism, signal transduction, and secondary metabolism. This diversity in negative samples aims to enhance the model’s annotation performance and generalization ability. The detailed numbers for each category of negative samples are shown in Supplementary Fig. 1b.

Model building of LucaPCycle

For binary-classification model, all positive samples (n = 214,193) and non-redundant negative samples (n = 853,615, 50% identity) were randomly divided into training, validation, and testing datasets with a ratio of 8:1:1 (Supplementary Fig. 1c). Regarding 31-classification model, the dataset consists of all positive samples without negatives. The sequences of each phosphorus-cycling protein were randomly divided into training, validation, and testing datasets with a ratio of 8:1:1.

The first channel of LucaPCycle applied ESM2-3B (https://github.com/facebookresearch/esm) for contextual representation. The second channel used a 4-layers with 8-heads Transformer-Encoder to characterize the raw sequence. The model training lasted for a maximum of 50 epochs based on the best F1-score in validation dataset for model finalization. We used the AdamW optimizer with betas (0.9, 0.98) and a peak learning rate of 2E-4, incorporating a linear warm-up schedule throughout the learning rate updates. We employed a batch size of 8 with a gradient accumulation step of 8 for the binary-classification model and a batch size of 16 with a gradient accumulation step of 4 for the 31-classification model. The model was trained on 4 Nvidia A100 GPUs.

Model evaluation and benchmarking

We evaluated the binary-classification model and the 31-class classification model using six metrics based on validation and testing datasets, i.e., accuracy, precision (macro), recall (macro), F1-Score (macro), AUC (macro, one-vs-rest), and PR-AUC (macro) (Fig. 1b). To assess the impact of positive sequence redundancy, modeling channels and dataset partition ratios on the performance of the LucaPCycle model, we conducted a comprehensive ablation study. Specifically, the ablation study evaluated different modeling channels and two dataset partition ratios (8:1:1 and 6:2:2) in both binary-classification and 31-class classification. Furthermore, we compared the effects of six levels of positive sequence redundancy clustered using CD-HIT (v4.8.1)81 (50%, 60%, 70%, 80%, 90%, and 100%) on binary-classification. The detailed results are provided in Supplementary Information (Supplementary Note 3, Supplementary Figs. 1920). Considering the presence of less frequent classes in the 31-class classification, we conducted a detailed error analysis to thoroughly evaluate the model’s performance on the 31 phosphorus-cycling categories (Supplementary Note 4 and Supplementary Tables 23).

To further benchmark LucaPCycle, we initially compared it against Diamond Blastp, KofamScan and PLMSearch32. However, due to significant overlap between the sequences predicted by PLMSearch and KofamScan, PLMSearch was ultimately excluded from the comparison. Detailed results are provided in Supplementary Note 5. The benchmarking dataset was predicted from the non-redundant gene (n = 147,289,169) and genome (n = 1878) catalogs from global cold seeps, comprising a total of 151,187,265 sequences (Supplementary Figs. 2122). To annotate these unknown sequences across 31 categories of phosphorus-cycling genes, we employed Diamond Blastp, KofamScan, and LucaPCycle. The predictions from all three methods were subsequently subjected to the ___domain-based validation step (DeepFRI, CLEAN, ECOD) described in detail below. Despite passing validation, the accuracy of these predicted results remains uncertain, as the same ___domain may occur in a variety of unrelated proteins. To enhance the reliability of the benchmark and reduce false positives, we defined the true positives as sequences predicted by at least two of the three methods (Diamond Blastp, KofamScan, and LucaPCycle). This process resulted in a total of 1,521,958 true positive sequences used for benchmarking (Supplementary Fig. 23). Since the sequences in the benchmark dataset can also be captured by homology-based searches, they are considered homologous. The sequence similarity between positive sequences predicted by LucaPCycle and the training dataset was evaluated, with detailed results provided in Supplementary Note 6 and Supplementary Fig. 24. Finally, the verified accuracy, precision, and recall were subsequently calculated using Eqs. (3)−(7).

$${True\; Positives}= ({{Pred\; Positives}}_{{LucaPCycle}}\cap {{Pred\; Positives}}_{{Blastp}}) \\ \cup ({{Pred\; Positives}}_{{LucaPCycle}}\cap {{Pred\; Positives}}_{{KofamScan}}) \\ \cup ({{Pred\; Positives}}_{{Blastp}}\cap {{Pred\; Positives}}_{{KofamScan}})$$
(3)
$${Verified\; Accuracy}=\frac{{Verified\; Positives}}{{Predicted\; Positives}}$$
(4)
$${Verified}\,{{\&}} \, {Hit\; Positives}=\left({Predicted\; Positives}\right)\cap \left({True\; Positives}\right)$$
(5)
$${Precision}=\frac{{Verified}\,{{\&}}\,{Hit\; Positives}}{{Predicted\; Positives}}$$
(6)
$${Recall}=\frac{{Verified}\,{{\&}}\,{Hit\; Positives}}{{True\; Positives}}$$
(7)

Metagenomic and metatranscriptomic datasets and processing

The 165 metagenomes and 33 metatranscriptomes analyzed in this study were sourced from sediment samples (sediment depths: 0–68.55 mbsf; water depths 860–3005 m) collected at 16 globally distributed cold seep sites (Supplementary Data 14). These samples encompass five types of cold seeps, namely gas hydrates (n = 39), mud volcanoes (n = 7), asphalt volcanoes (n = 7), oil and gas seeps (n = 15) and methane seeps (n = 96). Metagenomic datasets from QDN-W07 derived from sediment cores sampled in 2021 and 2023, along with those from active and extinct Haima cold seeps, were used to investigate the temporal dynamics of phosphorus-cycling genes. All metagenomic and metatranscriptomic datasets used in this study were sourced from our previous publications described in detail elsewhere82,83,84.

Non-redundant gene and genome catalogs were constructed as described in our previous publication82. In brief, metagenomic raw reads were quality-controlled and assembled into contigs. The protein-coding sequences were predicted from contigs using Prodigal v2.6.3 (parameter: -meta)85, and then clustered at 95% amino acid identity using CD-HIT (v4.8.1)81. This process yielded a non-redundant gene catalog (n = 147,289,169). MAGs were recovered from contigs with the length longer than 1 kb using six binning tools, including MetaBAT286, MaxBin287, CONCOCT88, SemiBin89, VAMB90 and Rosella (https://github.com/rhysnewell/rosella). All produced MAGs were dereplicated at 95% average nucleotide identity using dRep v3.4.0 (parameters: -comp 50 -con 10)91, resulting in 1878 representative MAGs. Completeness and contamination of MAGs were evaluated using CheckM2 (v1.0.1)92. Functional annotation of MAGs was carried out by searching against KEGG, Pfam, MEROPS and dbCAN databases using DRAM (v1.3.5)93.

With regard to 33 metatranscriptomes, raw reads were quality filtered using the Read_qc module within metaWRAP (v1.3.2)94. The removal of ribosomal RNAs was conducted using sortmeRNA v2.1 with default parameters95.

Identification of phosphorus-cycling proteins based on LucaPCycle and two homology-based methods

We used the trained LucaPCycle to predict 31 phosphorus-cycling proteins from the protein sequences of non-redundant gene catalog and MAGs. Detailed predictions for each protein are shown in Supplementary Fig. 1d. Additionally, the same datasets were also functionally annotated using two conventional methods, blast- and HMM-based approaches. For the blast-based approach, we used Diamond Blastp program (v2.0.15)96 with the filtering criteria of identity ≥ 30.0% and hit length ≥ 25 amino acids, as suggested by Zeng et al. (2022)30. For the HMM-based approach, we used pre-computed HMM profiles of KEGG ortholog via KofamScan (v1.3.0)31. The significant hits related to 31 phosphorus-cycling genes were selected according to KofamScan’s adaptive score threshold and an E-value threshold of 1E-5.

To ensure more precise and reliable results, all protein sequences identified by LucaPCycle, Diamond Blastp, and KofamScan were merged, followed by validation through three distinct methods: ___domain analysis, DeepFRI v1.0.0 (Deep Functional Residue Identification)41, and CLEAN v1.0.1 (Contrastive Learning-Enabled Enzyme Annotation)42. With regard to ___domain analysis, protein sequences were queried against full ECOD ___domain database (v.20231128)97 using Diamond Blastp program (v2.0.15)96 with an E-value threshold of 1E-2. ECOD is a hierarchical ___domain database that describes the evolutionary relationships between pairs of protein domains. A sequence was considered as bona fide phosphorus-cycling protein if supported by at least one of these validation processes. The cross-checked results were used for benchmark test.

To reveal phylogenetic diversity, clustering was conducted on the validated protein sequences using CD-HIT v4.8.1 (parameters: 70% sequence identity, 80% alignment coverage)81 to select representative ones. At this step, all singleton clusters were removed, resulting in 321,789 non-singleton protein families associated with phosphorus-cycling proteins.

Geochemical and metabolomics analyses

The porewater samples from Qiongdongnan cold seep were used for the following geochemical analyses. Total alkalinity (TA) was analyzed onboard by direct titration with an HCl standard solution. The pH of the porewater was monitored using a calibrated pH meter. Phosphate (PO43−) concentrations were determined photometrically using a UV-Vis spectrophotometer (Hitachi U5100, Hitachi Limited, Japan). Other cation-anion concentrations (SO42-, K+, Na+, Mg2+ and Ca2+) were measured using an ion chromatography system (ICS-1100, Thermo, United States) with conductance detector. The total organic carbon (TOC) and total nitrogen (TN) of bulk sediment samples were quantified on an elemental analyzer-isotope ratio mass spectrometer (Vario ISOPOTE Cube-Isoprime, Elementar, Germany).

The untargeted metabolomics analyses were performed on 55 sediment samples collected from the Qiongdongnan and Shenhu cold seeps98. For metabolite extraction, approximately 50 mg of each sediment sample was mixed with a solution of methanol, acetonitrile, and water in a 2:2:1 volume ratio. The samples were vortexed for 30 s, ground with steel beads for 10 min at 45 Hz, and ultrasonicated for 10 min in an ice bath. High-resolution LC-MS/MS analysis was performed on a Waters Acquity I-Class PLUS ultra-high performance liquid tandem with a Waters Xevo G2-XS QTOF high-resolution mass spectrometer. Data collection was performed in MSe mode and processed using Progenesis QI software, utilizing the METLIN database and Biomark’s proprietary library for metabolite identification.

Taxonomy assignment of bacteria and archaea

To obtain the taxonomic profiles for microbial communities driving the process of phosphorus cycling, all sequences from non-singleton phosphorus-cycling protein families were taxonomically assigned using the MMseqs2 taxonomy tool v13.45111 (parameter: --tax-lineage 1)99. This involved performing six-frame translation searches against the GTDB R207 database and assigning each protein sequence to the lowest common ancestor of the best hits for each frame.

The taxonomic classification of MAGs encoding phosphorus-cycling proteins was conducted using GTDB-Tk toolkit (v2.1.1)100 with default parameters against the GTDB R207 database. The assignment of MAG FR_S4_sbin harboring both McrA and PhnJ were further confirmed by the visual inspection of taxonomic trees. Briefly, the concatenated multiple sequence alignment of the MAG FR_S4_sbin and reference ANME genomes accessed from NCBI GenBank were produced based on 43 conserved single-copy genes extracted by CheckM (v1.0.12)101. The maximum likelihood phylogenomic tree was generated using RAxML (v8.2.12)102 with PROTCATLG model and 1000 bootstrap replicates.

Abundance calculations

The gene abundances of non-singleton phosphorus-cycling protein families across 165 metagenomes were quantified using Salmon v.1.9.0 in mapping-based mode (parameters: -validateMappings -meta)103. GPM (genes per million) values were used as a proxy for gene abundance. Similarly, transcript abundances of non-singleton phosphorus-cycling protein families were determined by mapping clean reads from 33 metatranscriptomes to the non-redundant gene catalog using Salmon v.1.9.0 with the aforementioned parameters103. Transcript abundances were calculated as TPM (transcripts per million). It is worth noting that all sequences from non-singleton phosphorus-cycling protein families were considered in these calculations.

At the genome level, the relative abundance of each MAG was profiled by mapping quality-trimmed reads from the cold seep metagenomes against the MAGs using CoverM in genome mode (https://github.com/wwood/CoverM) (v0.6.1; parameters: -min-read-percent-identity 0.95 -min-read-aligned-percent 0.75 -trim-min 0.10 -trim-max 0.90 -m relative_abundance).

Protein structure alignments, ___domain classification and molecular docking

Three-dimensional structures of identified phosphorus-cycling proteins were predicted using the deep-learning artificial intelligence system, via AlphaFold3 web server (May 2024 to June 2024)104. The confidence in the accuracy of the predicted structures was measured by the predicted template modeling (pTM) score. All remote ALP structures exhibited pTM score above 0.5 and 76% of the structures achieving very high confidence scores (pTM > 0.9; Supplementary Data 15). A pTM score above 0.5 means the overall predicted fold for the protein might be similar to the true structure. For PhnGHIJ complex, the interface predicted template modeling (ipTM) score was used to measure the accuracy of the predicted relative positions of the subunits within this complex.

The paired structural alignments were carried out using Foldseek (v8.ef4e960)33 in easy-search mode to align the structures of remote alkaline phosphatases against a library of reference structures drawn from PDB database and manually curated AlphaFold models (Supplementary Data 16). The domains of remote ALPs were analyzed with FoldSeek (v8.ef4e960)33 by comparing to domains defined in the ECOD ___domain databases (v.20231128)97.

Ledock (v1.0; https://www.lephar.com/) was used to predict the binding poses of 89 phosphoester compounds on remote ALP (RMSD: 1.0, number of binding conformations: 20). The binding box of each predicted protein was performed via the DockingPie1 plugin within PyMOL (docking program: Autodock Vina)105. All of the structures were visualized and exported as images using PyMOL.

Phylogenetic analysis

To facilitate a clear and intuitive phylogenetic tree, we used remote ALPs representative sequences from non-singleton phosphorus-cycling protein families. For sequence-based tree, the sequences of remote ALPs were aligned with corresponding reference sequences using MUSCLE v3.8.1551 (default settings)106. The alignments were subjected to phylogenetic tree reconstruction using FastTree v2.1.11 with JC + CAT models107. Structures of remote ALPs derived from AlphaFold 3 modeling, along with reference structures of well-characterized ALPs were subjected to structural tree building using Foldtree (https://github.com/DessimozLab/fold_tree), based on a local structural alphabet. All trees were visualized using iTOL (v6)108.

Identification of viral genomes-containing phosphorus-cycling genes

The assembled contigs with length longer than 10 kb were screened by geNomad pipeline v1.3.3 (parameter: genomad end-to-end)67 for viral prediction. The completeness and contamination of identified viral genomes were determined by CheckV v1.01 (database v1.5)109. Only viral genome-containing phosphorus-cycling genes were included for downstream analyzes. The taxonomy each of phosphorus-cycling viral genome was assigned by geNomad (v1.3.3)67. However, only a few viral genomes could be assigned to specific viral families or genera. Thus, the finer classifications of phosphorus-cycling viral genomes were alternatively performed by PhaGCN v2.0 (cut-off score > 0.5)68. Phage genes and hallmark genes were identified by DRAM-v (v1.3.5)93 and geNomad (v1.3.3)67. The putative virus-host linkages were predicted using iPHoP (v1.3.3)78 and CRISPR spacer matching. For the iPHoP, both default database and custom MAGs from our 165 metagenomes were used to maximize host prediction. For CRISPR spacer matching, the CRISPR arrays were identified using CRISPRCasFinder (v4.2.21)110 from our custom MAGs. Local alignments of extracted spacers with lengths greater than 25 bp against viral genomes were searched using BLASTn. Only BLAST matches with 100% alignment coverage and at most two mismatches were considered as high-confidence matches.

Statistical analysis

All statistical analyses were performed in R (v4.1.0). The normality and variance homogeneity of data were evaluated using Shapiro-Wilk test and Levene’s test, respectively. The two-sided Kruskal-Wallis rank-sum test was employed to compare gene abundance among different phosphorus metabolic processes. Correlation network analyses of environmental factors and phosphorus-cycling gene abundance were based on Spearman’s correlation. Pearson’s correlation and linear regression were performed to assess the relationship between PO43− concentrations and sediment depth. The alpha diversity (Chao1 index) of phosphorus-cycling microbial communities was calculated using the ‘vegan’ package. Variations in alpha diversity were evaluated using the two-sided Wilcoxon rank sum test for different depths and timescales, and the two-sided Kruskal-Wallis rank sum test for different types of cold seep. The bata diversity of phosphorus-cycling microbial communities was estimated using the partial least squares discrimination analysis (PLS-DA) with the ‘mixOmics’ package. PERMANOVA (Permutational Multivariate Analysis of Variance; permutations = 999) tests were used to calculate the statistical differences across different depths, timescales and types of cold seeps. Both alpha and beta diversity indices were calculated based on functional profiles.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.