Fig. 1: Overview of LucaPCycle for phosphorus-cycling protein annotation.

a Architecture of LucaPCycle. LucaPCycle consists of five modules: Input, Tokenizer, Encoder, Pooler, and Output. The Tokenizer module performs character-level and word-level tokenization on input sequences. The Encoder automatically extracts two types of features: the representation matrix of the protein language model and the representation matrix of the raw sequence. The Pooler selects essential features and transforms the feature matrix into a vector. The Output layer concatenates these two pooled vectors for classification. b Performance of binary-classification and 31-classification models within LucaPCycle. Six metrics, including accuracy, precision (macro), recall (macro), F1-score (macro), AUC (macro, one-vs-rest), and PR-AUC (macro) were evaluated based on the validation and testing datasets. c Benchmarking of LucaPCycle with KofamScan and Diamond Blastp. Source data are provided as a Source Data file.