Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction

Yue, Yang; Li, Shu; Cheng, Yihua; Wang, Lie; Hou, Tingjun; Zhu, Zexuan; He, Shan

doi:10.1038/s41467-024-53583-w

Download PDF

Article
Open access
Published: 07 November 2024

Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction

Nature Communications volume 15, Article number: 9629 (2024) Cite this article

7454 Accesses
2 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Structure-based machine learning algorithms have been utilized to predict the properties of protein-protein interaction (PPI) complexes, such as binding affinity, which is critical for understanding biological mechanisms and disease treatments. While most existing algorithms represent PPI complex graph structures at the atom-scale or residue-scale, these representations can be computationally expensive or may not sufficiently integrate finer chemical-plausible interaction details for improving predictions. Here, we introduce MCGLPPI, a geometric representation learning framework that combines graph neural networks (GNNs) with MARTINI molecular coarse-grained (CG) models to predict PPI overall properties accurately and efficiently. Extensive experiments on three types of downstream PPI property prediction tasks demonstrate that at the CG-scale, MCGLPPI achieves competitive performance compared with the counterparts at the atom- and residue-scale, but with only a third of computational resource consumption. Furthermore, CG-scale pre-training on protein ___domain-___domain interaction structures enhances its predictive capabilities for PPI tasks. MCGLPPI offers an effective and efficient solution for PPI overall property predictions, serving as a promising tool for the large-scale analysis of biomolecular interactions.

Decoding the protein–ligand interactions using parallel graph neural networks

Article Open access 10 May 2022

Classification and prediction of protein–protein interaction interface using machine learning algorithm

Article Open access 19 January 2021

Integration of pre-trained protein language models into geometric deep learning networks

Article Open access 25 August 2023

Introduction

Protein-protein interactions (PPIs) play a pivotal role in regulating diverse cellular processes, including signal transduction, immune response, and metabolic regulations^1,2. Gaining insights into PPI aids in understanding protein functions and identifying potential drug targets^3,4,5,6. While traditional experimental techniques for studying PPIs, such as yeast two-hybrid screening⁷, co-immunoprecipitation⁸, pull-down assays⁹, and fluorescence resonance energy transfer (FRET)¹⁰, are effective, they often require extensive labor and substantial financial investment. To address these challenges, advancements in computational tools and artificial intelligence (AI) algorithms have transformed the study of PPIs¹¹. These in-silico strategies leverage expansive datasets to predict PPIs, enabling interaction site prediction¹², interaction type classification⁹, and binding affinity prediction¹³.

The three-dimensional (3D) structures of proteins are fundamental to their biological functions^14,15,16. To gain a nuanced understanding of the biological significance and detailed mechanisms underlying PPIs, decoding the geometry of protein complexes have become essential¹. Among various computational methods, Graph neural networks (GNNs)^13,17 stand out with their proficiency in handling the 3D structures of proteins. By integrating spatial information and topological data inherent to protein complexes, GNNs provide a robust framework for illuminating the multifaceted nature of protein interactions^18,19. For instance, Jing et al. ²⁰. proposed a GNN framework, GVP-GNN, which preserves rotation equivariance of protein rigid motions when capturing geometric representations of protein-protein complexes. Zhang et al. ²¹. designed a line-graph-augmented message passing scheme to inject the relative positional information between two interactive edges for different PPI prediction tasks, such as protein-protein interface identifications.

Notably, in GNNs-based methods, proteins are represented as graph structures, with nodes corresponding to either heavy atoms (i.e., the atom-scale model) or amino acids (i.e., the residue-scale model)^21,22. However, each approach has its own trade-offs. Atom-scale models, while detailed, demand extensive computational resources to manage thousands of nodes, limiting their application to large PPI systems. On the other hand, residue-scale models are more computationally tractable but may overlook critical binding details that influence specificity and affinity. To address these limitations, multi-scale information can be integrated into the node features and edge connections. However, such integration requires intricate information exchange across scales, maintaining model consistency and physical relevance, which can complicate the design process. Additionally, in both atom- and residue-scale models, edges typically represent interactions based on sequential threshold or geometric distance, aiming to capture the complex relationships between protein structures and functions. Nevertheless, using such criteria to define connections may misrepresent chemical bonds, potentially affecting predictive accuracy.

A potential solution to these issues is to adopt coarse-grained (CG) modeling, which is a well-established framework in protein molecular dynamics (MD) simulation, designed to effectively strike a balance between maintaining essential molecular details and enhancing computational efficiency. CG-scale representation simplifies groups of atoms into single sites, such as amino acid side chains or specific chemical groups. The MARTINI model^23,24, a widely recognized CG-scale model in protein MD simulation, represents an average of four heavy atoms and their associated hydrogens with a single CG bead. It classifies beads into multiple main physical types, including polar (P), nonpolar (N), apolar (C), and charged (Q), etc., with subtypes based on hydrogen bonding capabilities or polarity. In addition to the various bead types, the model includes numerous chemical-plausible interaction parameters, both bonded (bonds, angles and dihedrals) and nonbonded, to directly and accurately reflect the partitioning free energy of amino acid sidechains^24,25. Through this strategy, the MARTINI model retains essential molecular interaction features while significantly reducing computational demands. It has been successfully applied and evaluated in many PPI-related studies, including the dimerization of the amino acid side chains²⁶, interactions involving membrane proteins such as glycophorin A²⁷, G protein-coupled receptor (GPCR) rhodopsin^28,29, and the Epidermal Growth Factor Receptor complex³⁰, as well as soluble protein complex interactions²⁹ like insulin, Ras-Raf, and Barnase-Barstar. Additionally, the MARTINI force field has been integrated into the HADDOCK framework³¹, enhancing its capability to predict the 3D structures of protein interactions.

Although the CG-scale offers improved efficiency, its simulations still consume more resource than PPI predictions using AI techniques. Previous efforts to integrate the CG-scale model with machine learning (ML) or deep learning (DL) methods have primarily focused on optimizing force field potential parameters, predicting the peptide self-assembly shapes, and converting the CG-scale model back to atomistic structures^32,33. However, a comprehensive approach that combines AI and CG modeling to predict PPI properties remains an under-explored area.

In this study, we present MCGLPPI, a lightweight geometric representation learning framework that combines GNNs with the MARTINI CG-scale models to predict the overall properties of PPI complexes. Designed to optimize computational efficiency without compromising prediction accuracy, MCGLPPI employs a specially-designed CG-scale complex graph, which maps each CG bead of protein complexes to nodes and utilizes chemical-plausible MARTINI force field bond parameters as edges for efficient structural characterization. Additionally, we introduce a GNN-based CG geometric-aware encoder to extract the high quality representations from the devised graph.

Our extensive validation demonstrates that MCGLPPI achieves competitive performance on multiple curated overall property prediction benchmarks for PPI structures, including binding affinity relevant prediction and interaction type classification tasks. When compared to atom-scale and residue-scale counterparts, MCGLPPI significantly improves computational efficiency, and reduces graphics processing unit (GPU) usage and total running time by more than threefold without compromising accuracy. Moreover, proteins are intricate molecular machines that typically consist of multiple domains. Domain-___domain interactions (DDIs) are critical subsets of PPIs, where the interaction typically occurs between domains rather than the entire proteins^34,35,36. We demonstrate that the CG-scale pre-training based on DDI patterns effectively enhances the model’s ability to predict PPI binding affinities. Overall, MCGLPPI emerges as a general, accurate, and efficient method for predicting PPI properties, offering a pathway to sophisticated analysis of biomolecular interactions.

Results

Overview of the proposed CG-scale complex geometric learning framework

We integrate biomolecular CG structures, force field parameters, and geometric-aware GNNs within the MCGLPPI framework for efficient prediction of overall properties of protein-protein complexes, which consists of three major components: (1) CG-scale complex graph generation, (2) CG-scale geometric representation learning, and (3) DDI-based CG-scale graph encoder pre-training. A comprehensive overview of the framework and its components is provided in Fig. 1.

**Fig. 1: The flowchart of the MCGLPPI framework.**

Force field parameter and CG-scale complex graph generation

Structure-based prediction of PPI complex properties typically demands high-quality learning of protein geometric graph representations. The number of graph nodes and edges significantly affect computational cost. At the same time, it is crucial to ensure that the graph structure is chemically plausible, as it is essential for accurately depicting the properties of protein complexes.

On top of this, we introduce the CG scale-based MARTINI parameterization that aims to efficiently achieve a balanced representation between chemically-plausible interaction characterization and computational cost. This process commences by transforming an atomistic PPI structure into the CG-scale structure and a comprehensive set of CG-scale force field parameters tailored for the MARTINI model (the extensively-used MARTINI22^23,37 and the latest MARTINI3²⁴ are examined, a crucial difference is that MARTINI3 brings richer bead types and bead numbers to slightly increase the bead resolution, and their detailed version characteristics are in Supplementary Note 1). This simplification reduces the high-resolution atomic model into a computationally easier-to-execute form by grouping multiple atoms into fewer representative beads. The resulting parameters describe how these beads interact with each other chemically and physically from different perspectives (Fig. 2).

**Fig. 2: MARTINI-based CG-scale representation of protein structure.**

After integrating the structural data with the force field parameters, a multi-relational graph corresponding to the protein complex is constructed (Figs. 1a and 2). Within this graph, each bead, representing a group of heavy atoms, becomes a node. The bonds between backbone beads (\(B\)) or between sidechain (\(S\)) and either sidechain or backbone beads defined by their type and length, are translated into edges that connect these nodes. It is worth noting that these nodes and edges are concise (i.e., their total numbers required to depict a protein complex are relatively lower, the corresponding statistics and further analysis for their effect on saving calculation overhead are provided in Supplementary Table 1), ensuring efficient protein modeling while maintaining chemical accuracy.

Within the MARTINI framework, the protein’s secondary structure plays a pivotal role in determining the bead types and associated bond, angle, and dihedral parameters for each residue. For instance, specific bond types such as constraint bonds \({d}_{{B}_{i}{B}_{i+1}}(H)\) or long harmonic bonds \({d}_{{B}_{i}{B}_{i+3}}(E)\) and \({d}_{{B}_{i}{B}_{i+4}}(E)\) are used for regions designated as helices \((H)\) or extended strands \((E)\), while other backbone bond parameters \({d}_{{B}_{i}{B}_{i+1}}({CTS})\) are adopted for irregular secondary structures such as coils, turns, and bends. In our CG-scale complex graph, the edge types also reflect these distinctions, facilitating the accurate description of secondary structural features within the protein complex. Furthermore, two distinct edge types, \({d}_{{intra}}\) and \({d}_{{inter}}\), are introduced for the differentiation of bead nodes originating from the same or different amino acid residues, providing valuable hierarchical geometric information regarding the spatial arrangement relationships and interactions within and between the residues. Additionally, other crucial force field parameters, such as bead types, bond angles, and dihedrals, are encoded as node features within the graph (as illustrated in Fig. 2). These features are essential for capturing the spatial orientation and potential movements of the protein segments.

Furthermore, when MARTINI generates force field parameters of bond lengths, angles, and dihedrals, what it provides include the bead composition (i.e., from which beads these bonds and angles will form) and values for these bonds lengths and angles. These values are not derived from the corresponding real conformations, instead, they are from the statistical values over samples in the Protein Data Bank (PDB)^25,38 database. To make these values specific to individual parameters for accurate CG-graph construction, we re-calibrate them based on the actual coordinates and give them appropriate feature assignment. Please refer to The construction of CG-scale protein complex graph and its cropping function section in “Methods” section for further details of defining the aforementioned multi-relational graph.

CG-scale geometric representation learning

To reduce the computational overhead and preserve the integrity of the data for different PPI structures, we implemented a residue backbone distance-based duel-strategy approach to graph cropping on the CG-scale complex graphs derived earlier (Fig. 1b). The initial strategy, core region cropping, focuses on the extraction of interaction interface between two proteins (or defined interaction parts of the structures beyond dimers), to ensure the focus on the most critical region of the interaction, likely enhancing model prediction accuracy and relevance. While the second strategy involves an adjacent region cropping scheme for capturing peripheral but potentially significant structural information like essential spatially correlated motifs surrounding the core interface. Through these strategies, we can produce a graph that balances detailed structural information retention with computational feasibility, regardless of the interaction pattern. The specific cropping details are in The construction of CG-scale protein complex graph and its cropping function section of “Methods” section.

We then apply the cropping method to each complex sample in our curated downstream datasets, which include two types of binding affinity-related regression and one interface type classification tasks. These tasks span a range of complexes, from the formation of simple dimers to the binding of the T cell receptor (TCR) to an antigenic peptide presented by the major histocompatibility complex (pMHC)^39,40.

We subsequently utilize a multi-relational heterogeneous GNN-based CG graph encoder²¹, which can efficiently encode the complex relationships between graph nodes and edges (detailed in The CG-scale representation learning for complex overall property prediction section of “Methods” section) within the cropped graph for generating its high-quality geometric representation. This representation is then forwarded to the task-specific prediction network, enabling us to obtain accurate predictions of the corresponding complex overall properties.

DDI-based CG-scale graph encoder pre-training

Domains are fundamental structural units within proteins that are often responsible for specific functions. They play a critical role in mediating interactions with other proteins^34,35,36, whether within a single multifaceted protein (intra-protein interactions) or between two distinct proteins (inter-protein interactions). Despite the limited availability of detailed yet labelled 3D structural data for PPIs, the wealth of DDI structural information provides a valuable opportunity for enhancing computational models through pre-training. To this end, we use the Three-Dimensional Interacting Domains (3DID) database³⁴ to construct a dataset tailored for pre-training our CG-scale graph encoder. The detailed curation process is described in The detailed curation process for the 3DID pre-training dataset section of “Methods” section.

We employ a denoising-based, self-supervised pre-training approach, adapted from the work by Zhang et al. ²², to instruct our CG graph encoder on the intricate patterns of DDI structures and sequences. This method involves introducing perturbation to each CG graph in the pre-training DDI dataset and then forcing the encoder to reconstruct the original graph information, thereby imprinting the fundamental characteristics of ___domain interactions (see Fig. 1c and The DDI-based CG graph encoder pre-training technique section in “Methods” section for more details). Following this pre-training phase, the encoder, now enriched with the knowledge from the DDI dataset, undergoes fine-tuning to tackle downstream PPI prediction tasks. Through this fine-tuning process, the encoder applies the principles of ___domain interactions learned during pre-training to downstream PPI scenarios, potentially enhancing its ability in making predictions.

MCGLPPI saves computational cost while keeping competitive performance

To validate the performance and computational cost of the proposed MCGLPPI framework on the PPI complex overall property predictions, we first curated three datasets: (1) the strict protein-protein dimer subset of the PDBbind dataset⁴¹ (PDBbind-strict-dimer dataset), (2) the ATLAS dataset³⁹, and (3) the MANY/DC dataset^42,43. The former two datasets were used to evaluate the model’s regression capabilities (protein-protein binding affinity predictions), while the MANY/DC dataset was used to assess the overall classification performance (protein complex interface classifications). The Pearson’s correlation coefficient (\({R}_{{\rm{P}}}\)), root mean square error (RMSE), and mean absolute error (MAE) were utilized to assess the quality of regressions. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR) were for checking the capability of classification. To ensure a fair comparison, we performed the same aforementioned complex graph cropping function for each sample in every dataset (across different scales) to identify the sample’s core interaction regions. Additionally, we managed to 1) compare the performance of MCGLPPI supported by MARTINI22 (denoted as MCGLPPI-M2) and MARTINI3 models (denoted as MCGLPPI-M3) and 2) further extend MCGLPPI into handling protein-protein binding affinity change regressions requiring pairwise complex structures as the input (Supplementary Table 5).

The binding affinity prediction of the formation of strict dimers

We successfully extracted protein-protein complexes exhibiting strict dimer structures from the PDBbind dataset⁴¹. Following sample correction and label unification (i.e., converting the binding affinity labels of all relevant samples to \(\triangle {\rm{G}}\))⁴⁴, we obtained 1270 dimer samples with binding affinity labels \(\triangle {\rm{G}}\), referred to as the PDBbind-strict-dimer dataset (the detailed curation process can be found in Supplementary Note 2). The standard tenfold cross-validation (CV) strategy was used to evaluate the model, specifically, aforementioned sample points will be uniformly split into 10 folds for the CV, in each iteration one fold is selected as the test set, and the rest of folds are treated as the training set. To ensure a fair comparison of the model performance across different scales, we compared the atom- and residue-scale versions of our employed protein graph encoder, GearNet-Edge²¹, using their default model settings (i.e., settings related to protein graph construction and geometric encoder hyper-parameters). Besides, we considered an atom-scale state-of-the-art geometric encoder, GVP-GNN²⁰, specifically designed for solving 3D macromolecular structures, particularly protein-protein complexes. Detailed information on the default hyper-parameters of all methods can be found in Supplementary Note 3.

Furthermore, to comprehensively quantify the cost of these approaches under limited lightweight computational resources, we utilized a single NVIDIA A100 GPU 40GB to run the comparative experiments. For each approach, based on the same epoch number of 150, starting with a batch size of 8 and gradually increased it by a factor of 2 until the GPU was out of memory (OOM), and we recorded the corresponding evaluation metrics, memory usage, and total time cost across the aforementioned tenfold CV.

For the atom- and residue-scale GearNet-Edge, 915 of 1270 samples were successfully identified. To ensure a fair comparison, this 915 subset of the PDBbind-strict-dimer dataset was first used for the comparative experiment. Table 1 presents the corresponding results. Under current experimental conditions, the key findings were as follows: (1) MCGLPPI outperformed its atom- and residue-scale counterparts. (2) Under the same batch size, MCGLPPI reduced GPU consumption by approximately 5\(\times\) and 3\(\times\), as well as total elapsed time by 3\(\times\) and 3\(\times\), compared to the atom- and residue-scale models, respectively, while maintaining competitive performance. These findings demonstrated the effectiveness and feasibility of introducing the MARTINI-based CG-scale representation to achieve a better balance between performance and computational cost. Additionally, MARTINI3-based MCGLPPI had moderately better performance under the best batch size (64) compared to that from MARTINI22, while with slightly increased computation overhead due to the expanding bead types and numbers.

Table 1 Test performance and computational cost of different approaches at different scales on the 915-subset of the PDBbind-strict-dimer dataset based on one A100 GPU 40GB

Full size table

The experimental results of MCGLPPI-M2 and MCGLPPI-M3 on the complete PDBbind-strict-dimer dataset were 0.590/0.583 (\({R}_{{\rm{P}}}\)), 2.071/2.067 (RMSE), 1.602/1.566 (MAE), 11,560/13,311 (GPU (MB)), and 14,312/17,446 (Time (s)). Besides, to validate the robustness of our proposed MCGLPPI, we conducted further investigations into (1) the impact of hyper-parameter, such as hidden feature dimensions on overall performance (Supplementary Note 4), (2) more challenging training and test scenario with further structural homology reduction (Supplementary Table 3), (3) model stability test using AlphaFold-generated structures (Supplementary Table 4). These investigations consistently supported the conclusions outlined above.

The effectiveness of MCGLPPI on more complex PPI patterns

To further investigate the effectiveness of MCGLPPI in handling more complex PPI structures beyond standard dimers, the ATLAS dataset³⁹, which contains the TCR-pMHC structures formed in the cell-mediated immunity processes along with their corresponding binding affinity values, was considered. After removing invalid samples, correcting samples, and unifying labels, we obtained 531 different structures with the \(\triangle {\rm{G}}\) labels. Please note that we utilized the structures that were optimized using the fixed backbone design option of Rosetta⁴⁵, which were reported to achieve high structural accuracy³⁹.

We performed the aforementioned standard tenfold cross-validation using the same experimental settings as the previous section and documented the corresponding evaluation results. Furthermore, the comprehensive comparison experiments were carried out based on 451 of the 531 curated samples that could be effectively processed by GearNet-Edge at both the atom- and residue-scale.

Table 2 shows the predictive performance and computational cost derived from the tenfold cross-validation performed on the 451-sample ATLAS subset. Additionally, we reported the best-performing results of MCGLPPI-M2 and MCGLPPI-M3 on the complete curated ATLAS dataset: 0.809/0.823 (\({R}_{{\rm{P}}}\)), 1.116/1.053 (RMSE), 0.837/0.803 (MAE), 13,615/16,108 (GPU (MB)), and 6982/7915 (Time (s)). Notably, when dealing with more complex protein-protein structures beyond standard dimers, the proposed MCGLPPI maintained the competitive performance and exhibited a relatively lower computational cost compared with its atom- and residue-scale counterparts, which further validated the effectiveness of the devised CG-scale protein complex geometric model and corresponding cropping function. An additional investigation into the necessity of the cropping function is conducted in Influence of graph cropping on overall model efficiency section.

Table 2 Test performance and computational cost of different approaches at different scales on the 451-subset of the curated ATLAS dataset based on one A100 GPU 40GB

Full size table

The prediction results for protein-protein interface classification

In addition to the aforementioned two regression tasks, an overall interface classification task for protein–protein complex was incorporated to further examine the generalizability of MCGLPPI. Specifically, the MANY⁴² and DC⁴³ datasets were utilized, containing 5739 and 161 dimers respectively. These dimers are categorized into two overall types: dimers with biological or crystal interfaces⁴⁶. Based on this classification, the model was trained to distinguish between the two interface types, which was further formulated as a binary complex graph classification task. Following the previous data splitting convention^1,18, 80% samples of MANY, 20% samples of MANY (for MANY dataset splitting, the balance between positive and negative samples were maintained), and the complete DC datasets were used as the training, optional validation, and test sets for model evaluation, respectively.

The experiment settings from the previous two sections were kept (except for the unified epoch number changing from 150 to 30). Additionally, we compared our approach with two existing approaches, DeepRank-GNN¹⁸ and EGGNet¹, which had already been tested on the complete MANY/DC dataset. However, it should be noted that the effective sample numbers for atom- and residue-scale GearNet-Edge on the MANY and DC datasets were 5535 and 151, respectively. Moreover, the node feature construction in existing approaches like DeepRank-GNN relies on time-consuming external amino acid sequence alignment search, making it difficult to fairly compare computational cost. Therefore, we only compared their predictive performance on the complete MANY/DC dataset and conducted detailed computational cost comparison experiments for the atom- and residue-scale GearNet-Edge models on the 5535-151-sample subset (following the aforementioned data splitting mode).

The results of the computational cost comparison experiments are shown in Table 3. It was observed that compared to its atom- and residue-scale counterparts, MCGLPPI achieved lower computational cost while surpassing their predictive capability. Specifically, MCGLPPI-M2 and MCGLPPI-M3 exhibited strong performance when evaluated with a batch size of 64, achieving AUROC values of 0.890 and 0.882, respectively. Additionally, the AUPR values for these models were 0.871 and 0.881, respectively. Overall, both models outperformed the performance of atom-scale and residue-scale models across different batch sizes. The reason for this improvement could be attributed to the integration of protein thermodynamics and specific secondary structure support information through the MARTINI force field, which is injected into the bonds (edges) of the CG complex graph, which provides extra distinguishable capability compared with its atom- and residue-scale counterparts. A further investigation into the importance of different CG graph edges are presented in Performance of the geometries considered in CG-scale complex graphs section.

Table 3 Test performance and computational cost of different approaches at different scales on the 5535-151-subset of the MANY/DC dataset based on one A100 GPU 40GB

Full size table

Furthermore, the performance of MCGLPPI supported by MARTINI22 (AUROC: 0.895, AUPR: 0.892) also outperformed both DeepRank-GNN (AUROC: 0.865, AUPR: 0.871) and EGGNet (AUROC: 0.869, AUPR: 0.863) on the complete MANY/DC dataset (the results of DeepRank-GNN and EGGNet were retrieved from the refs. ^1,18). This observation further supported the effectiveness and generalizability of the MCGLPPI framework for predicting overall properties of PPI complexes.

The investigation of CG-scale pre-training techniques on different tasks

To explore the feasibility of our hypothesis that pre-training on CG-scale informative DDI complexes could benefit the downstream property predictions of CG complexes, especially in scenarios with limited labeled samples, we constructed a dataset from the 3DID database³⁴, which comprises 41,663 representative DDI structures, serving as our pre-training repository. Furthermore, we implemented a CG-scale diffusion denoising-based self-supervised pre-training technique based on ref. ²², which enables us to capture and learn the general DDI patterns and knowledge.

Specifically, we first determined the optimal model settings for MCGLPPI under each downstream task through training-from-scratch separately. Using the same selected settings (for each task), we fine-tuned the CG graph encoder that had undergone pre-training for each respective downstream task (with the same epoch numbers as training-from-scratch). We then compared the performance difference between training-from-scratch and pre-training with fine-tuning. Moreover, to further explore the potential of the DDI dataset-based pre-training in enhancing PPI prediction models, we extended our performance comparison to the atom-scale and residue-scale (i.e., pre-training them using the original-scale corresponding pre-training settings²² based on the 3DID pre-training set). It should be noted that the atom- and residue-scale models (i.e., GearNet-Edge) could only recognize 33,144 out of the 41,663 DDI samples. Therefore, we selected these 33,144 samples as the common pre-training set and transformed these samples into their respective atom-, residue-, and CG-scale graphs to compare the performance across different scales (before and after pre-training) (Fig. 3a).

**Fig. 3: The investigation of performance influence contributed by the DDI-based pre-training, crucial graph geometric characterization components, and cropping function of MCGLPPI.**

In the PPI binding affinity prediction tasks on the PDBbind and ATLAS datasets, taking MCGLPPI-M2 as an example, pre-training improved the \({R}_{{\rm{P}}}\) from 0.597 to 0.606 and from 0.825 to 0.830, respectively, indicating that pre-training can effectively enhance model performance. However, for the interface type classification task on the MANY/DC dataset, the performance actually decreased after pre-training, with AUPR dropping from 0.880 to 0.866 (additional results evaluated using other metrics and those tested on the complete 3DID dataset for MCGLPPI were reported in Supplementary Table 2). Meanwhile, the consistent performance change trend was found on MCGLPPI-M3, GearNet-Atom, and GearNet-Res.

We thought the contributing reasons are as follows. For the binary classification task that aims to distinguish between biological interfaces and crystal artefacts (representing non-biological interactions), training-from-scratch with current graph settings might be sufficient for the protein learning model to capture the subtle geometric structural difference between these interface types. Furthermore, pre-training based on DDI complexes extracted from actual (biological) protein-protein interactions would be more beneficial to tasks which predict the properties of complexes formed through real PPI processes (e.g., binding affinity predictions), rather than distinguishing crystallographic interfaces resulting from non-biological interactions detected between repetitive crystal units¹⁸.

In addition, we can conclude that, when using the same number of pre-training samples, our CG-scale approaches exhibited the overall better performance on all the PDBbind, ATLAS, and MANY/DC datasets. In total, pre-trainings on DDIs are effective in enhancing PPI predictions, with the CG-scale graph encoder combined with CG DDIs pre-training being the most effective approach.

Performance of the geometries considered in CG-scale complex graphs

To assess the impact of various geometric representations within CG complex graphs, based on the relatively simple and classical MARTINI22-based MCGLPPI, we conducted a series of ablation studies. These experiments were designed to gradually remove specific graph components and analyze their individual contributions to the overall predictive performance of the system.

Initially, we focused on edges based on chemically plausible interactions as defined by the MARTINI(22) force field. These interactions included various types of bonds between beads, such as \({d}_{{B}_{i}{B}_{i+1}}({CTS})\), \({d}_{{B}_{i}{B}_{i+1}}(H)\), \({d}_{{B}_{i}{B}_{i+3}}(E)\), \({d}_{{B}_{i}{B}_{i+4}}(E)\), and \({d}_{S}\) (detailed in The construction of CG-scale protein complex graph and its cropping function section of “Methods” section, and the following analyzed graph components can be referred from the same section). We used the subset of 915 protein dimers from the PDBbind-strict-dimer dataset for our analysis. Upon selective removal of all MARTINI bond-based edges from the CG graphs, we observed a measurable decline in performance metrics. The \({R}_{{\rm{P}}}\) decreased from 0.597 to 0.569 (Fig. 3b), confirming the importance of these edges in accurately characterizing protein interactions.

Next, after removing MARTINI bond-based edges, we investigated the effectiveness of the proposed bead-residue geometric hierarchical composition-aware edges \({d}_{{intra}}\) and \({d}_{{inter}}\). We replaced these two types of edges with the standard radius-based edges that do not differentiate the compositional relationships between chemically-plausible bead nodes and their corresponding residues (using the same edge cutoff). This modification resulted in a further decrease in predictive accuracy, with \({R}_{{\rm{P}}}\) dropping from 0.569 to 0.557 (Fig. 3b). This emphasized the significant impact that composition-aware edges have on the model.

We also evaluated scenarios within a complete CG graph where, if both residue composition-aware edges and MARTINI bond-based edges are present between the same pair of end nodes, the MARTINI bond-based edges are disregarded. Under these conditions, the predictive performance experienced a decline from an \({R}_{{\rm{P}}}\) of 0.597 to 0.573 (Fig. 3b). This result emphasized the importance of explicitly incorporating MARTINI bond-based edges along with the composition-aware edges in our modeling framework.

Furthermore, other groups of MARTINI force field parameters, such as bead types, angles (\({\theta }_{{B}_{i}{B}_{i+1}{B}_{i+2}}\), \({\theta }_{{B}_{i+1}{B}_{i}{S}_{i,1}}\), \({\theta }_{{B}_{i}{S}_{i,1}{S}_{i,2}}\)), and dihedrals (\({\varPsi }_{{B}_{i}{B}_{i+1}{B}_{i+2}{B}_{i+3}}\)), were encoded as node or edge features within the graph. To assess their specific importance, we invalidated the bead type feature and angular features, which include (bond) angles and dihedrals, from the CG-scale graph. As shown in Fig. 3b, when the bead type in every node feature and edge feature were set to identical or when angular information was further omitted (on top of the former), there was a reduction in the \({R}_{{\rm{P}}}\) from 0.597 to 0.583 and 0.567, respectively.

Moreover, the complete invalidation of above two groups of force field parameters plus all MARTINI-based edges (i.e., only the standard radius-based edges are included), led to a further decline in \({R}_{{\rm{P}}}\) to 0.521 (reaching a 12.7% difference). These findings indicated that, not only does our framework rely on aforementioned chemically-plausible edges, but it also requires both the chemical and physical information provided by the MARTINI bead types and angular information from angles and dihedrals. This combined information enables MCGLPPI to make more informed predictions about the properties of protein interactions.

Influence of graph cropping on overall model efficiency

To provide a comprehensive analysis of the influence of graph cropping on the efficiency of the MCGLPPI framework, we conducted additional experiments using the ATLAS dataset, which contains more complex TCR-pMHC structures as a benchmark (based on MARTINI22-version MCGLPPI). We ran our model on the subset of 451 protein complex structures from the ATLAS dataset while maintaining all experimental settings consistent with our previous MCGLPPI experiments, except for the graph cropping function, which was disabled. As shown in Fig. 3c, the maximum batch size that could be processed by MCGLPPI under a single NVIDIA A100 GPU 40GB decreased significantly from 128 to 32 when the graph cropping function was turned off. With a batch size of 32, the \({R}_{{\rm{P}}}\), GPU memory consumption, and total runtime post-cropping were 0.825, 6796 MB, and 5992 s, respectively, compared to 0.809, 19,706 MB, and 13,794 s before cropping. These results confirmed the critical role of graph cropping in improving computational efficiency and predictive performance in the framework.

Discussion

In this study, we presented MCGLPPI, an efficient framework that enhances the structure-based overall property predictions for protein-protein complexes by utilizing the MARTINI force field for lightweight protein modeling. At the mesoscopic CG-scale, our proposed CG protein graph model uses concise yet chemically plausible beads and bonds to accurately represent the conformation characteristics of protein-protein complexes. This approach results in lower computational overhead, leading to a better balance between predictive performance and cost compared to its atom- and residue-scale counterparts. Further, the modern design of protein graphs and corresponding GNN protein encoders like GearNet-Edge is more relying on the construction of edges, which are usually fully built based on multiple pre-defined geometric distance and sequential thresholds, aiming to capture more comprehensive spatial relationships between particle nodes^21,22,47. While the number of edges will significantly influence the neighboring message aggregation¹⁷ speed of corresponding GNNs. Our devised CG-graph incorporates more chemical-plausible MARTINI-based edges wiring designed bead nodes pairs based on specific interaction definitions, thereby reducing the reliance on indiscriminately connecting every node pair within multiple pre-defined thresholds, which is ultimately beneficial to decreasing the processing overhead under current framework (more detailed explanation can be found in Supplementary Table 1).

On top of this, MCGLPPI was designed to adapt to the geometric parameters from both the classical MARTINI22 and the more recent MARTINI3 force field for flexible extensions and comparisons. Under current experimental settings, different versions achieve better performance on different datasets, while the DDIs-based pre-training narrows down the difference between the two versions. Besides, MARTINI22-based MCGLPPI enjoys slightly quicker processing speed due to less bead types and bead numbers (please see more specific performance difference analysis across versions in Supplementary Note 5).

In addition, our extensive ablation studies highlighted the significance of both edge and node features derived from the MARTINI force field for accurate PPI predictions using the MCGLPPI framework. Notably, these features, like chemical bond-based edges and physical-plausible bead type node features, are crucial in capturing the essential properties that govern protein interactions (The potential extension analysis of using the force field parameters in other scales can be found in Supplementary Note 6). Through our proposed CG-scale learning framework, we also demonstrated the effectiveness of DDI-based pre-training in improving binding affinity predictions of PPIs.

While MCGLPPI has shown promising overall performance, there are some areas for further improvement or investigation. For instance, MCGLPPI may not fully capture the complexity of protein-protein systems, and other CG-scale protein modeling systems deserve further exploration. Besides, the MCGLPPI model, built on a geometric GNN framework, learns from the confident 3D structures of complexes to predict related overall properties. Although with simple modifications, we further demonstrated its versatility by straightforwardly extending it to binding affinity change (i.e., \(\triangle \triangle {\rm{G}}\)) calculation of pairwise wildtype-mutant complexes on an independent multiple-point amino acid mutation dataset (detailed in Supplementary Table 5), it currently lacks the capability to (directly) utilize the more broadly available and readily accessible PPI sequence data as initial input of predicting PPI attributes.

Based on this, we plan to incorporate more cost-efficient geometric information to more comprehensively characterize CG complex structures, e.g., considering the Euler angles to describe the relative rotation between CG particles, and support more CG modeling systems to capture protein-protein thermodynamic quantities and underlying chemical mechanisms from different perspectives. Furthermore, there is a potential to further improve model performance on more PPI tasks by integrating sequence co-evolutionary information as a feature component⁴⁸. Additionally, combining our CG-scale framework (with further adaptive modifications) more intensively with tools that can predict confident PPI structures based on sequences, such as AlphaFold3⁴⁹, AlphaFold-Multimer⁵⁰, FoldDock⁵¹, or the MARTINI force field-integrated HADDOCK³¹, will open avenues for predicting PPI properties, such as determining whether two proteins interact and better understanding the effects of mutations on these interactions.

Methods

The detailed curation process for the 3DID pre-training dataset

The latest 3DID database³⁴ provides 15,983 DDI structure templates, with each template containing one or more samples of resolved 3D structural data. We removed any DDI templates from the 3DID dataset that were identical to those present in our downstream benchmark datasets. Following this stringent exclusion process, we obtained a pre-training dataset which provides 41,663 DDI structure samples in total.

MARTINI-based geometric parameter generation

Each complex structure was processed using pdbfixer tool (https://github.com/openmm/pdbfixer) to complete missing side-chain information and convert non-natural amino acids to their natural counterparts (for the atom- and residue-scale models, the same process was also performed). The Python-based MARTINI script martinize.py (https://cgmartini.nl/docs/downloads/tools/proteins-and-bilayers.html, version 2.4)²³ was used to generate MARTINI22-based CG structure and force field parameters for each protein complex. The Martinize2 and Vermouth programs⁵² were for producing MARTINI3-based CG structure and force field parameters of corresponding complexes. The CG structure encompasses the bead coordinate-related information, while the force field parameters include both nonbonded parameters, such as bead types, and bonded parameters, including bonds, angles, dihedrals, and bead connectivity instructing the bead composition for these bonds and angles (Fig. 2).

MARTINI-based bonds information

Within the MARTINI (applicable for both MARTINI22 and MARTINI3) force field representation, there are two principal types of bonds: backbone bonds and sidechain bonds. As illustrated in Figs. 1a and 2, a backbone bond \({d}_{B}\) is formed between two neighboring backbone beads (\({B}_{i}\)). Sidechain bonds \({d}_{S}\) occur either between a backbone bead and a sidechain bead (\({S}_{i}\)) within the same amino acid or between sidechain beads. Furthermore, the MARTINI force field differentiates backbone bond types based on the protein’s secondary structure. Consequently, \({d}_{B}\) can be subdivided into:

1.
Constraint bonds: These are formed between two adjacent amino acids that are part of a helical structure (\(H\)), denoted as \({d}_{{B}_{i}{B}_{i+1}}(H)\).
2.
Long harmonic backbone bonds: For three consecutive amino acids forming extended elements (\(E\)), these bonds connect the backbone beads of residues \(i\) and \(i+3\), denoted as \({d}_{{B}_{i}{B}_{i+3}}(E)\).
3.
Long harmonic backbone bonds: For four contiguous amino acids in extended elements (\(E\)), the bonds connect backbone beads of residues \(i\) and \(i+4\), denoted as \({d}_{{B}_{i}{B}_{i+4}}(E)\).
4.
Other harmonic backbone bonds: Parameters between two adjacent amino acids for irregular secondary structures such as coils, turns, and bends are denoted as \({d}_{{B}_{i}{B}_{i+1}}({CTS})\).

In total, there are five types of bonds in both MARTINI22 and MARTINI3 force fields: \({d}_{{B}_{i}{B}_{i+1}}({CTS})\), \({d}_{{B}_{i}{B}_{i+1}}(H),\) \({d}_{{B}_{i}{B}_{i+3}}(E)\), \({d}_{{B}_{i}{B}_{i+4}}(E)\), and \({d}_{S}\). These bond types were chosen as the primary edge types in our MCGLPPI framework.

MARTINI-based angles and dihedrals information

The bonded parameters also include angle and dihedral parameters (Fig. 2). Specifically, there are three types of angle parameters and one dihedral type being considered:

1.
\({\theta }_{{B}_{i}{B}_{i+1}{B}_{i+2}}\): The angle between three consecutive backbone beads.
2.
\({\theta }_{{B}_{i+1}{B}_{i}{S}_{i,1}}\): The angle formed between a backbone bead, its neighboring backbone bead, and the first sidechain bead of the amino acid.
3.
\({\theta }_{{B}_{i}{S}_{i,1}{S}_{i,2}}\): The angle between a backbone bead and two consecutive sidechain beads of the same amino acid.
4.
\({\varPsi }_{{B}_{i}{B}_{i+1}{B}_{i+2}{B}_{i+3}}\): The dihedral angle between four consecutive backbone beads. It is noted that dihedral angles were imposed only when all four interacting beads had the helical secondary structure (\(H\)) in MARTINI force field.

Notably, the bond lengths, angle and dihedral values were re-calibrated based on the given coordinates of corresponding endpoint beads. They were not directly adopted from the statistics values from PDB database since the calibration can provide accurate geometric interactive information for the CG-scale protein complex graph model.

The construction of CG-scale protein complex graph and its cropping function

The challenge to build an effective CG protein complex graph is how to fully preserve the introduced MARTINI parameters in a graph structure accurately and efficiently, while keeping the flexibility of injecting other useful knowledge on top of these MARTINI parameters. To overcome this challenge, we first modelled a given protein-protein complex as a multi-relational contact graph \({\mathscr{G}}=\left({\mathscr{V}},{\mathscr{E}},{\mathscr{R}}\right)\). \({\mathscr{V}}\) represents the set of graph nodes \(i\), i.e., all MARTINI beads produced for the complex, and the position of each bead node \(i\) is determined by its equipped 3D coordinate. \({\mathscr{E}}\) and \({\mathscr{R}}\) are the set of edges between bead nodes and the set of edge types \(r\), respectively. Based on this, we denoted an edge from nodes \(j\) to \(i\) with type \(r\) as \((i,j,r)\)).

In order to give the precise descriptions of inter-bead geometric positional relationships while integrating concise chemical-plausible MARTINI bonds, the graph structure is built as follows. First, an edge will be wired if any two bead nodes have the Euclidean distance smaller than 5\({\text{\AA}}\). Compared with the commonly-used radius edge used in atom-and residue-scale models, MCGLPPI further distinguishes the edge type based on whether the two end bead nodes are from the same residue. In other words, the edge is categorized as an intra-residue contact edge \({d}_{{intra}}\) if the two bead nodes belong to the same residue otherwise is an inter-residue contact edge \({d}_{{inter}}\), for which the hierarchical composition information between chemical-plausible beads and their surrounding residues within a complex is injected. Besides, all bond types described in the MARTINI-based bonds information section are incorporated in the graph edge structure.

To summarize, there are seven types of edges in total, describing the protein complex geometry from different perspectives, including various precise geometric contact relationships, actual secondary structure supports, and chemical bonded interactions, etc. Although multiple types of edges are included, due to the conciseness of MARTINI bonds, the average degree of nodes in the constructed graph is still relatively small (see Supplementary Table 1 for further comparison analysis with the atom- and residue-scale counterparts), which contributes to relatively low representation learning overhead. Next, the other MARTINI parameters were allocated as follows into these defined nodes and edges as their features (the definition of generation of a one-hot representation is in Supplementary Note 7).

Bead node features \({{\bf{f}}}_{i}\)

The MARTINI22 or MARTINI3 bead type, given as a one-hot representation
Sine-cosine encoded backbone angles (\([\sin ({\theta }_{{BBB}}),\cos ({\theta }_{{BBB}})]\))
Sine-cosine encoded backbone-side chain angles (\([\sin ({\theta }_{{BBS}}),\cos ({\theta }_{{BBS}})]\))
Sine-cosine encoded side chain angles (\([\sin ({\theta }_{{BSS}}),\cos ({\theta }_{{BSS}})]\))
Sine-cosine encoded backbone dihedrals (\([\sin ({\varPsi }_{{BBBB}}),\cos ({\varPsi }_{{BBBB}})]\))

Graph edge features \({{\bf{f}}}_{(i,j,r)}\)

The one-hot MARTINI22 or MARTINI3 bead type of the source node
The one-hot MARTINI22 or MARTINI3 bead type of the target node
The edge type, given as a one-hot representation
The absolute positional difference between source and target nodes in the MARTINI22 or MARTINI3 bead sequence, given as a one-hot representation
The calibrated bond length

For the above individual node and edge features, they were concatenated as the final features (\({{\bf{f}}}_{i}:\,{\bf{R}}\in 1\times 25\) (MARTINI22) or \(1\times 31\) (MARTINI3), \({{\bf{f}}}_{(i,j,r)}:\,{\bf{R}}\in 1\times 53\) (MARTINI22) or \(1\times 65\) (MARTINI3)). Besides, since the angular parameters generated from MARTINI are sparse (see Fig. 2, not every bead will involve in the calculation of every type of angles), an extra rule (Supplementary Note 8) was provided, to assign sparse angular node features to specific beads to avoid potential conflicts.

After the definition of the CG-scale complex graph, the corresponding graph cropping function was designed to identify its core interaction regions for further reducing the computational cost and potentially increasing the predictive accuracy. Specifically, we first determined the interaction parts of the protein-protein complexes with different interaction patterns. For the curated complexes in the PDBbind and MANY/DC datasets, they belong to the standard dimers, we chose each protein chain as one part, thus a complex can be represented as two interaction parts. For the structures in ATLAS which usually contain 4 or 5 chains, the peptide and MHC chains were treated as the first part, while the second part contained the remained TCR chain structures.

Next, a distance matrix \({{\bf{M}}}^{{\rm{dis}}}\) based on the specified two interaction parts was created to guide the generation of the cropped complex graph \({{\mathscr{G}}}^{{\prime} }\). For each residue, MARTINI will only assign one backbone bead (\(B\)) with a 3D coordinate to represent its backbone atoms and their overall position, and thus the coordinate of \(B\) was used as the position of the residue (analogous to using alpha carbon (Cα) as the residue position in residue-level protein graph constructions²¹). Based on this, \({{\bf{M}}}^{{\rm{dis}}}\) with the size of \({L}_{{AA}1}\times {L}_{{AA}2}\) can be calculated, in which \({L}_{{AA}1}\) and \({L}_{{AA}2}\) are the amino acid (AA) sequence length (given by above backbone beads \(B\)) of the interaction parts 1 and 2, respectively. In \({{\bf{M}}}^{{\rm{dis}}}\), every element \({{\bf{M}}}_{i,j}^{{\rm{dis}}}\) was the pairwise Euclidean distance between corresponding residues from separate interaction parts (given by \({B}_{i}\) and \({B}_{j}\)). Then any pair of residues having the distance smaller than 8.5\({\text{\AA}}\) were retained as the core region (i.e., the initial strategy). Other AAs that had \(B-B\) distance to any core region AAs smaller than 10\({\text{\AA}}\) were also retained (i.e., the second strategy).

After that, all bead nodes and edges within the retained region were kept as the cropped complex graph \({{\mathscr{G}}}^{{\prime} }\) (the angular node features were further re-calibrated if any end bead nodes within the angular information calculations were removed by this cropping). Furthermore, for the atom-scale and residue-scale models, to conduct a fair comparison, the same cropping function were employed to their protein graphs, the only difference was replacing \(B\) with \(C\alpha\) to indicate the overall position of the residue.

The CG-scale representation learning for complex overall property prediction

After acquiring the cropped protein complex graph \({{\mathscr{G}}}^{{\prime} }\), a representative multi-relational heterogeneous GNN-based protein encoder GearNet-Edge²¹, was incorporated into the framework, for predicting the overall property of protein complexes. Specifically, based on a line graph-enhanced edge message passing mechanism⁵³ to model the inter-edge positional relationships, the additional structural information can be injected into the node representations for more effective protein geometric interaction modeling (the corresponding equations are provided in Supplementary Note 9). We made it work at the CG-scale for generating the overall geometric representation of the input CG cropped graph \({{\mathscr{G}}}^{{\prime} }\), and the generated representation was further learnt by a three-layer task-specific multi-layer perception (MLP) to give the final property prediction results.

The DDI-based CG graph encoder pre-training technique

The protein ___domain-___domain complex parameterized by MARTINI still preserves the fundamental conformation and chain sequence, but the basic particles are substituted from original atoms to CG beads. Intuitively, performing the self-supervised noise-adding-denoising pre-training techniques, which are already demonstrated to be effective on understanding geometric regulations of proteins at atom- or residue-scale⁵⁴, could also benefit the understanding of general knowledge from CG DDI complexes (for downstream property predictions).

Therefore, based on the atom-scale work²², a CG-scale complex pre-training technique was developed, which adds noise with changing magnitudes into 3D coordinates and sequences of MARTINI-based CG bead nodes based on the diffusion mechanisms⁵⁵, for the CG-complex geometric regulation learning. The equations and complete details are provided in Supplementary Note 10. After the pre-training, the trained CG graph encoder will be fine-tuned on the specified downstream task to produce the effective structural representations for corresponding input CG complex graphs.

For the implementation of the CG-scale protein complex graph construction, representation learning, and pre-training processes, Pytorch⁵⁶ and Torchdrug⁵⁷ with a default random seed 0 were employed, and the Adam⁵⁸ with the initial learning rate of 0.0001 was adopted as the optimizer for model training (the same environment settings were also used for the other involved models working at the atom- and residue-scale). Besides, all experiments were deployed on a configuration of one NVIDIA A100 GPU 40GB. A complete summary of tools for implementing MCGLPPI is given in Supplementary Note 11.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets analyzed are freely available through the original sources: (1) 3DID: https://3did.irbbarcelona.org/, (2) PDBbind: http://www.pdbbind.org.cn/, (3) ATLAS: https://pubmed.ncbi.nlm.nih.gov/28160322/, (4) MANY and DC: https://www.eppic-web.org/ewui/#downloads, and (5) AB-bind: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4815335/. Easing access, we re-packaged all data at https://github.com/arantir123/MCGLPPI. When using those data, please quote and consult the authors of the original datasets. Source data are provided within this paper. Source data are provided with this paper.

Code availability

The source code of MCGLPPI (Version 1.0) can be downloaded from https://github.com/arantir123/MCGLPPI.

References

Wang, Z. et al. EGGNet, a generalizable geometric deep learning framework for protein complex pose scoring. ACS Omega 9, 7471–7479 (2024).
CAS PubMed PubMed Central Google Scholar
Yue, Y. et al. MpbPPI: a multi-task pre-training-based equivariant approach for the prediction of the effect of amino acid mutations on protein–protein interactions. Brief. Bioinform. 24, bbad310 (2023).
Article PubMed PubMed Central Google Scholar
Chen, B. et al. Identifying protein complexes and functional modules—from static PPI networks to dynamic PPI networks. Brief. Bioinform. 15, 177–194 (2014).
Article CAS PubMed Google Scholar
Yue, Y. et al. Improving therapeutic synergy score predictions with adverse effects using multi-task heterogeneous network learning. Brief. Bioinform. 24, bbac564 (2023).
Article PubMed Google Scholar
Liu, S. et al. Nonnatural protein–protein interaction-pair design by key residues grafting. Proc. Natl Acad. Sci. USA 104, 5330–5335 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, M., Cang, Z. & Wei, G. W. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nat. Mach. Intell. 2, 116–123 (2020).
Article PubMed PubMed Central Google Scholar
Koegl, M. & Uetz, P. Improving yeast two-hybrid screening systems. Brief. Funct. Genomic Proteomic 6, 302–312 (2007).
Article CAS PubMed Google Scholar
Lin, J. S. & Lai, E. M. Protein–protein interactions: co-immunoprecipitation. Methods Mol Biol. 1615, 211–219 (2017).
Article PubMed Google Scholar
Louche, A., Salcedo, S. P. & Bigot, S. Protein–protein interactions: pull-down assays. Bact. Protein Secret. Syst.: Methods Protoc. 1615, 247–255 (2017).
Article Google Scholar
Hussain, S. A. An introduction to fluorescence resonance energy transfer (FRET). Preprint at https://arxiv.org/abs/0908.1815 (2009).
Peng, X. et al. Characterizing the interaction conformation between T-cell receptors and epitopes with deep learning. Nat. Mach. Intell. 5, 395–407 (2023).
Article Google Scholar
Zhou, H. X. & Qin, S. Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics 23, 2203–2209 (2007).
Article CAS PubMed Google Scholar
Wang, R. et al. The PDBbind database: methodologies and updates. J. Med. Chem. 48, 4111–4119 (2005).
Article CAS PubMed Google Scholar
Alberts, B. Molecular Biology Of The Cell 4th edn (Garland Science, New York, NY, USA, 2002).
Whisstock, J. C. & Lesk, A. M. Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36, 307–340 (2003).
Article CAS PubMed Google Scholar
Zhou, B. et al. Protein engineering with lightweight graph denoising neural networks. J. Chem. Inf. Model. 64, 3650–3661 (2024).
Kipf, T. N. & Welling M. Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations, OpenReview.net, Online (2016).
Réau, M. et al. DeepRank-GNN: a graph neural network framework to learn patterns in protein–protein interfaces. Bioinformatics 39, btac759 (2023).
Article PubMed Google Scholar
Townshend, R. J. L. et al. ATOM3D: Tasks on Molecules in Three Dimensions. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (Curran Associates, Inc., Red Hook, NY, USA, 2021).
Jing, B. et al. Equivariant graph neural networks for 3d macromolecular structure. Preprint at https://arxiv.org/abs/2106.03843 (2021).
Zhang, Z. et al. Protein Representation Learning by Geometric Structure Pretraining. The Eleventh International Conference on Learning Representations, OpenReview.net, Online (2022).
Zhang, Z. et al. Pre-training protein encoder via siamese sequence-structure diffusion trajectory prediction. In Proc. 37th International Conference on Neural Information Processing Systems, 43496–43524 (Curran Associates, Inc., Red Hook, NY, USA, 2023).
De Jong, D. H. et al. Improved parameters for the martini coarse-grained protein force field. J. Chem. Theory Comput. 9, 687–697 (2013).
Article PubMed Google Scholar
Souza, P. C. T. et al. Martini 3: a general purpose force field for coarse-grained molecular dynamics. Nat. Methods 18, 382–388 (2021).
Article CAS PubMed Google Scholar
Monticelli, L. et al. The MARTINI coarse-grained force field: extension to proteins. J. Chem. Theory Comput. 4, 819–834 (2008).
Article CAS PubMed Google Scholar
de Jong, D. H., Periole, X. & Marrink, S. J. Dimerization of amino acid side chains: lessons from the comparison of different force fields. J. Chem. Theory Comput. 8, 1003–1014 (2012).
Article PubMed Google Scholar
Sengupta, D. & Marrink, S. J. Lipid-mediated interactions tune the association of glycophorin A helix and its disruptive mutants in membranes. Phys. Chem. Chem. Phys. 12, 12987–12996 (2010).
Article CAS PubMed Google Scholar
Periole, X. et al. Structural determinants of the supramolecular organization of G protein-coupled receptors in bilayers. J. Am. Chem. Soc. 134, 10959–10965 (2012).
Article CAS PubMed PubMed Central Google Scholar
Lamprakis, C. et al. Evaluating the efficiency of the Martini force field to study protein dimerization in aqueous and membrane environments. J. Chem. Theory Comput. 17, 3088–3102 (2021).
Article CAS PubMed Google Scholar
Lelimousin, M., Limongelli, V. & Sansom, M. S. P. Conformational changes in the epidermal growth factor receptor: Role of the transmembrane ___domain investigated by coarse-grained metadynamics free energy calculations. J. Am. Chem. Soc. 138, 10611–10622 (2016).
Article CAS PubMed PubMed Central Google Scholar
Roel-Touris, J. et al. Less is more: coarse-grained integrative modeling of large biomolecular assemblies with HADDOCK. J. Chem. Theory Comput. 15, 6358–6367 (2019).
Article CAS PubMed PubMed Central Google Scholar
Arts, M. et al. Two for one: Diffusion models and force fields for coarse-grained molecular dynamics. J. Chem. Theory Comput. 19, 6151–6159 (2023).
Article CAS PubMed Google Scholar
Wang, W. Generative coarse-graining. APS March Meeting Abstracts 2022, N49-010 (2022).
Mosca, R. et al. 3did: a catalog of ___domain-based interactions of known three-dimensional structure. Nucleic Acids Res. 42, D374–D379 (2014).
Article CAS PubMed Google Scholar
Alborzi, S. Z. et al. PPIDomainMiner: Inferring ___domain-___domain interactions from multiple sources of protein-protein interactions. PLoS Comput. Biol. 17, e1008844 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yellaboina, S. et al. DOMINE: a comprehensive collection of known and predicted ___domain-___domain interactions. Nucleic Acids Res. 39, D730–D735 (2011).
Article CAS PubMed Google Scholar
Marrink, S. J. et al. The MARTINI force field: coarse grained model for biomolecular simulations. J. Phys. Chem. B 111, 7812–7824 (2007).
Article CAS PubMed Google Scholar
Burley, S. K. et al. Protein Data Bank (PDB): the single global macromolecular structure archive. Protein Crystallogr. Methods Protoc. 1607, 627–641 (2017).
Article CAS Google Scholar
Borrman, T. et al. ATLAS: a database linking binding affinities with structures for wild‐type and mutant TCR‐pMHC complexes. Proteins Struct. Funct. Bioinform. 85, 908–916 (2017).
Article CAS Google Scholar
Rudolph, M. G. & Wilson, I. A. The specificity of TCR/pMHC interaction. Curr. Opin. Immunol. 14, 52–65 (2002).
Article CAS PubMed Google Scholar
Wang, R. et al. The PDBbind database: collection of binding affinities for protein− ligand complexes with known three-dimensional structures. J. Med. Chem. 47, 2977–2980 (2004).
Article CAS PubMed Google Scholar
Baskaran, K. et al. A PDB-wide, evolution-based assessment of protein-protein interfaces. BMC Struct. Biol. 14, 1–11 (2014).
Article Google Scholar
Duarte, J. M. et al. Protein interface classification by evolutionary analysis. BMC Bioinform. 13, 1–16 (2012).
Article Google Scholar
Wan, S. et al. Ensemble simulations and experimental free energy distributions: evaluation and characterization of isoxazole amides as SMYD3 inhibitors. J. Chem. Inf. Model. 62, 2561–2570 (2022).
Article CAS PubMed PubMed Central Google Scholar
Das, R. & Baker, D. Macromolecular modeling with rosetta. Annu. Rev. Biochem. 77, 363–382 (2008).
Article CAS PubMed Google Scholar
Xu, Q. & Dunbrack, R. L. The protein common interface database (ProtCID)—a comprehensive database of interactions of homologous proteins in multiple crystal forms. Nucleic Acids Res. 39, D761–D770 (2010).
Article PubMed PubMed Central Google Scholar
Zeng, Y. et al. Identifying B-cell epitopes using AlphaFold2 predicted structures and pretrained language model. Bioinformatics 39, btad187 (2023).
Article CAS PubMed PubMed Central Google Scholar
Singh, R. et al. Topsy-Turvy: integrating a global view into sequence-based PPI prediction. Bioinformatics 38, i264–i272 (2022).
Article PubMed PubMed Central Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.abstract (2021).
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein-protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Kroon, P. C. et al. Martinize2 and vermouth: unified framework for topology generation. Elife 12, RP90627 (2023).
Google Scholar
Harary, F. & Norman, R. Z. Some properties of line digraphs. Rendiconti del. Circolo Matematico di Palermo 9, 161–168 (1960).
Article MathSciNet Google Scholar
Liu, X. et al. Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput. Biol. 17, e1009284 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (Curran Associates, Inc., Red Hook, NY, USA, 2019).
Zhu, Z. et al. Torchdrug: a powerful and flexible machine learning platform for drug discovery. Preprint at https://arxiv.org/abs/2202.08320 (2022).
Kingma, D. P. & Ba J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (2022YFF1202104), National Natural Science Foundation of China (62471310), and National Natural Science Foundation of China (32341002, 32030035). This work was also supported by the Computer Science Ramsay Fund at the University of Birmingham (to Y.Y.). We are thankful for the financial support from Macao Polytechnic University Foundation (RP/FCA 07/2022 to S.L.). We also thank Dr. D. McDonald, Dr. M. Heinzinger, Ms. C. Marquet, and Dr. B. Rost for their helpful suggestions on our task studies. The computations described in this research were performed using the Baskerville Tier 2 HPC service. Baskerville was funded by the EPSRC and UKRI (EP/T022221/1 and EP/W032244/1) and is operated by Advanced Research Computing at the University of Birmingham.

Author information

These authors contributed equally: Yang Yue, Shu Li.

Authors and Affiliations

School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, UK
Yang Yue, Yihua Cheng & Shan He
Macao Polytechnic University, Macao, China
Shu Li & Shan He
Bone Marrow Transplantation Center of the First Affiliated Hospital, Institute of Immunology, Zhejiang University School of Medicine, Hangzhou, China
Lie Wang
College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
Tingjun Hou
National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China
Zexuan Zhu

Authors

Yang Yue
View author publications
Search author on:PubMed Google Scholar
Shu Li
View author publications
Search author on:PubMed Google Scholar
Yihua Cheng
View author publications
Search author on:PubMed Google Scholar
Lie Wang
View author publications
Search author on:PubMed Google Scholar
Tingjun Hou
View author publications
Search author on:PubMed Google Scholar
Zexuan Zhu
View author publications
Search author on:PubMed Google Scholar
Shan He
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Y. and S.L. conceived the research project. L.W., T.H., Z.Z., and S.H. supervised the research project. Y.Y. designed the computational pipeline. Y.Y. implemented the MCGLPPI framework and performed the model training and prediction validation tasks. S.L. curated all involved protein complex samples and conducted experiments for MARTINI force field-based geometric parameter generation. Y.Y., S.L., Y.C., L.W., T.H., Z.Z, and S.H. wrote the manuscript.

Corresponding authors

Correspondence to Zexuan Zhu or Shan He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Arne Elofsson, Antoine Taly, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yue, Y., Li, S., Cheng, Y. et al. Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction. Nat Commun 15, 9629 (2024). https://doi.org/10.1038/s41467-024-53583-w

Download citation

Received: 14 March 2024
Accepted: 16 October 2024
Published: 07 November 2024
DOI: https://doi.org/10.1038/s41467-024-53583-w

Subjects

Abstract

Similar content being viewed by others

Decoding the protein–ligand interactions using parallel graph neural networks

Classification and prediction of protein–protein interaction interface using machine learning algorithm

Integration of pre-trained protein language models into geometric deep learning networks

Introduction

Results

Overview of the proposed CG-scale complex geometric learning framework

Force field parameter and CG-scale complex graph generation

CG-scale geometric representation learning

DDI-based CG-scale graph encoder pre-training

MCGLPPI saves computational cost while keeping competitive performance

The binding affinity prediction of the formation of strict dimers

The effectiveness of MCGLPPI on more complex PPI patterns

The prediction results for protein-protein interface classification

The investigation of CG-scale pre-training techniques on different tasks

Performance of the geometries considered in CG-scale complex graphs

Influence of graph cropping on overall model efficiency

Discussion

Methods

The detailed curation process for the 3DID pre-training dataset

MARTINI-based geometric parameter generation

MARTINI-based bonds information

MARTINI-based angles and dihedrals information

The construction of CG-scale protein complex graph and its cropping function

Bead node features \({{\bf{f}}}_{i}\)

Graph edge features \({{\bf{f}}}_{(i,j,r)}\)

The CG-scale representation learning for complex overall property prediction

The DDI-based CG graph encoder pre-training technique

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links