Predicting noncoding RNA and disease associations using multigraph contrastive learning

Sun, Si-Lin; Jiang, Yue-Yi; Yang, Jun-Ping; Xiu, Yu-Han; Bilal, Anas; Long, Hai-Xia

doi:10.1038/s41598-024-81862-5

Download PDF

Article
Open access
Published: 02 January 2025

Predicting noncoding RNA and disease associations using multigraph contrastive learning

Si-Lin Sun^1,2,
Yue-Yi Jiang^1,2,
Jun-Ping Yang^1,2,
Yu-Han Xiu^1,2,
Anas Bilal^1,2 &
…
Hai-Xia Long^1,2

Scientific Reports volume 15, Article number: 230 (2025) Cite this article

1898 Accesses
4 Citations
Metrics details

Subjects

Abstract

MiRNAs and lncRNAs are two essential noncoding RNAs. Predicting associations between noncoding RNAs and diseases can significantly improve the accuracy of early diagnosis.With the continuous breakthroughs in artificial intelligence, researchers increasingly use deep learning methods to predict associations. Nevertheless, most existing methods face two major issues: low prediction accuracy and the limitation of only being able to predict a single type of noncoding RNA-disease association. To address these challenges, this paper proposes a method called K-Means and multigraph Contrastive Learning for predicting associations among miRNAs, lncRNAs, and diseases (K-MGCMLD). The K-MGCMLD model is divided into four main steps. The first step is the construction of a heterogeneous graph. The second step involves down sampling using the K-means clustering algorithm to balance the positive and negative samples. The third step is to use an encoder with a Graph Convolutional Network (GCN) architecture to extract embedding vectors. Multigraph contrastive learning, including both local and global graph contrastive learning, is used to help the embedding vectors better capture the latent topological features of the graph. The fourth step involves feature reconstruction using the balanced positive and negative samples and the embedding vectors fed into an XGBoost classifier for multi-association classification prediction. Experimental results have shown that AUC value for miRNA-disease association is 0.9542, lncRNA-disease association is 0.9603, and lncRNA-miRNA association is 0.9687. Additionally, this study has conducted case analyses using K-MGCMLD, which has validated the associations of all the top 30 miRNAs predicted to be associated with lung cancer and Alzheimer’s diseases.

A method for miRNA-disease association prediction using machine learning decoding of multi-layer heterogeneous graph Transformer encoded representations

Article Open access 03 September 2024

Computational prediction of disease related lncRNAs using machine learning

Article Open access 16 January 2023

Heterogeneous graph neural network for lncRNA-disease association prediction

Article Open access 20 October 2022

Introduction

Although noncoding RNAs¹ do not encode proteins, they have many essential biological functions, especially in disease regulation. When the human body is dysregulated, noncoding RNAs can lead to various diseases, such as tumors, neurological disorders, cardiovascular diseases, and developmental abnormalities². Noncoding RNAs are closely associated with diseases. miRNAs and lncRNAs are two important components of noncoding RNAs^3,4.

miRNAs are a class of endogenously-initiated noncoding RNAs with a length of approximately 22 nucleotides (nt)⁵. Technical limitations have caused researchers to overlook the roles of miRNAs. It was not until the 1990s that Lee et al. discovered a small noncoding RNA of 22 nt, known as lin-14, in Caenorhabditis elegans⁶. Reinhart et al. further discovered that lin-4 and let-7 can bind to the 3’ untranslated region (3’ UTR) of target genes to suppress or reduce their expression levels, thereby regulating the developmental timing of C. elegans⁷. With the deepening research on miRNAs, more studies have demonstrated that abnormal expression of miRNAs is closely related to the onset and progression of various diseases⁸. For example, abnormal expression of the miR-29 family (miR-29a, miR-29b-1, miR-29b-2, and miR-29c) has been closely associated with osteoarthritis, osteoporosis, cardiorenal, and immune diseases⁹. Moreover, abnormal expression of miRNAs is also closely linked to cancer. Zhu et al. found that hsa-miR-21 promotes cancer cell proliferation and metastasis by inhibiting the expression of various tumor suppressor genes, especially in breast cancer¹⁰. Yanaihara et al. experimentally demonstrated that hsa-miR-155 is closely associated with lung cancer¹¹.

LncRNAs are generally ≥ 200 nucleotides (nt) in length. LncRNAs play important roles in cell biology, including the regulation of gene expression¹², chromatin structure and function¹³, and tumorigenesis and tumor progression¹⁴. For example, lncRNA HOTAIR is closely associated with lung cancer, promoting proliferation, survival, metastasis, and drug resistance in lung cancer cells¹⁵. Chang et al. found that lncRNA MaTAR25 plays a significant role in the proliferation and migration of breast tumor cells¹⁶. Through knockout (KO) of linc-RoR in MCF-7 cells, Peng et al. discovered that linc-RoR can promote breast cancer cell growth and activation¹⁷. Jafari et al. identified three dysregulated lncRNAs (ESRG, LINC00518, and PWRN1) in clinical samples from colorectal cancer patients¹⁸. Chakravarty et al. found that NEAT1 is a key regulator of prostate cancer¹⁹.

Therefore, predicting the associations between noncoding RNAs and diseases is profoundly important. These RNAs can serve as potential biomarkers, with their expression changes reflecting disease states and progression. Knowing these noncoding RNA-disease associations in advance can provide clinicians with crucial support for diagnostic and therapeutic decision-making²⁰. In-depth studies of the roles of noncoding RNAs in diseases can reveal new mechanisms of disease onset and biological processes. Researchers can develop personalized medical strategies by analyzing the expression characteristics of noncoding RNAs in patients, which allows them to provide effective treatment options early. Therefore, researchers are conducting extensive and in-depth studies on the associations between miRNAs and diseases, as well as lncRNAs and diseases. These methods mainly fall into three categories: the first involves using biological experimental techniques to infer the functions of noncoding RNAs; The second method uses machine learning approaches, and the third method uses deep learning techniques.

In the early stages of noncoding RNA research, researchers primarily explored the associations between noncoding RNAs and diseases using experimental biological techniques. Chen et al. developed a qRT-PCR technique²¹, significantly improving detection sensitivity by specifically amplifying miRNA, allowing for accurate quantification of miRNA expression levels in various biological samples. However, this method still has issues, such as high cost. Lu et al. used microarray analysis technology to systematically map miRNAs’ expression profiles across different cancer types for the first time, providing new molecular markers for cancer classification and diagnosis²². Rinn et al. studied the function of lncRNAs in the HOX gene locus through gene knockout experiments. This study explored the role of lncRNAs in gene expression regulation²³. However, gene knockout experiments are time-consuming, labour-intensive, and have a low success rate, making large-scale research challenging.

With the development of big data and biological technologies, machine learning has demonstrated strong potential in processing large datasets effectively, especially in predicting associations between miRNA-disease and lncRNA-disease. Xu et al. used Support Vector Machines (SVM) to predict associations between miRNAs and tumors by constructing a miRNA target dysregulation network²⁴. William et al. used random forests to predict associations between miRNAs and cancer²⁵. Xuan et al. used a weighted K-nearest neighbors (KNN) algorithm to predict miRNAs associated with human diseases²⁶. Chen et al. developed LRLSLDA, based on the Laplacian Regularized Least Squares framework, to identify potential lncRNAs related to diseases²⁷. The LRLSLDA model achieved an AUC value of 0.776 in leave-one-out cross-validation, laying the foundation for subsequent research on lncRNA-disease association prediction. Although machine learning can handle large datasets to some extent, it faces challenges in capturing useful information effectively, lacks sufficient accuracy, and struggles to predict miRNA-disease and lncRNA-disease associations efficiently and accurately.

Computer science, particularly artificial intelligence, has rapidly advanced in recent years. As an important branch of artificial intelligence, deep learning has introduced new approaches for predicting associations between miRNA-disease and lncRNA-disease. Deep learning can learn complex nonlinear relationships and data representations through multi-layer neural network structures. Researchers have used deep learning to represent the features of noncoding RNAs and diseases, capturing useful information to achieve more accurate predictions. Liu et al. proposed the SMALF model, which utilizes a stacked autoencoder to integrate latent features of miRNAs and diseases, extracting feature vectors for miRNA-disease associations, and uses XGBoost to predict unknown miRNA-disease associations²⁸. Ji et al. introduced the SVAEMDA model, which employs a variational autoencoder to predict miRNA-disease associations²⁹. Xuan et al. developed the CNNLDA model, which uses a dual convolutional neural network with an attention mechanism to predict lncRNA-disease associations³⁰. Guo et al. proposed the LDASR computational method, which constructs feature vectors for lncRNA-disease pairs by integrating Gaussian interaction profile kernel similarity for lncRNAs, disease semantic similarity, and Gaussian interaction profile kernel similarity and uses an autoencoder to reduce feature dimensionality for predicting lncRNA-disease associations³¹.

More and more researchers have been integrating neural networks with graph structures to capture useful information better and extract embedding vectors, aiming to achieve better predictive performance. Zhang et al. proposed the AGAEMD model, which uses a node-level Attention Graph Auto-Encoder to integrate the graph attention mechanism into the autoencoder for predicting unknown miRNA-disease associations³². Chen et al. combined graph autoencoder and self attention mechanism to predict the associations between miRNAs and diseases³³. Li et al. fused multiple sources of information and used graph attention networks to predict miRNA -disease associations³⁴. Jin et al. employed a graph attention mechanism and multiple adaptive modalities to predict associations between miRNAs and diseases³⁵. Liao et al. introduced GCNA-MDA, which primarily uses a GCN to capture the topological information of the disease network and uses an autoencoder to extract features of miRNA-disease associations³⁶. Lan et al. used Principal Component Analysis (PCA) to reduce noise in the raw data, employed a Graph Attention Network (GAT) to extract useful information from lncRNAs and diseases, and utilized a Multi-Layer Perceptron (MLP) to infer lncRNA-disease associations³⁷. Shi et al. predicted lncRNA-disease association by constructing a heterogeneous graph neural network.Wang et al.³⁸. Li et al. used a node adaptive graph transformer with structural encoding for predicting lncRNA-disease associations³⁹. Wang et al. proposed a method based on Graph Attention Networks (GAT) to identify associations between lncRNAs and diseases⁴⁰. Zhao et al. constructed lncRNA gene disease-related heterogeneous structures to predict lncRNA-disease associations⁴¹. Li et al. used a graph autoencoder to predict the associations between cricRNAs and diseases⁴². Sheng et al. used graph contrastive learning to predict associations among miRNAs, lncRNAs, and diseases⁴³.

Although researchers have proposed many methods for noncoding RNA prediction, several significant challenges still exist. First, the accuracy remains insufficient, making it difficult to accurately predict the associations between noncoding RNAs and diseases. Second, miRNAs and lncRNAs have close associations with diseases, but most experimental methods cannot fully integrate, extract, and utilize the information among miRNAs, lncRNAs, and diseases. Third, most current models treat miRNA-disease and lncRNA-disease associations separately, making it impossible to predict multiple associations simultaneously.

To address the mentioned challenges, this study proposes a method that integrates the information among miRNAs, lncRNAs, and diseases using a multigraph contrastive learning model enhanced with a GCN called K-MGCMLD. This model enables multi-association predictions, including miRNA-disease, lncRNA-disease, and lncRNA-miRNA associations, all within a single framework. The main steps of K-MGCMLD are as follows:

The first step is the construction of a heterogeneous graph. This study constructs a lncRNA-miRNA-disease heterogeneous graph by integrating similarity and association information among miRNAs, lncRNAs, and diseases. The process then involves subjecting the heterogeneous graph to data augmentation and corruption.

The second step involves downsampling using the K-means clustering algorithm to balance the positive and negative samples, allowing the model to learn more effectively.

The third step is to use an encoder with a GCN architecture to extract embedding vectors. The embedding vectors can better capture the graph’s latent topological features through multigraph contrastive learning—utilizing both local and global graph contrastive learning.

In the fourth step, this study reconstructs features using the balanced positive and negative samples and the embedding vectors, which they feed into an XGBoost classifier for classification prediction.

Methods

The core idea of the proposed K-MGCMLD is to integrate information among miRNAs, lncRNAs, and diseases to predict their associations. The K-MGCMLD model is composed of four parts: A, B, C, and D. In Fig. 1-A, the model integrates similarity and association information between miRNAs, lncRNAs, and diseases to construct a lncRNA-miRNA-disease (MLD) heterogeneous graph. The heterogeneous graph structure more intuitively represents the associations among multiple entities, facilitating the processing of complex relationships among different entities. By leveraging information from various nodes and edges, the model can better learn and capture more critical feature information, enhancing the model’s understanding of data details and achieving more refined feature representations. The MLD heterogeneous graph is then subjected to data augmentation and corruption, generating MLD-A and MLD-C. This step allows for more effective contrastive learning in subsequent stages, improving the model’s generalization capability and robustness, thereby enhancing the quality of feature learning.

In Fig. 1-B, the multigraph contrastive learning component applies a self-supervised learning approach to process the graph structure and its features. This self-supervised learning method does not require a large amount of labeled data. It relies only on the intrinsic structure of the data to generate training signals, significantly reducing the need for labeled data. By utilizing multigraph contrastive learning, the model can fully exploit unlabeled graph data for training, improving its representation capabilities and performance. By contrasting different views, the model can capture multi-level information within the graph and extract more enriched and meaningful embedding vectors from the heterogeneous graph. This method effectively captures multi-scale information in the graph, including local node-level information and global graph-level structure, enhancing the model’s understanding of graph-structured data.

Figure 1-C presents the unsupervised feature extraction process using an autoencoder. The process retains the features of all positive samples and applies K-means clustering to the negative sample features. From these clusters, the process uniformly extracts an equal number of negative samples from different clusters to match the number of positive samples. This approach helps balance the positive and negative samples, reducing bias and preventing the model from favoring the majority class, which ultimately helps to enhance the model’s performance.

Figure 1-D involves feature reconstruction and multi-association Prediction. K-MGCMLD is capable of performing single-association prediction and multi-association Prediction. Using multigraph contrastive learning, we extract low-dimensional embedding feature vectors denoted as Z. We reconstruct these features for miRNA-disease, lncRNA-disease, and lncRNA-miRNA associations using the sample indices balanced by the K-means clustering. Finally, we feed the reconstructed features into multiple classifiers (including MLP, XGBoost, AdaBoost, Logistic Regression, KNN, and Decision Tree) to predict miRNA-disease associations, lncRNA-disease associations, and lncRNA-miRNA interactions.

Table 1 Multigraph contrastive learning algorithm.

Subjects

Abstract

Similar content being viewed by others

A method for miRNA-disease association prediction using machine learning decoding of multi-layer heterogeneous graph Transformer encoded representations

Computational prediction of disease related lncRNAs using machine learning

Heterogeneous graph neural network for lncRNA-disease association prediction

Introduction

Methods

Dataset introduction and balanced dataset

Construction of the heterogeneous graph

Constructing contrastive learning views

GCN encoder

Multigraph contrastive learning

Results and discussion

K-means clustering for balancing positive and negative samples

Comparison of results for different dimensions of embedding vectors

Comparison of results for different classifiers

Comparison of experimental results for different models

Case analysis

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Institutional review board statement

Informed consent

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

MicrobeNet: An Automated Approach for Microbe Organisms Prediction Using Feature Fusion and Weighted CNN Model

AMFCL: Predicting miRNA-Disease Associations Through Adaptive Multi-source Modality Fusion and Contrastive Learning

Search

Quick links