Abstract
Clinical biomedical applications of genomic technologies are extensive and provide possibilities to enhance healthcare covering the span of medical talents. Genome disorder prediction is an important issue in biomedical research. Genome disorders cause multivariate diseases such as cancer, dementia, diabetes, Leigh syndrome, etc. Existing machine and deep learning-based methods were introduced to forecast genome disorders. However, the genome prediction outcomes were not sufficient. To address this issue, propose a new method called Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) for acutely predicting the genome disorder with improved sensitivity and specificity. First, we utilized medical data about children from a public genomes dataset and applied it to Linear Quadratic and Feynman Kac Genome filtering to obtain computationally efficient filtered results. Next, the results are fed to the Concordance Correlated Polynomial Interpolation with the purpose of extracting genome wide data in an accurate manner. Finally, the features extracted are fused and fed to the Support Vector and Nearest Centroid model for genome disorder prediction. Experimental investigations of the proposed method employing the genome dataset confirm that the performance of the proposed method is prospective and in the scope of acceptance with relative to state-of-the-art methods in terms of convergence speed, recognition rate, sensitivity, and specificity. Results suggest that the QFPI-VNC method produces the best performance with a higher genome disease detection rate by 14%, accuracy by 11%, sensitivity by 14% specificity by 12%, and lesser convergence speed by 29% than compared to state-of-the-art methods.
Similar content being viewed by others
Introduction
Genomic technologies can be utilized by clinicians from all spheres for patient diagnoses who possess a high probability of genetic misconceptions resulting in disease. Researchers are employing these mechanisms in identifying new genes that bring about genetic disease at an astounding rate. Genomic technologies are progressively being utilized to comprehend the contribution of both sparse and recurrent genetic constituents to the evolution of frequent diseases, like diabetes, blood pressure, and so on.
Over the past few years, genomic sequencing technology has exceptionally improved and in turn has reduced the genetic testing cost and made it more approachable to individuals. Owing to this several laboratories have sowed the seeds to provide genetic testing for even healthy persons who want to recognize whether they may have a chance of advancing a genetic disease based on their ancestors, like genetic probability of advancing cancer or heart disease. The purpose behind healthy testing provides with the people upper hand in comprehending what health risks may influence them so that increased screening can detect the disorder at the earliest stage.
Genomic disorders are diseases that outcome from structural variations in the human genome, such as loss, or increase of chromosomal or DNA material. Genomic data is a kind of genetic data for considering the structure and function of an organism’s genome. A genetic disorder is a health issue caused by one or more abnormalities in the genome. Genetic disorders occur for abnormality in a person’s genetic material. These abnormalities are caused by a number of dissimilar things such as mutation, chromosomal abnormalities, and environmental factors. Four dissimilar types of genetic disorders are considered and shown in Fig. 1.
The genome disorder prediction model is performed in efficiently predict genome disorder and can process a maximum amount of patients’ genome disorder data with multi-class prediction. It is proposed as a tool which allows the prediction of phenotype of single cross hybrids that were not tested in field trials. This approach saves time and costs compared to traditional methods. This disorder is a powerful tool in plant breeding. By building a prediction model using training set with markers and phenotypes, genomic estimated breeding values are used as predictions of breeding values in a target set with only genotype data.
Healthcare 4.0 essentially necessitates more significant systems that can straightforwardly attach and interconnect with big data. A Blast Local Assignment Search Tools (BLAST)1 with the purpose of perceiving smart devices in the health sector for Genome sequence analysis for distinct patients was proposed. Here, an enhanced pattern-matching mechanism employing Hadoop’s ideas was also designed. Moreover, BLAST was also employed to identify biological sequence information and measure statistical importance in the form of blocks via mapper finally combined to model a reducer. By employing this mapper and reducer, the entire process speeds up, therefore improving the execution time and accuracy significantly. Despite improvement in time and accuracy, the true positive rate was not focused.
A method called Driver-Oriented Genomics Analysis (DrOGA) was proposed in2 with the purpose of improving the precision, recall, and accuracy. These three metrics were arrived at by employing Explainable Artificial Intelligence. Moreover, a new features engineering pipeline was also designed to constitute DNA alternatives via 70 feature vectors acquiring contemporary maintenance, practical and ensemble scores permitting precision oncology-chosen therapies based on data acquired via personal analysis. Despite improvement in accuracy, the time factor was not focused.
Biological and medical advancements have been providing immense data volumes involving both biological and physiological data, to name a few being, medical image processing, genome sequencing, electroencephalography, and so on. Learning from these data eases the comprehension process concerning human health and disease prediction. A survey of IoT-based remote healthcare monitoring was investigated in3.
An overview of deep learning techniques was discussed in4 along with the classification and prediction of protein structure in detail. Data transfer between places involved a huge amount of time, therefore causing high latency and concerns related to energy. To handle these types of issues, a CNN-based edge computing prediction model was presented in5. The evolution of edge computing in turn ensured swift availability of resources and response time via local edge servers. Though improvement was observed in response time, the accuracy factor was not focused.
IoTs is advancing into numerous walks of life but nevertheless have entered healthcare, where the application of IoT is much slower. Medical IoT integrates a fusion of medical devices and people that hugely depends on wireless communication to ensure the potential exchange of healthcare data, and monitoring of patients in a remote fashion, therefore providing higher patient quality of life. The evolution of medical IoT in healthcare has had a mushroom improvement but still is not found to be foolproof. A meta-analysis and systematic review of bio-medical imaging and applications of deep neural networks was presented in6. Yet another holistic literature review for the medical Internet of Things was investigated in7.
The interval between the physical and cyber world is now filled employing the IoT to permit strong scrutiny of the user’s attentiveness. The bilateral relationship between users and items would necessitate a justifiable and efficient recommendation system to persuade users’ inclinations and behavioral patterns in a better manner. The aid of the recommendation systems aims at generating a set of extensive recommendations for a specific user with behavior patterns and preferences.
Four well-known clustering algorithms were applied in8 in the IoT context, like recommending drugs and medicines to patients and so on. Yet another recommendation system employing fuzzy ontology by means of Type 2 was presented in9. This type of combination resulted in the improvement of prediction accuracy to a greater extent. However, the convergence speed or the rate at which the recommendation was made remained a major issue to be addressed.
Nevertheless, as far as several clinical data are found to be still hidden in a clinical chronicle pattern. Hence, the performance of biomedical natural language processing (NLP) mechanisms is necessitated to unlock the entire prospective of electronic health record data to transform the narrative into structured data. In this manner, biomedical NLP applications can be utilized in making clinical decisions, addressing medical issues, and significantly putting off the development of a disease. In10, the application of electronic health record data was reviewed for clinical research on chronic diseases and forwarded the perspective and applications of biomedical NLP techniques. Yet another recommendation mechanism via mining was presented in11 with a detailed presentation of the analytical study.
A DNA sequencing method employing machine learning and deep learning techniques was elaborated in12. Nevertheless, researchers are now identifying the prospective of IoT systems in other field areas, amongst them being the recommender systems. Over the past few years, the concentration of researchers has shifted towards using recommender systems to enhance IoT user selection. Hence the objective of design in13 remained in studying the prevailing mechanisms, and issues, and in addition to probable solutions was also investigated. Yet another deep learning technique was applied in14 with the purpose of predicting drug response to cancer. Also employing principal component analysis resulted in the improvement of accuracy to a greater extent.
Motivation and research questions
Genome disorder prediction is an essential issue in biomedical research. With the growth of technology, genetic data has been enhanced to cover the entire genome. The ability to recognize the genes dependable for certain ailments simplifies patient diagnosis and offers insight into the operational network of connections and mutation. The potential genetic illness was detected during disease gene recognition. Machine and deep learning methods were developed to forecast genome disorders. The prediction outcomes of these approaches were uncertain because of their minimal accuracy and higher time using genome sequence data. In addition, the true positive rate, sensitivity, and specificity recognition rate were not measured in the biomedical applications prediction model.
This paper explains the description of genomic disorder prediction in biomedical applications. This review addresses the research questions such as:
-
1.
What are the challenges of genomic disorder prediction?
-
2.
What are the factors affecting genomic disorders?
-
3.
What are the methods used for genomic disorder prediction?
-
4.
What is the machine learning approach for predicting?
-
5.
How are genetic diseases detected?
Novelty and contribution
Conventional methods limitations are low accuracy, precision, recall was not considered, minimal genome disorder detection rate, and higher convergence speed. To overcome the issue, a genome prediction method relating to biomedical applications using stress detection method using Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) is proposed. The major contributions of this work are listed below.
-
To improve sensitivity and specificity, Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction is designed.
-
QFPI-VNC method uses Linear Quadratic, and Feynman Kac Genome filtering algorithm for filtering to recognize computationally efficient filtered results for measuring system’s state and system’s moments. Hence it reduces the convergence speed.
-
QFPI-VNC method applies Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction for extracting relevant features for genome disorder prediction via Concordance Correlation Coefficient and Lagrange optimization function. With this, the genetic disorder recognition rate is improved.
-
Support Vector and Nearest Centroid-based Genome Disorder prediction is employed in the QFPI-VNC method to differentiate between two types of genomes, mitochondrial inheritance disorder, and multi-factorial inheritance disorder. In this way, accurate genome disorder prediction is achieved.
-
Finally, the performance of the proposed QFPI-VNC-based genome disorder detection method is compared with the state-of-the-art methods.
The rest of the paper is organized as given below. Section "Related works" provides the related works on biomedical applications. Section "Methodology" displays a brief description of the Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction method. After that, "Experimental results" presents the experimental results, and Sect. "Conclusion" introduces a detailed comparative study between the proposed QFPI-VNC method and the other state-of-the-art methods with the aid of a table and graphical representation. Finally, Sect. 6 concludes the paper.
Related works
The IoT is a pivotal technological innovation as far as networking is concerned. IoT has brought boundless probabilities and impacted daily and has hence brought an insurgence in healthcare and biomedical framework. Well-founded and precise IoT-based healthcare is considered a demanding task over the past few years. The requirement to provide preferable healthcare to the public by minimizing cost, enhancing accuracy, and accomplishing the inadequacy of the medical staff are significant reasons of concern. A linear quadratic regression model for IoT-based healthcare monitoring systems was presented in15 with the purpose of reducing the mean square error in addition to the improvement of accuracy. The sensor overhead was minimized, and sensor energy was saved. But the genome disease detection rate was not focused.
A holistic survey of applications of genomes using machine learning techniques was investigated in16. Comprehending the genomes of different species, particularly, the scrutiny of more than 3 billion base pairs is a pivotal objective as far as genomic studies are concerned. Genomics acquires a panoramic perspective that suggests all the genes within an organism. Deep learning techniques were reviewed in17 for human genomics. Also, an elaborate description of when to apply and what technique to apply were reviewed in detail. The application of deep learning tools was discussed in Genomic. The benefit of deep learning algorithms was employed to create high-throughput data. In18 data collection mechanism for biomedical applications using machine learning algorithms in the field of sports was presented in detail. Artificial intelligence-based body sensor network framework (AIBSNF) was utilized to integrate wearable biosensors to gather multivariate, minimal noise, as well as maximum-fidelity dataWearable sensor technology, and real-time ___location system (RTLS) were investigated. The vast number of wearable sensor methods was analyzed for big data in healthcare. Continuous monitoring of sensors was also made in19 employing machine learning for data processing. The designed method of classification accuracy was achieved.
Several patients with genetic syndromes possess exception features that enhance important prospective value for clinical diagnosis. Deep learning is said to be used in diagnosing genetic diseases by examining features of patients. A method called, BioFace employing deep learning technique was presented in20 with the purpose of identifying multiple genetic diseases. Squeeze-and-Excitation (SE) blocks were used for improving the weight of efficient features in network. The cross-loss training method was employed with higher accuracy. However, the diagnosis of genetic diseases was not focused.
Yet another advanced genome disorder prediction method (AGDPM) was proposed in21 employing Alex net neural network. With this type of design, high mortality rates were said to be controlled in an efficient manner. But accurate prediction results were not obtained. To address the issue, the Classification of Alzheimer’s disease was performed22 using deep learning techniques. Moreover, Boruta algorithm was employed to obtain the principal component, therefore generating accurate disease identification. The abnormal magnification and gene practices in cells may result in cancer. Also, the gene related to cancer is said to grow upon the occurrence of mutation. Hence, cancer identification is said to be an evaluative and demanding issue for researchers.
In23, robust features were obtained by means of incidence matrix based on the position. Following this incidence vector based on absolute positive was employed for dimensionality conversion. Finally, learning techniques were applied to train the model with improved accuracy. This proposed model was predicted by primary structure to cancer driver genes or not. A review of precision and genomic medicine was investigated in24 to improve patient healthcare. Yet another genetic-based prediction method was proposed in25 to focus on genomic disorder prediction with a high rate of accuracy.
Support vector machine (SVM) and K-nearest neighbor (KNN) machine learning techniques were developed in26 for forecasting disease. But the genetic sequence data quality was not improved. A novel feature engineering approach was introduced in27 to extract features with higher prediction performance. IoMT-based machine learning model was analyzed in28 to enhance prediction outcomes with higher accuracy. A machine learning model was developed in29 to find gene biomarkers. However, the dataset size was not lower.
The robust software and MTGIpick were proposed in30 to utilize the regions with pattern bias showing multiscale difference levels to identify GIs from the host. In31, genomic islands are associated with microbial adaptations are performed in some methods perform an overall test to identify genomic islands based on their local features. However, regions of different scales will display different genomic features. The Inhibitors of mammalian G1 cyclin-dependent kinases method proposed in32.A MeDIP-seq assists SMRT-seq for quality control in 6 mA identification was introduced in33 to identify 6 mA events without doing the whole genome amplification (WGA) sequencing. A disease-based mutation database of human papilloma viruswas designed in34 not only provides convenient browsing and search functions, mutation and ___domain combination analysis but also includes a tool to predict HPV genotype. The proposed feature selection method was designed in35 to enhance the efficiency of protein structural class prediction. Summary of related work described in Table 1.
Motivated by the above-mentioned techniques, in this work a method called, Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction method is proposed. An elaborate description of the QFPI-VNC method is provided in the following sections.
Methodology
Genomic medicine is a proportionately state-of-the-art medical specialty that concentrates on employing genetic information with reference to a discrete in therapy for diagnostic grounds and the correlated health consequences and policy inferences. Preliminary detection of genetic disorders is tremendously advantageous to the biomedical sector with regard to prescribing medicines for treatment. We propose a Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction method for the early detection of genomic abnormalities in this study. Figure 2 given below illustrates the structure of QFPI-VNC method.
Figure 2 shows the structure of the QFPI-VNC method. Initially, genome samples as input taken from the genome dataset and sent to the filtered part. Linear Quadratic and Feynman Kac Genome filtering model is applied to clean the outliers. After cleaning outliers of data, the optimal features are selected by using the Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction model to predict genome disorder. Following this, the optimal feature extraction data is broken down into training and testing patterns. Training patterns of data sent towards the QFPI-VNC method and at the stoppage of the training patterns of data are stored in a database for tranquil access. Finally, the testing patterns access the filtered data and trained model for genome disorder prediction employing Support Vector and Nearest Centroid-based Genome Disorder prediction. In this way, the genome disorder outcomes such as mitochondrial inheritance disorder and multi-factorial inheritance disorder are precisely predicted with maximum accuracy.
Dataset description
The proposed method uses genomes dataset taken from: https://www.kaggle.com/datasets/aryarishabh/of-genomes-and-genetics-hackerearth-ml-challenge for predicting genome disorder. The dataset is divided into training (80%) and testing (30%) for evaluating the performance. The genomes dataset folder consists of the following fields:
-
1.
Train.csv.
-
2.
Test.csv.
-
3.
Sample.csv.
In the above three files, the training sheet consists of 45 features with overall instances of 22,083. In a similar manner, the testing sheet consists of 43 features with an overall instance of 9465. Finally, the sample sheet consists of 3 features with five types of genetic disorder and disorder subclass respectively. With the above feature set, an input matrix is formulated as given below.
From the above formulates (1) the input matrix ‘\(\:IM\)’ consists of combination of ‘\(\:m\)’ samples and ‘\(\:n\)’ features respectively.
Linear Quadratic and Feynman Kac Genome filtering model
Quality control of biomedical data and filtering is frequently the preliminary step in processing genome sequencing data of disorder subclass. Not only can it assist in examining the genome sequence data quality, but it can also assist in acquiring high-quality data for biomedical applications. In this work, a Binate Filtrate Model (BFM) is employed in cleaning the outliers and acquiring high-quality data. Here, initially, Linear Quadratic Estimation is to evaluate the system’s state whereas Feynman Kac Estimation measures the system’s moments. Figure 3 given below illustrates the Linear Quadratic and Feynman Kac Genome filtering model.
As illustrated in the above Fig. 3, with the input matrix values obtained from the genomes dataset, the system’s state (i.e., the sequence) results are evaluated employing a recursive model which hypothesis that the present state (i.e., the present sequence) of a system (i.e., genome sequence) is dependent on the prior time span’s state (i.e., the prior sequence) respectively. This is formulated as given below.
From the above Eq. (2), the linear quadratic estimates for the corresponding input matrix ‘\(\:{IM}_{l}\)’ applied to prior state ‘\(\:l-1\)’ is formulated based on the sequence transition ‘\(\:{ST}_{l}\)’ and a significant processing noise ‘\(\:{}\epsilon_{l}\)’ respectively. Following which, at time ‘\(\:l\)’, an observation ‘\(\:{Obs}_{l}\)’ of the prevailing state ‘\(\:{IM}_{l}\)’ is formulated as given below.
At any time, the target gene sequence instance probability density moments cannot be measured in an analytical manner, therefore, to approximate them, in our work, Feynman Kac Estimation is employed. The hypothesis here remains in generating arbitrary numbers called, particles. Next, for every particle, a weight is allocated that validates the variance between the actual and estimated results. The time horizon is then fixed at ‘\(\:l\)’ for a genome sequence of observations ‘\(\:{Obs}_{1},\:{Obs}_{2},\:\dots\:,{Obs}_{l}\)’ then formulates for Feynman Kac Estimation to obtain the bounded sequence function moments is mathematically formulated as given below.
From the above Eq. (4), for any bounded genome sequence function ‘\(\:FK\)’ on the set of prevailing state input matrix ‘\(\:{IM}_{l}\)’the Feynman Kac Estimation from sequence ‘\(\:1\)’ to sequence ‘\(\:l\)’ is generated. Finally, for the indicator ‘\(\:N\)’, with the indicator function being ‘\(\:{IF}_{N}\)’ to clean the outliers the expected conditional distribution to obtain system moments is generated as given below.
Finally, the corrected labels or the filtered dataset are formulated for further processing. The pseudo-code representation of the Linear Quadratic and Feynman Kac Genome filtering model is given below.
As given in the above Linear Quadratic and Feynman Kac Genome filtering algorithm, the overall functionality is split into two processes, therefore referred to as Binate Filtrate Model (BFM). Here, initially with the genomes dataset obtained as input, the system’s state or the genome state for each patient is obtained using Linear Quadratic Estimation. This estimation can estimate unknown sequences with the aid of a series of genome sequences that incline to be more precise than those based on a single genome sequence of a patient, therefore assisting in yielding accurate results. Next, the system’s moments or the genome moments for each patient is arrived at by means of Feynman Kac Estimation which in turn removes the outliers, therefore improving the convergence speed significantly.
Concordance correlated polynomial interpolation based genome-wide data extraction model
Extracting medical data from clinical systems for analyzing genetic disorders, associating unique functional variants, disorder subclass heritability using log features, and exploring associations between metabolite levels and genomic disorder, with multi-molecular patterns, all of this predominance of ongoing endeavours is manual and said to be time-consuming. The advantages provided by machine learning models for investigating adequate and intricate biomedical information have tremendous prospects for shooting up genetic medicine developments.
To comprehend early genetic disorders and reduce time-consuming process, a Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction model is designed in our work. The Concordance Correlation Coefficient in our work evaluates the consensus between two filtered results. This is mathematically formulated as given below.
From the above Eq. (6), ‘\(\:{\mu\:}_{a}\)’ and ‘\(\:{\mu\:}_{b}\)’ denotes the means of the single sample filtered results whereas ‘\(\:{\sigma\:}_{a}^{2}\)’ and ‘\(\:{\sigma\:}_{b}^{2}\)’ represents the variances of the single sample filtered results in deployment. In a similar manner, the consensus between ‘\(\:N\)’ filtered results are determined as given below.
From the above Eqs. (7), (8), and (9), mean, variance, and covariance of the genome sequence consensus between ‘\(\:N\)’ filtered results are obtained. Finally, to the obtained results, an enhanced interpolation model that identifies a polynomial function to obtain top-rank features to predict genome disorder is formulated. The enhanced interpolation model that identifies a polynomial function is done using the Lagrange optimization function. Lagrange optimization function is mathematically formulated as given below.
From the above equation results obtained via (10) and (11), the resultant features extracted ‘\(\:{FE}_{j}\)’ identifies the local maxima and minima of a function subject to consensus constraints. Table 2 given below lists the features extracted (i.e., 24 features) using the Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction model.
The pseudo code representation of Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction is given below.
As given in the above algorithm to ensure precise genome-wide data analysis with minimum high convergence speed first filtered results from the genomes dataset are acquired as input. Following this, highly correlative coefficients between two genomes sample data are obtained using Concordance Correlation Coefficient filtering model. With this precise and accuracy feature required for further processing is evolved. Next, with high correlation coefficient features Lagrange function is applied with the purpose of identifying the local maxima and minima subject to equality constraints (i.e., based on the genetic disorder and disorder subclass).
Support Vector Machine (SVM) and nearest centroid-based genome disorder prediction
The evolution of exactness medicine in medical care directed the traditional indicator-directed treatment procedure by permitting early disease risk prediction via refined diagnostics. It is mandatory to probe comprehensive patient data against extensive components to perceive and adapt between sick and relatively healthy people with the purpose of captivating the most pertinent approach toward precision medicine. Precision and genomic medicine integrated with machine learning has the prospective to enhance patient healthcare. Moreover, patients with less frequent therapeutic retaliations are using genomic medicine technologies. In this work a genome disorder prediction using Support Vector Machine (SVM)and Nearest Centroid model is presented. Figure 4 shows the structure of Support Vector Machine and Nearest Centroid-based Genome Disorder prediction.
As illustrated in the above figure, with the filtered results and extracted features provided as input, the genome disorder prediction model is designed in such a way employing Nearest Centroidas a distance measure to evaluate the support vector hyperplanes. Support vector machine strives to process the feature extracted ‘\(\:FE\)’ filtered results ‘\(\:FR\)’ (i.e., ‘\(\:{FE}_{u}\left[{FR}_{y}\right]\)’) onto a dataset that contains medical information before the production of a perfect interim hyperplanethat can differentiate between positive (i.e., mitochondrial inheritance disorder) and negative (i.e., multi-factorial inheritance disorder) samples.
From the above mathematical formulates (12), the feature extracted filtered results are stored in the form of a vector with which prediction of genome disorder is said to be done via centroid function. Then with the training data of ‘\(\:u,v\)’ points of the form as ‘\(\:\left\{\left({x}_{1},{y}_{1}\right),\:\left({x}_{2},{y}_{2}\right),\:\dots\:,\left({x}_{n},{y}_{n}\right)\right\}\)’ where the ‘\(\:{y}_{i}\)’ are either ‘\(\:-1\)’ or ‘\(\:+1\)’, each representing the class to which the data ‘\(\:{x}_{i}\)’ belongs. Also, with the training data being linearly separable, two parallel hyperplanes separating two classes’ positive (i.e., mitochondrial inheritance disorder) and negative (i.e., multi-factorial inheritance disorder) are selected. To improve the genome disorder prediction outcomes the proposed method employed the nearest centroid function for arriving at the distance. These hyperplanes are mathematically stated as given below.
The distance between the above two hyperplanes as given in (13) and (14) is selected using nearest centroid function that allocates to samples the label of the class whose centroid is adjacent to the sample. This nearest centroid to determine the distance between two hyperplanes is mathematically formulated as given below.
From the above Eq. (15) ‘\(\:{Sym}_{i}\)’ refers to the set of symptoms of samples associated with class ‘\(\:l\in\:Y\)’ respectively. With this, the genome disorder prediction is made in an accurate and precise manner. Finally, the predicted results are obtained as given below.
The pseudo code representation of Support Vector and Nearest Centroid-based Genome Disorder prediction is given below.
As given in the above algorithm with the filtered results and extracted features acquired as input, with the objective of improving the correct prediction of abnormal genome disorder and reducing the incorrect prediction of normal genome disorder a distance factor is introduced to the conventional Support Vector Machine. The distance factor here employed using the Nearest Centroid function that with the aid of per class centroid improves the detection rate significantly.
Experimental results
The results of the simulations employed to validate the hypothetical genome disorder prediction method, called, Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction are described in detail in this section. The experiment uses the genomes dataset. The dataset consists of test files, training files, and sample submission files each with distinct numbers of instances. The proposed QFPI-VNC method has been implemented in Python language. The simulation results from the proposed QFPI-VNC method and state-of-the-art methods, Blast Local Assignment Search Tools (BLAST)1, Driver Oriented Genomic Analysis (DrOGA) and Advance genome disorder prediction model (AGDPM) are detailed below in terms of several prediction parameters, convergence speed, detection rate, sensitivity, and specificity, Accuracy and F1-score. To conduct fair comparison genomes dataset is applied to the proposed QFPI-VNC method and BLAST1, DrOGA2 and AGDPM21 for an average of 10 different simulation runs.
Case scenario 1: performance analysis of convergence speed
The first and foremost parameter in the analysis of genome disorder detection for humans analogous to biomedical applications is the rate of convergence or convergence speed. The faster the convergence speeds earlier the genome disorder is detected in humans and accordingly, treatment can be given to avoid uncertainty. The convergence speed in our work is mathematically formulated as given below.
From the above Eq. (17), the convergence speed ‘\(\:CS\)’ is measured based on the medical information who have genetic disorders ‘\(\:{S}_{i}\)’ and the actual time consumed in the prediction process ‘\(\:Time\left(PR\right)\)’. It is measured in terms of milliseconds. Table 3 provides a numeric summary of the performance measures that were utilized in determining the convergence speed for three methods, QFPI-VNC, BLAST1, DrOGA2 and AGDPM21. From the comparison, it can be shown that the proposed QFPI-VNC performs better in terms of convergence speed than1,2 and 21.
Figure 5 given above illustrates the graphical representation of convergence speed using three methods, QFPI-VNC, BLAST1, DrOGA2 and AGDPM21. It compares the convergence speed for the proposed QFPI-VNC method with the state-of-the-art methods1,2 and 21, . In the above figure, x-axis represents the sample instances involved in the genome disorder detection process and the y-axis represents the convergence speed consumed. The convergence speed in the above figure is found to be directly proportional to the samples involved. In other words, increasing the sample size causes an increase in the number of training data, therefore increasing the filtered results, and hence surge in convergence speed is found. One of the objectives of the study is to minimize the convergence speed, which was considered the major disadvantage of1. Compared to existing methods1,2 and 21, the convergence speed of the proposed QFPI-VNC method is reduced by differentiating between the system’s state and the system’s moments via Binate Filtrate Model (BFM). By applying this BFM model, Linear Quadratic Estimation was done to measure the system’s state, and Feynman Kac Estimation was performed to evaluate the system’s moments that in turn evolve the small number of characteristics or filtered results. The proposed QFPI-VNC method model uses the least amount of convergence speed when compared to conventional methods by 18% upon comparison to1 ,29% upon comparison to2 and 40% upon comparison to21 as can be inferred from the figure.
Case scenario 2: performance analysis of genome disorder detection rate
The second significant parameter of importance in genome disorder detection is the detection rate. This is due to the reason that improper detection would even cause mortality. Hence, utmost care should be taken while analyzing the genetic disorders. The genome disorder detection rate in our work is mathematically evolved as given below.
From the above Eq. (18), the genome disorder detection rate ‘\(\:GDDR\)’ is measured taking into consideration the true positive rate ‘\(\:TP\)’ (i.e., correctly predicted abnormal genome disorder), the true negative rate ‘\(\:TN\)’ (i.e., correctly predicted normal genome disorder) and the samples involved in the simulation procedure ‘\(\:{S}_{i}\)’ respectively. Table 4 lists the numeric summary of the performance measures that were utilized in determining the genome disorder detection rate for three methods, QFPI-VNC, BLAST1, DrOGA2 and AGDPM21. From the comparison, it can be inferred that the proposed QFPI-VNC performs better in terms of genome disorder detection ate than1,2 and AGDPM21.
In Fig. 6, the proposed method of genome disorder detection rate classification performance has been compared with other state-of-the-art methods1,2 and 21. According to the study, the proposed QFPI-VNC has the highest level of detection rate. The proposed model achieves maximum genome disorder detection rate through the immense contribution of feature extraction via Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction functionalities. The mapper and reducer applied in1 achieve only the minimum detection rate needed in comparison with the QFPI-VNC method. While Explainable Artificial Intelligence in2 operates as effectively as they do for a given iteration, nevertheless with the increase in the sample size, the accuracy decreases. But simulations performed with 2000 samples saw the true positive rate of 1930, 1920, and 1905 whereas the true negative rate was observed to be 40, 35, and 30 using QFPI-VNC1,2 and 21, respectively. With this, the overall genome disorder detection rate using the three methods was observed to be 98.5%, 97.75%, and 96.75%. The reason behind the improvement was that only highly correlated features were obtained by utilizing the Lagrange function between two genomes sample data. As a result, the genome disease detection rate using the QFPI-VNC method was said to be improved by 8% upon comparison to1 and 14% upon comparison to2 and 21% upon comparison to21.
Case scenario 3: performance analysis of sensitivity and specificity
Finally, in this section, the sensitivity and specificity rate involved in the genome disorder detection process is analyzed. The mathematical formulates for sensitivity and specificity are given below.
From the above Eq. (19), the sensitivity rate ‘\(\:Sen\)’ is measured based on the true positive rate ‘\(\:TP\)’ (i.e., correctly predicted abnormal genome disorder), and the false negative rate ‘\(\:FN\)’ (i.e., predicts the abnormal genome disorder incorrectly). On the other hand, the specificity rate ‘\(\:Spe\)’, from (20) is measured by taking into consideration the true negative rate ‘\(\:TN\)’ (i.e., correctly predicted normal genome disorder), and the false positive rate ‘\(\:FP\)’ (i.e., predicts the normal genome disorder as abnormal) respectively. Table 5 given below provides the tabulation results of sensitivity and specificity involved in genome disorder detection for humans using three methods, QFPI-VNC, BLAST1, DrOGA2 and AGDPM21. From the comparison, it is found that the proposed QFPI-VNC performs better performance results in terms of sensitivity and specificity than1,2 and 21.
Finally, Fig. 7 (a) and 6 (b) given above shows the graphical representation of sensitivity and specificity using the proposed QFPI-VNC and two existing methods1,2 and 21. From the above figure, both the sensitivity and specificity rate using the QFPI-VNC method were found to be better than1,2 and 21. In other words, with simulations performed for 2000 samples, the actual true positive being 1970, 1960 samples were correctly detected using QFPI-VNC whereas 1952 was only correctly detected using1 and 1935 correctly detected using2. As a result, the overall sensitivity rate using the three methods was observed to be 98%, 97.6%, and 96.75% respectively.
Similarly, with the actual true negative being 1940, 1930 samples were correctly predicted as normal genome disorder using the QFPI-VNC method, 1920 samples were correctly predicted as normal genome disorder using1 and 1905samples were correctly predicted as normal genome disorder using2. With this the specificity rate was found to be 96.5% using the QFPI-VNC method, 96% using1 and 95.25% using2 respectively. The reason behind the improvement in both sensitivity and specificity rates was due to the application of the Support Vector and Nearest Centroid-based Genome Disorder prediction algorithm. By applying this algorithm, the nearest centroid function was applied in identifying the distance between two hyperplanes in the support vector. Also, with the nearest centroid or mean the classification between two types of disorders is made whose mean is closest to that data point, therefore improving the sensitivity rate of the QFPI-VNC method by 5%, 18% and 20% compared to1,2 and 21. Similarly, the specificity rate of the QFPI-VNC method was said to be improved by 8%,12% and 15% compared to1,2 and 21 respectively.
Case scenario 4: performance analysis of Accuracy
-
Ac curacy is measured as predict the predicted normal genome disorder correctly identified. It is formulated as,
In above Eq. (21), accuracy is denoting the ‘\(\:Acc\)’, ‘\(\:TP\:and\:TN\)’ is specified as true positive and true negative,\(\:\:{\prime\:}FPand\:FN\)’ is False positive and False negative. It is measured in terms of percentage (%).
In Fig. 8; Table 6, shows the graphical representation of accuracy using the proposed QFPI-VNC and three existing methods1,2 and 21. From the above figure, both the accuracy using the QFPI-VNC method was found to be better than1,2 and 21. Similarly, the Accuracy of the QFPI-VNC method was said to be improved by 7%, 12% and 14% compared to1,2 and 21 respectively.
Case scenario 4: performance analysis of F1-score
-
\(\:\varvec{F}1-\varvec{s}\varvec{c}\varvec{o}\varvec{r}\varvec{e}\)’ is referred to as the harmonic mean of ‘\(\:\varvec{P}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}\)’ and recall ‘\(\:\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}\)’ scores evaluated as given below.
In Fig. 9; Table 7, shows the graphical representation of F1-score using the proposed QFPI-VNC and three existing methods1,2 and21. From the above figure, both the F1-score using the QFPI-VNC method was found to be better than1,2 and 21. Similarly, the F1-score of the QFPI-VNC method was said to be improved by 4%, 3% and 2% compared to1,2 and 21 respectively.
Statistical test/analysis
Statistical analyses are the procedure of gathering and examining vast volumes of data for finding trends and increasing valuable insights. The statistical test for genome disorder prediction is done in our work by using the McNemar test (Table 8). The McNemar test is employed as a non-parametric test for paired nominal data. In order to evaluate the McNemar test, the genome sequencing data is said to be placed into a 2 × 2 contingency table.
In Table 8, cells b and c are employed to estimate the McNemar test statistic, and it is as given below.
Table 9 offers the tabulation outcomes of the McNemar test involved in genome disorder detection for humans using three methods, QFPI-VNC, BLAST1, DrOGA2 and AGDPM21. From the comparison, it is found that the proposed QFPI-VNC performs better performance results in terms of the McNemar test than1,2 and 21.
Figure 10 demonstrates the McNemar test (M-test) using the proposed QFPI-VNC, BLAST1, DrOGA2 and AGDPM21. A statistical test is conducted based on a sample between 2000 and 20,000. The reason behind the improvement was due to the application of initially obtaining highly correlated features with the aid of the Lagrange function among two genomes sample data. M-test of QFPI-VNC is better performance for by 14%, 21%, 9% than the existing methods respectively.
Conclusion
In this study a Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) method is proposed for predicting genome disorder for humans. Undesirable outliers are eliminated in the filtering stage to acquire high quality data with Binate Filtrate Model, then the feature input is down sampled to reduce the convergence speed in further process. To precisely capture the optimal genome-wide feature of the filtered results, new Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction algorithm is developed. Here the local maxima and minima of a function subject to consensus constraints features are extracted with polynomial interpolation function. Additionally, enhanced interpolation model also helps to extract the pertinent features, this enhanced feature extraction model and also eliminates the misdiagnosis. Finally, the pertinent features are fused and classified with Support Vector and Nearest Centroid-based Genome Disorder prediction model. The genome dataset was utilized for the experimental assessment, and the results also compared with contemporary state-of-the-art methods. The proposed model performs better on the whole in terms of sensitivity, specificity, and convergence speed, and genome disorder detection rate. The limitation of the proposed model is the failure to consider the preprocessing method for eradicating noise and imputing missing data. In the future, the proposed method will be further extended to apply a preprocessing method to handle missing data. To get more accurate and enhanced prediction results, this research can be expanded to include more genetic disorders and more than one prediction model in the future.
Data availability
The datasets are publicly available. The datasets generated during and/or analysed during the current work are available with this link: https://www.kaggle.com/datasets/aryarishabh/of-genomes-and-genetics-hackerearth-ml-challenge.
References
Lilhore, E. M. O. U. K. et al. Evaluation of IoT-Enabled hybrid model for genome sequence analysis of patients in healthcare 4.0, Measurement: Sensors, Elsevier, Jan 2023, Volume 36, Pages 1–7. (Blast Local Assignment Search Tools [BLAST]).
Bastico, M., Fernandez-Garcia, A., Belmonte-Hernandez, A., Mayoral, S. U. & Access, I. E. E. E. DrOGA: An Artificial Intelligence Solution for Driver-Status Prediction of Genomics Mutations in Precision Cancer Medicine, Apr 10, 37378–37391. (2023). (Driver Oriented Genomic Analysis [DrOGA]).
MazinAlshamrani IoT and artificial intelligence implementations for remote healthcaremonitoring systems: A survey, Journal of King Saud University –Computer and Information Sciences, Elsevier, Jun 2021, Volume 34, Issue 8, Pages 4687–4701.
Chensi Cao, F. et al. Deep Learning and Its Applications in Biomedicine, Genomics Proteomics Bioinformatics, Elsevier, Mar Volume 16, Issue 1, Pages 17–32. (2018).
Piyush Gupta, A. V. et al. Prediction of Health Monitoring with deep Learning Using edge ComputingVolume 25Pages 1–8 (Sensors, Elsevier, Jan 2023).
Yogesh, H., Bhosale & Sridhar Patnaik, K. Feb, Bio-medical imaging (X-ray, CT, ultrasound, ECG),genome sequences applications of deep neuralnetwork and machine learning in Diagnosis, Detection,Classification, and Segmentation of COVID-19:A Meta-analysis & Systematic Review, Multimedia Tools and Applications, Springer, Volume 82, 39157–39210. (2023).
Aledhari, M. et al. Biomedical IoT: Enabling Technologies,Architectural Elements, Challenges,and Future Directions, Mar Volume 10, Pages 31306–31339. (2022).
Kashef, R. & Access, I. E. E. E. Enhancing the Role of Large-ScaleRecommendation Systems in the IoT Context, Sep 8, Pages 178248–178257. (2020).
Farman Ali, P. et al. Type-2 Fuzzy OntologyÐaided Recommendation Systems forIoTÐbased Healthcare, Computer CommunicationsPages 138–155 (Elsevier, Oct 2017).
Essam, H., Houssein, R. E., Mohamed, A. A., Ali & Access, I. E. E. E. Machine Learning Techniques for BiomedicalNatural Language Processing:A Comprehensive Review, Oct 9, Pages 140628–140653. (2021).
Sahar Ajmal, M., Awais, Khaldoon, S., Khurshid, M. S., Abdelrahman, A. & Peer, J. Data mining-based recommendationsystem using social networks_ananalytical study, Feb 9. (2023).
Dileep, V. V. S. & Gummadi, N. R. R. DNA sequencing using machine learning and deep learning algorithms. Int. J. Innovative Technol. Exploring Eng. (IJITEE). ISSN (Online), 2278–3075 (September 2022).
Richa Sharma,ShalliRani,and Stephen & JeswindeNuagh RecIoT: A Deep Insight into IoT-Based SmartRecommender Systems, Wireless Communications and Mobile Computing, Elsevier, Jun 2022, Volume Pages 1–15. (2022).
Hanan Ahmed, S., Hamad, H. A., Shedeed, S., Hussein, I. E. E. E. & Access Sep Volume 10, Pages 106050–106058. (2022).
Divya Upadhyay, P., Garg, S. M., Aldossary, J., Shafi & Kumar, S. A Linear Quadratic Regression-Based Synchronised HealthMonitoring System (SHMS) for IoT Applications, Electronics, Oct 2023, Volume 12, Issue 2, Pages 1–16.
Huang, K., Xiao, C., Glass, L. M. & Critchlow, C. W. Greg Gibson, and Jimeng Sun, Machine Learning Applicationsfor Therapeutic Tasks with Genomics dataPages 1–10 (Patterns, Cell, Oct 2021).
Wardah, S., Alharbi & Rashid, M. A Review of deep Learning Applicationsin Human Genomics Using next–generationsequencing dataPages 1–20 (Human Genomics, 2022).
Ashwin, A., Phatak, FranzGeorg Wieland, K., Vempala, F., Volkmar & Memmert, D. Nov, Artificial Intelligence Based Body SensorNetwork Framework—Narrative Review:proposing an end–to–end Framework usingWearable Sensors, Real–Time LocationSystems and Artificial Intelligence/MachineLearning Algorithms for Data Collection, DataMining and Knowledge Discovery in Sportsand Healthcare, Sports Medicine, Springer, Volume 7, 1–15. (2021).
Simone Aiassa, P. M. et al. Smart Portable Pen for continuous monitoring ofAnaesthetics in human serum WithMachine Learning. IEEE Trans. Biomed. Circuits Syst.15 (Issue 2), 294–302 (Apr 2021).
Jianfeng Wang, B. et al. HaitaoLv, Lin Hei, Multiple Genetic Syndromes Recognition Based on a Deep Learning Framework and Cross-Loss Training, IEEE Engineering in Medicine and Biology Society Section, Nov Pages 1–9. (2022).
Atta-Ur-Rahman, M. U. et al. Advance Genome Disorder Prediction Model Empowered With Deep Learning, Jul 10, Pages 70317–70328. (2022).
Alatrany, A. S., Khan, W., Hussain, A. & Al-Jumeily, D. Wide and deep learning based approaches forclassification of Alzheimer’s disease using genome-wide association studies. PLOS ONE |. 18 (Issue 5), 1–21 (May 2023).
Yasir Ali, M. et al. AmelKsibi, IDriveGenes: Cancer Driver Genes Prediction Using Machine Learning, Mar Pages 1–1. (2023).
Quazi, S. Artificial Intelligence and Machine Learning in Precision and genomicMedicine, Medical OncologyVolume 39Issue 120, Pages 1–18 (Springer, Jun 2022).
StevenJ.Schrodi, ShubhabrataMukherjee, Y. S., GerardTromp, J. J. S., Callear, A. P. & ZhanYe, T. C. C. MurrayH.BrilliantPaulK.Crane, DianeT.Smelser, RobertC.Elston8 and DanielE.Weeks, genetic-based prediction of disease traits: prediction is very difficult, especially about the future. Front. Genet. Jun. 5 (Issue 162), 1–19 (2014).
TaherM, Ghazal, H. A., Hamadi, M. U., Nasir, Atta-ur-Rahman, M. & Gollapalli Muhammad Zubair, Muhammad Adnan Khan, and Chan Yeob Yeun, Supervised Machine Learning Empowered Multifactorial Genetic Inheritance Disorder Prediction, Computational Intelligence and Neuroscience, May Volume 2022, Pages 1–10. (2022).
Raza, A. & Rustam, F. Hafeez Ur Rehman Siddiqui, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto Lee and Imran Ashraf, Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach, Genes, MDPI, Dec 2022, Volume 14, Issue 1, Pages 1–71.
Atta-ur Rahman, M. U., Nasir, M., Gollapalli, S. A., Alsaif, A. S. & Almadhor Shahid Mehmood, Muhammad Adnan Khan, and Amir Mosavi, IoMT-Based Mitochondrial and Multifactorial Genetic Inheritance Disorder Prediction Using Machine Learning, Computational Intelligence and Neuroscience, Hindawi, Jun Pages 1–8. (2022).
Ashraf Abou Tabl,AbedalrhmanAlkhateeb, ElMaragh, W., Rueda, L. & Ngom, A. A Machine Learning Approach for identifying Gene biomarkers guiding treatment of breast Cancer. Front. Genet.10, 1–13 (May 2019).
Qi Dai, C. et al. MTGIpick allows robust identification of genomic islands from a single genome. Brief. Bioinform. 1 (19, Issue 3), 361–373 (2018 May).
Kong, R. et al. 2SigFinder: the combined use of small-scale and large-scale statistical testing for genomic island detection from a single genome. BMC Bioinform.21 (Issue 159), 1–15 (April 2020).
Charles, J., Sherr, I. & Roberts, J. M. May, Inhibitors of mammalian G1 cyclin-dependent kinases, genes dev, 9, Issue 10, Pages 1149–1163 (2024).
Siqian Yang, Y., Wang, Y., Chen & Dai, Q. March, MASQC: Next generation sequencing assists third generation sequencing for Quality Control in N6-Methyladenine DNA identification, 11, Pages 1–10. (2020).
Zhenyu Yang, W. et al. HPVMD-C: a disease-based mutation database of human papillomavirus in China, Database, March Volume 2022, Pages 1–8 (2022).
Wang, Y., Xu, Y., Yang, Z., Liu, X. & Dai, Q. Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences, Computational and Mathematical Methods in Medicine, May Volume 2021, Pages 1–9. (2021).
Acknowledgements
This research was financially supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R393), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/549/45.
Funding
This research was financially supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R393), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/549/45.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the conceptualization, formal analysis, investigation, methodology, and writing and editing of the original draft. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Singh, S., Shukla, G., Agrawal, R. et al. Enhancing genomic disorder prediction through Feynman Concordance and Interpolated Nearest Centroid techniques. Sci Rep 14, 27653 (2024). https://doi.org/10.1038/s41598-024-72923-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-72923-w