Enhancing genomic disorder prediction through Feynman Concordance and Interpolated Nearest Centroid techniques

Singh, Sofia; Shukla, Garima; Agrawal, Rahul; Dhule, Chetan; Allabun, Sarah; Alqahtani, Mohammed S.; Othman, Manal; Abbas, Mohamed; Soufiene, Ben Othman

doi:10.1038/s41598-024-72923-w

Download PDF

Article
Open access
Published: 12 November 2024

Enhancing genomic disorder prediction through Feynman Concordance and Interpolated Nearest Centroid techniques

Sofia Singh¹,
Garima Shukla²,
Rahul Agrawal³,
Chetan Dhule³,
Sarah Allabun⁴,
Mohammed S. Alqahtani^5,6,
Manal Othman⁴,
Mohamed Abbas⁷ &
…
Ben Othman Soufiene⁸

Scientific Reports volume 14, Article number: 27653 (2024) Cite this article

1410 Accesses
Metrics details

Subjects

Abstract

Clinical biomedical applications of genomic technologies are extensive and provide possibilities to enhance healthcare covering the span of medical talents. Genome disorder prediction is an important issue in biomedical research. Genome disorders cause multivariate diseases such as cancer, dementia, diabetes, Leigh syndrome, etc. Existing machine and deep learning-based methods were introduced to forecast genome disorders. However, the genome prediction outcomes were not sufficient. To address this issue, propose a new method called Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) for acutely predicting the genome disorder with improved sensitivity and specificity. First, we utilized medical data about children from a public genomes dataset and applied it to Linear Quadratic and Feynman Kac Genome filtering to obtain computationally efficient filtered results. Next, the results are fed to the Concordance Correlated Polynomial Interpolation with the purpose of extracting genome wide data in an accurate manner. Finally, the features extracted are fused and fed to the Support Vector and Nearest Centroid model for genome disorder prediction. Experimental investigations of the proposed method employing the genome dataset confirm that the performance of the proposed method is prospective and in the scope of acceptance with relative to state-of-the-art methods in terms of convergence speed, recognition rate, sensitivity, and specificity. Results suggest that the QFPI-VNC method produces the best performance with a higher genome disease detection rate by 14%, accuracy by 11%, sensitivity by 14% specificity by 12%, and lesser convergence speed by 29% than compared to state-of-the-art methods.

Specialist multidisciplinary input maximises rare disease diagnoses from whole genome sequencing

Article Open access 07 November 2022

Methodological opportunities in genomic data analysis to advance health equity

Article 15 May 2025

How does the genomic naive public perceive whole genomic testing for health purposes? A scoping review

Article Open access 19 October 2022

Introduction

Genomic technologies can be utilized by clinicians from all spheres for patient diagnoses who possess a high probability of genetic misconceptions resulting in disease. Researchers are employing these mechanisms in identifying new genes that bring about genetic disease at an astounding rate. Genomic technologies are progressively being utilized to comprehend the contribution of both sparse and recurrent genetic constituents to the evolution of frequent diseases, like diabetes, blood pressure, and so on.

Over the past few years, genomic sequencing technology has exceptionally improved and in turn has reduced the genetic testing cost and made it more approachable to individuals. Owing to this several laboratories have sowed the seeds to provide genetic testing for even healthy persons who want to recognize whether they may have a chance of advancing a genetic disease based on their ancestors, like genetic probability of advancing cancer or heart disease. The purpose behind healthy testing provides with the people upper hand in comprehending what health risks may influence them so that increased screening can detect the disorder at the earliest stage.

Genomic disorders are diseases that outcome from structural variations in the human genome, such as loss, or increase of chromosomal or DNA material. Genomic data is a kind of genetic data for considering the structure and function of an organism’s genome. A genetic disorder is a health issue caused by one or more abnormalities in the genome. Genetic disorders occur for abnormality in a person’s genetic material. These abnormalities are caused by a number of dissimilar things such as mutation, chromosomal abnormalities, and environmental factors. Four dissimilar types of genetic disorders are considered and shown in Fig. 1.

The genome disorder prediction model is performed in efficiently predict genome disorder and can process a maximum amount of patients’ genome disorder data with multi-class prediction. It is proposed as a tool which allows the prediction of phenotype of single cross hybrids that were not tested in field trials. This approach saves time and costs compared to traditional methods. This disorder is a powerful tool in plant breeding. By building a prediction model using training set with markers and phenotypes, genomic estimated breeding values are used as predictions of breeding values in a target set with only genotype data.

Healthcare 4.0 essentially necessitates more significant systems that can straightforwardly attach and interconnect with big data. A Blast Local Assignment Search Tools (BLAST)¹ with the purpose of perceiving smart devices in the health sector for Genome sequence analysis for distinct patients was proposed. Here, an enhanced pattern-matching mechanism employing Hadoop’s ideas was also designed. Moreover, BLAST was also employed to identify biological sequence information and measure statistical importance in the form of blocks via mapper finally combined to model a reducer. By employing this mapper and reducer, the entire process speeds up, therefore improving the execution time and accuracy significantly. Despite improvement in time and accuracy, the true positive rate was not focused.

A method called Driver-Oriented Genomics Analysis (DrOGA) was proposed in² with the purpose of improving the precision, recall, and accuracy. These three metrics were arrived at by employing Explainable Artificial Intelligence. Moreover, a new features engineering pipeline was also designed to constitute DNA alternatives via 70 feature vectors acquiring contemporary maintenance, practical and ensemble scores permitting precision oncology-chosen therapies based on data acquired via personal analysis. Despite improvement in accuracy, the time factor was not focused.

Biological and medical advancements have been providing immense data volumes involving both biological and physiological data, to name a few being, medical image processing, genome sequencing, electroencephalography, and so on. Learning from these data eases the comprehension process concerning human health and disease prediction. A survey of IoT-based remote healthcare monitoring was investigated in³.

An overview of deep learning techniques was discussed in⁴ along with the classification and prediction of protein structure in detail. Data transfer between places involved a huge amount of time, therefore causing high latency and concerns related to energy. To handle these types of issues, a CNN-based edge computing prediction model was presented in⁵. The evolution of edge computing in turn ensured swift availability of resources and response time via local edge servers. Though improvement was observed in response time, the accuracy factor was not focused.

IoTs is advancing into numerous walks of life but nevertheless have entered healthcare, where the application of IoT is much slower. Medical IoT integrates a fusion of medical devices and people that hugely depends on wireless communication to ensure the potential exchange of healthcare data, and monitoring of patients in a remote fashion, therefore providing higher patient quality of life. The evolution of medical IoT in healthcare has had a mushroom improvement but still is not found to be foolproof. A meta-analysis and systematic review of bio-medical imaging and applications of deep neural networks was presented in⁶. Yet another holistic literature review for the medical Internet of Things was investigated in⁷.

The interval between the physical and cyber world is now filled employing the IoT to permit strong scrutiny of the user’s attentiveness. The bilateral relationship between users and items would necessitate a justifiable and efficient recommendation system to persuade users’ inclinations and behavioral patterns in a better manner. The aid of the recommendation systems aims at generating a set of extensive recommendations for a specific user with behavior patterns and preferences.

Four well-known clustering algorithms were applied in⁸ in the IoT context, like recommending drugs and medicines to patients and so on. Yet another recommendation system employing fuzzy ontology by means of Type 2 was presented in⁹. This type of combination resulted in the improvement of prediction accuracy to a greater extent. However, the convergence speed or the rate at which the recommendation was made remained a major issue to be addressed.

Nevertheless, as far as several clinical data are found to be still hidden in a clinical chronicle pattern. Hence, the performance of biomedical natural language processing (NLP) mechanisms is necessitated to unlock the entire prospective of electronic health record data to transform the narrative into structured data. In this manner, biomedical NLP applications can be utilized in making clinical decisions, addressing medical issues, and significantly putting off the development of a disease. In¹⁰, the application of electronic health record data was reviewed for clinical research on chronic diseases and forwarded the perspective and applications of biomedical NLP techniques. Yet another recommendation mechanism via mining was presented in¹¹ with a detailed presentation of the analytical study.

A DNA sequencing method employing machine learning and deep learning techniques was elaborated in¹². Nevertheless, researchers are now identifying the prospective of IoT systems in other field areas, amongst them being the recommender systems. Over the past few years, the concentration of researchers has shifted towards using recommender systems to enhance IoT user selection. Hence the objective of design in¹³ remained in studying the prevailing mechanisms, and issues, and in addition to probable solutions was also investigated. Yet another deep learning technique was applied in¹⁴ with the purpose of predicting drug response to cancer. Also employing principal component analysis resulted in the improvement of accuracy to a greater extent.

Motivation and research questions

Genome disorder prediction is an essential issue in biomedical research. With the growth of technology, genetic data has been enhanced to cover the entire genome. The ability to recognize the genes dependable for certain ailments simplifies patient diagnosis and offers insight into the operational network of connections and mutation. The potential genetic illness was detected during disease gene recognition. Machine and deep learning methods were developed to forecast genome disorders. The prediction outcomes of these approaches were uncertain because of their minimal accuracy and higher time using genome sequence data. In addition, the true positive rate, sensitivity, and specificity recognition rate were not measured in the biomedical applications prediction model.

This paper explains the description of genomic disorder prediction in biomedical applications. This review addresses the research questions such as:

1.
What are the challenges of genomic disorder prediction?
2.
What are the factors affecting genomic disorders?
3.
What are the methods used for genomic disorder prediction?
4.
What is the machine learning approach for predicting?
5.
How are genetic diseases detected?

Novelty and contribution

Conventional methods limitations are low accuracy, precision, recall was not considered, minimal genome disorder detection rate, and higher convergence speed. To overcome the issue, a genome prediction method relating to biomedical applications using stress detection method using Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) is proposed. The major contributions of this work are listed below.

To improve sensitivity and specificity, Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction is designed.
QFPI-VNC method uses Linear Quadratic, and Feynman Kac Genome filtering algorithm for filtering to recognize computationally efficient filtered results for measuring system’s state and system’s moments. Hence it reduces the convergence speed.
QFPI-VNC method applies Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction for extracting relevant features for genome disorder prediction via Concordance Correlation Coefficient and Lagrange optimization function. With this, the genetic disorder recognition rate is improved.
Support Vector and Nearest Centroid-based Genome Disorder prediction is employed in the QFPI-VNC method to differentiate between two types of genomes, mitochondrial inheritance disorder, and multi-factorial inheritance disorder. In this way, accurate genome disorder prediction is achieved.
Finally, the performance of the proposed QFPI-VNC-based genome disorder detection method is compared with the state-of-the-art methods.

The rest of the paper is organized as given below. Section "Related works" provides the related works on biomedical applications. Section "Methodology" displays a brief description of the Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction method. After that, "Experimental results" presents the experimental results, and Sect. "Conclusion" introduces a detailed comparative study between the proposed QFPI-VNC method and the other state-of-the-art methods with the aid of a table and graphical representation. Finally, Sect. 6 concludes the paper.

Related works

The IoT is a pivotal technological innovation as far as networking is concerned. IoT has brought boundless probabilities and impacted daily and has hence brought an insurgence in healthcare and biomedical framework. Well-founded and precise IoT-based healthcare is considered a demanding task over the past few years. The requirement to provide preferable healthcare to the public by minimizing cost, enhancing accuracy, and accomplishing the inadequacy of the medical staff are significant reasons of concern. A linear quadratic regression model for IoT-based healthcare monitoring systems was presented in¹⁵ with the purpose of reducing the mean square error in addition to the improvement of accuracy. The sensor overhead was minimized, and sensor energy was saved. But the genome disease detection rate was not focused.

A holistic survey of applications of genomes using machine learning techniques was investigated in¹⁶. Comprehending the genomes of different species, particularly, the scrutiny of more than 3 billion base pairs is a pivotal objective as far as genomic studies are concerned. Genomics acquires a panoramic perspective that suggests all the genes within an organism. Deep learning techniques were reviewed in¹⁷ for human genomics. Also, an elaborate description of when to apply and what technique to apply were reviewed in detail. The application of deep learning tools was discussed in Genomic. The benefit of deep learning algorithms was employed to create high-throughput data. In¹⁸ data collection mechanism for biomedical applications using machine learning algorithms in the field of sports was presented in detail. Artificial intelligence-based body sensor network framework (AIBSNF) was utilized to integrate wearable biosensors to gather multivariate, minimal noise, as well as maximum-fidelity dataWearable sensor technology, and real-time ___location system (RTLS) were investigated. The vast number of wearable sensor methods was analyzed for big data in healthcare. Continuous monitoring of sensors was also made in¹⁹ employing machine learning for data processing. The designed method of classification accuracy was achieved.

Several patients with genetic syndromes possess exception features that enhance important prospective value for clinical diagnosis. Deep learning is said to be used in diagnosing genetic diseases by examining features of patients. A method called, BioFace employing deep learning technique was presented in²⁰ with the purpose of identifying multiple genetic diseases. Squeeze-and-Excitation (SE) blocks were used for improving the weight of efficient features in network. The cross-loss training method was employed with higher accuracy. However, the diagnosis of genetic diseases was not focused.

Yet another advanced genome disorder prediction method (AGDPM) was proposed in²¹ employing Alex net neural network. With this type of design, high mortality rates were said to be controlled in an efficient manner. But accurate prediction results were not obtained. To address the issue, the Classification of Alzheimer’s disease was performed²² using deep learning techniques. Moreover, Boruta algorithm was employed to obtain the principal component, therefore generating accurate disease identification. The abnormal magnification and gene practices in cells may result in cancer. Also, the gene related to cancer is said to grow upon the occurrence of mutation. Hence, cancer identification is said to be an evaluative and demanding issue for researchers.

In²³, robust features were obtained by means of incidence matrix based on the position. Following this incidence vector based on absolute positive was employed for dimensionality conversion. Finally, learning techniques were applied to train the model with improved accuracy. This proposed model was predicted by primary structure to cancer driver genes or not. A review of precision and genomic medicine was investigated in²⁴ to improve patient healthcare. Yet another genetic-based prediction method was proposed in²⁵ to focus on genomic disorder prediction with a high rate of accuracy.

Support vector machine (SVM) and K-nearest neighbor (KNN) machine learning techniques were developed in²⁶ for forecasting disease. But the genetic sequence data quality was not improved. A novel feature engineering approach was introduced in²⁷ to extract features with higher prediction performance. IoMT-based machine learning model was analyzed in²⁸ to enhance prediction outcomes with higher accuracy. A machine learning model was developed in²⁹ to find gene biomarkers. However, the dataset size was not lower.

The robust software and MTGIpick were proposed in³⁰ to utilize the regions with pattern bias showing multiscale difference levels to identify GIs from the host. In³¹, genomic islands are associated with microbial adaptations are performed in some methods perform an overall test to identify genomic islands based on their local features. However, regions of different scales will display different genomic features. The Inhibitors of mammalian G1 cyclin-dependent kinases method proposed in³².A MeDIP-seq assists SMRT-seq for quality control in 6 mA identification was introduced in³³ to identify 6 mA events without doing the whole genome amplification (WGA) sequencing. A disease-based mutation database of human papilloma viruswas designed in³⁴ not only provides convenient browsing and search functions, mutation and ___domain combination analysis but also includes a tool to predict HPV genotype. The proposed feature selection method was designed in³⁵ to enhance the efficiency of protein structural class prediction. Summary of related work described in Table 1.

Table 1 Summary of genome disorder prediction related literature.

Full size table

Motivated by the above-mentioned techniques, in this work a method called, Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction method is proposed. An elaborate description of the QFPI-VNC method is provided in the following sections.

Methodology

Genomic medicine is a proportionately state-of-the-art medical specialty that concentrates on employing genetic information with reference to a discrete in therapy for diagnostic grounds and the correlated health consequences and policy inferences. Preliminary detection of genetic disorders is tremendously advantageous to the biomedical sector with regard to prescribing medicines for treatment. We propose a Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction method for the early detection of genomic abnormalities in this study. Figure 2 given below illustrates the structure of QFPI-VNC method.

Figure 2 shows the structure of the QFPI-VNC method. Initially, genome samples as input taken from the genome dataset and sent to the filtered part. Linear Quadratic and Feynman Kac Genome filtering model is applied to clean the outliers. After cleaning outliers of data, the optimal features are selected by using the Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction model to predict genome disorder. Following this, the optimal feature extraction data is broken down into training and testing patterns. Training patterns of data sent towards the QFPI-VNC method and at the stoppage of the training patterns of data are stored in a database for tranquil access. Finally, the testing patterns access the filtered data and trained model for genome disorder prediction employing Support Vector and Nearest Centroid-based Genome Disorder prediction. In this way, the genome disorder outcomes such as mitochondrial inheritance disorder and multi-factorial inheritance disorder are precisely predicted with maximum accuracy.

Dataset description

The proposed method uses genomes dataset taken from: https://www.kaggle.com/datasets/aryarishabh/of-genomes-and-genetics-hackerearth-ml-challenge for predicting genome disorder. The dataset is divided into training (80%) and testing (30%) for evaluating the performance. The genomes dataset folder consists of the following fields:

1.
Train.csv.
2.
Test.csv.
3.
Sample.csv.

In the above three files, the training sheet consists of 45 features with overall instances of 22,083. In a similar manner, the testing sheet consists of 43 features with an overall instance of 9465. Finally, the sample sheet consists of 3 features with five types of genetic disorder and disorder subclass respectively. With the above feature set, an input matrix is formulated as given below.

$$\:IM=\left[\begin{array}{cccc}{S}_{1}{F}_{1}&\:{S}_{1}{F}_{2}&\:\dots\:&\:{S}_{1}{F}_{n}\\\:{S}_{2}{F}_{1}&\:{S}_{2}{F}_{2}&\:\dots\:&\:{S}_{2}{F}_{n}\\\:\dots\:&\:\dots\:&\:\dots\:&\:\dots\:\\\:{S}_{m}{F}_{1}&\:{S}_{m}{F}_{2}&\:\dots\:&\:{S}_{m}{F}_{n}\end{array}\right]$$

(1)

From the above formulates (1) the input matrix ‘$\:IM$’ consists of combination of ‘$\:m$’ samples and ‘$\:n$’ features respectively.

Linear Quadratic and Feynman Kac Genome filtering model

Quality control of biomedical data and filtering is frequently the preliminary step in processing genome sequencing data of disorder subclass. Not only can it assist in examining the genome sequence data quality, but it can also assist in acquiring high-quality data for biomedical applications. In this work, a Binate Filtrate Model (BFM) is employed in cleaning the outliers and acquiring high-quality data. Here, initially, Linear Quadratic Estimation is to evaluate the system’s state whereas Feynman Kac Estimation measures the system’s moments. Figure 3 given below illustrates the Linear Quadratic and Feynman Kac Genome filtering model.

As illustrated in the above Fig. 3, with the input matrix values obtained from the genomes dataset, the system’s state (i.e., the sequence) results are evaluated employing a recursive model which hypothesis that the present state (i.e., the present sequence) of a system (i.e., genome sequence) is dependent on the prior time span’s state (i.e., the prior sequence) respectively. This is formulated as given below.

$$\:{IM}_{l}={ST}_{l}{IM}_{l-1}+\epsilon_{l}$$

(2)

From the above Eq. (2), the linear quadratic estimates for the corresponding input matrix ‘$\:{IM}_{l}$’ applied to prior state ‘$\:l-1$’ is formulated based on the sequence transition ‘$\:{ST}_{l}$’ and a significant processing noise ‘$\:{}\epsilon_{l}$’ respectively. Following which, at time ‘$\:l$’, an observation ‘$\:{Obs}_{l}$’ of the prevailing state ‘$\:{IM}_{l}$’ is formulated as given below.

$$\:{Obs}_{l}={IM}_{l}+{O\epsilon}_{l}$$

(3)

At any time, the target gene sequence instance probability density moments cannot be measured in an analytical manner, therefore, to approximate them, in our work, Feynman Kac Estimation is employed. The hypothesis here remains in generating arbitrary numbers called, particles. Next, for every particle, a weight is allocated that validates the variance between the actual and estimated results. The time horizon is then fixed at ‘$\:l$’ for a genome sequence of observations ‘$\:{Obs}_{1},\:{Obs}_{2},\:\dots\:,{Obs}_{l}$’ then formulates for Feynman Kac Estimation to obtain the bounded sequence function moments is mathematically formulated as given below.

$$\:FK\left({IM}_{l}\right)=\int\:\frac{FK\left({IM}_{1},{IM}_{2},\dots\:{IM}_{l}\right)\left\{\prod\:_{k=1}^{N}Prob\left({Obs}_{l}|{IM}_{l}\right)\right\}Prob\left({IM}_{1},{IM}_{2},\dots\:{IM}_{l}\right)d{IM}_{1},d{IM}_{2},.,d{IM}_{l}}{\left\{\prod\:_{k=1}^{N}Prob\left({Obs}_{l}|{IM}_{l}\right)\right\}Prob\left({IM}_{1},{IM}_{2},\dots\:{IM}_{l}\right)\:Prob\left({IM}_{1},{IM}_{2},\dots\:{IM}_{l}\right)d{IM}_{1},d{IM}_{2},.,d{IM}_{l}}$$

(4)

From the above Eq. (4), for any bounded genome sequence function ‘$\:FK$’ on the set of prevailing state input matrix ‘$\:{IM}_{l}$’the Feynman Kac Estimation from sequence ‘$\:1$’ to sequence ‘$\:l$’ is generated. Finally, for the indicator ‘$\:N$’, with the indicator function being ‘$\:{IF}_{N}$’ to clean the outliers the expected conditional distribution to obtain system moments is generated as given below.

$$\:FR=E\left(FK\left({IM}_{1},{IM}_{2},\dots\:{IM}_{N}\right)\right)=\frac{E\left({IM}_{1},{IM}_{2},\dots\:{IM}_{N}\right)\prod\:_{l=1}^{N}{IF}_{N}\left({IM}_{N}\right)}{E\left(\prod\:_{l=1}^{N}{IF}_{N}\left({IM}_{N}\right)\right)}$$

(5)

Finally, the corrected labels or the filtered dataset are formulated for further processing. The pseudo-code representation of the Linear Quadratic and Feynman Kac Genome filtering model is given below.

As given in the above Linear Quadratic and Feynman Kac Genome filtering algorithm, the overall functionality is split into two processes, therefore referred to as Binate Filtrate Model (BFM). Here, initially with the genomes dataset obtained as input, the system’s state or the genome state for each patient is obtained using Linear Quadratic Estimation. This estimation can estimate unknown sequences with the aid of a series of genome sequences that incline to be more precise than those based on a single genome sequence of a patient, therefore assisting in yielding accurate results. Next, the system’s moments or the genome moments for each patient is arrived at by means of Feynman Kac Estimation which in turn removes the outliers, therefore improving the convergence speed significantly.

Concordance correlated polynomial interpolation based genome-wide data extraction model

Extracting medical data from clinical systems for analyzing genetic disorders, associating unique functional variants, disorder subclass heritability using log features, and exploring associations between metabolite levels and genomic disorder, with multi-molecular patterns, all of this predominance of ongoing endeavours is manual and said to be time-consuming. The advantages provided by machine learning models for investigating adequate and intricate biomedical information have tremendous prospects for shooting up genetic medicine developments.

To comprehend early genetic disorders and reduce time-consuming process, a Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction model is designed in our work. The Concordance Correlation Coefficient in our work evaluates the consensus between two filtered results. This is mathematically formulated as given below.

$$\:{\rho\:}_{CC}=\frac{2\rho\:{\sigma\:}_{a}{\sigma\:}_{b}}{{\sigma\:}_{a}^{2}+{\sigma\:}_{b}^{2}+{\left({\mu\:}_{a}-{\mu\:}_{b}\right)}^{2}};where\:a,b\:\in\:FR$$

(6)

From the above Eq. (6), ‘$\:{\mu\:}_{a}$’ and ‘$\:{\mu\:}_{b}$’ denotes the means of the single sample filtered results whereas ‘$\:{\sigma\:}_{a}^{2}$’ and ‘$\:{\sigma\:}_{b}^{2}$’ represents the variances of the single sample filtered results in deployment. In a similar manner, the consensus between ‘$\:N$’ filtered results are determined as given below.

$$\:{\rho\:}_{CC}\left(N\right)=\frac{2{S}_{ab}}{{S}_{a}^{2}+{S}_{b}^{2}+{\left({a}^{{\prime\:}}-{b}^{{\prime\:}}\right)}^{2}}$$

(7)

$$\:{a}^{{\prime\:}}=\frac{1}{N}\sum\:_{n=1}^{N}{a}_{n};\:{+b}^{{\prime\:}}=\frac{1}{M}\sum\:_{m=1}^{M}{b}_{m};\:{S}_{a}^{2}=\frac{1}{N}\sum\:{\left({a}_{n}-{a}^{{\prime\:}}\right)}^{2};\:{S}_{b}^{2}=\frac{1}{M}\sum\:{\left({b}_{m}-{b}^{{\prime\:}}\right)}^{2}$$

(8)

$$\:{S}_{ab}=\frac{1}{N}\sum\:_{n=1}^{N}\left({a}_{n}-{a}^{{\prime\:}}\right)\left({b}_{m}-{b}^{{\prime\:}}\right)$$

(9)

From the above Eqs. (7), (8), and (9), mean, variance, and covariance of the genome sequence consensus between ‘$\:N$’ filtered results are obtained. Finally, to the obtained results, an enhanced interpolation model that identifies a polynomial function to obtain top-rank features to predict genome disorder is formulated. The enhanced interpolation model that identifies a polynomial function is done using the Lagrange optimization function. Lagrange optimization function is mathematically formulated as given below.

$$\:{L}_{mn}\left(a\right)=\frac{{\rho\:}_{CC}\left(N\right)-{\rho\:}_{CC}\left({N}_{i}\right)}{{\rho\:}_{CC}\left({N}_{j}\right)-{\rho\:}_{CC}\left({N}_{i}\right)}$$

(10)

$$\:Prob\:\left({\rho\:}_{CC}\right)=\sum\:_{j=1}^{R}{FE}_{j}\left({L}_{mn}\left(a\right)\right)$$

(11)

From the above equation results obtained via (10) and (11), the resultant features extracted ‘$\:{FE}_{j}$’ identifies the local maxima and minima of a function subject to consensus constraints. Table 2 given below lists the features extracted (i.e., 24 features) using the Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction model.

Table 2 Feature extracted results.

Full size table

The pseudo code representation of Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction is given below.

As given in the above algorithm to ensure precise genome-wide data analysis with minimum high convergence speed first filtered results from the genomes dataset are acquired as input. Following this, highly correlative coefficients between two genomes sample data are obtained using Concordance Correlation Coefficient filtering model. With this precise and accuracy feature required for further processing is evolved. Next, with high correlation coefficient features Lagrange function is applied with the purpose of identifying the local maxima and minima subject to equality constraints (i.e., based on the genetic disorder and disorder subclass).

Support Vector Machine (SVM) and nearest centroid-based genome disorder prediction

The evolution of exactness medicine in medical care directed the traditional indicator-directed treatment procedure by permitting early disease risk prediction via refined diagnostics. It is mandatory to probe comprehensive patient data against extensive components to perceive and adapt between sick and relatively healthy people with the purpose of captivating the most pertinent approach toward precision medicine. Precision and genomic medicine integrated with machine learning has the prospective to enhance patient healthcare. Moreover, patients with less frequent therapeutic retaliations are using genomic medicine technologies. In this work a genome disorder prediction using Support Vector Machine (SVM)and Nearest Centroid model is presented. Figure 4 shows the structure of Support Vector Machine and Nearest Centroid-based Genome Disorder prediction.

As illustrated in the above figure, with the filtered results and extracted features provided as input, the genome disorder prediction model is designed in such a way employing Nearest Centroidas a distance measure to evaluate the support vector hyperplanes. Support vector machine strives to process the feature extracted ‘$\:FE$’ filtered results ‘$\:FR$’ (i.e., ‘$\:{FE}_{u}\left[{FR}_{y}\right]$’) onto a dataset that contains medical information before the production of a perfect interim hyperplanethat can differentiate between positive (i.e., mitochondrial inheritance disorder) and negative (i.e., multi-factorial inheritance disorder) samples.

$$\:{x}_{i}=\left[\begin{array}{cccc}{FE}_{1}{FR}_{1}&\:{FE}_{1}{FR}_{2}&\:\dots\:&\:{FE}_{1}{FR}_{v}\\\:{FE}_{2}{FR}_{1}&\:{FE}_{2}{FR}_{2}&\:\dots\:&\:{FE}_{2}{FR}_{v}\\\:\dots\:&\:\dots\:&\:\dots\:&\:\dots\:\\\:{FE}_{u}{FR}_{1}&\:{FE}_{u}{FR}_{2}&\:\dots\:&\:{FE}_{u}{FR}_{v}\end{array}\right]$$

(12)

From the above mathematical formulates (12), the feature extracted filtered results are stored in the form of a vector with which prediction of genome disorder is said to be done via centroid function. Then with the training data of ‘$\:u,v$’ points of the form as ‘$\:\left\{\left({x}_{1},{y}_{1}\right),\:\left({x}_{2},{y}_{2}\right),\:\dots\:,\left({x}_{n},{y}_{n}\right)\right\}$’ where the ‘$\:{y}_{i}$’ are either ‘$\:-1$’ or ‘$\:+1$’, each representing the class to which the data ‘$\:{x}_{i}$’ belongs. Also, with the training data being linearly separable, two parallel hyperplanes separating two classes’ positive (i.e., mitochondrial inheritance disorder) and negative (i.e., multi-factorial inheritance disorder) are selected. To improve the genome disorder prediction outcomes the proposed method employed the nearest centroid function for arriving at the distance. These hyperplanes are mathematically stated as given below.

$$\:{W}^{T}x-b=+1\:\left(i.e.\text{m}\text{i}\text{t}\text{o}\text{c}\text{h}\text{o}\text{n}\text{d}\text{r}\text{i}\text{a}\text{l}\:\text{i}\text{n}\text{h}\text{e}\text{r}\text{i}\text{t}\text{a}\text{n}\text{c}\text{e}\:\text{d}\text{i}\text{s}\text{o}\text{r}\text{d}\text{e}\text{r}\right)$$

(13)

$$\:{W}^{T}x-b=-1\:\left(i.e.\text{m}\text{u}\text{l}\text{t}\text{i}-\text{f}\text{a}\text{c}\text{t}\text{o}\text{r}\text{i}\text{a}\text{l}\:\text{i}\text{n}\text{h}\text{e}\text{r}\text{i}\text{t}\text{a}\text{n}\text{c}\text{e}\:\text{d}\text{i}\text{s}\text{o}\text{r}\text{d}\text{e}\text{r}\right)$$

(14)

The distance between the above two hyperplanes as given in (13) and (14) is selected using nearest centroid function that allocates to samples the label of the class whose centroid is adjacent to the sample. This nearest centroid to determine the distance between two hyperplanes is mathematically formulated as given below.

$$\:{\mu\:}_{l}=\frac{1}{{Sym}_{l}}\sum\:_{i\in\:{Sym}_{l}}{x}_{i}$$

(15)

From the above Eq. (15) ‘$\:{Sym}_{i}$’ refers to the set of symptoms of samples associated with class ‘$\:l\in\:Y$’ respectively. With this, the genome disorder prediction is made in an accurate and precise manner. Finally, the predicted results are obtained as given below.

$$\:\text{P}\text{R}=\left[\frac{1}{\text{n}}\sum\:_{\text{i}=1}^{\text{n}}\text{max}\left(\text{0,1}-{\text{y}}_{\text{i}}\left({\text{W}}^{\text{T}}{\text{x}}_{\text{i}}-\text{b}\right)\right)\right]+{{\upmu\:}}_{\text{l}}$$

(16)

The pseudo code representation of Support Vector and Nearest Centroid-based Genome Disorder prediction is given below.

As given in the above algorithm with the filtered results and extracted features acquired as input, with the objective of improving the correct prediction of abnormal genome disorder and reducing the incorrect prediction of normal genome disorder a distance factor is introduced to the conventional Support Vector Machine. The distance factor here employed using the Nearest Centroid function that with the aid of per class centroid improves the detection rate significantly.

Experimental results

The results of the simulations employed to validate the hypothetical genome disorder prediction method, called, Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) genome disorder prediction are described in detail in this section. The experiment uses the genomes dataset. The dataset consists of test files, training files, and sample submission files each with distinct numbers of instances. The proposed QFPI-VNC method has been implemented in Python language. The simulation results from the proposed QFPI-VNC method and state-of-the-art methods, Blast Local Assignment Search Tools (BLAST)¹, Driver Oriented Genomic Analysis (DrOGA) and Advance genome disorder prediction model (AGDPM) are detailed below in terms of several prediction parameters, convergence speed, detection rate, sensitivity, and specificity, Accuracy and F1-score. To conduct fair comparison genomes dataset is applied to the proposed QFPI-VNC method and BLAST¹, DrOGA² and AGDPM²¹ for an average of 10 different simulation runs.

Case scenario 1: performance analysis of convergence speed

The first and foremost parameter in the analysis of genome disorder detection for humans analogous to biomedical applications is the rate of convergence or convergence speed. The faster the convergence speeds earlier the genome disorder is detected in humans and accordingly, treatment can be given to avoid uncertainty. The convergence speed in our work is mathematically formulated as given below.

$$\:\text{C}\text{S}=\sum\:_{\text{i}=1}^{\text{m}}{\text{S}}_{\text{i}}\text{*}\text{T}\text{i}\text{m}\text{e}\:\left(\text{P}\text{R}\right)$$

(17)

From the above Eq. (17), the convergence speed ‘$\:CS$’ is measured based on the medical information who have genetic disorders ‘$\:{S}_{i}$’ and the actual time consumed in the prediction process ‘$\:Time\left(PR\right)$’. It is measured in terms of milliseconds. Table 3 provides a numeric summary of the performance measures that were utilized in determining the convergence speed for three methods, QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹. From the comparison, it can be shown that the proposed QFPI-VNC performs better in terms of convergence speed than^1,2 and ²¹.

Table 3 Comparison of table over convergence speed using QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹.

Full size table

Figure 5 given above illustrates the graphical representation of convergence speed using three methods, QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹. It compares the convergence speed for the proposed QFPI-VNC method with the state-of-the-art methods^1,2 and ²¹, . In the above figure, x-axis represents the sample instances involved in the genome disorder detection process and the y-axis represents the convergence speed consumed. The convergence speed in the above figure is found to be directly proportional to the samples involved. In other words, increasing the sample size causes an increase in the number of training data, therefore increasing the filtered results, and hence surge in convergence speed is found. One of the objectives of the study is to minimize the convergence speed, which was considered the major disadvantage of¹. Compared to existing methods^1,2 and ²¹, the convergence speed of the proposed QFPI-VNC method is reduced by differentiating between the system’s state and the system’s moments via Binate Filtrate Model (BFM). By applying this BFM model, Linear Quadratic Estimation was done to measure the system’s state, and Feynman Kac Estimation was performed to evaluate the system’s moments that in turn evolve the small number of characteristics or filtered results. The proposed QFPI-VNC method model uses the least amount of convergence speed when compared to conventional methods by 18% upon comparison to¹ ,29% upon comparison to² and 40% upon comparison to²¹ as can be inferred from the figure.

Case scenario 2: performance analysis of genome disorder detection rate

The second significant parameter of importance in genome disorder detection is the detection rate. This is due to the reason that improper detection would even cause mortality. Hence, utmost care should be taken while analyzing the genetic disorders. The genome disorder detection rate in our work is mathematically evolved as given below.

$$\:\text{G}\text{D}\text{D}\text{R}=\sum\:_{\text{i}=1}^{\text{m}}\frac{\text{T}\text{P}+\text{T}\text{N}}{{\text{S}}_{\text{i}}}\text{*}100$$

(18)

From the above Eq. (18), the genome disorder detection rate ‘$\:GDDR$’ is measured taking into consideration the true positive rate ‘$\:TP$’ (i.e., correctly predicted abnormal genome disorder), the true negative rate ‘$\:TN$’ (i.e., correctly predicted normal genome disorder) and the samples involved in the simulation procedure ‘$\:{S}_{i}$’ respectively. Table 4 lists the numeric summary of the performance measures that were utilized in determining the genome disorder detection rate for three methods, QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹. From the comparison, it can be inferred that the proposed QFPI-VNC performs better in terms of genome disorder detection ate than^1,2 and AGDPM²¹.

Table 4 Comparison of genome disorder detection rate using QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹.

Full size table

In Fig. 6, the proposed method of genome disorder detection rate classification performance has been compared with other state-of-the-art methods^1,2 and ²¹. According to the study, the proposed QFPI-VNC has the highest level of detection rate. The proposed model achieves maximum genome disorder detection rate through the immense contribution of feature extraction via Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction functionalities. The mapper and reducer applied in¹ achieve only the minimum detection rate needed in comparison with the QFPI-VNC method. While Explainable Artificial Intelligence in² operates as effectively as they do for a given iteration, nevertheless with the increase in the sample size, the accuracy decreases. But simulations performed with 2000 samples saw the true positive rate of 1930, 1920, and 1905 whereas the true negative rate was observed to be 40, 35, and 30 using QFPI-VNC^1,2 and ²¹, respectively. With this, the overall genome disorder detection rate using the three methods was observed to be 98.5%, 97.75%, and 96.75%. The reason behind the improvement was that only highly correlated features were obtained by utilizing the Lagrange function between two genomes sample data. As a result, the genome disease detection rate using the QFPI-VNC method was said to be improved by 8% upon comparison to¹ and 14% upon comparison to² and 21% upon comparison to²¹.

Case scenario 3: performance analysis of sensitivity and specificity

Finally, in this section, the sensitivity and specificity rate involved in the genome disorder detection process is analyzed. The mathematical formulates for sensitivity and specificity are given below.

$$\:Sen=\frac{TP}{TP+FN}*100$$

(19)

$$\:Spe=\frac{TN}{TN+FP}*100$$

(20)

From the above Eq. (19), the sensitivity rate ‘$\:Sen$’ is measured based on the true positive rate ‘$\:TP$’ (i.e., correctly predicted abnormal genome disorder), and the false negative rate ‘$\:FN$’ (i.e., predicts the abnormal genome disorder incorrectly). On the other hand, the specificity rate ‘$\:Spe$’, from (20) is measured by taking into consideration the true negative rate ‘$\:TN$’ (i.e., correctly predicted normal genome disorder), and the false positive rate ‘$\:FP$’ (i.e., predicts the normal genome disorder as abnormal) respectively. Table 5 given below provides the tabulation results of sensitivity and specificity involved in genome disorder detection for humans using three methods, QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹. From the comparison, it is found that the proposed QFPI-VNC performs better performance results in terms of sensitivity and specificity than^1,2 and ²¹.

Table 5 Comparison of genome disorder detection rate using QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹.

Full size table

Finally, Fig. 7 (a) and 6 (b) given above shows the graphical representation of sensitivity and specificity using the proposed QFPI-VNC and two existing methods^1,2 and ²¹. From the above figure, both the sensitivity and specificity rate using the QFPI-VNC method were found to be better than^1,2 and ²¹. In other words, with simulations performed for 2000 samples, the actual true positive being 1970, 1960 samples were correctly detected using QFPI-VNC whereas 1952 was only correctly detected using¹ and 1935 correctly detected using². As a result, the overall sensitivity rate using the three methods was observed to be 98%, 97.6%, and 96.75% respectively.

Similarly, with the actual true negative being 1940, 1930 samples were correctly predicted as normal genome disorder using the QFPI-VNC method, 1920 samples were correctly predicted as normal genome disorder using¹ and 1905samples were correctly predicted as normal genome disorder using². With this the specificity rate was found to be 96.5% using the QFPI-VNC method, 96% using¹ and 95.25% using² respectively. The reason behind the improvement in both sensitivity and specificity rates was due to the application of the Support Vector and Nearest Centroid-based Genome Disorder prediction algorithm. By applying this algorithm, the nearest centroid function was applied in identifying the distance between two hyperplanes in the support vector. Also, with the nearest centroid or mean the classification between two types of disorders is made whose mean is closest to that data point, therefore improving the sensitivity rate of the QFPI-VNC method by 5%, 18% and 20% compared to^1,2 and ²¹. Similarly, the specificity rate of the QFPI-VNC method was said to be improved by 8%,12% and 15% compared to^1,2 and ²¹ respectively.

Case scenario 4: performance analysis of Accuracy

Ac curacy is measured as predict the predicted normal genome disorder correctly identified. It is formulated as,

$$\:Acc=\:(TP+TN)/(TP+TN+FP+FN)$$

(20)

In above Eq. (21), accuracy is denoting the ‘$\:Acc$’, ‘$\:TP\:and\:TN$’ is specified as true positive and true negative,$\:\:{\prime\:}FPand\:FN$’ is False positive and False negative. It is measured in terms of percentage (%).

In Fig. 8; Table 6, shows the graphical representation of accuracy using the proposed QFPI-VNC and three existing methods^1,2 and ²¹. From the above figure, both the accuracy using the QFPI-VNC method was found to be better than^1,2 and ²¹. Similarly, the Accuracy of the QFPI-VNC method was said to be improved by 7%, 12% and 14% compared to^1,2 and ²¹ respectively.

Table 6 Comparison of accuracy using QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹.

Full size table

Case scenario 4: performance analysis of F1-score

$\:\varvec{F}1-\varvec{s}\varvec{c}\varvec{o}\varvec{r}\varvec{e}$’ is referred to as the harmonic mean of ‘$\:\varvec{P}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}$’ and recall ‘$\:\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}$’ scores evaluated as given below.

$$\:\varvec{F}1-\varvec{s}\varvec{c}\varvec{o}\varvec{r}\varvec{e}=2\mathbf{*}\left(\varvec{P}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}\mathbf{*}\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}\right)/(\varvec{P}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}+\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l})$$

In Fig. 9; Table 7, shows the graphical representation of F1-score using the proposed QFPI-VNC and three existing methods^1,2 and²¹. From the above figure, both the F1-score using the QFPI-VNC method was found to be better than^1,2 and ²¹. Similarly, the F1-score of the QFPI-VNC method was said to be improved by 4%, 3% and 2% compared to^1,2 and ²¹ respectively.

Table 7 Comparison of F1-score using QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹.

Full size table

Statistical test/analysis

Statistical analyses are the procedure of gathering and examining vast volumes of data for finding trends and increasing valuable insights. The statistical test for genome disorder prediction is done in our work by using the McNemar test (Table 8). The McNemar test is employed as a non-parametric test for paired nominal data. In order to evaluate the McNemar test, the genome sequencing data is said to be placed into a 2 × 2 contingency table.

Table 8 Tabulation for McNemar test for the proposed QFPI-VNC method.

Full size table

In Table 8, cells b and c are employed to estimate the McNemar test statistic, and it is as given below.

$$\:{\chi\:}^{2}=\frac{{\left(b-c\right)}^{2}}{b+c}$$

(21)

Table 9 offers the tabulation outcomes of the McNemar test involved in genome disorder detection for humans using three methods, QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹. From the comparison, it is found that the proposed QFPI-VNC performs better performance results in terms of the McNemar test than^1,2 and ²¹.

Table 9 Comparison of McNemar test using QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹.

Full size table

Figure 10 demonstrates the McNemar test (M-test) using the proposed QFPI-VNC, BLAST¹, DrOGA² and AGDPM²¹. A statistical test is conducted based on a sample between 2000 and 20,000. The reason behind the improvement was due to the application of initially obtaining highly correlated features with the aid of the Lagrange function among two genomes sample data. M-test of QFPI-VNC is better performance for by 14%, 21%, 9% than the existing methods respectively.

Conclusion

In this study a Quadratic Feynman Polynomial Interpolated and Vector Nearest Centroid-based (QFPI-VNC) method is proposed for predicting genome disorder for humans. Undesirable outliers are eliminated in the filtering stage to acquire high quality data with Binate Filtrate Model, then the feature input is down sampled to reduce the convergence speed in further process. To precisely capture the optimal genome-wide feature of the filtered results, new Concordance Correlated Polynomial Interpolation-based Genome-wide Data Extraction algorithm is developed. Here the local maxima and minima of a function subject to consensus constraints features are extracted with polynomial interpolation function. Additionally, enhanced interpolation model also helps to extract the pertinent features, this enhanced feature extraction model and also eliminates the misdiagnosis. Finally, the pertinent features are fused and classified with Support Vector and Nearest Centroid-based Genome Disorder prediction model. The genome dataset was utilized for the experimental assessment, and the results also compared with contemporary state-of-the-art methods. The proposed model performs better on the whole in terms of sensitivity, specificity, and convergence speed, and genome disorder detection rate. The limitation of the proposed model is the failure to consider the preprocessing method for eradicating noise and imputing missing data. In the future, the proposed method will be further extended to apply a preprocessing method to handle missing data. To get more accurate and enhanced prediction results, this research can be expanded to include more genetic disorders and more than one prediction model in the future.

Data availability

The datasets are publicly available. The datasets generated during and/or analysed during the current work are available with this link: https://www.kaggle.com/datasets/aryarishabh/of-genomes-and-genetics-hackerearth-ml-challenge.

References

Lilhore, E. M. O. U. K. et al. Evaluation of IoT-Enabled hybrid model for genome sequence analysis of patients in healthcare 4.0, Measurement: Sensors, Elsevier, Jan 2023, Volume 36, Pages 1–7. (Blast Local Assignment Search Tools [BLAST]).
Bastico, M., Fernandez-Garcia, A., Belmonte-Hernandez, A., Mayoral, S. U. & Access, I. E. E. E. DrOGA: An Artificial Intelligence Solution for Driver-Status Prediction of Genomics Mutations in Precision Cancer Medicine, Apr 10, 37378–37391. (2023). (Driver Oriented Genomic Analysis [DrOGA]).
MazinAlshamrani IoT and artificial intelligence implementations for remote healthcaremonitoring systems: A survey, Journal of King Saud University –Computer and Information Sciences, Elsevier, Jun 2021, Volume 34, Issue 8, Pages 4687–4701.
Chensi Cao, F. et al. Deep Learning and Its Applications in Biomedicine, Genomics Proteomics Bioinformatics, Elsevier, Mar Volume 16, Issue 1, Pages 17–32. (2018).
Piyush Gupta, A. V. et al. Prediction of Health Monitoring with deep Learning Using edge ComputingVolume 25Pages 1–8 (Sensors, Elsevier, Jan 2023).
Yogesh, H., Bhosale & Sridhar Patnaik, K. Feb, Bio-medical imaging (X-ray, CT, ultrasound, ECG),genome sequences applications of deep neuralnetwork and machine learning in Diagnosis, Detection,Classification, and Segmentation of COVID-19:A Meta-analysis & Systematic Review, Multimedia Tools and Applications, Springer, Volume 82, 39157–39210. (2023).
Aledhari, M. et al. Biomedical IoT: Enabling Technologies,Architectural Elements, Challenges,and Future Directions, Mar Volume 10, Pages 31306–31339. (2022).
Kashef, R. & Access, I. E. E. E. Enhancing the Role of Large-ScaleRecommendation Systems in the IoT Context, Sep 8, Pages 178248–178257. (2020).
Farman Ali, P. et al. Type-2 Fuzzy OntologyÐaided Recommendation Systems forIoTÐbased Healthcare, Computer CommunicationsPages 138–155 (Elsevier, Oct 2017).
Essam, H., Houssein, R. E., Mohamed, A. A., Ali & Access, I. E. E. E. Machine Learning Techniques for BiomedicalNatural Language Processing:A Comprehensive Review, Oct 9, Pages 140628–140653. (2021).
Sahar Ajmal, M., Awais, Khaldoon, S., Khurshid, M. S., Abdelrahman, A. & Peer, J. Data mining-based recommendationsystem using social networks_ananalytical study, Feb 9. (2023).
Dileep, V. V. S. & Gummadi, N. R. R. DNA sequencing using machine learning and deep learning algorithms. Int. J. Innovative Technol. Exploring Eng. (IJITEE). ISSN (Online), 2278–3075 (September 2022).
Richa Sharma,ShalliRani,and Stephen & JeswindeNuagh RecIoT: A Deep Insight into IoT-Based SmartRecommender Systems, Wireless Communications and Mobile Computing, Elsevier, Jun 2022, Volume Pages 1–15. (2022).
Hanan Ahmed, S., Hamad, H. A., Shedeed, S., Hussein, I. E. E. E. & Access Sep Volume 10, Pages 106050–106058. (2022).
Divya Upadhyay, P., Garg, S. M., Aldossary, J., Shafi & Kumar, S. A Linear Quadratic Regression-Based Synchronised HealthMonitoring System (SHMS) for IoT Applications, Electronics, Oct 2023, Volume 12, Issue 2, Pages 1–16.
Huang, K., Xiao, C., Glass, L. M. & Critchlow, C. W. Greg Gibson, and Jimeng Sun, Machine Learning Applicationsfor Therapeutic Tasks with Genomics dataPages 1–10 (Patterns, Cell, Oct 2021).
Wardah, S., Alharbi & Rashid, M. A Review of deep Learning Applicationsin Human Genomics Using next–generationsequencing dataPages 1–20 (Human Genomics, 2022).
Ashwin, A., Phatak, FranzGeorg Wieland, K., Vempala, F., Volkmar & Memmert, D. Nov, Artificial Intelligence Based Body SensorNetwork Framework—Narrative Review:proposing an end–to–end Framework usingWearable Sensors, Real–Time LocationSystems and Artificial Intelligence/MachineLearning Algorithms for Data Collection, DataMining and Knowledge Discovery in Sportsand Healthcare, Sports Medicine, Springer, Volume 7, 1–15. (2021).
Simone Aiassa, P. M. et al. Smart Portable Pen for continuous monitoring ofAnaesthetics in human serum WithMachine Learning. IEEE Trans. Biomed. Circuits Syst.15 (Issue 2), 294–302 (Apr 2021).
Jianfeng Wang, B. et al. HaitaoLv, Lin Hei, Multiple Genetic Syndromes Recognition Based on a Deep Learning Framework and Cross-Loss Training, IEEE Engineering in Medicine and Biology Society Section, Nov Pages 1–9. (2022).
Atta-Ur-Rahman, M. U. et al. Advance Genome Disorder Prediction Model Empowered With Deep Learning, Jul 10, Pages 70317–70328. (2022).
Alatrany, A. S., Khan, W., Hussain, A. & Al-Jumeily, D. Wide and deep learning based approaches forclassification of Alzheimer’s disease using genome-wide association studies. PLOS ONE |. 18 (Issue 5), 1–21 (May 2023).
Yasir Ali, M. et al. AmelKsibi, IDriveGenes: Cancer Driver Genes Prediction Using Machine Learning, Mar Pages 1–1. (2023).
Quazi, S. Artificial Intelligence and Machine Learning in Precision and genomicMedicine, Medical OncologyVolume 39Issue 120, Pages 1–18 (Springer, Jun 2022).
StevenJ.Schrodi, ShubhabrataMukherjee, Y. S., GerardTromp, J. J. S., Callear, A. P. & ZhanYe, T. C. C. MurrayH.BrilliantPaulK.Crane, DianeT.Smelser, RobertC.Elston8 and DanielE.Weeks, genetic-based prediction of disease traits: prediction is very difficult, especially about the future. Front. Genet. Jun. 5 (Issue 162), 1–19 (2014).
Google Scholar
TaherM, Ghazal, H. A., Hamadi, M. U., Nasir, Atta-ur-Rahman, M. & Gollapalli Muhammad Zubair, Muhammad Adnan Khan, and Chan Yeob Yeun, Supervised Machine Learning Empowered Multifactorial Genetic Inheritance Disorder Prediction, Computational Intelligence and Neuroscience, May Volume 2022, Pages 1–10. (2022).
Raza, A. & Rustam, F. Hafeez Ur Rehman Siddiqui, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto Lee and Imran Ashraf, Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach, Genes, MDPI, Dec 2022, Volume 14, Issue 1, Pages 1–71.
Atta-ur Rahman, M. U., Nasir, M., Gollapalli, S. A., Alsaif, A. S. & Almadhor Shahid Mehmood, Muhammad Adnan Khan, and Amir Mosavi, IoMT-Based Mitochondrial and Multifactorial Genetic Inheritance Disorder Prediction Using Machine Learning, Computational Intelligence and Neuroscience, Hindawi, Jun Pages 1–8. (2022).
Ashraf Abou Tabl,AbedalrhmanAlkhateeb, ElMaragh, W., Rueda, L. & Ngom, A. A Machine Learning Approach for identifying Gene biomarkers guiding treatment of breast Cancer. Front. Genet.10, 1–13 (May 2019).
Qi Dai, C. et al. MTGIpick allows robust identification of genomic islands from a single genome. Brief. Bioinform. 1 (19, Issue 3), 361–373 (2018 May).
Kong, R. et al. 2SigFinder: the combined use of small-scale and large-scale statistical testing for genomic island detection from a single genome. BMC Bioinform.21 (Issue 159), 1–15 (April 2020).
Charles, J., Sherr, I. & Roberts, J. M. May, Inhibitors of mammalian G1 cyclin-dependent kinases, genes dev, 9, Issue 10, Pages 1149–1163 (2024).
Siqian Yang, Y., Wang, Y., Chen & Dai, Q. March, MASQC: Next generation sequencing assists third generation sequencing for Quality Control in N6-Methyladenine DNA identification, 11, Pages 1–10. (2020).
Zhenyu Yang, W. et al. HPVMD-C: a disease-based mutation database of human papillomavirus in China, Database, March Volume 2022, Pages 1–8 (2022).
Wang, Y., Xu, Y., Yang, Z., Liu, X. & Dai, Q. Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences, Computational and Mathematical Methods in Medicine, May Volume 2021, Pages 1–9. (2021).

Download references

Acknowledgements

This research was financially supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R393), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/549/45.

Funding

This research was financially supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R393), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/549/45.

Author information

Authors and Affiliations

Department of AI, ASET, Amity University, Noida, UP, India
Sofia Singh
Department of CSE, Amity University, Mumbai, India
Garima Shukla
Department of Data Science, IoT, Cybersecurity (DIC), G H Raisoni College of Engineering Nagpur, Nagpur, Maharashtra, India
Rahul Agrawal & Chetan Dhule
Department of Medical Education, College of Medicine, Princess Nourah bint Abdulrahman University, P.O.Box 84428, Riyadh, 11671, Saudi Arabia
Sarah Allabun & Manal Othman
Radiological Sciences Department, College of Applied Medical Sciences, King Khalid University, Abha, 61421, Saudi Arabia
Mohammed S. Alqahtani
BioImaging Unit, Space Research Centre, University of Leicester, Michael Atiyah Building, Leicester, LE1 7RH, UK
Mohammed S. Alqahtani
Electrical Engineering Department, College of Engineering, King Khalid University, Abha, 61421, Saudi Arabia
Mohamed Abbas
PRINCE Laboratory Research, ISITcom, Hammam Sousse, University of Sousse, Sousse, Tunisia
Ben Othman Soufiene

Authors

Sofia Singh
View author publications
Search author on:PubMed Google Scholar
Garima Shukla
View author publications
Search author on:PubMed Google Scholar
Rahul Agrawal
View author publications
Search author on:PubMed Google Scholar
Chetan Dhule
View author publications
Search author on:PubMed Google Scholar
Sarah Allabun
View author publications
Search author on:PubMed Google Scholar
Mohammed S. Alqahtani
View author publications
Search author on:PubMed Google Scholar
Manal Othman
View author publications
Search author on:PubMed Google Scholar
Mohamed Abbas
View author publications
Search author on:PubMed Google Scholar
Ben Othman Soufiene
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed equally to the conceptualization, formal analysis, investigation, methodology, and writing and editing of the original draft. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Ben Othman Soufiene.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Singh, S., Shukla, G., Agrawal, R. et al. Enhancing genomic disorder prediction through Feynman Concordance and Interpolated Nearest Centroid techniques. Sci Rep 14, 27653 (2024). https://doi.org/10.1038/s41598-024-72923-w

Download citation

Received: 26 May 2024
Accepted: 11 September 2024
Published: 12 November 2024
DOI: https://doi.org/10.1038/s41598-024-72923-w

Subjects

Abstract

Similar content being viewed by others

Specialist multidisciplinary input maximises rare disease diagnoses from whole genome sequencing

Methodological opportunities in genomic data analysis to advance health equity

How does the genomic naive public perceive whole genomic testing for health purposes? A scoping review

Introduction

Motivation and research questions

Novelty and contribution

Related works

Methodology

Dataset description

Concordance correlated polynomial interpolation based genome-wide data extraction model

Support Vector Machine (SVM) and nearest centroid-based genome disorder prediction

Experimental results

Case scenario 1: performance analysis of convergence speed

Case scenario 2: performance analysis of genome disorder detection rate

Case scenario 3: performance analysis of sensitivity and specificity

Case scenario 4: performance analysis of Accuracy

Case scenario 4: performance analysis of F1-score

Statistical test/analysis

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links