Introduction

Obstructive sleep apnea

Obstructive sleep apnea (OSA) is characterized by recurrent closure of the upper airway during sleep, resulting in decreased oxyhemoglobin saturation and fragmented sleep patterns. Patients typically exhibit symptoms such as snoring, witnessed apnea, morning headache, and daytime sleepiness1,2. OSA leads to depression, cognitive impairment, and increased risk of cardiovascular, metabolic, and other diseases3,4,5. Globally, it is estimated that over 425 million adults suffer from moderate-to-severe OSA, with the prevalence of moderate-to-severe OSA ranging from 6 to 17% in the general adult population and up to 49% among the elderly6,7.

Overnight polysomnography (PSG) is the gold standard for the diagnosis of OSA, which records multiple data channels including electroencephalogram (EEG), electrocardiogram (ECG), and electromyogram (EMG)8,9. PSG is complex and costly, which may make it difficult to scale up for large numbers of potential patients. However, undiagnosed OSA can lead to severe medical and economic consequences10. A previous study showed that moderate-to-severe OSA is highly associated with adverse health events such as hypertension, depression, and all-cause mortality11.

Obstructive sleep respiratory events impairs physical restoration during nocturnal sleep. Individuals with OSA may experience excessive daytime sleepiness, compromised vigilance, and a diminished capacity to effectively carry out daily tasks. This state may ultimately lead to an elevated risk of complications, traffic accidents, and workplace accidents, thereby exerting a considerable financial burden on both the healthcare system and the economy. Therefore, the development of a simple tool to identify moderate-to-severe OSA is imperative.

Machine learning

Machine learning (ML), the process of using observed data to build statistical models to predict outcomes, holds great promise in medical research12 and has been applied in predicting OSA. ML algorithms can continually refine their predictions based on new data, thereby enhancing accuracy and predictive efficacy over time.

Related works

Features derived from electrophysiological data, such as heart rate variability, oxygen saturation and ECG-derived respiration (EDR) have been used for OSA prediction. For ML models, previous studies have applied professional questionnaires, sociodemographic information, clinical symptoms, physical examination and laboratory indicators to predict OSA for screening and further grading of suspected OSA patients.

BMI, age, and sex were the earliest and most commonly applied clinical factors, followed by neck circumference, snoring, the Epworth Somnolence Scale, witnessed apneas, waist circumference, etc.1,13,14,15. In addition, heart rate was also correlated with the severity of OSA16. Age-related factors such as increased fat deposition in the parapharyngeal region, elongation of the soft palate, and changes in peripharyngeal body structure contribute to an increased prevalence of OSA17. Differences in body fat distribution, differences in upper airway anatomy, and pathophysiological differences such as hormonal status lead to greater susceptibility to OSA in men than in women18,19. Obesity exacerbates OSA by reducing airway lumen and increasing upper airway collapsibility19. Sympathetic nerve activity appears to be increased during sleep in OSA patients20,21, leading to increased heart rates.

Frequently used machine learning approaches include logistic regression, linear regression, support vector machines, neural networks, decision trees, etc. For instance, Kim et al. applied logistic regression (LR), support vector machine (SVM), random forest (RF), and XGBoost (XGB) using features such as hypertension, waist circumference, length between the subnasale and stomion (subnasale to stomion), and the Fatigue Severity Scale total score to develop predictive models for OSA22. Other researchers have also chosen age, gender, BMI, head circumference, neck circumference, dry throat, and witnessed apnea as features to develop prediction models23,24. These studies demonstrate the feasibility of machine learning approaches in OSA screening and provide several clinical tools for early diagnosis and management of OSA. However, the predictive features selected in the models developed by different researchers vary widely, and the majority of the predictive models have not been translated into clinical applications. Most studies lack external validation and the accuracy level of diagnosis is insufficient. Challenges encompassing the complexity of selecting and acquiring predictive features, the limited availability of PSG data, imbalanced datasets, the interpretability of models, etc. have constrained the development of ML models for OSA prediction. Therefore, a simple, effective, and low-cost OSA prediction model and application platform are still needed to meet clinical needs.

The objective of the study

This study aims to provide a model for screening patients with moderate-to-severe OSA using only a small number of easily available, low-cost, and generalisable variables. The objective is to promote the awareness, treatment, and control rate among the population. Accordingly, this research employed machine learning models to predict moderate-to-severe OSA using six machine learning methods—LR, artificial neural network (ANN), naïve Bayes (NB), SVM, RF, and decision tree (DT)—based on age, sex, BMI, and mean heart rate during sleep, with the aim to derive the best fit model for predicting moderate-to-severe OSA and apply it to the cloud-based mobile sleep medicine management platform at the Sleep Medicine Center of the Second Affiliated Hospital of Fujian Medical University for practical clinical use.

Materials and methods

Data set

This was a retrospective study using the clinical records from patients who underwent PSG at the Sleep Medicine Center of the Second Affiliated Hospital of Fujian Medical University from January 2021 to December 2022.

The inclusion criteria for this study selected those who (1) were older than 18 years of age and (2) had participated in overnight PSG testing. Participants were excluded under the following conditions: (1) had participated in any treatment related to OSA, (2) had undergone continuous positive airway pressure (CPAP) treatment, (3) had a total PSG recording time less than 4 h, or (4) had a nighttime sleep duration less than 3 h.

Data collection was facilitated through our self-developed cloud-based mobile sleep medicine management platform, which enabled real-time data collection, including basic information and physical examination data on admission, PSG test data, follow-up and scientific research questionnaire data, etc. The platform is regularly upgraded and maintained, and is connected to the hospital intranet without external access, thus ensuring access stability, data compatibility and security.

In addition, we improve the privacy protection level through the following points: (1) Data desensitization, ensuring user-submitted data was used solely for model inference calculations without retention; (2) Use Differential Privacy technically. Noise is added in the training process of the model to ensure that the data set is always below a certain threshold when outputting information being affected by a single record, so that the third party cannot judge the change or addition or deletion of a single record according to the change of the output. This is considered to be the highest level of security in the privacy protection method based on perturbation. (3) Sign the privacy protection regulations in the system. Before utilizing our model to make predictions, users would be provided with an informed consent, which makes it clear to them what data is being collected and for what purpose.

Polysomnography

All participants underwent an overnight PSG (Compumedics Grael, Compumedics SOMTEPSG, SOMNO screen plus and NOX T3). Sleep signals were recorded by qualified technicians.

The apnea–hypopnea index (AHI) was computed for each participant based on the total number of apnea and hypopnea events divided by the total sleep time. All sleep apnea and respiratory events were scored following the updated 2007 American Academy of Sleep Medicine (AASM) scoring rules25. Apnea was defined as a ≥ 90% reduction in airflow lasting ≥ 10 s. Hypopnea was defined as a ≥ 30% reduction in airflow lasting ≥ 10 s and a ≥ 4% reduction in arterial oxygen saturation.

OSA severity classification relied on AHI cutoffs: ≥ 5 for the presence of OSA, ≥ 15 for moderate-to-severe OSA, and ≥ 30 for severe OSA. We labeled the dataset accordingly and divided the participants into a none-to-mild OSA group (AHI < 15) and a moderate-to-severe OSA group (AHI ≥ 15) to develop the OSA risk prediction model.

Ethical review

The Second Affiliated Hospital of Fujian Medical University, the Sleep Medicine Key Laboratory of the University in Fujian Province, and the Sleep Disorder Medicine Center of the Second Affiliated Hospital of Fujian Medical University review committee approved the protocol of this study (2023-133). In addition, all participants and/or their legal guardians consented to use their medical data in this study. This study was carried out following the Helsinki Declaration contents.

Statistical analysis

Statistical analysis of the clinical records was performed using R version 4.2.1. Continuous data are presented as means and standard deviations, whereas frequencies and percentages were calculated for categorical variables. To explore the effects of age, gender, BMI, and mean heart rate during sleep on moderate-to-severe OSA, univariate analysis and multivariate analysis were performed. A Pearson's chi-squared test was conducted to determine whether the variables were statistically significant, and multivariate analysis was used to determine independent predictors of moderate-to-severe OSA. A two-tailed P-value less than 0.05 indicated a statistically significant difference. A multiple imputation method was implemented to fill in missing data for the height and weight of a total of 58 patients. As a repetitive simulation-based method, multiple imputation can generate a complete set of datasets from a dataset containing missing values, and the missing data in each dataset is filled in using Monte Carlo methods.

Development and validation of the prediction models

Six models were built using LR, ANN, NB, SVM, RF, and DT. The six applied algorithms are mainstream supervised learning algorithms that can handle both numerical and categorical features and have a certain research base in the field of OSA prediction, which makes it easy to compare with other studies. Meanwhile, users are relatively more familiar with the six models, which is more conducive to the promotion of applications and subsequent customization upgrades of the models. All models were trained using age, gender, BMI, and mean heart rate during sleep to predict moderate-to-severe OSA (Fig. 1). The included data were randomly divided into a training-test set (80%) and an independent validation set (20%). To avoid overfitting and improve the model, we used ten-fold cross-validation in the training-test set. In this process, all data were divided into ten parts, and then each part was used as the validation set and the others as the training set for training and validation, while the hyperparameters were kept consistent and the average training loss and average validation loss of the 10 models were then taken to measure the hyperparameters. Once a satisfactory hyperparameter is obtained, the entire data set is used as the training set, and a model is trained with that hyperparameter. Finally, the model with the best prediction performance was considered the best model and applied to the independent validation set.

Fig. 1
figure 1

Flowchart of data collection and development of the ANN model. (A) Flowchart showing the data collection and division. (B) Artificial neural network model structure diagram.

RapidMiner Studio has been used to automatically tune hyperparameters, which is an open source Java-based platform for implementing predictive models and data mining algorithms. The hyperparameters of the developed ANN model were set as: hidden_layers = 1; training_cycles = 1000; learning_rate = 0.01; momentum = 0.6; decay = false; shuffle = true; normalize = true; error_epsilon = 1.0E-4; use_local_random_seed = true; local_random_seed = 1992. Hyperparameters for other models are available in the source code, which can be freely downloaded on GitHub26, as are the data sets used in this study. The Supplementary Table S1 of Supplementary Material 1 also provide hyperparameters of all models.

We evaluated the model performance by calculating the area under the receiver operating curve (AUROC) for the six models on the independent validation set. In addition, we computed model performance metrics such as accuracy, recall, specificity, precision, and F1-score.

In this study, calibration curve analysis was performed to assess consistency in terms of the slope, intercept, and Brier score of the calibration curve. The calibration curve is discretised by continuous data to see if the predicted probability of the model is close to the true probability. The Brier score is a combined evaluation of model differentiation and calibration and takes a value in the range of 0–1. The closer the score is to 0, the better the model is. Decision curve analysis was performed by quantifying the net clinical benefits at different threshold probabilities, and clinical impact curve analysis was performed to determine the clinical usefulness of the prediction model by quantifying the cost-to-benefit ratio at different threshold probabilities.

R version 4.2.1 was used for all machine learning model development and validation.

Ethics approval and consent to participate

The Second Affiliated Hospital of Fujian Medical University, the Sleep Medicine Key Laboratory of the University in Fujian Province, and the Sleep Disorder Medicine Center of the Second Affiliated Hospital of Fujian Medical University review committee approved the protocol of this study (2023-133). As this study is a retrospective study for secondary analysis of existing clinical databases, all subjects involved in the study approved the informed consent. In addition, this study was carried out following the Helsinki Declaration contents.

Results

Demographic and clinical characteristics of the participants

A total of 610 participants underwent PSG at the sleep center during the study, 39 of whom were excluded from further analysis as they were less than 18 years old (n = 33) or had a total sleep time less than 3 h (n = 6) (Fig. 1). Ultimately, 571 participants (115 women and 456 men) were included in our study. Among them, 379 participants (66.4%) were diagnosed with moderate-to-severe OSA (AHI ≥ 15), while 192 patients (33.6%) had none-to-mild OSA (AHI < 15).

Relationship between participant characteristics and OSA

Men were found to have a higher rate of moderate-to-severe OSA than women (Table 1). Patients with moderate-to-severe OSA were characterized by older age, higher BMI, lower mean oxygen saturation, and higher mean heart rate during sleep compared with none-to-mild OSA patients (Table 1).

Table 1 Demographic and clinical characteristics of the participants.

Multivariate analysis was applied to investigate factors independently predicting moderate-to-severe OSA after adjusting for age, sex, BMI, and mean heart rate during sleep. The results revealed significant associations between advanced age, male gender, high BMI, high mean heart rate during sleep and an increased risk of moderate-to-severe OSA. Details are shown in Table 2.

Table 2 Unadjusted and adjusted odds ratios identifying independent predictors of OSA.

Model performance

As previously outlined, age, gender, BMI, and mean heart rate during sleep served as input parameters to build multiple machine learning models for predicting moderate-to-severe OSA. The AUROCs of the LR, ANN, NB, SVM, RF, and DT models in the training-test set were 79.0%, 79.1%, 71.0%, 78.8%, 77.3%, and 67.8% respectively. The AUROCs with standard deviation of each fold are presented in Supplementary Table S2 of Supplementary Material 1. Following sufficient training, the LR, ANN, NB, SVM, RF, and DT models were applied to the independent validation set. Table 3 summarizes the performance of the six models in the independent validation set. The ANN model exhibited a relatively higher AUROC compared to the others, with an AUROC, accuracy, recall, specificity, precision, and F1-score of 80.4%, 69.9%, 86.5%, 61.5%, 53.2%, and 65.9%, respectively. This was followed by LR, NB, SVM, RF, and DT, with AUROCs of 80.2%, 79.7%, 79.2%, 78.4%, and 70.4%, respectively. The p-value of AUC between each model is less than 0.05, which is statistically significant. Figure 2A,B show the receiver operating characteristic (ROC) curves and AUROCs for the six models applied to the training-test set and independent validation set, respectively.

Table 3 The performance of the six OSA prediction models based on machine learning.
Fig. 2
figure 2

Comparison of receiver operating characteristic (ROC) curves among the six machine learning models for OSA prediction. (A) ROC curves of training-test set; (B) ROC curves of independent validation set.

The calibration curves (Fig. 3) show that the predictions agreed well with the observations, and the LR, ANN, NB, SVM, RF, and DT models gave Brier scores of 0.165, 0.165, 0.182, 0.197, 0.175, and 0.206, respectively. The details are shown in Table 4.

Fig. 3
figure 3

Calibration curves testing the stability of the six prediction models. (A) LR; (B) ANN; (C) NB; (D) SVM; (E) RF; (F) DT. The logical calibration curve is shown in solid blue, and the statistics are displayed in the top left corner of each graph.

Table 4 Results of calibration tests of the six machine learning models for predicting OSA.

To determine the clinical usefulness of the models, we plotted decision curves and clinical impact curves for the six prediction models. The clinical decision curves (Fig. 4) showed that the OSA thresholds for achieving a greater net benefit than the "no treatment" or "all treatment" scenarios when making clinical decisions using the LR, ANN, NB, SVM, RF, and DT prediction models were 0.92, 0.90, 0.67, 0.65, 0.78, and 0.65, respectively.

Fig. 4
figure 4

Decision curve analysis of the six prediction models. (A) LR; (B) ANN; (C) NB; (D) SVM; (E) RF; (F) DT.

The clinical impact curve (Fig. 5) analysis illustrated the clinical effectiveness of the six predictive models. At threshold probabilities greater than 70%, 65%, 75%, 75%, 60%, and 70%, respectively, the populations identified as high risk of moderate-to-severe OSA by the LR, ANN, NB, SVM, RF, and DT models significantly overlapped with those actually experiencing moderate-to-severe OSA, indicating their clinical efficacy.

Fig. 5
figure 5

Clinical impact curve analysis of the six prediction models. (A) LR; (B) ANN; (C) NB; (D) SVM; (E) RF; (F) DT.

Combining these performance metrics, the ANN model was identified as the most suitable for this study due to its superior nonlinear modeling, adaptive learning, generalization capability, and flexibility. Figure 6 shows the feature importance of the ANN model.

Fig. 6
figure 6

The feature importance of the ANN model.

Comparison with previous research

We compared the effectiveness of the generated ANN model with the results of previous studies. A systematic search of PubMed was conducted to identify OSA prediction models developed since 2017. Four relevant studies22,23,24,27 were selected for comparative analysis alongside our ANN model. Our study demonstrates better predictive performance with higher recall and specificity when compared to most of these previous models (Table 5). Features used in each study were shown in Supplementary Table S3 of Supplementary Material 1.

Table 5 Effectiveness of the ANN model in our study compared to that of models from previous research.

Application of the model

The risk of moderate-to-severe OSA could be estimated by inputting the patient's age, gender, BMI, and mean heart rate during sleep into the model. Further testing and treatment are recommended if the risk is ≥ 50%, and lifestyle improvement and regular follow-up are recommended if the risk is < 50%.

Deployed as a mobile online preliminary screening tool, the model operates within a cloud-based mobile sleep medicine management platform, which supports data collection across various scenarios including home, primary hospital and sleep medicine center screening via multi-ports accessibility, facilitating accurate outcomes. Details are shown in Fig. 6. In addition, we established a website for patients and clinicians to use the ANN model (https://slumbot.angelong.cn/)28 (Fig. 7).

Fig. 7
figure 7

Scenarios for the patient. BMI body mass index, IoT Internet of Things, AI artificial intelligence.

Discussion

Principal findings

Obstructive Sleep Apnea (OSA) represents a significant global healthcare burden, yet remains largely undiagnosed and untreated due to limited awareness and high diagnostic costs10. In this study, we developed six risk prediction models for moderate-to-severe OSA using four parameters: age, gender, BMI, and mean heart rate during sleep. The developed models employing LR, ANN, NB, SVM, RF, and DT exhibited excellent performance in predicting moderate-to-severe OSA in the independent validation set, with AUROCs of 80.2%, 80.4%, 79.7%, 79.2%, 78.4%, and 70.4%, respectively. Calibration curve analysis confirmed the models' reliability, and decision curve analysis and clinical impact curve analysis highlighted their practical clinical utility. Our model offer potential for early clinical diagnosis of adult moderate-to-severe OSA and may mitigate its adverse consequences.

Model selection was primarily based on AUROC and recall metrics. Higher AUROC values indicate better prediction ability. Higher recall can help screen for moderate-severe OSA patients more effectively, which is crucial for early detection and intervention. LR, ANN, NB, SVM, RF, and DT are all models widely used in clinical medicine. The ANN model, performing notably well with high AUROC, recall, specificity, and low Brier score, likely benefits from its ability to capture nonlinear relationships within data structures. An ANN is a computational model based on the function of biological neural networks that can be used for modeling nonlinear statistics and revealing complex relationships between input and output data, and it is well suited to solving problems such as clinical prediction and image recognition29,30.

A large dataset of adult samples from a hospital was used in our study for accurate comparisons between the none-to-mild OSA group and the moderate-to-severe OSA group. To evaluate whether the included sample size was sufficient to draw reliable conclusions, a post hoc sample size calculation31 was conducted using an interactive tool available online (https://riskcalc.org/samplesize/)32. In the final model with 4 predictors, using the C-statistic in conjunction with the expected incidence to approximate the Cox–Snell R-squared, a minimum sample of 343 participants and a minimum of 22.8 events per predictor parameter were required. Thus, the real sample of 571 participants in this study may provide adequate power to ensure the reliability of our results.

While previous research has established associations between OSA and variables such as age, sex, BMI, and mean heart rate during sleep are associated with OSA, few studies have integrated these four factors into OSA risk prediction models. Studies have shown that machine learning can enhance the precision of the prediction model and promote awareness of potentially latent conditions33,34,35. There have been many models using machine learning for the prediction of OSA. However, most of them incorporate parameters which are too complex to obtain in daily life, such as head measurement images36, sleep questionnaires23,36, breathing sounds during sleep37, and anthropometric features including neck circumference, waist circumference, and head circumference23,36,38. Moreover, fewer prediction models have been put into real clinical application.

Clinical implications

In clinical practice, our prediction model effectively stratifies patients into none-to-mild OSA and moderate-to-severe OSA. Patients with moderate-to-severe OSA show more serious clinical symptoms such as drowsiness, headache, snoring, etc., which have a greater impact on daily life and are more strongly associated with co-morbidities such as cardiovascular and cerebral vascular diseases, so it is more clinically significant to carry out early screening in these patients. The modeling parameters included in our study are simple and easy to obtain, and the generated decision curves and clinical impact curves indicate the high clinical usefulness of the model. What’s more, our developed model in this study is based on a cloud-based mobile sleep medicine management platform for clinical application, and the model has shown unique advantages in several clinical application scenarios. For residents screening at home, it can reduce the time spent on hospital appointments and queues, save costs, and increase acceptance and cooperation when initial screening is conducted in the familiar home environment. For primary hospital screening, the utilization rate of sleep monitoring equipment can be improved, and medical resources and data exchange with tertiary hospitals can be achieved, so as to provide more sleep medical services to patients. For sleep center screening, the economic burden of patients can be reduced; the use of IoT makes the consultation process simpler; the initial screening results are presented in real time; and a personalized treatment or referral plan can be formulated with the advice of sleep physicians. The OSA prediction model provides an effective tool for moderate-to-severe OSA prediction or referral under limited clinical conditions. In addition, this model can be applied to the public’s popularization of OSA to improve people's understanding of OSA and its hazards.

Limitations and future work

While our developed a model demonstrates good performance, there are several potential limitations that merit attention and future investigation.

First, our study used only a single-center dataset exclusively comprising individuals of an Asian ethnicity. Therefore, we will further validate the developed model using an external dataset, exploring its ability to be generalized to other institutions or ethnic groups.

Second, due to the COVID-19 epidemic, there was a reduced number of patients undergoing full-night polysomnography during our study, resulting in a limited sample size. Moving forward, continuous training and optimization of the model will be conducted additional data from the Sleep Medicine Center. Meanwhile, implementing transfer learning techniques will be explored to enhance training efficiency and generalize the model's applicability while mitigating overfitting.

Third, the participants in this study, all of whom underwent PSG at the sleep medicine center and were generally considered likely to have OSA or other sleep disorders, had a 66.4% prevalence of moderate–severe OSA, which is higher than the prevalence reported in the general public6,7,39. This may limit the model's representativeness of the broader community.

Additionally, our retrospective study did not exclude individuals at high risk for undiagnosed OSA-related conditions such as cardiovascular disease, metabolic disease, neurological disease, etc. Such comorbidities may confound the model's predictions, suggesting a potential bias towards individuals at higher risk for OSA-related complications. The above indicates that our model may be more suitable for the detection of people at high risk of OSA, and there is no avoiding the existence of collinearity problems.

Fourth, the performance of our model falls within a mid-range spectrum in comparison with previous studies, which may be related to differences in the sample demographics and outcome criteria. For example, the relatively young age of our study cohort implies a lower prevalence of moderate-to-severe OSA compared to older populations40, posing challenges for accurate prediction models. Future efforts will focus on diversifying patient data across different age groups to achieve a balanced dataset, ensuring robust prediction performance across diverse population segments.

Fifth, in aiming for broad applicability from home settings to primary care facilities and specialized sleep centers, our model incorporated only four readily available predictors. This selection may compromise the model's specificity in favor of simplicity. In future work, multidimensional indicators including physical examination data and laboratory indicators will be incorporated to construct a more accurate risk stratification model suitable for medical settings.

Sixth, traditional questionnaire-based assessments, requiring users to input data sequentially, may diminish user engagement. Future studies may explore AI-driven medical chatbots to enhance user interactivity. The Internet of Things and wearable technology can also be applied to collect and analyze OSA-related data in real time.

Finally, as the number of individuals screened for OSA continues to grow, there will be increasing demands for the stability, accuracy, real-time performance, and scalability of the algorithms used. To address the computational challenges posed by large-scale datasets, it may be necessary to adopt parallel computing and distributed processing techniques. Additionally, optimizing model complexity and hyperparameter tuning will be ongoing priorities to align with the computational resources available in medical institutions.

Conclusions

In conclusion, our study has highlighted the underappreciated impact and disease burden associated with moderate-to-severe obstructive sleep apnea (OSA). We addressed this gap by developing six streamlined risk prediction models utilizing logistic regression (LR), artificial neural network (ANN), naïve Bayes (NB), support vector machine (SVM), random forest (RF), and decision tree (DT).

Among these models, the artificial neural network (ANN) model demonstrated significant potential for practical implementation within a cloud-based mobile sleep medicine management platform. Particularly noteworthy is the identification and integration of four key predictors—age, gender, BMI, and mean heart rate during sleep—into the modeling process. These predictors have shown substantial capability in predicting the severity of OSA with notable accuracy.

Our findings present a convenient and effective tool for the early detection of OSA by leveraging readily available data, and show the practical application of OSA screening model in clinical scenarios. Future research could further explore combinations of predictors that are even more sensitive and specific, thus enhancing overall model performance.