Introduction

Rapidly increasing healthcare spending has become a significant challenge globally1, primarily driven by high-need, high-cost (HNHC) patients, who are typically defined as those accounting for the top 5% of annual healthcare expenditures2. Research has shown that the top 5% of HNHC patients accounted for over 50% of the total healthcare expenditures in the United Kingdom (54%, 2022)3, the United States (51.2%, 2021)4, Canada (60%), Spain (56%), and Germany (50%, 2019)5. The substantial financial burden imposed by HNHC patients underscores the urgent need for targeted interventions to optimize healthcare resource allocation and reduce costs6. Furthermore, studies demonstrate that HNHC patients and their families face a significant disease-related economic burden while experiencing poorer health outcomes, suggesting inefficiencies in healthcare service delivery and resource allocation6,7,8.

To address these challenges, many countries have prioritized HNHC patients in healthcare reform efforts by exploring various service delivery models, such as Adapting Stepped Care, Intensive Primary/Outpatient Care Programs, Interdisciplinary Transitional Care Programs, etc.9,10,11,12,13,14,15. However, evaluations of these initiatives have yielded mixed results. While some programs have shown limited effectiveness in reducing emergency and inpatient services for specific patient subgroups, their long-term impact on cost control remains limited9,10,12,13,15. Previous research emphasizes the limited effectiveness of targeting existing HNHC patients, highlighting the necessity for healthcare systems to proactively predict and identify potential high-risk HNHC patients before high needs and costs arise, enabling preventive interventions and personalized services to control healthcare expenses and improve patient outcomes10,12,16,17.

Therefore, a prediction model for future HNHC patients has gained significant attention from policymakers and payers, as targeted interventions for this group are expected to be more effective. International studies have predominantly focused on predicting HNHC patients in adult populations, utilizing data from countries such as the United States, South Korea, and Japan, with no known prediction models for HNHC children18,19,20. Previous studies have commonly utilized healthcare insurance claims or hospitalization data, employing machine-learning-based methods such as random forest18,19,20,21,22,23,24,25, artificial neural network19,20,22,26,27, k-nearest neighbors24, naive Bayes21,28,29, and XGBoost30 to construct models, offering methodological insights for this study. In recent years, while studies have further subdivided the HNHC population and developed predictive models for sustained HNHC patients21,26, there remains a critical gap in predictive models specifically tailored for pediatric HNHC patients globally.

In China, research has concentrated on single diseases such as heart disease and chronic obstructive pulmonary disease, developing models to predict HNHC patients with these specific conditions, and these studies lack comprehensive analyses across diverse disease categories22,28,31. There is also a paucity of evidence on comprehensive prediction models for pediatric HNHC patients in China. However, evidence is limited as to HNHC patients with all disease categories in China. Significant differences in characteristics, prevalent diseases, healthcare utilization, and expenditures between children and adults further underscore the necessity of developing a specific, valid, and reliable machine-learning-based model tailored to pediatric HNHC patients32.

This study aims to develop a machine-learning-based prediction model for HNHC pediatric patients using a 7-year retrospective cohort dataset from China. The proposed model is intended to support decision-makers and payers in formulating effective strategies for early interventions, thereby mitigating the rapid escalation of healthcare costs.

Methods

Data resource

This study conducted a retrospective cohort study from January 1, 2017, to December 31, 2023, with data from two sources in Shanghai, China. One database originates from the administrative data of the Urban and Rural Resident Basic Medical Insurance (URBMI), a fundamental component of China’s social medical insurance system providing coverage for children. Another database derives from the administrative records of the Children’s Hospitalization Mutual Fund (CHMF), a non-profit, public welfare medical security mutual aid fund in Shanghai. Established in 1996, URBMI and CHMF cover almost all children in Shanghai, aiming to share the risk of hospitalization and major outpatient medical expenses for children.

The data used in this study is divided into three parts. The first part comprises 2017–2023 seven-year data comprehensively describing the most common 20 diseases among pediatric HNHC patients, based on the International Classification of Diseases Tenth Revision (ICD-10) and grouping ICD at the three-digit level. The second part consists of training and internal validation datasets from 2021 to 2022 for machine learning. The third part is an external validation dataset used for robustness analysis, employing 2022–2023 data to assess the robustness of the trained models.

Study population

The study included all children aged 18 and under hospitalized from January 1, 2017, to December 31, 2023, excluding patients outside the specified age range, records with missing cost data, or incomplete ICD-10 codes. Furthermore, measures were taken to identify and manage cost outliers through trimming procedures to uphold data integrity. The ethics of this study were approved by the Shanghai Health Development Research Center (approval number 2024001) and all the methods were performed in accordance with the Declaration of Helsinki.

Candidate predictors

The main predictive factors inputted into the model in this study consist of four categories, totaling 12 variables. The first category includes patient characteristics such as age categorized into five groups, gender, household registration status, health insurance type, and whether they belong to the Top 20 HNHC Diseases. The second category involves characteristics related to healthcare service utilization, including whether surgery was performed during hospitalization, length of hospital stays, and annual number of hospitalizations. The third category pertains to healthcare expenditure characteristics and previous research indicates a strong correlation between HNHC status in the previous year and the subsequent year20, thus annual expenditure is computed to determine if patient costs rank within the top 5% as previous HNHC Status. The fourth category encompasses hospital characteristics such as hospital level, hospital type, and hospital region.

Outcomes

The outcome measured in this study was becoming an HNHC patient in the subsequent year, defined as individuals who rank within the top 5% of annual hospitalization costs. For the training dataset spanning 2021–2022, the predictive model assessed whether patients would become HNHC in 2022. For the external validation dataset spanning 2022–2023, the predictive model assessed whether patients would become HNHC in 2023.

Statistical analysis

Machine-learning-based models

This study employed five machine-learning-based models to predict HNHC pediatric patients: (1) logistic regression (used as the reference model)30,33: A statistical model that predicts the probability of a binary outcome P(y = 1 | X)\(=\frac{1}{1+{e}^{-({\beta }_{0}+{\beta }_{1}{X}_{1}+\dots +{\beta }_{n}{X}_{n})}}\) based on linear relationships between predictor variables and the logarithm of odds, where \({X}_{1},{X}_{2},\dots {,X}_{n}\) are predictor variables, and \({\beta }_{1},{\beta }_{2},\dots {,\beta }_{n}\)​ are model coefficients; (2) k-nearest neighbors’ regression (KNN)34: A non-parametric method that predicts the class of a query point by finding the majority class among its k nearest neighbors. The Euclidean distance is commonly used to determine nearest neighbors, defined as: d(\({X}_{i}\)​,\({X}_{j}\)) = \(\sqrt{\sum_{m=1}^{n}{({X}_{im}-{X}_{jm})}^{2}}\), where \({X}_{i}\) and \({X}_{j}\)​ represent two points in the feature space; (3) random forest (RF)35: An ensemble method that constructs multiple decision trees during training and outputs the mode (classification) or average prediction (regression) of the individual trees. The prediction function for a random forest can be written as: \(\widehat{f}\left(X\right)=\frac{1}{T}\sum_{t=1}^{T}{f}_{t}(X)\), where \(T\) is the number of trees and \({f}_{t}(X)\) is the prediction of the \(t\)-th tree; (4) Multi-layer Perceptron (MLP)36: A type of artificial neural network composed of multiple layers of nodes and uses backpropagation for training, where each layer is fully connected to the next. The output of a single neuron is computed as:\({z}^{(l)}=\sigma ({W}^{\left(l\right)}{x}^{\left(l-1\right)}+{b}^{(l)})\), where \({W}^{\left(l\right)}\) is the weight matrix, \({b}^{(l)}\) is the bias, σ() is the activation function, and \({x}^{\left(l-1\right)}\) is the input from the previous layer; and (5) Naive Bayes37: A probabilistic classifier, which was based on Bayes’ theorem with the “naive” assumption of independence between features, making it efficient and effective for large datasets with categorical features. The posterior probability is calculated as: \(P\left(C|X\right)=\frac{P\left(X|C\right)P\left(C\right)}{P\left(X\right)}\), where \(C\) is the class label, \(X\) is the feature set, and \(P\left(X|C\right)\) assumes independence among features for computational efficiency.

Model development

This study used 70% of the data from 2021–2022 as a training set, employing the predictive factors of 2021 patients as model input parameters to construct models for predicting HNHC. Based on the initial model development, the remaining 30% of the 2021–2022 data served as a validation set to evaluate model performance. This process refined the training outcomes of the predictive model based on the actual HNHC cases in 2022, facilitating the selection of the optimal model. To address the issue of class imbalance between HNHC and non-HNHC groups within the dataset, internal class balancing was implemented to avoid the biased model performance particularly towards the majority class of HNHC38. The approach of the Synthetic Minority Over-sampling Technique (SMOTE) ensures that the classifier is exposed to a more equitable distribution of samples, enhancing its ability to generalize across both majority and minority classes39.

A grid search strategy was employed with k-fold cross-validation to optimize hyperparameters for the RF and KNN models25,40. The MLP model, a machine-learning algorithm, utilizes multiple layers to effectively model nonlinear relationships between predictors and outcomes38,41. The final models were also developed by manual tuning of the hyperparameters (i.e. the number of layers, etc.). In k-fold cross-validation, the dataset is divided into ‘k’ equal-sized subsets or folds to train the models ‘k’ times, each time using ‘k-1’ folds for training and the remaining fold for testing, which ensures that each data point is used for both training and testing, providing a more robust estimate of model performance40. After primarily evaluating the performance across different values of k = 5, k = 10, and k = 20, k = 10 was selected for often providing a good balance between bias and variance.

Model assessment

The study assessed each model’s predictive performance using the following 5 metrics. (1) ROC-AUC curve: A graphical plot that illustrates the trade-off between the true positive rate (sensitivity) and false positive rate (1-specificity) across various threshold settings, providing a comprehensive assessment of a model’s discriminatory ability. AUC represents the area under the receiver operating characteristic curve; (2) accuracy: The proportion of correct predictions (both true positives and true negatives) made by the model among all predictions, reflecting overall correctness, as shown in Eq. (1); (3) sensitivity: Known as the true positive rate (TP), sensitivity measures the proportion of actual positives (HNHC patients) correctly identified by the model, indicating its ability to detect true value of the positive cases and avoid false negatives (FN), as shown in Eq. (2); (4) specificity: The true negative rate (TN) measures the proportion of actual negatives correctly identified by the model among all negative instances, highlighting its ability to avoid false positives (FP), as shown in Eq. (3); (5) F1 score: The harmonic mean of precision and recall, F1 score balances both metrics and provides a single metric that combines precision (ability of the model to not label a negative sample as positive) and recall (ability of the model to find all positive samples), which is useful when there is an uneven class distribution, as shown in Eq. (4). The calculation formulas for the above metrics are as follows:

$$Accuracy= \frac{TP+TN}{TP+TN+FP+FN}$$
(1)
$$Sensitivity=\frac{TP}{TP+FN}$$
(2)
$$Specificity=\frac{TN}{TN+FP}$$
(3)
$$F1 score=2\times \frac{\frac{TP}{TP+FP}\times \frac{TP}{TP+FN}}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}$$
(4)

Considering that the cohort of HNHC patients in this study comprised 91,882 children, which is significantly larger than the number of parameters used for prediction (12 parameters), the risk of overfitting was deemed low. The event per variable (EPV) for all models exceeded 80, which is well above the threshold of EPV < 20 typically associated with overfitting risks42.

Robustness analysis

The external validation dataset 2022–2023 was used to assess the robustness of the trained models. These prediction models were validated using predictors in the 2022 data and the outcome in the 2023 data. For internal validation, changing the 70:30 train-test ratio in the main models, additional experiments were conducted using different train-test ratios, namely 80:20 and 90:1029, to assess the robustness of the models. The same 5 metrics were used to assess each model’s predictive performance.

In the robustness analysis section, additional metrics were also incorporated across various train-test ratios to provide a more comprehensive evaluation of model performance, including: (1) the Mean Squared Error (MSE) quantifies the average squared difference between predicted and actual values, emphasizing larger errors, while the Root Mean Squared Error (RMSE) is the square root of MSE, providing an error metric in the same units as the target variable, facilitating more intuitive interpretation of model performance; (2) the Positive Likelihood Ratio (PLR) and Negative Likelihood Ratio (NLR) are statistical measures used to evaluate the diagnostic performance of a test, providing an assessment of how well a test can distinguish between the presence and absence of a condition; (3) the Diagnostic Odds Ratio (DOR) is a single indicator of test performance, defined as the ratio of the odds of a positive test result in individuals with the condition to the odds of a positive result in those without the condition; a higher DOR signifies better discriminatory power of the model; and (4) the Matthews Correlation Coefficient (MCC) is a measure of the quality of binary classifications, taking into account true and false positives and negatives; it returns a value between -1 and 1, where 1 indicates perfect predictions, 0 indicates random predictions, and -1 indicates total disagreement between prediction and observation.

These models were developed and optimized with Sci-kit Learn in Python version 3.0.

Results

Characteristics of the study sample

The study included a total of 91,882 hospitalized children in 2021. These children were categorized into HNHC (n = 4,595) and non-HNHC (n = 87,287) groups based on whether their annual medical costs ranked in the top 5%, and their predictor variables were described and compared accordingly (Table 1).

Table 1 Predictor variables of the study populations in 2021 (n = 91,882).

Compared to non-HNHC children, HNHC children were more likely to be under 1 year old or over 13 years old, female, Shanghai residents, and dual-insured (URBMI + CHMF). Analysis of diseases among HNHC children from 2017 to 2023 indicated that The top 20 diseases identified in the study can be categorized into five groups: (1) common neonatal diseases (e.g., preterm birth, neonatal respiratory distress syndrome), (2) traumas and limb deformities (e.g., fractures, dislocations, sprains), (3) respiratory diseases (e.g., pneumonia), (4) malignant neoplasms, and (5) cardiovascular, cerebrovascular, and rare diseases (e.g., congenital heart malformations, hemophilia, Crohn’s disease) (Table 2). The results revealed that 49.14% had one of the top 20 HNHC diseases, a significantly higher rate than that of non-HNHC children. Regarding healthcare service utilization, 91.64% of HNHC children underwent surgery (74.94% of non-HNHC), with a longer average hospital stay (34.15 days) and more frequent annual hospitalizations (2.86 times). In terms of hospital characteristics, HNHC children were more likely to be hospitalized in tertiary hospitals located in the city center, either in general or specialized hospitals.

Table 2 The top 20 most prevalent diseases among HNHC patients during 2017–2023.

Between 2021 and 2023, the total hospitalization spending and HNHC patient spending experienced slight fluctuations. In 2021, the median annual hospitalization spending among pediatric patients was 190.75 million dollars and 67.50 million dollars was spent on 5% HNHC patients. In three years, the proportion of total spending attributed to HNHC patients remained stable, ranging between 35 and 36% (Fig. 1).

Fig. 1
figure 1

Distribution of annual hospitalization spending among children 2021–2023.

Prediction of HNHC patients

The prediction abilities of the reference model and four machine-learning-based prediction models for HNHC patients are presented in Table 3, including ROC-AUC, accuracy, sensitivity, specificity, and F1-score for each model. There were no predictors with high variance inflation factors (VIF > 10) among the parameters included in the reference model, indicating that collinearity is not a concern for the prediction models. Based on the metrics in the majority of previous studies19,20,22,26,28,31,43, this study utilized the ROC curve as the criterion for selecting the optimal model, as demonstrated in Table 3 and Fig. 2.

Table 3 Prediction ability of models for HNHC patients (Train-test Ratio of 70:30).
Fig. 2
figure 2

Receiver-operating characteristics (ROC) curves.

Four machine-learning-based models had a higher discriminative ability than the logistic regression model, which was the reference model (ROC-AUC = 0.827, F1-score = 0.127). The MPL model (0.872) had the highest ROC-AUC of 0.869, followed by the random forest model (0.869), the KNN model (0.836), and the Naive Bayes model (0.828). The Naive Bayes model had the highest sensitivity (0.567) and the specificity of the random forest model, the MPL model and the KNN model was 1.000. Considering the low risk of overfitting, the event per variable (EPV) for prediction models was posited and all EPVs for all models were over 80 (EPV < 20 is indicative of potential overfitting), which further supported the robustness and generalizability of these models.

The ROC curve results for the prediction models are shown in Fig. 2, where the x-axis represents sensitivity and the y-axis represents specificity, and the corresponding values of the area under the ROC curve for each model are presented in Table 3. Compared to the red line representing the logistic regression model, the other machine learning models demonstrate higher predictive power. The Multi-layer Perceptron model exhibits the highest predictive performance, as indicated by the green line with the largest area under the ROC curve.

The Losses or Error Graph for the prediction models are shown in Fig. 3, where the y-axis represents different error rates for the corresponding models. As shown in the Losses or Error Graph, the highest error rate was observed for the Naive Bayes model, while the MLP maintained the lowest error rate—lower than those of RF, Logistic Regression, and KNN. The MLP model achieved high predictive performance while maintaining the lowest error rate during training, indicating superior model performance.

Fig. 3
figure 3

The losses or error graph among different prediction models.

Predictors importance

Based on the results of the MLP model, the SHAP (SHapley Additive exPlanations) plots were utilized to visually elucidate machine learning model predictions by quantifying the contribution of each feature in Fig. 4. In Fig. 4, the impact of 12 indicators on model predictions was identified to aggregate feature importance across all variables in the summary SHAP plot by assigning a SHAP value to each feature relative to a baseline prediction (often the dataset’s mean), and the magnitude of SHAP values indicating the extent to which features influence the model’s predicted output. In the MLP model, the factors that most significantly influence the predictive results are length of stay, number of hospitalizations, previous HNHC status, age, and presence of Top 20 HNHC diseases.

Fig. 4
figure 4

Importance of each predictor in the MLP model.

Robustness analyses

The results for robustness analyses are presented in Table 4. Using these five models, the study predicted HNHC for 2023 based on relevant characteristics of a total of 77,774 hospitalized children in 2022 to assess the discriminative ability of the models based on actual outcomes, and additional experiments using different train-test ratios, namely 80:20, and 90:10 were conducted to assess the robustness of the models.

Table 4 Robustness analyses of prediction ability for HNHC patients.

In the external validation dataset, the predictive performance of all models showed a decrease compared to the main results. However, the MLP model remained the best-performing model (ROC-AUC of 0.843, accuracy of 0.986, sensitivity of 0.122, specificity of 0.999, F1-score of 0.203). Additionally, the MLP model’s performance was significantly higher than that of the reference logistic regression model (ROC-AUC of 0.808). The external validation results indicate the robustness of the optimal model identified in the main analysis. The model trained using MLP demonstrates effective prediction of HNHC occurrence among hospitalized children.

After adjusting the train-test ratio of the internal validation data to 80:20 and 90:10, the predictive performance of all models demonstrated robustness and consistency compared to the main results. The MLP model continued to exhibit the best fit (ROC-AUC: 0.872 in 80:20 ratio; 0.883 in 90:10 ratio), significantly outperforming the logistic regression (ROC-AUC: 0.826 in 80:20 ratios; 0.807 in 90:10 ratios). There was also no substantial change in ranking for the prediction ability of other models in 80:20 and 90:10 ratios, which kept the same with the main result in 70:30 ratios (MLP > RF > KNN > Naive Bayes > logistic regression). The alteration of the train-test ratio for the internal validation data had minimal impact on the primary results.

In the robustness analysis, additional metrics including RMSE, MSE, PLR, NLR, DOR, and MCC were incorporated across different train-test ratios (Table 5). Results indicate that both RMSE and MSE remain within acceptable ranges; aside from the Naïve Bayes model, the performance of the other models is consistent across all ratios. Moreover, favorable PLR and DOR values are observed—particularly under the primary 70:30 split, where high PLR values are recorded and the MLP model maintains optimal predictive performance.

Table 5 Robustness analyses of model performance metrics comparison across different train-test ratios.

Discussion

Using retrospective cohort data of hospitalized children in Shanghai, China, this study found that hospitalization costs for HNHC pediatric patients accounted for over 35% of total spending, consistent with findings from studies in other countries3,44. Machine learning models outperformed the logistic regression model in predicting HNHC patients for the following year, with the MLP model showing the highest predictive efficacy. The results of external validation from 2022–2023 data and the internal validation using different train-test ratios (80:20 and 90:10) corroborated the main findings, affirming MLP as the optimal model and demonstrating robustness. Taken together, this study provides foundational evidence for the status of HNHC children in China and fills a gap in the international literature on predictive modeling for HNHC children.

The study found significant differences between the HNHC and non-HNHC groups in key predictive factors based on the results describing both groups, consistent with previous studies32,44. HNHC children were more likely to be under 1 year old or over 13, female, and residents of Shanghai with dual insurance (URBMI + CHMF). They also showed higher rates of severe diseases and more frequent use of healthcare services, including surgeries, longer hospital stays, and more frequent annual hospitalizations. Additionally, HNHC children were more often hospitalized in tertiary hospitals in urban areas. At the same time, the top 20 most common diseases among HNHC children were identified, distinguishing the causes of HNHC in children from those in adults from a disease perspective18,20. The top 20 diseases identified in the study can be categorized into five groups, including common neonatal diseases, traumas and limb deformities, respiratory diseases, malignant neoplasms, and cardiovascular, cerebrovascular, and rare diseases. This categorization provides valuable insights for pediatric resource allocation in clinical settings and serves as a strong predictive factor for subsequent model analysis.

Numerous studies have been conducted to identify prospective HNHC adult patients using traditional methods like logistic regression models, which have been proven to have lower predictive power compared to machine-learning-based models18,19,20,21,22,43. In predictive models for HNHC patients, other studies have identified the following best-performing methods and their predictive efficacy (often measured by AUC): XGBoost model (AUC: 0.79043 and 0.80131), Smooth Bayesian network model (AUC: 0.84028), neural networks (AUC: 0.84020), RF model (AUC: 0.88319), and LightGBM (AUC: 0.90043). For different populations, datasets, and model constructions, the optimal type of model may vary. However, the AUC metric remains a reliable measure for assessing the predictive efficacy of different models within the same study or for cross-study comparisons19,20,22,26,28,31,43. Generally, an AUC greater than 0.8 indicates good predictive performance of the model40, and in this study, the MLP model achieved an AUC of 0.872, which is similar to the predictive performance observed in previous adult HNHC models. Additionally, when selecting machine learning models for large-scale datasets, understanding the time complexity and computational efficiency is crucial, as the measures of how fast or slow an algorithm will perform for the input size45. Models such as logistic regression, Naive Bayes, and KNN exhibit linear time complexity concerning the number of samples and features, making them computationally efficient for both training and prediction46,47. In contrast, models like RF and MLP often offer enhanced predictive performance, which may require more computational resources due to their complexity and parameter tuning, with additional computational cost depending on the application’s requirements48.

In the analysis of predictor importance for prediction models, it was found that incorporating previous year healthcare service utilization (length of stay and number of hospitalizations) and HNHC status significantly enhances the predictive power of the model. Prior studies have shown strong correlations between healthcare costs and utilization from the previous year and those in the subsequent year8,20,49, which supports the notion that adding these relevant factors substantially improves the predictive performance of the model. Based on a retrospective cohort study spanning 2017 to 2023, the top three-digit ICD-10 codes of HNHC patients were ranked to identify the highest prevalence of diseases within the top 20. This process enabled us to pinpoint the pediatric conditions posing the highest risk of HNHC, demonstrating substantial differences between pediatric and adult HNHC conditions18,19. Thus, enabling early patient classification based on these findings as predictive factors inputted into the forecasting model, which was the top 4 important predictor, suggesting that clinical practice should focus on these potential population to optimize services and prevent the occurrence of high healthcare costs.

Moreover, few studies validate their constructed models using external datasets, and even prior studies have primarily conducted sensitivity analyses focused on defining HNHC populations, essentially using a single set of training data for analysis20. This study innovatively employed external data for 77,774 children from 2022–2023 to validate the robustness of the models, which confirmed the MLP model as optimal, with sustained high efficacy (ROC-AUC: 0.843), and remained consistent with the main result. However, the external validation results demonstrated a slight decline in performance across all models compared to the main results. This decrease might be attributed to several factors, including differences in patient demographics, clinical practices, and data collection methods between the training and external datasets, as well as temporal variability in healthcare practices and patient characteristics50. Also, a degree of overfitting to the internal data may have contributed to the reduced generalizability38. To improve robustness across diverse datasets, future studies should incorporate more heterogeneous external data from multiple institutions or regions, employ ___domain adaptation techniques to better align the training and external data distributions, and explore ensemble methods to reduce variance21. Based on the MLP model in this study, these strategies provide a pathway for refining predictive models and enhancing their applicability in real-world settings to analyze HNHC cases among hospitalized children in China.

The contributions of this study are reflected in two key aspects. From an academic perspective, this is the first study, to the best of current knowledge, to conduct HNHC prediction among pediatric populations in China, comparing multiple machine learning methods that outperformed traditional statistical models. This study found significant differences in key predictive factors, such as disease types, between children and adults in HNHC predictions, suggesting the need for specific factors when predicting HNHC pediatric patients18,19,20.The use of internal and external validation enhances the robustness of the results and provides a reference for future research to further improve predictive accuracy using methods such as MLP. The utilization of the machine learning model in HNHC also raises ethical concerns relating to data privacy and the potential for discrimination, and it is essential to address these issues by improving data transparency and adopting a feasible approach for data collection in the future.

From a policy and practical application perspective, this study provides a predictive model for the early identification of HNHC pediatric patients, which can help optimize and improve healthcare service models that have been attempted but yielded limited success9,10,12,13,15. In the future, the integration of the trained MLP model into clinical workflows can enable real-time identification and risk stratification of HNHC patients, facilitating early intervention and resource optimization. The model could be involved in integration with the hospital’s electronic medical record (EMR) system and the claims database, as a decision-support tool, and be optimized by the feedback from clinicians and continuous monitoring to update the model’s real-world utility. By pinpointing the optimal intervention window and initiating services early, the model contributes to preventing high healthcare costs.

This study has several limitations. Firstly, due to barriers to accessing medical data in China, it’s difficult to link outpatient and inpatient data. More social, physiological, and biochemical information could not be obtained, which is a common data limitation in Chinese research. While this study endeavored to collect all factors that maybe influencing the predictive performance of HNHC in theory and incorporated them into the predictive models to provide robust results within the available data scope, future research should investigate more predictors in improving predictive models of HNHC pediatric patients. Secondly, because machine-learning-based models rely on training data, the generalizability of this study’s results to predict HNHC children in other countries or generalize to other populations (eg, older adults), different settings, or other data sources should be discussed with caution. This study used health insurance claims data for all hospitalized children in Shanghai, a major city in China, to reduce sampling bias and leverage the advantages of big data as possible. Future studies could further validate these findings and incorporate additional data sources to refine the model. Finally, due to the impact of COVID-19 lockdown policies, there may be some fluctuations in the 2021–2022 data. To address this issue, additionally external validation was conducted using a dataset from 2022–2023, and the ranking of the efficacy of the models remained robust, indicating the feasibility of using the MLP model for prediction.

Conclusion

In summary, this study revealed that the top 5% of HNHC pediatric patients accounted for over 35% of total hospitalization costs. Using a large retrospective cohort of 91,882 children for training and 77,774 for external validation to develop the machine learning models, the MLP indicated the best predictive performance. Key predictors of HNHC pediatric patients were identified, including length of stay, number of hospitalizations, previous HNHC status, age, and presence of Top 20 HNHC diseases. These findings underscore the critical importance of the integration of predictive models into clinical practice and policymaking to realize early identification of HNHC pediatric populations for optimizing healthcare resource allocation, facilitating targeted interventions, and controlling healthcare costs. In future studies, real-world longitudinal data from multiple regions, additional predictive factors related to sociodemographic characteristics and health behaviors, and more ensemble methods of comprehensive machine learning algorithms should be employed to enhance the predictive performance and generalizability of the models.