Abstract
Artificial intelligence (AI) decision support systems in pediatric healthcare have a complex application background. As an AI decision support system (AI-DSS) can be costly, once applied, it is crucial to focus on its performance, interpret its success, and then monitor and update it to ensure ongoing success consistently. Therefore, a set of evaluation indicators was explicitly developed for AI-DSS in pediatric healthcare, enabling continuous and systematic performance monitoring. The study unfolded in two stages. The first stage encompassed establishing the evaluation indicator set through a literature review, a focus group interview, and expert consultation using the Delphi method. In the second stage, weight analysis was conducted. Subjective weights were calculated based on expert opinions through analytic hierarchy process, while objective weights were determined using the entropy weight method. Subsequently, subject and object weights were synthesized to form the combined weight. In the two rounds of expert consultation, the authority coefficients were 0.834 and 0.846, Kendall's coordination coefficient was 0.135 in Round 1 and 0.312 in Round 2. The final evaluation indicator set has three first-class indicators, fifteen second-class indicators, and forty-seven third-class indicators. Indicator I-1(Organizational performance) carries the highest weight, followed by Indicator I-2(Societal performance) and Indicator I-3(User experience performance) in the objective and combined weights. Conversely, 'Societal performance' holds the most weight among the subjective weights, followed by 'Organizational performance' and 'User experience performance'. In this study, a comprehensive and specialized set of evaluation indicators for the AI-DSS in the pediatric outpatient clinic was established, and then implemented. Continuous evaluation still requires long-term data collection to optimize the weight proportions of the established indicators.
Similar content being viewed by others
Introduction
In April of 2018, the US Food and Drug Administration approved the first Artificial intelligence (AI) device to detect eye disease in adults patients with diabetes1. The emergence of AI-based technology has garnered significant attention across various domains, particularly in advancing clinical support system, thereby reshaping medical care2. These advancements encompass a spectrum of AI applications, ranging from preventive strategies development3,4, image-based diagnosis5,6,7, nursing and therapy workflows improvement8,9. In China, where pediatric outpatient clinics grapple with perpetual overcrowding, AI-DSS has emerged as a valuable tool, particularly evident during the COVID-19 epidemic10, aiding in expediting disease diagnosis and treatment, and mitigating patient wait times11,12. However, the predominant focus on technical prowess during AI development often sidelines considerations for seamless integration into real-world workflows and the practical value of these innovations. Such oversights may only surface during clinical evaluations, as training datasets are meticulously curated to eliminate imperfect samples13. Moreover, AI-DSS in clinical settings can pose both costly and occasionally disruptive, highlighting the imperative to dynamically evaluate their efficacy post-adoption.
The rigorous evaluation of information system's quality and impact were initially proposed in the 1980s, encompassing every stage of its lifecycle, from design and development to selection and utilization. Initially, such publications focused on revealing techniques for assessing effectiveness in controlled laboratory settings. Subsequent papers delved into real-world testing in clinical environments, analyzing their impact on the structure, process, and outcome of healthcare delivery14,15,16,17,18. With the proliferation of AI-DSS in clinical practice, subsequent papers aimed at assessing the clinical impacts of them8,19,20,21,22.
Currently, various frameworks and guidelines exist for grouping indicators to evaluate AI system performance. The Delone and Mclean(D&M) Information System (IS) Success Model was well-known as a framework and model for measuring the variables in IS research in healthcare settings23. It was proposed to conceptualize and operationalize IS success24. Modified and extended, it now provides components of system quality, information quality, service quality, intention to use, user satisfaction, and net benefits, all interrelated25,26,27,28,29. Meanwhile, they can also be characterized as technology, task, user, organization, and environment, mainly focusing on the appraised objects30,31. The Guideline for Good Evaluation Practice in Health Informatics (GEP-HI) was developed to plan and conduct scientifically robust evaluation studies in healthcare32. Additionally, other guidelines were specifically designed to aid in reporting studies incorporating interventions with AI components33,34. These frameworks and guidelines primarily concentrated on reporting and regulatory aspects. However, they offered limited guidance on assessing the application of AI systems. A later study proposed Translational Evaluation of Healthcare AI (TEHAI) to guide the evaluation of AI systems integrating in clinical settings and provide a scoring matrix35. Yet, TEHAI lacks consideration of pertinent stakeholders and detailed information within each component.
Remarkably, only a few indicators reported in the proposed articles for evaluating AI systems in specialized healthcare clinics are guided by the above existing models or frameworks20,36; instead, the indicators were often self-conducted. Some focused on clinician impacts, some on patient impacts, and others on economic impacts, revealing a lack of systematic and comprehensive options37. AI-DSS in pediatric healthcare has a more complex application background, encompassing diverse clinical scenarios, treatment possibilities, and a wide age span range from 0 to 18 years. The evaluation process requires to evolve from being a one-time activity to a continuous process, ensuring effective AI-DSS utilization, considering all stakeholders, and providing a comprehensive set of indicators for practical evaluation. This study aims to develop a set of evaluation indicators tailored specifically for AI-DSS in pediatric healthcare, enabling continuous and systematic performance monitoring. It will be established through expert consensus and then integrated into the hospital information system of the Children's Hospital of Fudan University for a pilot implementation. We hypothesize that this incorporated evaluation indicator set will realize dynamic monitoring, and significantly enhance AI governance within pediatric healthcare settings.
Methods
Study design
The study was conducted in two stages: (1) Generating the draft of the evaluation indicators via literature reviews and focus group interviews, followed by executing the Delphi study to establish the final indicator set; (2) Conducting weight analysis for the finalized indicators. The study was approved by the Research Ethics Board of Children's Hospital of Fudan University (No. 2022307A), and all methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all participants.
Procedures, participants, and data collection
Literature review
A systematic literature review of AI decision-making system studies in healthcare was conducted first. We searched Ovid Medline, Ovid Embase, Cochrane Library for eligible English written studies, and China National Knowledge Infrastructure (CNKI), Wanfang data, Sinomed for eligible Chinese written studies published from 1990 to 2022. The following keywords or medical terms were used: artificial intelligence, machine learning, clinical decision support, evaluation, metrics, indicator, index, framework, implementation, application, performance, and success. Considering AI systems deployed in healthcare sectors are generally based on hospital information systems, we also searched literature related to evaluation indicators for information systems. The literature inclusion criteria were: (1) language: English or Chinese; (2) application scenario: healthcare sectors; (3) content: application evaluation for AI systems or hospital information systems. The exclusion criteria included conference abstracts and articles lacking accessible full text.
Constructing the draft evaluation indicator set
After the literature review, a focus group interview was conducted to get more information on the evaluation indicators for AI-DSS in healthcare sectors. Eleven participants, actively involved in AI-DSS development research or its utilization, were invited to attend the virtual interview session through the Tengxun APP. They hailed from four tertiary children's hospitals and one university in Shanghai Municipality, Anhui Province, Jiangsu Province, and Shandong Province. Among them were four physicians, two nurses, two radiologists, one pharmacist, and two AI professors. We encouraged them to share their insights on the following topics: (1) What is your interpretation of a successful AI-DSS in pediatric outpatient clinics? (2) What criteria do you consider essential for evaluating AI-DSS in pediatric outpatient clinics?
Eventually, relevant content from literature and interviews was extracted, duplicated and synthesized into indicators by two independent members of the research team. These indicators were subsequently categorized based on the Society-Management-Application-Result-Technology (SMART) model. This model, rooted in recent theoretical and empirical AI implementation studies, was developed through expert consensus within a panel of national AI experts convened by the Shanghai AI Laboratory. The expert panel agreed that evaluating an AI system should encompass three primary dimensions: societal performance (referred to as society), organizational performance (referred to as management), and user experience performance (referred to as application). Within each dimension, various components of technology utility should be assessed (referred to as technology), and the outcomes of these components should be comprehensively analyzed and documented (referred to as result). Significant emphasis was placed on tailoring the evaluation indicator set to suit a specialized application scenario. Further details can be found at https://www.shlab.org.cn/. Consequently, the '1–3-5 evaluation indicator set' was developed for further Delphi study, where '1' signifies the pediatric outpatient clinic, '3' represents the three evaluation dimensions, and '5' denotes the five components of technology utility.
Delphi study
Delphi methodology was employed to develop the 1–3–5 evaluation indicator set. Using the purposive sampling method, experts were recruited from members of the Chinese Medical Information and Big Data Association (CHMIA) Pediatric Committee. The CHMIA is a national-level association under the supervision of the National Health Commission of the People’s Republic of China, dedicated to advancing healthcare informatics and big data analytics, fostering collaboration, innovation, and policy advocacy in the fields of health information technology and data science. The Experts inclusion criteria were as follows: (a) having experience in either using or designing AI systems in pediatric healthcare sectors; (b) more than 10 years of working experience; and (c) with high interest, willing to participate in this study.
The Delphi study questionnaire was structured into four sections: (I) introduction of the study and instructions for completing the questionnaire.; (II) demographic information of the expert, including age, gender, training experience, professions, and academic title; (III) draft of the 1–3–5 evaluation indicator set, wherein the experts were asked to rate the importance of each category and indicator, using a 5-point Likert scale, ranging from 'Not at all important (1 point)' to 'Extremely important (5 points)'. Additionally, experts could propose additions, deletions, or modifications; (IV) experts' familiarity with the study and their judgment on the indicators. A total of two rounds of Delphi study were conducted. Experts were asked to complete the questionnaires and returned within 2 weeks. In the first round, paper questionnaires were distributed to the expert panel during the CHMIA annual committee meeting on March 4, 2023. Additionally, mail services were arranged for each expert to facilitate questionnaire returns. All Round 1 questionnaires were promptly received by March 10, 2023, and Round 2 took place from March 13 to March 26, 2023, with all questionnaires distributed and returned via email.
After collecting data from each round, the study team screened out indicators with a mean importance score of ≤ 3.5, a coefficient of variation of ≤ 0.25, and a full mark rate ≤ 20%38, while also analyzing experts' opinions. The results were then shared with the experts following each round.
Determining the index weights
Analytic hierarchy process
The rated importance score for each indicator, as provided by each expert, was compiled from the final consultation round. Following this, a judgment matrix was established to conduct pairwise comparisons and importance ratios were then determined using Saaty’s 9-point scale39. The weight vector was then measured using yaahp 10.1 software (https://www.metadecsn.com/yaahp/).
Entropy weight method
The data sources for collecting the necessary information to calculate the evaluation indicators were identified. Relevant data fields from the hospital information system were mapped, and data extraction mechanisms were developed. Subsequently, data were summarized every quarter, specifically on March 31, 2023, June 30, 2023, and September 28, 2023, at Children’s Hospital of Fudan University, defined as m quarters and n indicators, and the indicators were normalized to obtain the data matrix. \({X}_{ij}\) is the original value of quarter i, indicator j.
As to forward indicators:
As to inverted indicators:
The entropy value of indicator j was calculated as follows.
The difference coefficient of indicator j was calculated as follows:
The weight of indicator j was calculated as follows:
Combination weight method
Assuming the weight vector calculated by the hierarchical analysis is α = (α1, α2, ..., αn) and the weight vector calculated by the entropy weight method is β = (β1, β2, ..., βn). Combined weights were calculated based on geometric average method, and the calculation formula was as follows40:
The entropy weight and the combination method were all realized through python, and detailed python code can be found in Supplement A.
Statistical analysis
SPSS 25.0 statistical software (https://www.ibm.com/support/pages/downloading-ibm-spss-statistics-25), python 3.11 (python 3.11), and Analytic Hierarchy Process (AHP) yaahp 10.1 (https://www.metadecsn.com/yaahp/) were used to analyze the data. The measurement data were described by mean ± standard deviation, and the count data were described for frequency and percentage. The experts' enthusiasm was quantified through the effective recovery rate of the questionnaire, while the expert authority coefficient was determined as the mean of the judgment coefficient and familiarity coefficient. The degree of coordination among expert opinions was assessed using Kendall's harmony coefficient. A minimum recovery rate of 80 percent is indicative of a valid result. An expert authority coefficient exceeding 0.70 suggests high expertise among the experts, lending credibility to the inquiry results. Kendall's coordination coefficient should first be statistically significant (P < 0.05), with values ranging between 0 (indicating no agreement) and 1 (representing complete agreement), where higher values denote stronger agreement, and those below 0.2 suggest poor agreement. The combination of the subjective weighting method (the analytic hierarchy process, AHP) and objective weighting method (the Entropy Weight Method, EWM) were used to calculate the 1–3–5 evaluation indicator set and determine the impact of the assessment indicators using the software Analytic Hierarchy Process (AHP) yaahp 10.1 and python 3.11. The details can be found in the supplement material.
Results
Sample characteristics
Fifty-two invitees agreed to participate in the Delphi consultations, with a 100.0% response rate for Round 1 questionnaires and 92.0% for Round 2. Three experts dropped out due to non-response within two weeks. All returned questionnaires were completed. The participating experts come from eight municipalities and provinces in China, including Shanghai, Tianjin, Jiangsu Province, Henan Province, Shandong Province, Hubei Province, Anhui Province, and Zhejiang Province. Demographic characteristics of the experts are shown in Table 1.
Expert authority coefficient and the degree of opinion coordination
In the two rounds of expert consultation, the authority coefficients were 0.834 and 0.846, respectively, which met the criteria of the expert consultation authority coefficient > 0.7. Kendall's coordination coefficient was 0.135 in Round 1 and 0.312 in Round 2. Kendall's test had statistical significance (all p < 0.001).
1–3-5 Evaluation indicator set and corresponding weights
The preliminary indicator set included three first-class indicators and fifteen second-class indicators, constructed using the SMART model, along with fifty-six third-class indicators derived from the literature reviews and interview, following deliberation among the research team. In Round 1, no changes were made to the first and second-class indicators. However, five third-class indicators, including ‘coverage rate of disease spectrum’, ‘number of jobs displaced by AI-DSS’, ‘public comprehension of AI-DSS’, ‘daily average patient visits per clinical doctor’, and ‘daily average outpatient and emergency room visits per hospital’, were directly eliminated based on the exclusion criteria. In comparing ‘time expenditure for AI-DSS operation training’ with ‘training frequency for AI-DSS utilization’, the latter was deemed a more reasonable indicator for reflecting the accessibility of user experience performance. Consequently, ‘time expenditure for system operation training’ was eliminated and ‘training frequency for AI-DSS utilization’ was added based on experts’ opinions and research team discussion. Similarly, in comparing ‘number of granted invention patents’, ‘number of granted utility model patents’, ‘number of granted software copyrights’, ‘number of medical device certificates obtained’, ‘quantities of embedded algorithms’ with ‘extent of process reengineering for improved healthcare services via AI-DSS’, the latter was considered more representative of innovation in societal performance. Thus, ‘extent of process reengineering for improved healthcare services via AI-DSS’ was selected to replace the former five indicators following experts’ opinions and research team discussion. Furthermore, seven third-class indicators underwent wording revisions. In Round 2, only three third-class indicators underwent wording revisions, and there were no further deletions from or additions to the indicator set.
Ultimately, 1–3–5 evaluation indicator set including three first-class, fifteen second-class, and forty-seven third-class indicators were established. As depicted in Table 2, seventeen third-class indicators highlighted in bold font are derived from data collected through self-reports of AI-DSS users, leadership, and the general public, using an electric 5-point Likert scale. The remaining third-class indicators are calculated from data directly obtained from hospital information system.
As detailed in Table 3, all the indicators included in Round 2 exhibited mean scores of importance exceeding 3.5, coefficient of variation surpassing 2.5, and full mark rates exceeding 20%. In terms of weight, societal performance carried the highest weight (0.539), followed by organizational performance (0.164), with user experience performance (0.297) having the smallest weight, according to subjective weighting method. However, under the subjective weighting method, organizational performance held the highest weight (0.690), societal performance followed by (0.302), with user experience performance (0.008) having the smallest weight. Similarly, in the combined weighting method, organizational performance retained the highest weight (0.543), societal performance followed (0.391), and user experience performance (0.067) had the smallest weight, following. After rounding to three decimal places, many values of 0.000 emerged in the objective weights column, indicating minimal variation observed in both the data collected from self-reports of AI-DSS users and the data derived directly from hospital information system each quarter.
Discussion
The number of publications describing the applications of AI systems in healthcare settings has snowballed over the past decade; the majority solely report on their effects41, and the actual impact when deploying AI in clinical settings remains largely unknown42. At the same time, evidence supporting clinical effectiveness, comparative effectiveness, cost-effectiveness, or other formal health technology assessment of AI in a clinical healthcare setting appears to be limited43. A recent systematic review highlighted a key reason for the lack of adoption of AI systems: the absence of demonstrated benefits. The risk lies in adopting AI systems with insufficient evidence, potentially leading to the incorporation of non-valuable or inadequately supported systems44.
In response to these challenges, researchers have developed multiple models and frameworks aimed at elucidating the elements contributing to AI systems. These model and framework-guided tools played an essential role in evaluating the AI systems45. Indeed, these models have limitations due to their specific focus. Some center on assessing AI algorithms in controlled laboratory settings46, while others aim to offer holistic metrics for evaluating user experience in clinical settings36,37. AI systems evaluation should not only start early in the development process but also be continuous and comprehensive, considering the ethical implications introduced by AI. This comprehensive approach is often lacking in current evaluation frameworks.
In this study, a comprehensive set of evaluation indicators, specifically tailored for AI-DSS implementation in pediatric outpatient clinics, was developed using the Delphi method. The draft was based on the established model specific to AI system performance evaluation and information extracted from the relevant published studies. The 49 consulting experts came from 8 provinces and cities across China, and they were all experienced in the field of implementing AI systems in pediatrics. The set of evaluation indicators has a strong foundation in the pediatric outpatient clinic and emphasizes the measurable results for the technology application in three dimensions. Compared with the TEHAI35, this set is tailored explicitly for pediatric outpatient clinics and offers a more extensive range of detailed indicators to interpret the success of implementing AI systems. Other reported indicators for evaluating the specialized healthcare clinic were often self-conducted and did not align with models or frameworks8,9,20,37.
Regarding weight analysis, previous studies have primarily relied on experts' opinions (subjective weights), utilizing the AHP method47,48. However, this study employed real-world data directly captured from the hospital information system to derive objective weights through the entropy weight method. Subsequently, subject and object weights were synthesized to form the combined weight. Incorporating the evaluation indicator set into the hospital information system helps realize active surveillance of the effect of AI-DSS in the pediatric outpatient clinic. The study revealed that the combined weights align much more closely with the objective weights. Indicator I-1(Organizational performance) carries the highest weight, followed by Indicator I-2(Societal performance) and Indicator I-3(User experience performance) in the objective and combined weights. Conversely, among the subjective weights, 'Societal performance' holds the most weight, followed by 'Organizational performance' and 'User experience performance'. The reasons for this discrepancy may stem from the entropy weight method utilized to calculate objective weights. According to its algorithm, if the collected data undergoes only slight changes during observation, the indicator may exhibit an extremely low weight. Indicators III-1 ~ III-4, III-6 ~ III-8, III-6 ~ III-8, III-19, III-23 ~ III-34, III-36 ~ III-37, III-39 ~ III-42, III-44 ~ III-47 exhibited this behavior. Their change trend was curved, offering minimal contribution to reflecting the dynamic performance of the AI system, suggesting a need for prolonged observation or frequent data collection.
Limitations of the study
Firstly, the Delphi method, being a structured and iterative forecasting technique, depends on the input of a panel of experts. The second round of consulting conducted via email lacks real-time discussion, potentially limiting the depth of analysis and the exploration of alternative viewpoints. Secondly, the subjective weight analysis heavily relies on the expertise and judgment of the participants. Experts' inherent biases, insufficient information, or influenced professional interests can introduce errors or inaccuracies into the forecasting process, potentially affecting the validity of the results. Thirdly, the entropy weight method requires long-term observation or more frequent data collection. The data collected quarterly, and four times may result in bias. Additionally, further exploration of the entropy weight method could involve assessing the impact of different data collection frequencies on the validity of results. Conducting comparative studies with varied observation periods, such as monthly or biannually, could provide insights into the optimal data collection frequency for minimizing bias and enhancing the robustness of the analysis.
Conclusion
Monitoring and evaluating AI systems in pediatric outpatient clinics necessitates a fitting set of evaluation indicators. The 1–3–5 evaluation indicator set, owing to its comprehensive nature and successful integration into hospital information systems, enables the automatic acquisition of real-world data. This allows for the continuous evaluation and monitoring of AI system performance in pediatric outpatient clinics. Future efforts should focus on enhancing long-term data collection for the indicators to optimize their weight proportions.
Data availability
Correspondence and requests for materials should be addressed to X.B.Z.
References
Administration., U. F. a. D. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems: FDA news release. https://www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm. Published April 11, 2018. Accessed 17 April 2024.
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36. https://doi.org/10.1038/s41591-018-0307-0 (2019).
Sugano, K., Moss, S. F. & Kuipers, E. J. Gastric intestinal metaplasia: Real culprit or innocent bystander as a precancerous condition for gastric cancer?. Gastroenterology 165, 1352-1366.e1351. https://doi.org/10.1053/j.gastro.2023.08.028 (2023).
Vishwanathaiah, S., Fageeh, H. N., Khanagar, S. B. & Maganur, P. C. Artificial intelligence its uses and application in pediatric dentistry: A review. Biomedicines 11, 788. https://doi.org/10.3390/biomedicines11030788 (2023).
Liang, H. et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat. Med. 25, 433–438. https://doi.org/10.1038/s41591-018-0335-9 (2019).
Dillman, J. R., Somasundaram, E., Brady, S. L. & He, L. Current and emerging artificial intelligence applications for pediatric abdominal imaging. Pediatr. Radiol. 52, 2139–2148. https://doi.org/10.1007/s00247-021-05057-0 (2022).
Balkenende, L., Teuwen, J. & Mann, R. M. Application of deep learning in breast cancer imaging. Semin. Nucl. Med. 52, 584–596. https://doi.org/10.1053/j.semnuclmed.2022.02.003 (2022).
Radici, L. et al. Implementation of a commercial deep learning-based auto segmentation software in radiotherapy: Evaluation of effectiveness and impact on workflow. Life (Basel) 12, 2088. https://doi.org/10.3390/life12122088 (2022).
Seibert, K. et al. Application scenarios for artificial intelligence in nursing care: Rapid review. J. Med. Internet Res. 23, e26522. https://doi.org/10.2196/26522 (2021).
Gu, Y. et al. Effective multidimensional approach for practical management of the emergency department in a COVID-19 designated children’s hospital in east China during the Omicron pandemic: A cross-sectional study. Transl. Pediatr. 12, 113–124. https://doi.org/10.21037/tp-22-314 (2023).
Li, W. H. et al. Artificial intelligence promotes shared decision-making through recommending tests to febrile pediatric outpatients. World J. Emerg. Med. 14, 106–111. https://doi.org/10.5847/wjem.j.1920-8642.2023.033 (2023).
Li, X. et al. Artificial intelligence-assisted reduction in patients’ waiting time for outpatient process: A retrospective cohort study. BMC Health Serv. Res. 21, 237. https://doi.org/10.1186/s12913-021-06248-z (2021).
Nsoesie, E. O. Evaluating artificial intelligence applications in clinical settings. JAMA Netw. Open 1, e182658. https://doi.org/10.1001/jamanetworkopen.2018.2658 (2018).
Yu, V. L. et al. Evaluating the performance of a computer-based consultant. Comput. Programs Biomed. 9, 95–102. https://doi.org/10.1016/0010-468x(79)90022-9 (1979).
Wyatt, J. & Spiegelhalter, D. Evaluating medical expert systems: What to test and how?. Med. Inform. (Lond.) 15, 205–217. https://doi.org/10.3109/14639239009025268 (1990).
Nykänen, P., Chowdhury, S. & Wigertz, O. Evaluation of decision support systems in medicine. Comput. Methods Programs Biomed. 34, 229–238. https://doi.org/10.1016/0169-2607(91)90047-w (1991).
Clarke, K. et al. A methodology for evaluation of knowledge-based systems in medicine. Artif. Intell. Med. 6, 107–121. https://doi.org/10.1016/0933-3657(94)90040-x (1994).
van Gennip, E. M., Talmon, J. L. & Bakker, A. R. ATIM, accompanying measure on the assessment of information technology in medicine. Comput. Methods Programs Biomed. 45, 5–8. https://doi.org/10.1016/0169-2607(94)90005-1 (1994).
Kanagasingam, Y. et al. Evaluation of artificial intelligence-based grading of diabetic retinopathy in primary care. JAMA Netw. Open 1, e182665. https://doi.org/10.1001/jamanetworkopen.2018.2665 (2018).
Ciecierski-Holmes, T., Singh, R., Axt, M., Brenner, S. & Barteit, S. Artificial intelligence for strengthening healthcare systems in low- and middle-income countries: a systematic scoping review. NPJ Digit. Med. 5, 162. https://doi.org/10.1038/s41746-022-00700-y (2022).
Scott, I., Carter, S. & Coiera, E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health Care Inform. 28, e100251. https://doi.org/10.1136/bmjhci-2020-100251 (2021).
Cabitza, F. & Campagner, A. The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int. J. Med. Inform. 153, 104510. https://doi.org/10.1016/j.ijmedinf.2021.104510 (2021).
DeLone, W. H. & McLean, E. R. The DeLone and McLean model of information systems success: A ten-year update. J. Manag. Inform. Syst. 19, 9–30. https://doi.org/10.1080/07421222.2003.11045748 (2003).
DeLone, W. H. & McLean, E. R. Information systems success: The quest for the dependent variable. Inf. Syst. Res. 3, 60–95. https://doi.org/10.1287/isre.3.1.60 (1992).
Nguyen, L., Bellucci, E. & Nguyen, L. T. Electronic health records implementation: An evaluation of information system impact and contingency factors. Int. J. Med. Inform. 83, 779–796. https://doi.org/10.1016/j.ijmedinf.2014.06.011 (2014).
Shim, M. & Jo, H. S. What quality factors matter in enhancing the perceived benefits of online health information sites? Application of the updated DeLone and McLean information systems success model. Int. J. Med. Inform. 137, 104093. https://doi.org/10.1016/j.ijmedinf.2020.104093 (2020).
Bossen, C., Jensen, L. G. & Udsen, F. W. Evaluation of a comprehensive EHR based on the DeLone and McLean model for IS success: Approach, results, and success factors. Int. J. Med. Inform. 82, 940–953. https://doi.org/10.1016/j.ijmedinf.2013.05.010 (2013).
Cho, K. W. et al. Performance evaluation of public hospital information systems by the information system success model. Healthc. Inform. Res. 21, 43–48. https://doi.org/10.4258/hir.2015.21.1.43 (2015).
Yang, M. H. et al. A comparison of two cross-sectional studies on successful model of introducing nursing information system in a regional teaching hospital in Taiwan. Comput. Inform. Nurs. 40, 571–579. https://doi.org/10.1097/cin.0000000000000818 (2022).
Salahuddin, L. & Ismail, Z. Classification of antecedents towards safety use of health information technology: A systematic review. Int. J. Med. Inform. 84, 877–891. https://doi.org/10.1016/j.ijmedinf.2015.07.004 (2015).
Tubaishat, A. Evaluation of electronic health record implementation in hospitals. Comput. Inform. Nurs. 35, 364–372. https://doi.org/10.1097/cin.0000000000000328 (2017).
Nykänen, P. et al. Guideline for good evaluation practice in health informatics (GEP-HI). Int. J. Med. Inform. 80, 815–827. https://doi.org/10.1016/j.ijmedinf.2011.08.004 (2011).
Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Lancet Digit. Health 2, e537–e548. https://doi.org/10.1016/s2589-7500(20)30218-1 (2020).
Rivera, S. C., Liu, X., Chan, A. W., Denniston, A. K. & Calvert, M. J. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension. Bmj 370, m3210. https://doi.org/10.1136/bmj.m3210 (2020).
Reddy, S. et al. Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inform. 28, e100444. https://doi.org/10.1136/bmjhci-2021-100444 (2021).
Parasa, S. et al. Framework and metrics for the clinical use and implementation of artificial intelligence algorithms into endoscopy practice: Recommendations from the American Society for Gastrointestinal Endoscopy Artificial Intelligence Task Force. Gastrointest. Endosc. 97, 815-824.e811. https://doi.org/10.1016/j.gie.2022.10.016 (2023).
Yin, J., Ngiam, K. Y. & Teo, H. H. Role of artificial intelligence applications in real-life clinical practice: Systematic review. J. Med. Internet Res. 23, e25759. https://doi.org/10.2196/25759 (2021).
Diamond, I. R. et al. Defining consensus: A systematic review recommends methodologic criteria for reporting of Delphi studies. J. Clin. Epidemiol. 67, 401–409. https://doi.org/10.1016/j.jclinepi.2013.12.002 (2014).
Saaty, T. L. & Bennett, J. P. A theory of analytical hierarchies applied to political candidacy. Behav. Sci. 22, 237–245. https://doi.org/10.1002/bs.3830220402 (1977).
Le-hong, K., Lin-rong, X. & Bao-chen, L. 2006 International Conference on Computational Intelligence and Security. 963–967.
Gore, J. C. Artificial intelligence in medical imaging. Magn. Reson. Imaging 68, A1-a4. https://doi.org/10.1016/j.mri.2019.12.006 (2020).
Zhou, Q., Chen, Z. H., Cao, Y. H. & Peng, S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit. Med. 4, 154. https://doi.org/10.1038/s41746-021-00524-2 (2021).
Wiegand T, L. N., Pujari S, et al. White paper for the ITU/WHO Focus Group on artificial intelligence for health. ITU. https://www.itu.int/go/fgai4h. Accessed 24 May 2021.
Voets, M. M., Veltman, J., Slump, C. H., Siesling, S. & Koffijberg, H. Systematic review of health economic evaluations focused on artificial intelligence in healthcare: The tortoise and the cheetah. Value Health 25, 340–349. https://doi.org/10.1016/j.jval.2021.11.1362 (2022).
Urbach, N., Smolnik, S. & Riempp, G. The state of research on information systems success. Bus. Inf. Syst. Eng. 1, 315–325. https://doi.org/10.1007/s12599-009-0059-y (2009).
Larson, D. B. et al. Regulatory frameworks for development and evaluation of artificial intelligence-based diagnostic imaging algorithms: Summary and recommendations. J. Am. Coll. Radiol. 18, 413–424. https://doi.org/10.1016/j.jacr.2020.09.060 (2021).
Ni, X. et al. Development of an evaluation indicator system for the rational use of proton pump inhibitors in pediatric intensive care units: An application of Delphi method. Medicine (Baltimore) 100, e26327. https://doi.org/10.1097/md.0000000000026327 (2021).
Wei, J. et al. Construction on teaching quality evaluation indicator system of multi-disciplinary team (MDT) clinical nursing practice in China: A Delphi study. Nurse Educ. Pract. 64, 103452. https://doi.org/10.1016/j.nepr.2022.103452 (2022).
Acknowledgements
We wish to thank all healthcare professionals who participated in the data collection in the process of subject weight analysis. We would like to thank Rui Wang for her help in polishing our paper.
Funding
This work was funded by Science and Technology Commission of Shanghai Municipality (21511104502); National Key R&D Program of China (2021ZD0113504); 2021 Artificial Intelligence Technology Support Special Directional Project (The second batch) (21002411800), Shanghai Municipal Hospital Pediatric Specialist Alliance (SHDC22021305-A, SHDC22021305-B).
Author information
Authors and Affiliations
Contributions
Y.W.W., W.J.F., X.B.Z., Y.J.Z., and R. F. contributed to conception, design, manuscript drafting; H. X. and Y.G. contributed to provision of study materials; X.L.G., C.J. Y, L.S. and W.H. contributed to collection and assembly of data; D.Y.W., W.B.W., J.W. F, J.Y.W. contributed to data analysis and interpretation. All authors reviewed the manuscript, approved the final version, and agreed on all aspects of the work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Y., Fu, W., Zhang, Y. et al. Constructing and implementing a performance evaluation indicator set for artificial intelligence decision support systems in pediatric outpatient clinics: an observational study. Sci Rep 14, 14482 (2024). https://doi.org/10.1038/s41598-024-64893-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-64893-w