Constructing and implementing a performance evaluation indicator set for artificial intelligence decision support systems in pediatric outpatient clinics: an observational study

Wang, Yingwen; Fu, Weijia; Zhang, Yuejie; Wang, Daoyang; Gu, Ying; Wang, Weibing; Xu, Hong; Ge, Xiaoling; Ye, Chengjie; Fang, Jinwu; Su, Ling; Wang, Jiayu; He, Wen; Zhang, Xiaobo; Feng, Rui

doi:10.1038/s41598-024-64893-w

Download PDF

Article
Open access
Published: 24 June 2024

Constructing and implementing a performance evaluation indicator set for artificial intelligence decision support systems in pediatric outpatient clinics: an observational study

Yingwen Wang¹^na1,
Weijia Fu²^na1,
Yuejie Zhang³,
Daoyang Wang⁴,
Ying Gu¹,
Weibing Wang⁴,
Hong Xu⁵,
Xiaoling Ge⁶,
Chengjie Ye²,
Jinwu Fang⁴,
Ling Su⁶,
Jiayu Wang⁷,
Wen He⁸,
Xiaobo Zhang⁸ &
…
Rui Feng^3,9

Scientific Reports volume 14, Article number: 14482 (2024) Cite this article

2018 Accesses
3 Citations
Metrics details

Subjects

Abstract

Artificial intelligence (AI) decision support systems in pediatric healthcare have a complex application background. As an AI decision support system (AI-DSS) can be costly, once applied, it is crucial to focus on its performance, interpret its success, and then monitor and update it to ensure ongoing success consistently. Therefore, a set of evaluation indicators was explicitly developed for AI-DSS in pediatric healthcare, enabling continuous and systematic performance monitoring. The study unfolded in two stages. The first stage encompassed establishing the evaluation indicator set through a literature review, a focus group interview, and expert consultation using the Delphi method. In the second stage, weight analysis was conducted. Subjective weights were calculated based on expert opinions through analytic hierarchy process, while objective weights were determined using the entropy weight method. Subsequently, subject and object weights were synthesized to form the combined weight. In the two rounds of expert consultation, the authority coefficients were 0.834 and 0.846, Kendall's coordination coefficient was 0.135 in Round 1 and 0.312 in Round 2. The final evaluation indicator set has three first-class indicators, fifteen second-class indicators, and forty-seven third-class indicators. Indicator I-1(Organizational performance) carries the highest weight, followed by Indicator I-2(Societal performance) and Indicator I-3(User experience performance) in the objective and combined weights. Conversely, 'Societal performance' holds the most weight among the subjective weights, followed by 'Organizational performance' and 'User experience performance'. In this study, a comprehensive and specialized set of evaluation indicators for the AI-DSS in the pediatric outpatient clinic was established, and then implemented. Continuous evaluation still requires long-term data collection to optimize the weight proportions of the established indicators.

Artificial intelligence-based clinical decision support in pediatrics

Article 29 July 2022

Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI

Article 18 May 2022

A novel deep learning algorithm for real-time prediction of clinical deterioration in the emergency department for a multimodal clinical decision support system

Article Open access 03 December 2024

Introduction

In April of 2018, the US Food and Drug Administration approved the first Artificial intelligence (AI) device to detect eye disease in adults patients with diabetes¹. The emergence of AI-based technology has garnered significant attention across various domains, particularly in advancing clinical support system, thereby reshaping medical care². These advancements encompass a spectrum of AI applications, ranging from preventive strategies development^3,4, image-based diagnosis^5,6,7, nursing and therapy workflows improvement^8,9. In China, where pediatric outpatient clinics grapple with perpetual overcrowding, AI-DSS has emerged as a valuable tool, particularly evident during the COVID-19 epidemic¹⁰, aiding in expediting disease diagnosis and treatment, and mitigating patient wait times^11,12. However, the predominant focus on technical prowess during AI development often sidelines considerations for seamless integration into real-world workflows and the practical value of these innovations. Such oversights may only surface during clinical evaluations, as training datasets are meticulously curated to eliminate imperfect samples¹³. Moreover, AI-DSS in clinical settings can pose both costly and occasionally disruptive, highlighting the imperative to dynamically evaluate their efficacy post-adoption.

The rigorous evaluation of information system's quality and impact were initially proposed in the 1980s, encompassing every stage of its lifecycle, from design and development to selection and utilization. Initially, such publications focused on revealing techniques for assessing effectiveness in controlled laboratory settings. Subsequent papers delved into real-world testing in clinical environments, analyzing their impact on the structure, process, and outcome of healthcare delivery^{14,15,16,17,18}. With the proliferation of AI-DSS in clinical practice, subsequent papers aimed at assessing the clinical impacts of them^{8,19,20,21,22}.

Currently, various frameworks and guidelines exist for grouping indicators to evaluate AI system performance. The Delone and Mclean(D&M) Information System (IS) Success Model was well-known as a framework and model for measuring the variables in IS research in healthcare settings²³. It was proposed to conceptualize and operationalize IS success²⁴. Modified and extended, it now provides components of system quality, information quality, service quality, intention to use, user satisfaction, and net benefits, all interrelated^{25,26,27,28,29}. Meanwhile, they can also be characterized as technology, task, user, organization, and environment, mainly focusing on the appraised objects^30,31. The Guideline for Good Evaluation Practice in Health Informatics (GEP-HI) was developed to plan and conduct scientifically robust evaluation studies in healthcare³². Additionally, other guidelines were specifically designed to aid in reporting studies incorporating interventions with AI components^33,34. These frameworks and guidelines primarily concentrated on reporting and regulatory aspects. However, they offered limited guidance on assessing the application of AI systems. A later study proposed Translational Evaluation of Healthcare AI (TEHAI) to guide the evaluation of AI systems integrating in clinical settings and provide a scoring matrix³⁵. Yet, TEHAI lacks consideration of pertinent stakeholders and detailed information within each component.

Remarkably, only a few indicators reported in the proposed articles for evaluating AI systems in specialized healthcare clinics are guided by the above existing models or frameworks^20,36; instead, the indicators were often self-conducted. Some focused on clinician impacts, some on patient impacts, and others on economic impacts, revealing a lack of systematic and comprehensive options³⁷. AI-DSS in pediatric healthcare has a more complex application background, encompassing diverse clinical scenarios, treatment possibilities, and a wide age span range from 0 to 18 years. The evaluation process requires to evolve from being a one-time activity to a continuous process, ensuring effective AI-DSS utilization, considering all stakeholders, and providing a comprehensive set of indicators for practical evaluation. This study aims to develop a set of evaluation indicators tailored specifically for AI-DSS in pediatric healthcare, enabling continuous and systematic performance monitoring. It will be established through expert consensus and then integrated into the hospital information system of the Children's Hospital of Fudan University for a pilot implementation. We hypothesize that this incorporated evaluation indicator set will realize dynamic monitoring, and significantly enhance AI governance within pediatric healthcare settings.

Methods

Study design

The study was conducted in two stages: (1) Generating the draft of the evaluation indicators via literature reviews and focus group interviews, followed by executing the Delphi study to establish the final indicator set; (2) Conducting weight analysis for the finalized indicators. The study was approved by the Research Ethics Board of Children's Hospital of Fudan University (No. 2022307A), and all methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all participants.

Procedures, participants, and data collection

Literature review

A systematic literature review of AI decision-making system studies in healthcare was conducted first. We searched Ovid Medline, Ovid Embase, Cochrane Library for eligible English written studies, and China National Knowledge Infrastructure (CNKI), Wanfang data, Sinomed for eligible Chinese written studies published from 1990 to 2022. The following keywords or medical terms were used: artificial intelligence, machine learning, clinical decision support, evaluation, metrics, indicator, index, framework, implementation, application, performance, and success. Considering AI systems deployed in healthcare sectors are generally based on hospital information systems, we also searched literature related to evaluation indicators for information systems. The literature inclusion criteria were: (1) language: English or Chinese; (2) application scenario: healthcare sectors; (3) content: application evaluation for AI systems or hospital information systems. The exclusion criteria included conference abstracts and articles lacking accessible full text.

Constructing the draft evaluation indicator set

After the literature review, a focus group interview was conducted to get more information on the evaluation indicators for AI-DSS in healthcare sectors. Eleven participants, actively involved in AI-DSS development research or its utilization, were invited to attend the virtual interview session through the Tengxun APP. They hailed from four tertiary children's hospitals and one university in Shanghai Municipality, Anhui Province, Jiangsu Province, and Shandong Province. Among them were four physicians, two nurses, two radiologists, one pharmacist, and two AI professors. We encouraged them to share their insights on the following topics: (1) What is your interpretation of a successful AI-DSS in pediatric outpatient clinics? (2) What criteria do you consider essential for evaluating AI-DSS in pediatric outpatient clinics?

Eventually, relevant content from literature and interviews was extracted, duplicated and synthesized into indicators by two independent members of the research team. These indicators were subsequently categorized based on the Society-Management-Application-Result-Technology (SMART) model. This model, rooted in recent theoretical and empirical AI implementation studies, was developed through expert consensus within a panel of national AI experts convened by the Shanghai AI Laboratory. The expert panel agreed that evaluating an AI system should encompass three primary dimensions: societal performance (referred to as society), organizational performance (referred to as management), and user experience performance (referred to as application). Within each dimension, various components of technology utility should be assessed (referred to as technology), and the outcomes of these components should be comprehensively analyzed and documented (referred to as result). Significant emphasis was placed on tailoring the evaluation indicator set to suit a specialized application scenario. Further details can be found at https://www.shlab.org.cn/. Consequently, the '1–3-5 evaluation indicator set' was developed for further Delphi study, where '1' signifies the pediatric outpatient clinic, '3' represents the three evaluation dimensions, and '5' denotes the five components of technology utility.

Delphi study

Delphi methodology was employed to develop the 1–3–5 evaluation indicator set. Using the purposive sampling method, experts were recruited from members of the Chinese Medical Information and Big Data Association (CHMIA) Pediatric Committee. The CHMIA is a national-level association under the supervision of the National Health Commission of the People’s Republic of China, dedicated to advancing healthcare informatics and big data analytics, fostering collaboration, innovation, and policy advocacy in the fields of health information technology and data science. The Experts inclusion criteria were as follows: (a) having experience in either using or designing AI systems in pediatric healthcare sectors; (b) more than 10 years of working experience; and (c) with high interest, willing to participate in this study.

The Delphi study questionnaire was structured into four sections: (I) introduction of the study and instructions for completing the questionnaire.; (II) demographic information of the expert, including age, gender, training experience, professions, and academic title; (III) draft of the 1–3–5 evaluation indicator set, wherein the experts were asked to rate the importance of each category and indicator, using a 5-point Likert scale, ranging from 'Not at all important (1 point)' to 'Extremely important (5 points)'. Additionally, experts could propose additions, deletions, or modifications; (IV) experts' familiarity with the study and their judgment on the indicators. A total of two rounds of Delphi study were conducted. Experts were asked to complete the questionnaires and returned within 2 weeks. In the first round, paper questionnaires were distributed to the expert panel during the CHMIA annual committee meeting on March 4, 2023. Additionally, mail services were arranged for each expert to facilitate questionnaire returns. All Round 1 questionnaires were promptly received by March 10, 2023, and Round 2 took place from March 13 to March 26, 2023, with all questionnaires distributed and returned via email.

After collecting data from each round, the study team screened out indicators with a mean importance score of ≤ 3.5, a coefficient of variation of ≤ 0.25, and a full mark rate ≤ 20%³⁸, while also analyzing experts' opinions. The results were then shared with the experts following each round.

Determining the index weights

Analytic hierarchy process

The rated importance score for each indicator, as provided by each expert, was compiled from the final consultation round. Following this, a judgment matrix was established to conduct pairwise comparisons and importance ratios were then determined using Saaty’s 9-point scale³⁹. The weight vector was then measured using yaahp 10.1 software (https://www.metadecsn.com/yaahp/).

Entropy weight method

The data sources for collecting the necessary information to calculate the evaluation indicators were identified. Relevant data fields from the hospital information system were mapped, and data extraction mechanisms were developed. Subsequently, data were summarized every quarter, specifically on March 31, 2023, June 30, 2023, and September 28, 2023, at Children’s Hospital of Fudan University, defined as m quarters and n indicators, and the indicators were normalized to obtain the data matrix. ${X}_{ij}$ is the original value of quarter i, indicator j.

As to forward indicators:

$${X}_{ij}=\frac{{X}_{ij}-\text{min}({X}_{ij})}{max\left({X}_{ij}\right)-min({X}_{ij})}$$

As to inverted indicators:

$${X}_{ij}=\frac{{\text{max}\left({X}_{ij}\right)-X}_{ij}}{max\left({X}_{ij}\right)-min({X}_{ij})}$$

The entropy value of indicator j was calculated as follows.

$${\text{En}}_{{\text{j}}} = - {\text{k }} \times \sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{n} {P_{ij} } } \, \times {\text{ln}}(P_{ij} ),\,{\text{i}} = 1,2,...,{\text{m}};\,{\text{j}} = 1,2,...,{\text{n}},{\text{k}}\, = \frac{1}{{{\text{ln}}(m \times n)}}$$

The difference coefficient of indicator j was calculated as follows:

$${\text{D}}_{{\text{j}}} = 1 - {\text{En}}_{{\text{j}}} .$$

The weight of indicator j was calculated as follows:

$${\text{W}}_{\text{j}}\hspace{0.17em}=\hspace{0.17em}\frac{\text{Dj}}{\sum_{j=1}^{n}\text{Dj}}$$

Combination weight method

Assuming the weight vector calculated by the hierarchical analysis is α = (α₁, α₂, ..., α_n) and the weight vector calculated by the entropy weight method is β = (β₁, β₂, ..., β_n). Combined weights were calculated based on geometric average method, and the calculation formula was as follows⁴⁰:

$${\text{W}}_{\text{j}}=\frac{\sqrt{{\alpha }_{i}{\upbeta }_{j}}}{\sum_{j=1}^{n}\sqrt{{\alpha }_{i}{\upbeta }_{j}}}$$

The entropy weight and the combination method were all realized through python, and detailed python code can be found in Supplement A.

Statistical analysis

SPSS 25.0 statistical software (https://www.ibm.com/support/pages/downloading-ibm-spss-statistics-25), python 3.11 (python 3.11), and Analytic Hierarchy Process (AHP) yaahp 10.1 (https://www.metadecsn.com/yaahp/) were used to analyze the data. The measurement data were described by mean ± standard deviation, and the count data were described for frequency and percentage. The experts' enthusiasm was quantified through the effective recovery rate of the questionnaire, while the expert authority coefficient was determined as the mean of the judgment coefficient and familiarity coefficient. The degree of coordination among expert opinions was assessed using Kendall's harmony coefficient. A minimum recovery rate of 80 percent is indicative of a valid result. An expert authority coefficient exceeding 0.70 suggests high expertise among the experts, lending credibility to the inquiry results. Kendall's coordination coefficient should first be statistically significant (P < 0.05), with values ranging between 0 (indicating no agreement) and 1 (representing complete agreement), where higher values denote stronger agreement, and those below 0.2 suggest poor agreement. The combination of the subjective weighting method (the analytic hierarchy process, AHP) and objective weighting method (the Entropy Weight Method, EWM) were used to calculate the 1–3–5 evaluation indicator set and determine the impact of the assessment indicators using the software Analytic Hierarchy Process (AHP) yaahp 10.1 and python 3.11. The details can be found in the supplement material.

Results

Sample characteristics

Fifty-two invitees agreed to participate in the Delphi consultations, with a 100.0% response rate for Round 1 questionnaires and 92.0% for Round 2. Three experts dropped out due to non-response within two weeks. All returned questionnaires were completed. The participating experts come from eight municipalities and provinces in China, including Shanghai, Tianjin, Jiangsu Province, Henan Province, Shandong Province, Hubei Province, Anhui Province, and Zhejiang Province. Demographic characteristics of the experts are shown in Table 1.

Table 1 Demographic characteristics of participants in Delphi rounds.

Full size table

Expert authority coefficient and the degree of opinion coordination

In the two rounds of expert consultation, the authority coefficients were 0.834 and 0.846, respectively, which met the criteria of the expert consultation authority coefficient > 0.7. Kendall's coordination coefficient was 0.135 in Round 1 and 0.312 in Round 2. Kendall's test had statistical significance (all p < 0.001).

1–3-5 Evaluation indicator set and corresponding weights

The preliminary indicator set included three first-class indicators and fifteen second-class indicators, constructed using the SMART model, along with fifty-six third-class indicators derived from the literature reviews and interview, following deliberation among the research team. In Round 1, no changes were made to the first and second-class indicators. However, five third-class indicators, including ‘coverage rate of disease spectrum’, ‘number of jobs displaced by AI-DSS’, ‘public comprehension of AI-DSS’, ‘daily average patient visits per clinical doctor’, and ‘daily average outpatient and emergency room visits per hospital’, were directly eliminated based on the exclusion criteria. In comparing ‘time expenditure for AI-DSS operation training’ with ‘training frequency for AI-DSS utilization’, the latter was deemed a more reasonable indicator for reflecting the accessibility of user experience performance. Consequently, ‘time expenditure for system operation training’ was eliminated and ‘training frequency for AI-DSS utilization’ was added based on experts’ opinions and research team discussion. Similarly, in comparing ‘number of granted invention patents’, ‘number of granted utility model patents’, ‘number of granted software copyrights’, ‘number of medical device certificates obtained’, ‘quantities of embedded algorithms’ with ‘extent of process reengineering for improved healthcare services via AI-DSS’, the latter was considered more representative of innovation in societal performance. Thus, ‘extent of process reengineering for improved healthcare services via AI-DSS’ was selected to replace the former five indicators following experts’ opinions and research team discussion. Furthermore, seven third-class indicators underwent wording revisions. In Round 2, only three third-class indicators underwent wording revisions, and there were no further deletions from or additions to the indicator set.

Ultimately, 1–3–5 evaluation indicator set including three first-class, fifteen second-class, and forty-seven third-class indicators were established. As depicted in Table 2, seventeen third-class indicators highlighted in bold font are derived from data collected through self-reports of AI-DSS users, leadership, and the general public, using an electric 5-point Likert scale. The remaining third-class indicators are calculated from data directly obtained from hospital information system.

Table 2 1–3–5 Evaluation Indicator set.

Full size table

As detailed in Table 3, all the indicators included in Round 2 exhibited mean scores of importance exceeding 3.5, coefficient of variation surpassing 2.5, and full mark rates exceeding 20%. In terms of weight, societal performance carried the highest weight (0.539), followed by organizational performance (0.164), with user experience performance (0.297) having the smallest weight, according to subjective weighting method. However, under the subjective weighting method, organizational performance held the highest weight (0.690), societal performance followed by (0.302), with user experience performance (0.008) having the smallest weight. Similarly, in the combined weighting method, organizational performance retained the highest weight (0.543), societal performance followed (0.391), and user experience performance (0.067) had the smallest weight, following. After rounding to three decimal places, many values of 0.000 emerged in the objective weights column, indicating minimal variation observed in both the data collected from self-reports of AI-DSS users and the data derived directly from hospital information system each quarter.

Table 3 Weights for indicators.

Full size table

Discussion

The number of publications describing the applications of AI systems in healthcare settings has snowballed over the past decade; the majority solely report on their effects⁴¹, and the actual impact when deploying AI in clinical settings remains largely unknown⁴². At the same time, evidence supporting clinical effectiveness, comparative effectiveness, cost-effectiveness, or other formal health technology assessment of AI in a clinical healthcare setting appears to be limited⁴³. A recent systematic review highlighted a key reason for the lack of adoption of AI systems: the absence of demonstrated benefits. The risk lies in adopting AI systems with insufficient evidence, potentially leading to the incorporation of non-valuable or inadequately supported systems⁴⁴.

In response to these challenges, researchers have developed multiple models and frameworks aimed at elucidating the elements contributing to AI systems. These model and framework-guided tools played an essential role in evaluating the AI systems⁴⁵. Indeed, these models have limitations due to their specific focus. Some center on assessing AI algorithms in controlled laboratory settings⁴⁶, while others aim to offer holistic metrics for evaluating user experience in clinical settings^36,37. AI systems evaluation should not only start early in the development process but also be continuous and comprehensive, considering the ethical implications introduced by AI. This comprehensive approach is often lacking in current evaluation frameworks.

In this study, a comprehensive set of evaluation indicators, specifically tailored for AI-DSS implementation in pediatric outpatient clinics, was developed using the Delphi method. The draft was based on the established model specific to AI system performance evaluation and information extracted from the relevant published studies. The 49 consulting experts came from 8 provinces and cities across China, and they were all experienced in the field of implementing AI systems in pediatrics. The set of evaluation indicators has a strong foundation in the pediatric outpatient clinic and emphasizes the measurable results for the technology application in three dimensions. Compared with the TEHAI³⁵, this set is tailored explicitly for pediatric outpatient clinics and offers a more extensive range of detailed indicators to interpret the success of implementing AI systems. Other reported indicators for evaluating the specialized healthcare clinic were often self-conducted and did not align with models or frameworks^8,9,20,37.

Regarding weight analysis, previous studies have primarily relied on experts' opinions (subjective weights), utilizing the AHP method^47,48. However, this study employed real-world data directly captured from the hospital information system to derive objective weights through the entropy weight method. Subsequently, subject and object weights were synthesized to form the combined weight. Incorporating the evaluation indicator set into the hospital information system helps realize active surveillance of the effect of AI-DSS in the pediatric outpatient clinic. The study revealed that the combined weights align much more closely with the objective weights. Indicator I-1(Organizational performance) carries the highest weight, followed by Indicator I-2(Societal performance) and Indicator I-3(User experience performance) in the objective and combined weights. Conversely, among the subjective weights, 'Societal performance' holds the most weight, followed by 'Organizational performance' and 'User experience performance'. The reasons for this discrepancy may stem from the entropy weight method utilized to calculate objective weights. According to its algorithm, if the collected data undergoes only slight changes during observation, the indicator may exhibit an extremely low weight. Indicators III-1 ~ III-4, III-6 ~ III-8, III-6 ~ III-8, III-19, III-23 ~ III-34, III-36 ~ III-37, III-39 ~ III-42, III-44 ~ III-47 exhibited this behavior. Their change trend was curved, offering minimal contribution to reflecting the dynamic performance of the AI system, suggesting a need for prolonged observation or frequent data collection.

Limitations of the study

Firstly, the Delphi method, being a structured and iterative forecasting technique, depends on the input of a panel of experts. The second round of consulting conducted via email lacks real-time discussion, potentially limiting the depth of analysis and the exploration of alternative viewpoints. Secondly, the subjective weight analysis heavily relies on the expertise and judgment of the participants. Experts' inherent biases, insufficient information, or influenced professional interests can introduce errors or inaccuracies into the forecasting process, potentially affecting the validity of the results. Thirdly, the entropy weight method requires long-term observation or more frequent data collection. The data collected quarterly, and four times may result in bias. Additionally, further exploration of the entropy weight method could involve assessing the impact of different data collection frequencies on the validity of results. Conducting comparative studies with varied observation periods, such as monthly or biannually, could provide insights into the optimal data collection frequency for minimizing bias and enhancing the robustness of the analysis.

Conclusion

Monitoring and evaluating AI systems in pediatric outpatient clinics necessitates a fitting set of evaluation indicators. The 1–3–5 evaluation indicator set, owing to its comprehensive nature and successful integration into hospital information systems, enables the automatic acquisition of real-world data. This allows for the continuous evaluation and monitoring of AI system performance in pediatric outpatient clinics. Future efforts should focus on enhancing long-term data collection for the indicators to optimize their weight proportions.

Data availability

Correspondence and requests for materials should be addressed to X.B.Z.

References

Administration., U. F. a. D. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems: FDA news release. https://www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm. Published April 11, 2018. Accessed 17 April 2024.
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36. https://doi.org/10.1038/s41591-018-0307-0 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sugano, K., Moss, S. F. & Kuipers, E. J. Gastric intestinal metaplasia: Real culprit or innocent bystander as a precancerous condition for gastric cancer?. Gastroenterology 165, 1352-1366.e1351. https://doi.org/10.1053/j.gastro.2023.08.028 (2023).
Article PubMed Google Scholar
Vishwanathaiah, S., Fageeh, H. N., Khanagar, S. B. & Maganur, P. C. Artificial intelligence its uses and application in pediatric dentistry: A review. Biomedicines 11, 788. https://doi.org/10.3390/biomedicines11030788 (2023).
Article PubMed PubMed Central Google Scholar
Liang, H. et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat. Med. 25, 433–438. https://doi.org/10.1038/s41591-018-0335-9 (2019).
Article CAS PubMed Google Scholar
Dillman, J. R., Somasundaram, E., Brady, S. L. & He, L. Current and emerging artificial intelligence applications for pediatric abdominal imaging. Pediatr. Radiol. 52, 2139–2148. https://doi.org/10.1007/s00247-021-05057-0 (2022).
Article PubMed Google Scholar
Balkenende, L., Teuwen, J. & Mann, R. M. Application of deep learning in breast cancer imaging. Semin. Nucl. Med. 52, 584–596. https://doi.org/10.1053/j.semnuclmed.2022.02.003 (2022).
Article PubMed Google Scholar
Radici, L. et al. Implementation of a commercial deep learning-based auto segmentation software in radiotherapy: Evaluation of effectiveness and impact on workflow. Life (Basel) 12, 2088. https://doi.org/10.3390/life12122088 (2022).
Article PubMed ADS Google Scholar
Seibert, K. et al. Application scenarios for artificial intelligence in nursing care: Rapid review. J. Med. Internet Res. 23, e26522. https://doi.org/10.2196/26522 (2021).
Article PubMed PubMed Central Google Scholar
Gu, Y. et al. Effective multidimensional approach for practical management of the emergency department in a COVID-19 designated children’s hospital in east China during the Omicron pandemic: A cross-sectional study. Transl. Pediatr. 12, 113–124. https://doi.org/10.21037/tp-22-314 (2023).
Article PubMed PubMed Central Google Scholar
Li, W. H. et al. Artificial intelligence promotes shared decision-making through recommending tests to febrile pediatric outpatients. World J. Emerg. Med. 14, 106–111. https://doi.org/10.5847/wjem.j.1920-8642.2023.033 (2023).
Article PubMed PubMed Central Google Scholar
Li, X. et al. Artificial intelligence-assisted reduction in patients’ waiting time for outpatient process: A retrospective cohort study. BMC Health Serv. Res. 21, 237. https://doi.org/10.1186/s12913-021-06248-z (2021).
Article CAS PubMed PubMed Central Google Scholar
Nsoesie, E. O. Evaluating artificial intelligence applications in clinical settings. JAMA Netw. Open 1, e182658. https://doi.org/10.1001/jamanetworkopen.2018.2658 (2018).
Article PubMed Google Scholar
Yu, V. L. et al. Evaluating the performance of a computer-based consultant. Comput. Programs Biomed. 9, 95–102. https://doi.org/10.1016/0010-468x(79)90022-9 (1979).
Article CAS PubMed Google Scholar
Wyatt, J. & Spiegelhalter, D. Evaluating medical expert systems: What to test and how?. Med. Inform. (Lond.) 15, 205–217. https://doi.org/10.3109/14639239009025268 (1990).
Article CAS PubMed Google Scholar
Nykänen, P., Chowdhury, S. & Wigertz, O. Evaluation of decision support systems in medicine. Comput. Methods Programs Biomed. 34, 229–238. https://doi.org/10.1016/0169-2607(91)90047-w (1991).
Article PubMed Google Scholar
Clarke, K. et al. A methodology for evaluation of knowledge-based systems in medicine. Artif. Intell. Med. 6, 107–121. https://doi.org/10.1016/0933-3657(94)90040-x (1994).
Article CAS PubMed Google Scholar
van Gennip, E. M., Talmon, J. L. & Bakker, A. R. ATIM, accompanying measure on the assessment of information technology in medicine. Comput. Methods Programs Biomed. 45, 5–8. https://doi.org/10.1016/0169-2607(94)90005-1 (1994).
Article PubMed Google Scholar
Kanagasingam, Y. et al. Evaluation of artificial intelligence-based grading of diabetic retinopathy in primary care. JAMA Netw. Open 1, e182665. https://doi.org/10.1001/jamanetworkopen.2018.2665 (2018).
Article PubMed PubMed Central Google Scholar
Ciecierski-Holmes, T., Singh, R., Axt, M., Brenner, S. & Barteit, S. Artificial intelligence for strengthening healthcare systems in low- and middle-income countries: a systematic scoping review. NPJ Digit. Med. 5, 162. https://doi.org/10.1038/s41746-022-00700-y (2022).
Article PubMed PubMed Central Google Scholar
Scott, I., Carter, S. & Coiera, E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health Care Inform. 28, e100251. https://doi.org/10.1136/bmjhci-2020-100251 (2021).
Article PubMed PubMed Central Google Scholar
Cabitza, F. & Campagner, A. The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int. J. Med. Inform. 153, 104510. https://doi.org/10.1016/j.ijmedinf.2021.104510 (2021).
Article PubMed Google Scholar
DeLone, W. H. & McLean, E. R. The DeLone and McLean model of information systems success: A ten-year update. J. Manag. Inform. Syst. 19, 9–30. https://doi.org/10.1080/07421222.2003.11045748 (2003).
Article Google Scholar
DeLone, W. H. & McLean, E. R. Information systems success: The quest for the dependent variable. Inf. Syst. Res. 3, 60–95. https://doi.org/10.1287/isre.3.1.60 (1992).
Article Google Scholar
Nguyen, L., Bellucci, E. & Nguyen, L. T. Electronic health records implementation: An evaluation of information system impact and contingency factors. Int. J. Med. Inform. 83, 779–796. https://doi.org/10.1016/j.ijmedinf.2014.06.011 (2014).
Article PubMed Google Scholar
Shim, M. & Jo, H. S. What quality factors matter in enhancing the perceived benefits of online health information sites? Application of the updated DeLone and McLean information systems success model. Int. J. Med. Inform. 137, 104093. https://doi.org/10.1016/j.ijmedinf.2020.104093 (2020).
Article PubMed Google Scholar
Bossen, C., Jensen, L. G. & Udsen, F. W. Evaluation of a comprehensive EHR based on the DeLone and McLean model for IS success: Approach, results, and success factors. Int. J. Med. Inform. 82, 940–953. https://doi.org/10.1016/j.ijmedinf.2013.05.010 (2013).
Article PubMed Google Scholar
Cho, K. W. et al. Performance evaluation of public hospital information systems by the information system success model. Healthc. Inform. Res. 21, 43–48. https://doi.org/10.4258/hir.2015.21.1.43 (2015).
Article PubMed PubMed Central Google Scholar
Yang, M. H. et al. A comparison of two cross-sectional studies on successful model of introducing nursing information system in a regional teaching hospital in Taiwan. Comput. Inform. Nurs. 40, 571–579. https://doi.org/10.1097/cin.0000000000000818 (2022).
Article PubMed Google Scholar
Salahuddin, L. & Ismail, Z. Classification of antecedents towards safety use of health information technology: A systematic review. Int. J. Med. Inform. 84, 877–891. https://doi.org/10.1016/j.ijmedinf.2015.07.004 (2015).
Article PubMed Google Scholar
Tubaishat, A. Evaluation of electronic health record implementation in hospitals. Comput. Inform. Nurs. 35, 364–372. https://doi.org/10.1097/cin.0000000000000328 (2017).
Article PubMed Google Scholar
Nykänen, P. et al. Guideline for good evaluation practice in health informatics (GEP-HI). Int. J. Med. Inform. 80, 815–827. https://doi.org/10.1016/j.ijmedinf.2011.08.004 (2011).
Article PubMed Google Scholar
Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Lancet Digit. Health 2, e537–e548. https://doi.org/10.1016/s2589-7500(20)30218-1 (2020).
Article PubMed PubMed Central Google Scholar
Rivera, S. C., Liu, X., Chan, A. W., Denniston, A. K. & Calvert, M. J. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension. Bmj 370, m3210. https://doi.org/10.1136/bmj.m3210 (2020).
Article PubMed PubMed Central Google Scholar
Reddy, S. et al. Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inform. 28, e100444. https://doi.org/10.1136/bmjhci-2021-100444 (2021).
Article PubMed PubMed Central Google Scholar
Parasa, S. et al. Framework and metrics for the clinical use and implementation of artificial intelligence algorithms into endoscopy practice: Recommendations from the American Society for Gastrointestinal Endoscopy Artificial Intelligence Task Force. Gastrointest. Endosc. 97, 815-824.e811. https://doi.org/10.1016/j.gie.2022.10.016 (2023).
Article PubMed Google Scholar
Yin, J., Ngiam, K. Y. & Teo, H. H. Role of artificial intelligence applications in real-life clinical practice: Systematic review. J. Med. Internet Res. 23, e25759. https://doi.org/10.2196/25759 (2021).
Article PubMed PubMed Central Google Scholar
Diamond, I. R. et al. Defining consensus: A systematic review recommends methodologic criteria for reporting of Delphi studies. J. Clin. Epidemiol. 67, 401–409. https://doi.org/10.1016/j.jclinepi.2013.12.002 (2014).
Article PubMed Google Scholar
Saaty, T. L. & Bennett, J. P. A theory of analytical hierarchies applied to political candidacy. Behav. Sci. 22, 237–245. https://doi.org/10.1002/bs.3830220402 (1977).
Article Google Scholar
Le-hong, K., Lin-rong, X. & Bao-chen, L. 2006 International Conference on Computational Intelligence and Security. 963–967.
Gore, J. C. Artificial intelligence in medical imaging. Magn. Reson. Imaging 68, A1-a4. https://doi.org/10.1016/j.mri.2019.12.006 (2020).
Article PubMed Google Scholar
Zhou, Q., Chen, Z. H., Cao, Y. H. & Peng, S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit. Med. 4, 154. https://doi.org/10.1038/s41746-021-00524-2 (2021).
Article PubMed PubMed Central Google Scholar
Wiegand T, L. N., Pujari S, et al. White paper for the ITU/WHO Focus Group on artificial intelligence for health. ITU. https://www.itu.int/go/fgai4h. Accessed 24 May 2021.
Voets, M. M., Veltman, J., Slump, C. H., Siesling, S. & Koffijberg, H. Systematic review of health economic evaluations focused on artificial intelligence in healthcare: The tortoise and the cheetah. Value Health 25, 340–349. https://doi.org/10.1016/j.jval.2021.11.1362 (2022).
Article PubMed Google Scholar
Urbach, N., Smolnik, S. & Riempp, G. The state of research on information systems success. Bus. Inf. Syst. Eng. 1, 315–325. https://doi.org/10.1007/s12599-009-0059-y (2009).
Article Google Scholar
Larson, D. B. et al. Regulatory frameworks for development and evaluation of artificial intelligence-based diagnostic imaging algorithms: Summary and recommendations. J. Am. Coll. Radiol. 18, 413–424. https://doi.org/10.1016/j.jacr.2020.09.060 (2021).
Article PubMed Google Scholar
Ni, X. et al. Development of an evaluation indicator system for the rational use of proton pump inhibitors in pediatric intensive care units: An application of Delphi method. Medicine (Baltimore) 100, e26327. https://doi.org/10.1097/md.0000000000026327 (2021).
Article PubMed Google Scholar
Wei, J. et al. Construction on teaching quality evaluation indicator system of multi-disciplinary team (MDT) clinical nursing practice in China: A Delphi study. Nurse Educ. Pract. 64, 103452. https://doi.org/10.1016/j.nepr.2022.103452 (2022).
Article PubMed Google Scholar

Download references

Acknowledgements

We wish to thank all healthcare professionals who participated in the data collection in the process of subject weight analysis. We would like to thank Rui Wang for her help in polishing our paper.

Funding

This work was funded by Science and Technology Commission of Shanghai Municipality (21511104502); National Key R&D Program of China (2021ZD0113504); 2021 Artificial Intelligence Technology Support Special Directional Project (The second batch) (21002411800), Shanghai Municipal Hospital Pediatric Specialist Alliance (SHDC22021305-A, SHDC22021305-B).

Author information

These authors contributed equally: Yingwen Wang and Weijia Fu.

Authors and Affiliations

Nursing Department, Children’s Hospital of Fudan University, Shanghai, 201102, China
Yingwen Wang & Ying Gu
Medical Information Center, Children’s Hospital of Fudan University, Shanghai, 201102, China
Weijia Fu & Chengjie Ye
School of Computer Science, Fudan University, Shanghai, 200438, China
Yuejie Zhang & Rui Feng
School of Public, Health Fudan University, Shanghai, 200032, China
Daoyang Wang, Weibing Wang & Jinwu Fang
Nephrology Department, Children’s Hospital of Fudan University, Shanghai, 201102, China
Hong Xu
Statistical and Data Management Center, Children’s Hospital of Fudan University, Shanghai, 201102, China
Xiaoling Ge & Ling Su
National Health Commission Key Laboratory of Neonatal Diseases (Fudan University), Children’s Hospital of Fudan University, Shanghai, 201102, China
Jiayu Wang
Respiratory Department, Children’s Hospital of Fudan University, Shanghai, 201102, China
Wen He & Xiaobo Zhang
School of Computer Science, Fudan University, 2005 Songhu Road, Shanghai, 200438, China
Rui Feng

Authors

Yingwen Wang
View author publications
Search author on:PubMed Google Scholar
Weijia Fu
View author publications
Search author on:PubMed Google Scholar
Yuejie Zhang
View author publications
Search author on:PubMed Google Scholar
Daoyang Wang
View author publications
Search author on:PubMed Google Scholar
Ying Gu
View author publications
Search author on:PubMed Google Scholar
Weibing Wang
View author publications
Search author on:PubMed Google Scholar
Hong Xu
View author publications
Search author on:PubMed Google Scholar
Xiaoling Ge
View author publications
Search author on:PubMed Google Scholar
Chengjie Ye
View author publications
Search author on:PubMed Google Scholar
Jinwu Fang
View author publications
Search author on:PubMed Google Scholar
Ling Su
View author publications
Search author on:PubMed Google Scholar
Jiayu Wang
View author publications
Search author on:PubMed Google Scholar
Wen He
View author publications
Search author on:PubMed Google Scholar
Xiaobo Zhang
View author publications
Search author on:PubMed Google Scholar
Rui Feng
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.W.W., W.J.F., X.B.Z., Y.J.Z., and R. F. contributed to conception, design, manuscript drafting; H. X. and Y.G. contributed to provision of study materials; X.L.G., C.J. Y, L.S. and W.H. contributed to collection and assembly of data; D.Y.W., W.B.W., J.W. F, J.Y.W. contributed to data analysis and interpretation. All authors reviewed the manuscript, approved the final version, and agreed on all aspects of the work.

Corresponding authors

Correspondence to Xiaobo Zhang or Rui Feng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Fu, W., Zhang, Y. et al. Constructing and implementing a performance evaluation indicator set for artificial intelligence decision support systems in pediatric outpatient clinics: an observational study. Sci Rep 14, 14482 (2024). https://doi.org/10.1038/s41598-024-64893-w

Download citation

Received: 01 December 2023
Accepted: 13 June 2024
Published: 24 June 2024
DOI: https://doi.org/10.1038/s41598-024-64893-w

Subjects

Abstract

Similar content being viewed by others

Artificial intelligence-based clinical decision support in pediatrics

Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI

A novel deep learning algorithm for real-time prediction of clinical deterioration in the emergency department for a multimodal clinical decision support system

Introduction

Methods

Study design

Procedures, participants, and data collection

Literature review

Constructing the draft evaluation indicator set

Delphi study

Determining the index weights

Analytic hierarchy process

Entropy weight method

Combination weight method

Statistical analysis

Results

Sample characteristics

Expert authority coefficient and the degree of opinion coordination

1–3-5 Evaluation indicator set and corresponding weights

Discussion

Limitations of the study

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links