Introduction

In April of 2018, the US Food and Drug Administration approved the first Artificial intelligence (AI) device to detect eye disease in adults patients with diabetes1. The emergence of AI-based technology has garnered significant attention across various domains, particularly in advancing clinical support system, thereby reshaping medical care2. These advancements encompass a spectrum of AI applications, ranging from preventive strategies development3,4, image-based diagnosis5,6,7, nursing and therapy workflows improvement8,9. In China, where pediatric outpatient clinics grapple with perpetual overcrowding, AI-DSS has emerged as a valuable tool, particularly evident during the COVID-19 epidemic10, aiding in expediting disease diagnosis and treatment, and mitigating patient wait times11,12. However, the predominant focus on technical prowess during AI development often sidelines considerations for seamless integration into real-world workflows and the practical value of these innovations. Such oversights may only surface during clinical evaluations, as training datasets are meticulously curated to eliminate imperfect samples13. Moreover, AI-DSS in clinical settings can pose both costly and occasionally disruptive, highlighting the imperative to dynamically evaluate their efficacy post-adoption.

The rigorous evaluation of information system's quality and impact were initially proposed in the 1980s, encompassing every stage of its lifecycle, from design and development to selection and utilization. Initially, such publications focused on revealing techniques for assessing effectiveness in controlled laboratory settings. Subsequent papers delved into real-world testing in clinical environments, analyzing their impact on the structure, process, and outcome of healthcare delivery14,15,16,17,18. With the proliferation of AI-DSS in clinical practice, subsequent papers aimed at assessing the clinical impacts of them8,19,20,21,22.

Currently, various frameworks and guidelines exist for grouping indicators to evaluate AI system performance. The Delone and Mclean(D&M) Information System (IS) Success Model was well-known as a framework and model for measuring the variables in IS research in healthcare settings23. It was proposed to conceptualize and operationalize IS success24. Modified and extended, it now provides components of system quality, information quality, service quality, intention to use, user satisfaction, and net benefits, all interrelated25,26,27,28,29. Meanwhile, they can also be characterized as technology, task, user, organization, and environment, mainly focusing on the appraised objects30,31. The Guideline for Good Evaluation Practice in Health Informatics (GEP-HI) was developed to plan and conduct scientifically robust evaluation studies in healthcare32. Additionally, other guidelines were specifically designed to aid in reporting studies incorporating interventions with AI components33,34. These frameworks and guidelines primarily concentrated on reporting and regulatory aspects. However, they offered limited guidance on assessing the application of AI systems. A later study proposed Translational Evaluation of Healthcare AI (TEHAI) to guide the evaluation of AI systems integrating in clinical settings and provide a scoring matrix35. Yet, TEHAI lacks consideration of pertinent stakeholders and detailed information within each component.

Remarkably, only a few indicators reported in the proposed articles for evaluating AI systems in specialized healthcare clinics are guided by the above existing models or frameworks20,36; instead, the indicators were often self-conducted. Some focused on clinician impacts, some on patient impacts, and others on economic impacts, revealing a lack of systematic and comprehensive options37. AI-DSS in pediatric healthcare has a more complex application background, encompassing diverse clinical scenarios, treatment possibilities, and a wide age span range from 0 to 18 years. The evaluation process requires to evolve from being a one-time activity to a continuous process, ensuring effective AI-DSS utilization, considering all stakeholders, and providing a comprehensive set of indicators for practical evaluation. This study aims to develop a set of evaluation indicators tailored specifically for AI-DSS in pediatric healthcare, enabling continuous and systematic performance monitoring. It will be established through expert consensus and then integrated into the hospital information system of the Children's Hospital of Fudan University for a pilot implementation. We hypothesize that this incorporated evaluation indicator set will realize dynamic monitoring, and significantly enhance AI governance within pediatric healthcare settings.

Methods

Study design

The study was conducted in two stages: (1) Generating the draft of the evaluation indicators via literature reviews and focus group interviews, followed by executing the Delphi study to establish the final indicator set; (2) Conducting weight analysis for the finalized indicators. The study was approved by the Research Ethics Board of Children's Hospital of Fudan University (No. 2022307A), and all methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all participants.

Procedures, participants, and data collection

Literature review

A systematic literature review of AI decision-making system studies in healthcare was conducted first. We searched Ovid Medline, Ovid Embase, Cochrane Library for eligible English written studies, and China National Knowledge Infrastructure (CNKI), Wanfang data, Sinomed for eligible Chinese written studies published from 1990 to 2022. The following keywords or medical terms were used: artificial intelligence, machine learning, clinical decision support, evaluation, metrics, indicator, index, framework, implementation, application, performance, and success. Considering AI systems deployed in healthcare sectors are generally based on hospital information systems, we also searched literature related to evaluation indicators for information systems. The literature inclusion criteria were: (1) language: English or Chinese; (2) application scenario: healthcare sectors; (3) content: application evaluation for AI systems or hospital information systems. The exclusion criteria included conference abstracts and articles lacking accessible full text.

Constructing the draft evaluation indicator set

After the literature review, a focus group interview was conducted to get more information on the evaluation indicators for AI-DSS in healthcare sectors. Eleven participants, actively involved in AI-DSS development research or its utilization, were invited to attend the virtual interview session through the Tengxun APP. They hailed from four tertiary children's hospitals and one university in Shanghai Municipality, Anhui Province, Jiangsu Province, and Shandong Province. Among them were four physicians, two nurses, two radiologists, one pharmacist, and two AI professors. We encouraged them to share their insights on the following topics: (1) What is your interpretation of a successful AI-DSS in pediatric outpatient clinics? (2) What criteria do you consider essential for evaluating AI-DSS in pediatric outpatient clinics?

Eventually, relevant content from literature and interviews was extracted, duplicated and synthesized into indicators by two independent members of the research team. These indicators were subsequently categorized based on the Society-Management-Application-Result-Technology (SMART) model. This model, rooted in recent theoretical and empirical AI implementation studies, was developed through expert consensus within a panel of national AI experts convened by the Shanghai AI Laboratory. The expert panel agreed that evaluating an AI system should encompass three primary dimensions: societal performance (referred to as society), organizational performance (referred to as management), and user experience performance (referred to as application). Within each dimension, various components of technology utility should be assessed (referred to as technology), and the outcomes of these components should be comprehensively analyzed and documented (referred to as result). Significant emphasis was placed on tailoring the evaluation indicator set to suit a specialized application scenario. Further details can be found at https://www.shlab.org.cn/. Consequently, the '1–3-5 evaluation indicator set' was developed for further Delphi study, where '1' signifies the pediatric outpatient clinic, '3' represents the three evaluation dimensions, and '5' denotes the five components of technology utility.

Delphi study

Delphi methodology was employed to develop the 1–3–5 evaluation indicator set. Using the purposive sampling method, experts were recruited from members of the Chinese Medical Information and Big Data Association (CHMIA) Pediatric Committee. The CHMIA is a national-level association under the supervision of the National Health Commission of the People’s Republic of China, dedicated to advancing healthcare informatics and big data analytics, fostering collaboration, innovation, and policy advocacy in the fields of health information technology and data science. The Experts inclusion criteria were as follows: (a) having experience in either using or designing AI systems in pediatric healthcare sectors; (b) more than 10 years of working experience; and (c) with high interest, willing to participate in this study.

The Delphi study questionnaire was structured into four sections: (I) introduction of the study and instructions for completing the questionnaire.; (II) demographic information of the expert, including age, gender, training experience, professions, and academic title; (III) draft of the 1–3–5 evaluation indicator set, wherein the experts were asked to rate the importance of each category and indicator, using a 5-point Likert scale, ranging from 'Not at all important (1 point)' to 'Extremely important (5 points)'. Additionally, experts could propose additions, deletions, or modifications; (IV) experts' familiarity with the study and their judgment on the indicators. A total of two rounds of Delphi study were conducted. Experts were asked to complete the questionnaires and returned within 2 weeks. In the first round, paper questionnaires were distributed to the expert panel during the CHMIA annual committee meeting on March 4, 2023. Additionally, mail services were arranged for each expert to facilitate questionnaire returns. All Round 1 questionnaires were promptly received by March 10, 2023, and Round 2 took place from March 13 to March 26, 2023, with all questionnaires distributed and returned via email.

After collecting data from each round, the study team screened out indicators with a mean importance score of ≤ 3.5, a coefficient of variation of ≤ 0.25, and a full mark rate ≤ 20%38, while also analyzing experts' opinions. The results were then shared with the experts following each round.

Determining the index weights

Analytic hierarchy process

The rated importance score for each indicator, as provided by each expert, was compiled from the final consultation round. Following this, a judgment matrix was established to conduct pairwise comparisons and importance ratios were then determined using Saaty’s 9-point scale39. The weight vector was then measured using yaahp 10.1 software (https://www.metadecsn.com/yaahp/).

Entropy weight method

The data sources for collecting the necessary information to calculate the evaluation indicators were identified. Relevant data fields from the hospital information system were mapped, and data extraction mechanisms were developed. Subsequently, data were summarized every quarter, specifically on March 31, 2023, June 30, 2023, and September 28, 2023, at Children’s Hospital of Fudan University, defined as m quarters and n indicators, and the indicators were normalized to obtain the data matrix. \({X}_{ij}\) is the original value of quarter i, indicator j.

As to forward indicators:

$${X}_{ij}=\frac{{X}_{ij}-\text{min}({X}_{ij})}{max\left({X}_{ij}\right)-min({X}_{ij})}$$

As to inverted indicators:

$${X}_{ij}=\frac{{\text{max}\left({X}_{ij}\right)-X}_{ij}}{max\left({X}_{ij}\right)-min({X}_{ij})}$$

The entropy value of indicator j was calculated as follows.

$${\text{En}}_{{\text{j}}} = - {\text{k }} \times \sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{n} {P_{ij} } } \, \times {\text{ln}}(P_{ij} ),\,{\text{i}} = 1,2,...,{\text{m}};\,{\text{j}} = 1,2,...,{\text{n}},{\text{k}}\, = \frac{1}{{{\text{ln}}(m \times n)}}$$

The difference coefficient of indicator j was calculated as follows:

$${\text{D}}_{{\text{j}}} = 1 - {\text{En}}_{{\text{j}}} .$$

The weight of indicator j was calculated as follows:

$${\text{W}}_{\text{j}}\hspace{0.17em}=\hspace{0.17em}\frac{\text{Dj}}{\sum_{j=1}^{n}\text{Dj}}$$

Combination weight method

Assuming the weight vector calculated by the hierarchical analysis is α = (α1, α2, ..., αn) and the weight vector calculated by the entropy weight method is β = (β1, β2, ..., βn). Combined weights were calculated based on geometric average method, and the calculation formula was as follows40:

$${\text{W}}_{\text{j}}=\frac{\sqrt{{\alpha }_{i}{\upbeta }_{j}}}{\sum_{j=1}^{n}\sqrt{{\alpha }_{i}{\upbeta }_{j}}}$$

The entropy weight and the combination method were all realized through python, and detailed python code can be found in Supplement A.

Statistical analysis

SPSS 25.0 statistical software (https://www.ibm.com/support/pages/downloading-ibm-spss-statistics-25), python 3.11 (python 3.11), and Analytic Hierarchy Process (AHP) yaahp 10.1 (https://www.metadecsn.com/yaahp/) were used to analyze the data. The measurement data were described by mean ± standard deviation, and the count data were described for frequency and percentage. The experts' enthusiasm was quantified through the effective recovery rate of the questionnaire, while the expert authority coefficient was determined as the mean of the judgment coefficient and familiarity coefficient. The degree of coordination among expert opinions was assessed using Kendall's harmony coefficient. A minimum recovery rate of 80 percent is indicative of a valid result. An expert authority coefficient exceeding 0.70 suggests high expertise among the experts, lending credibility to the inquiry results. Kendall's coordination coefficient should first be statistically significant (P < 0.05), with values ranging between 0 (indicating no agreement) and 1 (representing complete agreement), where higher values denote stronger agreement, and those below 0.2 suggest poor agreement. The combination of the subjective weighting method (the analytic hierarchy process, AHP) and objective weighting method (the Entropy Weight Method, EWM) were used to calculate the 1–3–5 evaluation indicator set and determine the impact of the assessment indicators using the software Analytic Hierarchy Process (AHP) yaahp 10.1 and python 3.11. The details can be found in the supplement material.

Results

Sample characteristics

Fifty-two invitees agreed to participate in the Delphi consultations, with a 100.0% response rate for Round 1 questionnaires and 92.0% for Round 2. Three experts dropped out due to non-response within two weeks. All returned questionnaires were completed. The participating experts come from eight municipalities and provinces in China, including Shanghai, Tianjin, Jiangsu Province, Henan Province, Shandong Province, Hubei Province, Anhui Province, and Zhejiang Province. Demographic characteristics of the experts are shown in Table 1.

Table 1 Demographic characteristics of participants in Delphi rounds.

Expert authority coefficient and the degree of opinion coordination

In the two rounds of expert consultation, the authority coefficients were 0.834 and 0.846, respectively, which met the criteria of the expert consultation authority coefficient > 0.7. Kendall's coordination coefficient was 0.135 in Round 1 and 0.312 in Round 2. Kendall's test had statistical significance (all p < 0.001).

1–3-5 Evaluation indicator set and corresponding weights

The preliminary indicator set included three first-class indicators and fifteen second-class indicators, constructed using the SMART model, along with fifty-six third-class indicators derived from the literature reviews and interview, following deliberation among the research team. In Round 1, no changes were made to the first and second-class indicators. However, five third-class indicators, including ‘coverage rate of disease spectrum’, ‘number of jobs displaced by AI-DSS’, ‘public comprehension of AI-DSS’, ‘daily average patient visits per clinical doctor’, and ‘daily average outpatient and emergency room visits per hospital’, were directly eliminated based on the exclusion criteria. In comparing ‘time expenditure for AI-DSS operation training’ with ‘training frequency for AI-DSS utilization’, the latter was deemed a more reasonable indicator for reflecting the accessibility of user experience performance. Consequently, ‘time expenditure for system operation training’ was eliminated and ‘training frequency for AI-DSS utilization’ was added based on experts’ opinions and research team discussion. Similarly, in comparing ‘number of granted invention patents’, ‘number of granted utility model patents’, ‘number of granted software copyrights’, ‘number of medical device certificates obtained’, ‘quantities of embedded algorithms’ with ‘extent of process reengineering for improved healthcare services via AI-DSS’, the latter was considered more representative of innovation in societal performance. Thus, ‘extent of process reengineering for improved healthcare services via AI-DSS’ was selected to replace the former five indicators following experts’ opinions and research team discussion. Furthermore, seven third-class indicators underwent wording revisions. In Round 2, only three third-class indicators underwent wording revisions, and there were no further deletions from or additions to the indicator set.

Ultimately, 1–3–5 evaluation indicator set including three first-class, fifteen second-class, and forty-seven third-class indicators were established. As depicted in Table 2, seventeen third-class indicators highlighted in bold font are derived from data collected through self-reports of AI-DSS users, leadership, and the general public, using an electric 5-point Likert scale. The remaining third-class indicators are calculated from data directly obtained from hospital information system.

Table 2 1–3–5 Evaluation Indicator set.

As detailed in Table 3, all the indicators included in Round 2 exhibited mean scores of importance exceeding 3.5, coefficient of variation surpassing 2.5, and full mark rates exceeding 20%. In terms of weight, societal performance carried the highest weight (0.539), followed by organizational performance (0.164), with user experience performance (0.297) having the smallest weight, according to subjective weighting method. However, under the subjective weighting method, organizational performance held the highest weight (0.690), societal performance followed by (0.302), with user experience performance (0.008) having the smallest weight. Similarly, in the combined weighting method, organizational performance retained the highest weight (0.543), societal performance followed (0.391), and user experience performance (0.067) had the smallest weight, following. After rounding to three decimal places, many values of 0.000 emerged in the objective weights column, indicating minimal variation observed in both the data collected from self-reports of AI-DSS users and the data derived directly from hospital information system each quarter.

Table 3 Weights for indicators.

Discussion

The number of publications describing the applications of AI systems in healthcare settings has snowballed over the past decade; the majority solely report on their effects41, and the actual impact when deploying AI in clinical settings remains largely unknown42. At the same time, evidence supporting clinical effectiveness, comparative effectiveness, cost-effectiveness, or other formal health technology assessment of AI in a clinical healthcare setting appears to be limited43. A recent systematic review highlighted a key reason for the lack of adoption of AI systems: the absence of demonstrated benefits. The risk lies in adopting AI systems with insufficient evidence, potentially leading to the incorporation of non-valuable or inadequately supported systems44.

In response to these challenges, researchers have developed multiple models and frameworks aimed at elucidating the elements contributing to AI systems. These model and framework-guided tools played an essential role in evaluating the AI systems45. Indeed, these models have limitations due to their specific focus. Some center on assessing AI algorithms in controlled laboratory settings46, while others aim to offer holistic metrics for evaluating user experience in clinical settings36,37. AI systems evaluation should not only start early in the development process but also be continuous and comprehensive, considering the ethical implications introduced by AI. This comprehensive approach is often lacking in current evaluation frameworks.

In this study, a comprehensive set of evaluation indicators, specifically tailored for AI-DSS implementation in pediatric outpatient clinics, was developed using the Delphi method. The draft was based on the established model specific to AI system performance evaluation and information extracted from the relevant published studies. The 49 consulting experts came from 8 provinces and cities across China, and they were all experienced in the field of implementing AI systems in pediatrics. The set of evaluation indicators has a strong foundation in the pediatric outpatient clinic and emphasizes the measurable results for the technology application in three dimensions. Compared with the TEHAI35, this set is tailored explicitly for pediatric outpatient clinics and offers a more extensive range of detailed indicators to interpret the success of implementing AI systems. Other reported indicators for evaluating the specialized healthcare clinic were often self-conducted and did not align with models or frameworks8,9,20,37.

Regarding weight analysis, previous studies have primarily relied on experts' opinions (subjective weights), utilizing the AHP method47,48. However, this study employed real-world data directly captured from the hospital information system to derive objective weights through the entropy weight method. Subsequently, subject and object weights were synthesized to form the combined weight. Incorporating the evaluation indicator set into the hospital information system helps realize active surveillance of the effect of AI-DSS in the pediatric outpatient clinic. The study revealed that the combined weights align much more closely with the objective weights. Indicator I-1(Organizational performance) carries the highest weight, followed by Indicator I-2(Societal performance) and Indicator I-3(User experience performance) in the objective and combined weights. Conversely, among the subjective weights, 'Societal performance' holds the most weight, followed by 'Organizational performance' and 'User experience performance'. The reasons for this discrepancy may stem from the entropy weight method utilized to calculate objective weights. According to its algorithm, if the collected data undergoes only slight changes during observation, the indicator may exhibit an extremely low weight. Indicators III-1 ~ III-4, III-6 ~ III-8, III-6 ~ III-8, III-19, III-23 ~ III-34, III-36 ~ III-37, III-39 ~ III-42, III-44 ~ III-47 exhibited this behavior. Their change trend was curved, offering minimal contribution to reflecting the dynamic performance of the AI system, suggesting a need for prolonged observation or frequent data collection.

Limitations of the study

Firstly, the Delphi method, being a structured and iterative forecasting technique, depends on the input of a panel of experts. The second round of consulting conducted via email lacks real-time discussion, potentially limiting the depth of analysis and the exploration of alternative viewpoints. Secondly, the subjective weight analysis heavily relies on the expertise and judgment of the participants. Experts' inherent biases, insufficient information, or influenced professional interests can introduce errors or inaccuracies into the forecasting process, potentially affecting the validity of the results. Thirdly, the entropy weight method requires long-term observation or more frequent data collection. The data collected quarterly, and four times may result in bias. Additionally, further exploration of the entropy weight method could involve assessing the impact of different data collection frequencies on the validity of results. Conducting comparative studies with varied observation periods, such as monthly or biannually, could provide insights into the optimal data collection frequency for minimizing bias and enhancing the robustness of the analysis.

Conclusion

Monitoring and evaluating AI systems in pediatric outpatient clinics necessitates a fitting set of evaluation indicators. The 1–3–5 evaluation indicator set, owing to its comprehensive nature and successful integration into hospital information systems, enables the automatic acquisition of real-world data. This allows for the continuous evaluation and monitoring of AI system performance in pediatric outpatient clinics. Future efforts should focus on enhancing long-term data collection for the indicators to optimize their weight proportions.