Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial

Wan, Peixing; Huang, Zigeng; Tang, Wenjun; Nie, Yulan; Pei, Dajun; Deng, Shaofen; Chen, Jing; Zhou, Yizhi; Duan, Hongru; Chen, Qingyu; Long, Erping

doi:10.1038/s41591-024-03148-7

Article
Published: 15 July 2024

Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial

Peixing Wan^1,2^na1,
Zigeng Huang¹^na1,
Wenjun Tang¹,
Yulan Nie³,
Dajun Pei⁴,
Shaofen Deng³,
Jing Chen⁴,
Yizhi Zhou³,
Hongru Duan³,
Qingyu Chen ORCID: orcid.org/0000-0002-6036-1516⁵^na2 &
…
Erping Long ORCID: orcid.org/0000-0002-3502-5596¹^na2

Nature Medicine volume 30, pages 2878–2885 (2024)Cite this article

9004 Accesses
31 Citations
32 Altmetric
Metrics details

Subjects

Abstract

Reception is an essential process for patients seeking medical care and a critical component influencing the healthcare experience. However, current communication systems rely mainly on human efforts, which are both labor and knowledge intensive. A promising alternative is to leverage the capabilities of large language models (LLMs) to assist the communication in medical center reception sites. Here we curated a unique dataset comprising 35,418 cases of real-world conversation audio corpus between outpatients and receptionist nurses from 10 reception sites across two medical centers, to develop a site-specific prompt engineering chatbot (SSPEC). The SSPEC efficiently resolved patient queries, with a higher proportion of queries addressed in fewer rounds of queries and responses (Q&Rs; 68.0% ≤2 rounds) compared with nurse-led sessions (50.5% ≤2 rounds) (P = 0.009) across administrative, triaging and primary care concerns. We then established a nurse–SSPEC collaboration model, overseeing the uncertainties encountered during the real-world deployment. In a single-center randomized controlled trial involving 2,164 participants, the primary endpoint indicated that the nurse–SSPEC collaboration model received higher satisfaction feedback from patients (3.91 ± 0.90 versus 3.39 ± 1.15 in the nurse group, P < 0.001). Key secondary outcomes indicated reduced rate of repeated Q&R (3.2% versus 14.4% in the nurse group, P < 0.001) and reduced negative emotions during visits (2.4% versus 7.8% in the nurse group, P < 0.001) and enhanced response quality in terms of integrity (4.37 ± 0.95 versus 3.42 ± 1.22 in the nurse group, P < 0.001), empathy (4.14 ± 0.98 versus 3.27 ± 1.22 in the nurse group, P < 0.001) and readability (3.86 ± 0.95 versus 3.71 ± 1.07 in the nurse group, P = 0.006). Overall, our study supports the feasibility of integrating LLMs into the daily hospital workflow and introduces a paradigm for improving communication that benefits both patients and nurses. Chinese Clinical Trial Registry identifier: ChiCTR2300077245.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Profiling real-world conversation data across reception 10 sites from two medical centers.**

**Fig. 2: Internal validation of the SSPEC.**

**Fig. 4: Randomized controlled trial testing the feasibility of the nurse–SSPEC collaboration model.**

The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs)

Article Open access 08 July 2024

Real-time surveillance system for patient deterioration: a pragmatic cluster-randomized controlled trial

Article 02 April 2025

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Article Open access 09 May 2025

Data availability

The data supporting the findings of this trial are available within the paper and its Supplementary Information files. All requests for further data sharing will be reviewed by the Ethics Review Committee of Southern University of Science and Technology Yantian Hospital, Shenzhen, China, and Renmin Hospital of Wuhan University, Wuhan, China, to verify whether the request is subject to any intellectual property or confidentiality obligations. Requests for access to de-identified individual-level data from this trial can be submitted via email to E.L. ([email protected]) with detailed proposals for approval and will be responded to within 60 d. Each request complying with the terms of use of the data indicated in the consent form will be granted. A signed data access agreement with the collaborator is required before accessing shared data. The raw conversation data are not publicly available due to privacy restrictions.

Code availability

The source code can be accessed via the following link: https://github.com/ZigengHuang/SSPEC. The SSPEC was developed with OpenAI version 0.28.1 (https://github.com/openai/openai-python), RAGAS version 0.0.18 (https://github.com/explodinggradients/ragas) and LangChain version 0.0.333 (https://github.com/langchain-ai/langchain).

References

Himmelstein, D. U. et al. A comparison of hospital administrative costs in eight nations: US costs exceed all others by far. Health Aff. (Millwood) 33, 1586–1594 (2014).
Article PubMed Google Scholar
Guo, S., Yang, T. & Dong, S. Research advances on cost-efficiency measurement and evaluation of public hospitals in China. Chin. J. Health Policy 13, 45–51 (2020).
Google Scholar
Zeng, D. The comparison and implication on the cost management of public hospitals between China and the US. Chin. J. Health Policy 12, 13–17 (2012).
Google Scholar
Kwame, A. & Petrucka, P. M. A literature-based study of patient-centered care and communication in nurse–patient interactions: barriers, facilitators, and the way forward. BMC Nurs. 20, 158 (2021).
Sharkiya, S. H. Quality communication can improve patient-centred health outcomes among older patients: a rapid review. BMC Health Serv. Res. 23, 886 (2023).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. Prevalence of depression and depressive symptoms among outpatients: a systematic review and meta-analysis. BMJ Open 7, e017173 (2017).
Article PubMed PubMed Central Google Scholar
Fang, H. et al. Depressive symptoms and workplace-violence-related risk factors among otorhinolaryngology nurses and physicians in Northern China: a cross-sectional study. BMJ Open 8, e019514 (2018).
Article PubMed PubMed Central Google Scholar
Yuan, Y. et al. Survey on mental health status of the medical staff. Nurs. J. Chin. Peoples Liberation Army 24, 22–23 (2007).
Google Scholar
Portoghese, I., Galletta, M., Coppola, R. C., Finco, G. & Campagna, M. Burnout and workload among health care workers: the moderating role of job control. Saf. Health Work 5, 152–157 (2014).
Article PubMed PubMed Central Google Scholar
Jingwei He, A. & Qian, J. Hospitals’ responses to administrative cost-containment policy in urban China: the case of Fujian Province. China Q. 216, 946–969 (2013).
Article Google Scholar
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).
Article PubMed PubMed Central Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
Article PubMed Google Scholar
Agrawal, G., Kumarage, T., Alghami, Z. & Liu, H. Can knowledge graphs reduce hallucinations in LLMs?: a survey. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 3947–3960 (Association for Computational Linguistics, 2024).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Konopasky, A. et al. Understanding context specificity: the effect of contextual factors on clinical reasoning. Diagnosis 7, 257–264 (2020).
Article PubMed Google Scholar
Mondal, P. The limits of language-thought influences can be set by the constraints of embodiment. Front. Psychol. 12, 593137 (2021).
Article PubMed PubMed Central Google Scholar
Loh, S. B. & Raamkumar, A. S. Harnessing large language models’ empathetic response generation capabilities for online mental health counselling support. Preprint at https://arxiv.org/abs/2310.08017 (2023).
Geist, S. M. & Geist, J. R. Improvement in medical consultation responses with a structured request form. J. Dent. Educ. 72, 553–561 (2008).
Article PubMed Google Scholar
Du, S. & Martinez, A. M. Compound facial expressions of emotion: from basic research to clinical applications. Dialogues Clin. Neurosci. 17, 443–455 (2015).
Article PubMed PubMed Central Google Scholar
Okada, B. M., Lachs, L. & Boone, B. Interpreting tone of voice: musical pitch relationships convey agreement in dyadic conversation. J. Acoust. Soc. Am. 132, EL208–EL214 (2012).
Article PubMed Google Scholar
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Article CAS PubMed Google Scholar
Robles, P. & Mallinson, D. J. Artificial intelligence technology, public trust, and effective governance. Rev. Policy Res. https://doi.org/10.1111/ropr.12555 (2023).
Shaikh, O., Zhang, H., Held, W., Bernstein, M. & Yang, D. On second thought, letʼs not think step by step! Bias and toxicity in zero-shot reasoning. In Proc. 61st Annual Meeting of the Association for Computational Linguistics 4454–4470 (Association for Computational Linguistics, 2023).
Ito, N. et al. The accuracy and potential racial and ethnic biases of GPT-4 in the diagnosis and triage of health conditions: evaluation study. JMIR Med. Educ. 9, e47532 (2023).
Article PubMed PubMed Central Google Scholar
Singh, N., Lawrence, K., Richardson, S. & Mann, D. M. Centering health equity in large language model deployment. PLoS Digit. Health 2, e0000367 (2023).
Article PubMed PubMed Central Google Scholar
Neuwelt, P. M., Kearns, R. A. & Cairns, I. R. The care work of general practice receptionists. J. Prim. Health Care 8, 122–129 (2016).
Article PubMed Google Scholar
Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inf. Decis. Mak. 19, 232 (2019).
Article Google Scholar
Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5, 194 (2022).
Article PubMed PubMed Central Google Scholar
Yuan, J. et al. Advanced prompting as a catalyst: empowering large language models in the management of gastrointestinal cancers. Innov. Med. 1, 100019 (2023).
Article Google Scholar
Martin-Maroto, F. & de Polavieja, G. G. Semantic embeddings in semilattices. Preprint at https://arxiv.org/abs/2205.12618 (2022).
Gao, K. et al. Examining user-friendly and open-sourced large GPT models: a survey on language, multimodal, and scientific GPT models. Preprint at https://arxiv.org/abs/2308.14149 (2023).
Mao, R., Chen, G., Zhang, X., Guerin, F. & Cambria, E. GPTEval: a survey on assessments of ChatGPT and GPT-4. In Proc. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (eds Calzolari, N. et al.) 7844–7866 (ELRA and ICCL, 2024).
Es, S., James, J., Espinosa-Anke, L. & Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (eds Aletras, N. & De Clercq, O.) 150–158 (Association for Computational Linguistics, 2024).
Jiang, Z. et al. Active retrieval augmented generation. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 7969–7992 (Association for Computational Linguistics, 2023).
Bonferroni, C. E. Il calcolo delle assicurazioni su gruppi di teste. In Studi in Onore del Professore Salvatore Ortu Carboni (Tipografia des Senato del dott. G. Bardi, 1935).

Download references

Acknowledgements

E.L. is supported by the National Natural Science Foundation of China (Excellent Youth Scholars Program, 32300483 and 82090011) and the Chinese Academy of Medical Sciences Innovation Fund (2023-I2M-3-010). We thank Z. Wang and the Long laboratory members for valuable comments. We also thank the Bioinformatics Center of the Institute of Basic Medical Sciences for computing support.

Author information

These authors contributed equally: Peixing Wan, Zigeng Huang.
These authors jointly supervised this work: Qingyu Chen, Erping Long.

Authors and Affiliations

State Key Laboratory of Respiratory Health and Multimorbidity, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
Peixing Wan, Zigeng Huang, Wenjun Tang & Erping Long
Laboratory of Immune Cell Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Peixing Wan
Southern University of Science and Technology Yantian Hospital, Shenzhen, China
Yulan Nie, Shaofen Deng, Yizhi Zhou & Hongru Duan
Renmin Hospital of Wuhan University, Wuhan, China
Dajun Pei & Jing Chen
Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA
Qingyu Chen

Authors

Peixing Wan
View author publications
Search author on:PubMed Google Scholar
Zigeng Huang
View author publications
Search author on:PubMed Google Scholar
Wenjun Tang
View author publications
Search author on:PubMed Google Scholar
Yulan Nie
View author publications
Search author on:PubMed Google Scholar
Dajun Pei
View author publications
Search author on:PubMed Google Scholar
Shaofen Deng
View author publications
Search author on:PubMed Google Scholar
Jing Chen
View author publications
Search author on:PubMed Google Scholar
Yizhi Zhou
View author publications
Search author on:PubMed Google Scholar
Hongru Duan
View author publications
Search author on:PubMed Google Scholar
Qingyu Chen
View author publications
Search author on:PubMed Google Scholar
Erping Long
View author publications
Search author on:PubMed Google Scholar

Contributions

E.L., P.W. and Q.C. contributed to the conception and design of the study. P.W., Z.H., W.T., Y.N., D.P., S.D. and J.C. contributed to the data acquisition, curation and analysis. Y.Z. and H.D. provided technical assistance. All authors contributed to the drafting and revising of the paper.

Corresponding authors

Correspondence to Qingyu Chen or Erping Long.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Jordan Alpert, Max Rollwage and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Study design overview.

(A) Conversational audio was collected from 10 reception sites across two medical centers. (B) The audio data was transformed into text format with meticulous manual editing, encompassing a spectrum of patients’ queries including administrative, triaging, and primary-care concerns. The cases were used as the training set for knowledge curation at each site. Fine-tuning and prompt strategies were applied for developing the site-specific prompt engineering chatbot (SSPEC). (C) The cases independent from the training set were reserved as the validation set for the comparison testing in terms of factuality, integrity, safety, empathy, readability, and satisfaction. (D) An alert system was implemented to alert the uncertainty in SSPEC responses. Subsequently, a collaboration model between receptionist nurses and SSPEC was established. This model was then tested in a randomized controlled trial to ascertain its practicality in the outpatient reception setting.

Extended Data Fig. 2 Context setting of patient-nurse conversation.

The study involved outpatients, patients about to be admitted to the hospital, or individuals seeking help. Patients with queries for the nurse who agreed to participate in the audio recording of conversations were recruited upon arrival at the hospital entrance. After recruitment, participants were directed to the reception sites for an in-person visit with the nurse. Prior to the commencement of audio recording, informed consent was obtained from both the nurses and the patients.

Extended Data Fig. 3 Nurse-SSPEC collaboration model involving the alert system in mitigating uncertainty.

Upon patient arrival at the reception site, their queries are recorded audibly and automatically transformed into text. To address uncertain or potentially harmful responses generated by SSPEC, an alert system has been implemented. This system triggers an alert to the nurses if any ‘signals of uncertainty’ are detected, through key-phrases matching, independent LLM evaluation, or automatic evaluation. This alert prompts immediate nurses review or modification of the response. Furthermore, a dedicated team reviews all patient-SSPEC conversations to continually refine the prompting.

Extended Data Fig. 4 The workflow in randomized controlled trial.

Patient participants were randomly assigned to either the nurse-SSPEC group or the nurse group. Patients in the nurse-SSPEC group primarily interacted with SSPEC via audio, with nurses alerted for review if any uncertain responses were detected. For patients in the nurse group, they were directed to nurses and interacted in person. Satisfaction was measured immediately after the encounter.

Supplementary information

Supplementary Information

Supplementary Tables 1–25, study protocol and statistical plan.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wan, P., Huang, Z., Tang, W. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat Med 30, 2878–2885 (2024). https://doi.org/10.1038/s41591-024-03148-7

Download citation

Received: 17 December 2023
Accepted: 20 June 2024
Published: 15 July 2024
Issue Date: October 2024
DOI: https://doi.org/10.1038/s41591-024-03148-7

This article is cited by

Unravelling ChatGPT’s potential in summarising qualitative in-depth interviews
- Mei Hui Adeline Kon
- Michelle Jessica Pereira
- WanFen Yip
Eye (2025)
Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems
- Jacob T. Rosenthal
- Ashley Beecy
- Mert R. Sabuncu
npj Digital Medicine (2025)
Clinical implementation of AI-based screening for risk for opioid use disorder in hospitalized adults
- Majid Afshar
- Felice Resnik
- Marlon P. Mundt
Nature Medicine (2025)
Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases
- Emily Alsentzer
- Michelle M. Li
- Marinka Zitnik
npj Digital Medicine (2025)
Improving primary healthcare with generative AI
- Winnie Yip
Nature Medicine (2024)