Abstract
While large language models (LLMs) have shown promise in diagnostic reasoning, their impact on management reasoning, which involves balancing treatment decisions and testing strategies while managing risk, is unknown. This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources. From November 2023 to April 2024, 92 practicing physicians were randomized to use either GPT-4 plus conventional resources or conventional resources alone to answer five expert-developed clinical vignettes in a simulated setting. All cases were based on real, de-identified patient encounters, with information revealed sequentially to mirror the nature of clinical environments. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included ___domain-specific scores and time spent per case. Physicians using the LLM scored significantly higher compared to those using conventional resources (mean difference = 6.5%, 95% confidence interval (CI) = 2.7 to 10.2, P < 0.001). LLM users spent more time per case (mean difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.02). There was no significant difference between LLM-augmented physicians and LLM alone (−0.9%, 95% CI = −9.0 to 7.2, P = 0.8). LLM assistance can improve physician management reasoning in complex clinical vignettes compared to conventional resources and should be validated in real clinical practice. ClinicalTrials.gov registration: NCT06208423.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
27,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
209,00 € per year
only 17,42 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
Example case vignettes, questions and grading are included in the manuscript. All the raw scores produced by study participants are available via Figshare at https://doi.org/10.6084/m9.figshare.27886788 (ref. 37). Source data are provided with this paper.
Code availability
No custom code or software development was required for the current research.
Change history
17 February 2025
A Correction to this paper has been published: https://doi.org/10.1038/s41591-025-03586-x
References
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583 (2024).
Tu, T. et al. Towards conversational diagnostic AI. Preprint at https://arxiv.org/abs/2401.05654 (2024).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at https://arxiv.org/abs/2312.00164 (2023).
Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).
Zaboli, A., Brigo, F., Sibilio, S., Mian, M. & Turcato, G. Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage? Am. J. Emerg. Med. 79, 44–47 (2024).
Truhn, D. et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci. Rep. 13, 20159 (2023).
Cook, D. A., Sherbino, J. & Durning, S. J. Management reasoning beyond the diagnosis. JAMA 319, 2267–2268 (2018).
Ledley, R. S. & Lusted, L. B. Reasoning foundations of medical diagnosis: symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science 130, 9–21 (1959).
Bordage, G. Prototypes and semantic qualifiers: from past to present. Med. Educ. 41, 1117–1121 (2007).
Bowen, J. L. Education educational strategies to promote clinical diagnostic reasoning. N. Engl. J. Med. 355, 2217–2225 (2006).
Cook, D. A., Stephenson, C. R., Gruppen, L. D. & Durning, S. J. Management reasoning: empirical determination of key features and a conceptual model. Acad. Med. 98, 80–87 (2023).
Mercuri, M. et al. When guidelines don’t guide: the effect of patient context on management decisions based on clinical practice guidelines. Acad. Med. 90, 191–196 (2015).
Schmidt, H. G., Norman, G. R., Mamede, S. & Magzoub, M. The influence of context on diagnostic reasoning: a narrative synthesis of experimental findings. J. Eval. Clin. Pract. 30, 1091–1101 (2024).
Parsons, A. S., Wijesekera, T. P. & Rencic, J. J. The management script: a practical tool for teaching management reasoning. Acad. Med. 95, 1179–1185 (2020).
Reverberi, C. et al. Experimental evidence of effective human–AI collaboration in medical decision-making. Sci. Rep. 12, 14952 (2022).
Kempt, H. & Nagel, S. K. Responsibility, second opinions and peer-disagreement: ethical and epistemological challenges of using AI in clinical diagnostic contexts. J. Med. Ethics 48, 222–229 (2022).
Restrepo, D., Rodman, A. & Abdulnour, R.-E. Conversations on reasoning: large language models in diagnosis. J. Hosp. Med. 19, 731–735 (2024).
Friedman, C. P. et al. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 282, 1851–1856 (1999); erratum 285, 2979 (2001).
Miller, R. A., Pople, H. E. Jr & Myers, J. D. Internist-1, an experimental computer-based diagnostic consultant for general internal medicine. N. Engl. J. Med. 307, 468–476 (1982).
Chen, Y. et al. SoulChat: improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. Preprint at https://arxiv.org/abs/2311.00273 (2023).
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).
Tai-Seale, M. et al. AI-generated draft replies integrated into health records and physicians’ electronic communication. JAMA Netw. Open 7, e246565 (2024).
Chen, S. et al. The effect of using a large language model to respond to patient messages. Lancet Digit. Health 6, e379–e381 (2024).
Pfeffer, M. A., Shah, N. H., Sharp, C. & Lindmark, C. Nigam Shah and partners roll out beta version of Stanford medicine SHC and SoM Secure GPT. Stanford Medicine https://dbds.stanford.edu/2024/nigam-shaw-and-partners-roll-out-beta-version-of-stanford-medicine-shc-and-som-secure-gpt/ (2024).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at https://arxiv.org/abs/2311.16452 (2023).
Core IM. American College of Physicians www.acponline.org/cme-moc/internal-medicine-cme/internal-medicine-podcasts/core-im (2024).
Pell, G., Fuller, R., Homer, M. & Roberts, T. How to measure the quality of the OSCE: a review of metrics—AMEE guide no. 49. Med. Teach. 32, 802–811 (2010).
Khan, K. Z., Ramachandran, S., Gaunt, K. & Pushkar, P. The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part I: an historical and theoretical perspective. Med. Teach. 35, e1437–e1446 (2013).
Cook, D. A., Durning, S. J., Stephenson, C. R., Gruppen, L. D. & Lineberry, M. Assessment of management reasoning: design considerations drawn from analysis of simulated outpatient encounters. Med. Teach. 1–15, https://doi.org/10.1080/0142159X.2024.2337251 (2024).
Singaraju, R. C., Durning, S. J., Battista, A. & Konopasky, A. Exploring procedure-based management reasoning: a case of tension pneumothorax. Diagnosis 9, 437–445 (2022).
Jones, J. & Hunter, D. Consensus methods for medical and health services research. BMJ 311, 376–380 (1995).
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25, e50638 (2023).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Gallo, R. J., Savage, T. & Chen, J. H. Affiliation bias in peer review of abstracts. JAMA 331, 1234–1235 (2024).
Gallo, R. J. et al. Establishing best practices in large language model research: an application to repeat prompting. J. Am. Med. Inform. Assoc. 32, 386–390 (2025).
Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Figshare https://doi.org/10.6084/m9.figshare.27886788 (2025).
Acknowledgements
We thank M. Chua, M. Maddali, H. Magon and M. Schwede for providing feedback and participating in our pilot study.
Author information
Authors and Affiliations
Contributions
E.G. and R.J.G. participated in study design, acquired and interpreted the data, prepared the manuscript and revised it critically. E.S. and H.K. participated in study design, and acquired and interpreted the data. Y.W., J.A.F., J.C., Z.K., K.P.L., A.S.P., D.Y. and A.P.J.O. participated in study design and interpreted the data. A.M. and N.A. acquired the funding and provided administrative support. E.H. participated in study design and provided critical revision of the manuscript. J.H. and J.H.C. participated in study design, analyzed and interpreted the data, carried out critical revision of the manuscript, supervised the study, acquired the funding and provided administrative support. A.R. participated in study design, analyzed and interpreted the data, carried out critical revision of the manuscript and supervised the study.
Corresponding author
Ethics declarations
Competing interests
E.G., J.H., E.S., J.C., Z.K., A.P.J.O., A.R. and J.H.C. disclose funding from the Gordon and Betty Moore Foundation (grant no. 12409). R.J.G. is supported by a VA Advanced Fellowship in Medical Informatics. Z.K. discloses royalties from Wolters Kluwer for books edited (unrelated to this study), former paid advisory membership for Wolters Kluwer on medical education products (unrelated to this study) and honoraria from Oakstone Publishing for CME delivered (unrelated to this study). A.S.P. discloses a paid advisory role for New England Journal of Medicine Group and National Board of Medical Examiners for medical education products (unrelated to this study). A.P.J.O. receives funding from 3M for research related to rural health workforce shortages. and consulting fees for work related to a clinical reasoning application from the New England Journal of Medicine. A.M. reports uncompensated and compensated relationships with care.coach, Emsana Health, Embold Health, ezPT, FN Advisors, Intermountain Healthcare, JRSL, The Leapfrog Group, the Peterson Center on Healthcare, Prealize Health and PBGH. J.H.C. reports cofounding Reaction Explorer, which develops and licenses organic chemistry education software, as well as paid consulting fees from Sutton Pierce, Younker Hyde Macfarlane and Sykes McAllister as a medical expert witness. He receives funding from the National Institutes of Health (NIH)/National Institute of Allergy and Infectious Diseases (1R01AI17812101), NIH/National Institute on Drug Abuse Clinical Trials Network (UG1DA015815—CTN-0136), Stanford Artificial Intelligence in Medicine and Imaging—Human-Centered Artificial Intelligence Partnership Grant, the NIH-NCATS-Clinical & Translational Science Award (UM1TR004921), Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program (IIP) [R12], NIH/Center for Undiagnosed Diseases at Stanford (U01 NS134358) and the American Heart Association—Strategically Focused Research Network—Diversity in Clinical Trials. J.H. discloses a paid advisory role for Cognita Imaging. The other authors declare no competing interests. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Peer review
Peer review information
Nature Medicine thanks Eric Oermann and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Correlation between Time Spent in Seconds and Total Score.
This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.
Supplementary information
Supplementary Information
Supplementary Items 1–6.
Source data
Source Data Fig. 1
Anonymized raw scores from participants.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Goh, E., Gallo, R.J., Strong, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat Med 31, 1233–1238 (2025). https://doi.org/10.1038/s41591-024-03456-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41591-024-03456-y
This article is cited by
-
Making large language models into reliable physician assistants
Nature Medicine (2025)
-
Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance
Communications Medicine (2025)
-
A multimodal vision foundation model for clinical dermatology
Nature Medicine (2025)
-
AI in psoriatic disease (“PsAI”): Current insights and future directions
Clinical Rheumatology (2025)