Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

A Publisher Correction to this article was published on 17 February 2025

This article has been updated

Abstract

While large language models (LLMs) have shown promise in diagnostic reasoning, their impact on management reasoning, which involves balancing treatment decisions and testing strategies while managing risk, is unknown. This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources. From November 2023 to April 2024, 92 practicing physicians were randomized to use either GPT-4 plus conventional resources or conventional resources alone to answer five expert-developed clinical vignettes in a simulated setting. All cases were based on real, de-identified patient encounters, with information revealed sequentially to mirror the nature of clinical environments. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included ___domain-specific scores and time spent per case. Physicians using the LLM scored significantly higher compared to those using conventional resources (mean difference = 6.5%, 95% confidence interval (CI) = 2.7 to 10.2, P < 0.001). LLM users spent more time per case (mean difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.02). There was no significant difference between LLM-augmented physicians and LLM alone (−0.9%, 95% CI = −9.0 to 7.2, P = 0.8). LLM assistance can improve physician management reasoning in complex clinical vignettes compared to conventional resources and should be validated in real clinical practice. ClinicalTrials.gov registration: NCT06208423.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Study flow diagram.
Fig. 2: Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0–100).
Fig. 3: Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0–100).
Fig. 4: Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only.

Similar content being viewed by others

Data availability

Example case vignettes, questions and grading are included in the manuscript. All the raw scores produced by study participants are available via Figshare at https://doi.org/10.6084/m9.figshare.27886788 (ref. 37). Source data are provided with this paper.

Code availability

No custom code or software development was required for the current research.

Change history

References

  1. Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Tu, T. et al. Towards conversational diagnostic AI. Preprint at https://arxiv.org/abs/2401.05654 (2024).

  4. McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at https://arxiv.org/abs/2312.00164 (2023).

  5. Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Zaboli, A., Brigo, F., Sibilio, S., Mian, M. & Turcato, G. Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage? Am. J. Emerg. Med. 79, 44–47 (2024).

    Article  PubMed  Google Scholar 

  7. Truhn, D. et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci. Rep. 13, 20159 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Cook, D. A., Sherbino, J. & Durning, S. J. Management reasoning beyond the diagnosis. JAMA 319, 2267–2268 (2018).

    Article  PubMed  Google Scholar 

  9. Ledley, R. S. & Lusted, L. B. Reasoning foundations of medical diagnosis: symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science 130, 9–21 (1959).

    Article  CAS  PubMed  Google Scholar 

  10. Bordage, G. Prototypes and semantic qualifiers: from past to present. Med. Educ. 41, 1117–1121 (2007).

    Article  PubMed  Google Scholar 

  11. Bowen, J. L. Education educational strategies to promote clinical diagnostic reasoning. N. Engl. J. Med. 355, 2217–2225 (2006).

    Article  CAS  PubMed  Google Scholar 

  12. Cook, D. A., Stephenson, C. R., Gruppen, L. D. & Durning, S. J. Management reasoning: empirical determination of key features and a conceptual model. Acad. Med. 98, 80–87 (2023).

    Article  PubMed  Google Scholar 

  13. Mercuri, M. et al. When guidelines don’t guide: the effect of patient context on management decisions based on clinical practice guidelines. Acad. Med. 90, 191–196 (2015).

    Article  PubMed  Google Scholar 

  14. Schmidt, H. G., Norman, G. R., Mamede, S. & Magzoub, M. The influence of context on diagnostic reasoning: a narrative synthesis of experimental findings. J. Eval. Clin. Pract. 30, 1091–1101 (2024).

    Article  PubMed  Google Scholar 

  15. Parsons, A. S., Wijesekera, T. P. & Rencic, J. J. The management script: a practical tool for teaching management reasoning. Acad. Med. 95, 1179–1185 (2020).

    Article  PubMed  Google Scholar 

  16. Reverberi, C. et al. Experimental evidence of effective human–AI collaboration in medical decision-making. Sci. Rep. 12, 14952 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Kempt, H. & Nagel, S. K. Responsibility, second opinions and peer-disagreement: ethical and epistemological challenges of using AI in clinical diagnostic contexts. J. Med. Ethics 48, 222–229 (2022).

    Article  PubMed  Google Scholar 

  18. Restrepo, D., Rodman, A. & Abdulnour, R.-E. Conversations on reasoning: large language models in diagnosis. J. Hosp. Med. 19, 731–735 (2024).

    Article  PubMed  Google Scholar 

  19. Friedman, C. P. et al. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 282, 1851–1856 (1999); erratum 285, 2979 (2001).

    Article  CAS  PubMed  Google Scholar 

  20. Miller, R. A., Pople, H. E. Jr & Myers, J. D. Internist-1, an experimental computer-based diagnostic consultant for general internal medicine. N. Engl. J. Med. 307, 468–476 (1982).

    Article  CAS  PubMed  Google Scholar 

  21. Chen, Y. et al. SoulChat: improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. Preprint at https://arxiv.org/abs/2311.00273 (2023).

  22. Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Tai-Seale, M. et al. AI-generated draft replies integrated into health records and physicians’ electronic communication. JAMA Netw. Open 7, e246565 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Chen, S. et al. The effect of using a large language model to respond to patient messages. Lancet Digit. Health 6, e379–e381 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Pfeffer, M. A., Shah, N. H., Sharp, C. & Lindmark, C. Nigam Shah and partners roll out beta version of Stanford medicine SHC and SoM Secure GPT. Stanford Medicine https://dbds.stanford.edu/2024/nigam-shaw-and-partners-roll-out-beta-version-of-stanford-medicine-shc-and-som-secure-gpt/ (2024).

  26. Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at https://arxiv.org/abs/2311.16452 (2023).

  27. Core IM. American College of Physicians www.acponline.org/cme-moc/internal-medicine-cme/internal-medicine-podcasts/core-im (2024).

  28. Pell, G., Fuller, R., Homer, M. & Roberts, T. How to measure the quality of the OSCE: a review of metrics—AMEE guide no. 49. Med. Teach. 32, 802–811 (2010).

    Article  PubMed  Google Scholar 

  29. Khan, K. Z., Ramachandran, S., Gaunt, K. & Pushkar, P. The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part I: an historical and theoretical perspective. Med. Teach. 35, e1437–e1446 (2013).

    Article  PubMed  Google Scholar 

  30. Cook, D. A., Durning, S. J., Stephenson, C. R., Gruppen, L. D. & Lineberry, M. Assessment of management reasoning: design considerations drawn from analysis of simulated outpatient encounters. Med. Teach. 1–15, https://doi.org/10.1080/0142159X.2024.2337251 (2024).

  31. Singaraju, R. C., Durning, S. J., Battista, A. & Konopasky, A. Exploring procedure-based management reasoning: a case of tension pneumothorax. Diagnosis 9, 437–445 (2022).

    Article  PubMed  Google Scholar 

  32. Jones, J. & Hunter, D. Consensus methods for medical and health services research. BMJ 311, 376–380 (1995).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25, e50638 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Gallo, R. J., Savage, T. & Chen, J. H. Affiliation bias in peer review of abstracts. JAMA 331, 1234–1235 (2024).

    Article  PubMed  Google Scholar 

  36. Gallo, R. J. et al. Establishing best practices in large language model research: an application to repeat prompting. J. Am. Med. Inform. Assoc. 32, 386–390 (2025).

    Article  PubMed  Google Scholar 

  37. Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Figshare https://doi.org/10.6084/m9.figshare.27886788 (2025).

Download references

Acknowledgements

We thank M. Chua, M. Maddali, H. Magon and M. Schwede for providing feedback and participating in our pilot study.

Author information

Authors and Affiliations

Authors

Contributions

E.G. and R.J.G. participated in study design, acquired and interpreted the data, prepared the manuscript and revised it critically. E.S. and H.K. participated in study design, and acquired and interpreted the data. Y.W., J.A.F., J.C., Z.K., K.P.L., A.S.P., D.Y. and A.P.J.O. participated in study design and interpreted the data. A.M. and N.A. acquired the funding and provided administrative support. E.H. participated in study design and provided critical revision of the manuscript. J.H. and J.H.C. participated in study design, analyzed and interpreted the data, carried out critical revision of the manuscript, supervised the study, acquired the funding and provided administrative support. A.R. participated in study design, analyzed and interpreted the data, carried out critical revision of the manuscript and supervised the study.

Corresponding author

Correspondence to Jonathan H. Chen.

Ethics declarations

Competing interests

E.G., J.H., E.S., J.C., Z.K., A.P.J.O., A.R. and J.H.C. disclose funding from the Gordon and Betty Moore Foundation (grant no. 12409). R.J.G. is supported by a VA Advanced Fellowship in Medical Informatics. Z.K. discloses royalties from Wolters Kluwer for books edited (unrelated to this study), former paid advisory membership for Wolters Kluwer on medical education products (unrelated to this study) and honoraria from Oakstone Publishing for CME delivered (unrelated to this study). A.S.P. discloses a paid advisory role for New England Journal of Medicine Group and National Board of Medical Examiners for medical education products (unrelated to this study). A.P.J.O. receives funding from 3M for research related to rural health workforce shortages. and consulting fees for work related to a clinical reasoning application from the New England Journal of Medicine. A.M. reports uncompensated and compensated relationships with care.coach, Emsana Health, Embold Health, ezPT, FN Advisors, Intermountain Healthcare, JRSL, The Leapfrog Group, the Peterson Center on Healthcare, Prealize Health and PBGH. J.H.C. reports cofounding Reaction Explorer, which develops and licenses organic chemistry education software, as well as paid consulting fees from Sutton Pierce, Younker Hyde Macfarlane and Sykes McAllister as a medical expert witness. He receives funding from the National Institutes of Health (NIH)/National Institute of Allergy and Infectious Diseases (1R01AI17812101), NIH/National Institute on Drug Abuse Clinical Trials Network (UG1DA015815—CTN-0136), Stanford Artificial Intelligence in Medicine and Imaging—Human-Centered Artificial Intelligence Partnership Grant, the NIH-NCATS-Clinical & Translational Science Award (UM1TR004921), Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program (IIP) [R12], NIH/Center for Undiagnosed Diseases at Stanford (U01 NS134358) and the American Heart Association—Strategically Focused Research Network—Diversity in Clinical Trials. J.H. discloses a paid advisory role for Cognita Imaging. The other authors declare no competing interests. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Peer review

Peer review information

Nature Medicine thanks Eric Oermann and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Correlation between Time Spent in Seconds and Total Score.

This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.

Extended Data Table 1 Post-hoc Analysis Adjusted for Time Spent in Each Case
Extended Data Table 2 Post-hoc Analysis for the Associations between the Primary and Secondary Outcomes Overall

Supplementary information

Source data

Source Data Fig. 1

Anonymized raw scores from participants.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goh, E., Gallo, R.J., Strong, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat Med 31, 1233–1238 (2025). https://doi.org/10.1038/s41591-024-03456-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-024-03456-y

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing