Extended Data Fig. 4: Distribution of clinical LLM’s accuracy in FRQ across the medical specialties.
From: An evaluation framework for clinical use of large language models in patient interaction tasks

Distribution of clinical LLM’s accuracy in FRQs across the 12 medical specialties for (a) GPT-4, (b) GPT-3.5, (c) Mistral-v2-7b, and (d) LLaMA-2-7b. Trends for the 4 experimental settings (vignette, multi-turn conversation, single-turn conversation and summarized conversation) are consistent to the combined accuracy for all 12 specialties - Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Error bars represent 95% confidence intervals, and numbers represent the mean accuracy.