Fig. 1: Performance of ChatGPT on the MRCOG Part One examination.
From: Exploring the capabilities of ChatGPT in women’s health: obstetrics and gynaecology

Significant variance in performance across the four domains was noted (p = 0.02; χ² = 9.85). The highest accuracy was observed in the ___domain of “Illness” at 80.0% (95% Confidence Interval [CI]: 73.3–85.7), whereas the lowest was in “Measurement and Manipulation” at 65.7% (95% CI: 58.8–72.7). Analysis of ChatGPT’s accuracy within individual subjects corresponding to these domains revealed no substantial differences (Domain-specific p-values: Cell Function, p = 0.08; Human Structure, p = 0.07; Illness, p = 0.49; Measurement and Manipulation, p = 0.11). Within each ___domain, subjects with the highest accuracy were Biochemistry (79.8% [95% CI: 71.4–88.1], Cell Function), Embryology (80.4% [95% CI: 70.0–90.8], Human Structure), Clinical Management (83.3% [95% CI: 68.4–98.2], Illness), and Pharmacology (75.4% [95% CI: 64.3–86.6], Measurement and Manipulation). Subjects with the lowest accuracy were Physiology (65.3% [95% CI: 56.1–74.6], Illness), Anatomy (63.2% [95% CI: 54.0–72.4], Human Structure), Immunology (70.0% [95% CI: 53.6–86.4], Illness), and Biophysics (51.4% [95% CI: 35.2–67.5], Measurement and Manipulation).