Fig. 4: Comparative entropy distribution for correct and incorrect responses by ChatGPT. | npj Women's Health

Fig. 4: Comparative entropy distribution for correct and incorrect responses by ChatGPT.

From: Exploring the capabilities of ChatGPT in women’s health: obstetrics and gynaecology

Fig. 4

This box-and-whisker plot displays the entropy values for ChatGPT responses, stratified by the model’s accuracy and the actual correctness of the exam answers. The y-axis represents entropy, a measure of uncertainty, with higher values indicating greater uncertainty. The blue box represents responses where ChatGPT’s answers were correct for questions with correct exam answers, showing a median entropy of 1.46 (IQR: 0.44–1.77). The red box denotes responses where ChatGPT’s answers were incorrect for questions with correct exam answers, with an identical median entropy of 1.46 (IQR: 0.67–1.77, p < 0.001). The consistent median entropy across both categories indicates that ChatGPT’s confidence does not significantly vary between its correct and incorrect responses, despite the statistical significance, calling into question the model’s self-assessment accuracy.

Back to article page