Fig. 1: Performance of human (purple), GPT-4 (dark blue), GPT-3.5 (light blue) and LLaMA2-70B (green) on the battery of theory of mind tests. | Nature Human Behaviour

Fig. 1: Performance of human (purple), GPT-4 (dark blue), GPT-3.5 (light blue) and LLaMA2-70B (green) on the battery of theory of mind tests.

From: Testing theory of mind in large language models and humans

Fig. 1

a, Original test items for each test showing the distribution of test scores for individual sessions and participants. Coloured dots show the average response score across all test items for each individual test session (LLMs) or participant (humans). Black dots indicate the median for each condition. P values were computed from Holm-corrected Wilcoxon two-way tests comparing LLM scores (n = 15 LLM observations) against human scores (irony, N = 50 human participants; faux pas, N = 51 human participants; hinting, N = 48 human participants; strange stories, N = 50 human participants). Tests are ordered in descending order of human performance. b, Interquartile ranges of the average scores on the original published items (dark colours) and novel items (pale colours) across each test (for LLMs, n = 15 LLM observations; for humans, false belief, N = 49 human participants; faux pas, N = 51 human participants; hinting, N = 48 human participants; strange stories, N = 50 human participants). Empty diamonds indicate the median scores, and filled circles indicate the upper and lower bounds of the interquartile range. P values shown are from Holm-corrected Wilcoxon two-way tests comparing performance on original items against the novel items generated as controls for this study.

Source data

Back to article page