Extended Data Table 3 Inter-rater agreement for medical expert annotations

From: An evaluation framework for clinical use of large language models in patient interaction tasks

  1. Inter-rater agreement for medical expert annotations to assess clinical LLM and patient-AI agent. Each cell in the table represents the number of evaluations with inter-rater agreement/total number of evaluations for the different models (GPT-4, GPT-3.5, Mistral-v2-7b and LLaMA-2-7b) and questions (Q1: Did the clinical LLM stop asking questions when only a single most likely diagnosis was possible? Q2: Did the clinical LLM elicit the relevant medical history from the vignette? Q3: Did the patient-AI agent use medical terminology in its responses?) (Methods)