Fig. 3: Performance analysis of GPT-4V with kNN ICL on PatchCamelyon and MHIST datasets.
From: In-context learning enables multimodal large language models to classify cancer pathology images

This figure is divided into two sections, with Panel A and B focusing on PatchCamelyon (to the left) and the MHIST dataset (right subpanel) respectively. In A, line graphs illustrate the average performance of GPT-4V when used with kNN-based in-context learning relative to several specialist image classification and histopathology foundation models: We first compare GPT-4V with ResNet-18, ResNet-50 and two Vision Transformers (ViT-Tiny and ViT-Small) where the number of ICL samples for GPT-4V equals the number of training samples for the image classification models (1, top left). Additionally, we compare the same vision classifiers, trained on the full respective datasets (2, bottom left), and the performance of two histopathology foundation models, Phikon (3, top right) and UNI (4, bottom right). For the latter, we compare GPT-4V against training a linear layer on top of the pre-trained foundation model (for one, three, five, and ten epochs) and kNN classification. Note that in these cases, the models are trained on the full datasets, and the term ’# Samples’ is used to denote the number of few-shot ICL samples for GPT-4V only. The Y-axis displays the average accuracy across all labels, derived from 100,000 bootstrapping steps. All relevant metrics (accuracy, lower and upper confidence intervals) are summarized in Supplementary Tables 1–3. Panel B presents a series of heatmaps, highlighting the absolute and relative performance per label in zero-, three-, five-, and ten-shot kNN-based sampling scenarios, each with a sample size of n = 60. Lastly, the spider plot in Panel C highlights the superiority of 10-shot GPT-4V in classification performance for both datasets when compared under equitable conditions to two ResNet-style models and two vision transformers. Source data are provided as a Source Data file.