Abstract
Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-___domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.
Similar content being viewed by others
Introduction
Artificial intelligence (AI) is about to transform healthcare. While its potential is immense, it also presents unique challenges in medicine, arising from the field’s inherent complexity and the critical need for accuracy and reliability1. Over the last years, applications of AI have been developed that focus on specific areas, especially computer vision models in radiology2 and pathology3, or skin cancer detection4 for oncology.
Histopathology plays a central role in diagnosing diseases, notably cancer, and has consistently been at the forefront of computational advancements in medicine5. Recent developments have enabled the detection of cancer subtypes6 and biomarkers like genetic alterations7 which can potentially stratify and improve patient care directly from routine hematoxylin and eosin (H&E) stained microscopic images7. The current gold standard for computational pathology is training vision foundation models8 based on a vast and diverse dataset of images that can easily be customized for clinically relevant applications9,10. However, these foundation models need a substantial volume of ___domain-specific images during training and are restricted to vision applications only. Moreover, before being applied to a medical task, these models require an additional re-training stage (fine-tuning) that is in itself computationally demanding11 and requires additional annotated training data. This last step needs to be repeated for every potential application, which limits researchers to develop these models at scale.
In-context learning (ICL)—a concept borrowed from the field of natural language processing (NLP)—could provide a possible solution to this problem. The ability of large language models (LLMs) to learn from a few handcrafted examples that are provided to the LLM alongside the prompt, holds great potential and has been shown to improve model performance12. A practical implementation in a medical setting might involve presenting the LLM with a detailed clinical scenario, such as a complex oncology case, accompanied by several comparable instances with different strategies on how to solve a certain challenge. This approach is called few-shot prompting. Numerous methodologies have been developed utilizing in-context learning. Their foundational principles are explained in detail in the ‘Supplementary Methods: In-Context Learning’ section.
In the medical field, one model has recently been built upon the aforementioned paradigms: MedPrompt13, which is based on the GPT-4 architecture. Central to this method is the application of the k-nearest Neighbor (kNN) search, which herein helps identify the most relevant few-shot examples for a specific clinical input. This process involves comparing text embeddings, which are numeric representations of words with the input in question, and then selecting samples with the closest alignment. We highlight further implementation details of this approach, as it has partial overlap with the methods developed in our study, in the ‘Supplementary Methods: Related Work—Enhancing LLM strategies’ section.
However, a major shortcoming is the restriction to text-based tasks. Medicine is a highly multimodal discipline, where a comprehensive understanding of a patient’s symptoms or diagnoses requires information from diverse data sources such as radiographic and microscopic imaging, clinical reports, laboratory values, and electronic health records14. Only recently, the AI community has entered into the field of vision language models (VLMs), exemplified by the release of GPT-4V15, the announcement of Google DeepMind’s Gemini16 family or open-source variants like LLaVA17.
Building on the trend of large vision language foundation models, we hypothesize that the principles applied for in-context learning of text-based models can be equally effective when extended to multimodal scenarios, such as medical imaging. In the non-medical setting, robust evidence for in-context learning with images has already been established18. Especially, in the medical field, where generating annotated ground truth data presents a critical challenge, the potential for performance improvements through this approach could be immensely beneficial. This issue is also of relevance for underrepresented medical cases, such as rare tumor types, which receive insufficient representation in traditional deep-learning training pipelines. Moreover, the concurrent integration of textual, theoretical knowledge, and visual information could pave the way toward a more holistic understanding of multidimensional medical data.
In this study, we present results of benchmarking the efficacy of in-context learning with GPT-4V against dedicated image classifiers across three histopathology benchmarking datasets. Notably, we demonstrate that the performance of GPT-4V in tissue classification can be improved through in-context learning and is on par with specialist computer vision models. This advancement casts doubt on the necessity of developing task-specific deep learning models in the future and democratizes access to generalist AI models to accelerate medical research.
Results
In-context learning with medical images improves classification accuracy for histopathology
In this study, we hypothesize that few-shot prompting can improve the performance of foundation vision models. This hypothesis has been shown with text-only tasks, but remains unclear for its application to biomedical images12,18. We provide a high-level overview of our evaluation datasets (Fig. 1A) and the overall experimental concept in Fig. 1B. We first evaluate this hypothesis on a binary classification task between tumor (TUM) and non-tumorous normal mucosa (NORM) tissue tiles from the CRC100K dataset19. As shown in Fig. 2A, GPT-4V only marginally surpasses the expectation of random guessing when used in a zero-shot setting, attaining an accuracy of 61.7% (CI: 0.5–0.733). In-context learning changes this situation: We see a consistent improvement in classification accuracy with increasing numbers of few-shot samples with an accuracy of 66.7% in the three-shot sampling setting (CI: 0.55–0.783), 78.3% for five-shot sampling (CI: 0.667–0.883) and an accuracy of 90% when showing 10 images of each class to the model (CI: 0.817–0.967). In our subsequent ablation study (Fig. 2B), we compare random versus kNN sampling across the MHIST20 and PatchCamelyon21 (PCAM) datasets. From a zero-shot baseline that again barely achieves a better classification than random guessing (MHIST accuracy 56,7%, CI: 0.433–0.683; PCAM accuracy 60%, CI: 0.467–0.717), we see that in both datasets, random image sampling can improve classification accuracy. These results can further be improved by selecting the sampled images based on their similarity to the target image (kNN sampling), which results in the best-achieved accuracy of 83.4% and 88.3% for detecting sessile-serrated adenoma over hyperplastic polyps (MHIST, CI: 0.733–0.917) and lymph-node metastases from breast cancer versus tumor-free lymphatic tissue (PCAM, CI: 0.8–0.95) in a ten-shot setting.
This figure presents a systematic overview of the three histopathology benchmarking datasets, detailing the number of samples incorporated in our study (Panel A). A selection of random test images was drawn from each of these datasets for evaluation using three distinct methodologies: Zero-Shot Classification (Method 1), random few-shot sampling (Method 2), and kNN-based selection (Method 3). For the latter, feature extraction was performed using the Phikon ViT-B 40 M Pancancer model (*). Cosine similarity was used as the comparison metric between the target image and its closest k neighbors in embedding space. As a benchmark against GPT-4 ICL, we trained four image classifiers (indicated by +, namely ResNet-18, ResNet15, Tiny-Vit, and Small-Vit) via transfer learning from ImageNet for each target image (Panel B). For an in-depth understanding of these methods, please refer to Algorithm 1 and the Experimental Design section. * The BACK (background) label was excluded from the analysis.
Panel A shows that classification accuracy on a simple task detecting tumor (TUM) versus non-tumor (NORM) tiles from the CRC100K dataset can drastically be improved by leveraging ICL through randomly sampled, few-shot image samples. Additionally, we compare random and kNN-based image sampling on two datasets and show that kNN-based image sampling improves model performance in classifying images from both MHIST (left) and PatchCamelyon (right), especially when scaling the number of few-shot samples (Panel B). Note that samples have been slightly shifted on the x-axis for visibility. The y-axis denotes the mean accuracy with lower and upper 2.5% confidence intervals (CIs) from 100,000 bootstrap iterations for both panels, respectively. Source data are provided as a Source Data file.
In summary, these results demonstrate that in-context learning can improve the performance of foundation vision models in classifying histopathology images. Moreover, we show that kNN sampling can further enhance accuracy over random sampling, especially when increasing the number of images that are shown to the model. Corresponding metrics can be found in Tables 1 and 2, with the best-performing method highlighted in bold.
Vision-language models can achieve performance on par with retrained vision classifiers
Next, we compare few-shot sampling with the previous status-quo7 in image classification, which involves retraining models from ImageNet weights. As an initial comparison, we train one distinct model for each target image shown to GPT-4V, with the identical images used for in-context learning as the training set. This approach reveals that in-context learning is sufficiently robust to achieve results that are on par with, or even surpass, specialized narrow image classifiers under the same conditions. Specifically, the ten-shot in-context learning GPT-4V approach not only matches but exceeds the performance of all other models (Fig. 3A), leading to a classification accuracy of 83.3% for MHIST (CI: 0.733–0.917) and 88.3% for PatchCamelyon (CI: 0.8–0.95), outperforming the second-best model, Tiny-ViT, by 3.3% and 6.6% respectively. Notably, in the case of PatchCamelyon, even the three- and five-shot prompting were sufficient to outperform all other models in this setting. We show a detailed comparison of each model’s evaluation metrics in Supplementary Table 1. Further, we extend this comparison to include two scenarios: Firstly, we show that GPT-4V’s performance through in-context learning can partially match that of previously mentioned vision models, even when those models have been trained on the complete training datasets, such as CRC100K which includes tens of thousands of tiles (Supplementary Table 2). Secondly, we compare the efficacy of in-context learning with the current gold standards in histopathology image classification—specifically using the Phikon and UNI models as examples. All evaluation metrics, including specifications on training parameters, are shown in Supplementary Tables 3 and 4. The results indicate that in-context learning significantly narrows the performance gap between GPT-4V and these models. For instance, in-context learning reduced the disparity from 36.6% (zero-shot GPT-4V) to a 10% difference (ten-shot GPT-4V) relative to the performance achieved by kNN classification with the Phikon feature extractor on the MHIST dataset. We also discovered that GPT-4V demonstrated remarkable zero-shot capabilities for some of the targets: For PatchCamelyon, it correctly identified all tumor tiles, albeit with a high false positive rate of 80%. In the MHIST dataset it correctly recognized 83% of Sessile Serrated Adenomas but only 30% of Hyperplastic Polyps (Fig. 3B). Considerable improvements could be observed with few-shot prompting. In the case of PatchCamelyon, the model’s ability to identify normal lymph node tissue progressively increased with the number of example images, ranging from an accuracy of 67% for three-shot, 77% for five-shot to 80% for ten-shot image prompting. Similarly, for MHIST, the correct identification of hyperplastic polyps could be increased from 30% (zero-shot) to close to 90% (ten-shot). Notably, these enhancements did not compromise the model’s performance in detecting tumors in the PatchCamelyon dataset or SSAs in the MHIST dataset (Fig. 3C). These findings show that in-context learning with microscopic images can achieve an accuracy on par with fine-tuning specialized image classification models.
This figure is divided into two sections, with Panel A and B focusing on PatchCamelyon (to the left) and the MHIST dataset (right subpanel) respectively. In A, line graphs illustrate the average performance of GPT-4V when used with kNN-based in-context learning relative to several specialist image classification and histopathology foundation models: We first compare GPT-4V with ResNet-18, ResNet-50 and two Vision Transformers (ViT-Tiny and ViT-Small) where the number of ICL samples for GPT-4V equals the number of training samples for the image classification models (1, top left). Additionally, we compare the same vision classifiers, trained on the full respective datasets (2, bottom left), and the performance of two histopathology foundation models, Phikon (3, top right) and UNI (4, bottom right). For the latter, we compare GPT-4V against training a linear layer on top of the pre-trained foundation model (for one, three, five, and ten epochs) and kNN classification. Note that in these cases, the models are trained on the full datasets, and the term ’# Samples’ is used to denote the number of few-shot ICL samples for GPT-4V only. The Y-axis displays the average accuracy across all labels, derived from 100,000 bootstrapping steps. All relevant metrics (accuracy, lower and upper confidence intervals) are summarized in Supplementary Tables 1–3. Panel B presents a series of heatmaps, highlighting the absolute and relative performance per label in zero-, three-, five-, and ten-shot kNN-based sampling scenarios, each with a sample size of n = 60. Lastly, the spider plot in Panel C highlights the superiority of 10-shot GPT-4V in classification performance for both datasets when compared under equitable conditions to two ResNet-style models and two vision transformers. Source data are provided as a Source Data file.
In-context learning reduces the performance gap between generalist and histopathology foundation models
In a subsequent evaluation, we tested GPT-4V on the CRC100K dataset, which is more challenging as it consists of a more diverse set of labels. As we increased the number of few-shot image samples, GPT-4V showed considerable improvements in performance. However, it did not achieve the levels observed on other datasets such as PatchCamelyon or MHIST (Fig. 4A). Despite this, there was a significant narrowing of the performance gap between GPT-4V and models fully trained on all data, such as ResNet-15, ResNet-18, ViT-Tiny, and ViT-Small, as well as compared to the downstream performance of the Phikon and UNI models. Initially, the performance deficit of GPT-4V in zero-shot classification relative to kNN stood at 61.7% and 62.5% for Phikon and UNI respectively. This gap was reduced to 18.3% and 19.2% when using five-shot in-context learning compared to the best scores achieved by Phikon and UNI in any of our settings. While our study does not claim to maximize potential performance across all models, it highlights that ICL can bring us closer to the performance levels of models extensively pretrained on these tasks. Also, the GPT-4V model natively excelled in identifying tumor and muscle tissue, achieving a recall score of 80% and 100%, respectively. However, it failed completely in recognizing debris (DEB), adipose tissue (ADI), lymphocytes (LYM), mucus (MUC), and tumor-associated stroma (STR). Herein, three instances are particularly noteworthy: lymphocytes were consistently misclassified as tumor tissue, debris was incorrectly categorized as a tumor in 93% of cases, and stroma was misclassified as muscle tissue in 87% of instances. The addition of few-shot examples led to a substantial improvement. The best results are achieved with five-shot kNN-sampling, where the model receives a total of 40 sample images. This leads to enhanced accuracy across all labels (Fig. 4B). A clear trend of continuous performance gains is evident as the number of few-shot samples is increased, demonstrating consistent improvements at each stage of the process (from zero- to one-, one- to three-, and three- to five-shot prompting) for almost all labels (LYM, MUC, NORM, STR), with the exception of debris (Fig. 4C). Details to confidence intervals are summarized in Supplementary Table 1. In summary, our findings underline the potential of few-shot image learning in GPT-4V, even in a multilabel classification setting.
The line graphs (Panel A) show the comparative average performance of GPT-4V with kNN-based in-context learning against the four image classification models (1) when trained on the same number of images as used as example images for in-context learning with GPT-4V. Additionally, we show how in-context learning can reduce the performance gap between GPT-4V and the respective image classifiers when trained on the entire datasets respectively (2) as well as in comparison to the state-of-the-art foundation models Phikon and UNI. # Samples refers to the count of few-shot ICL samples for GPT-4V and training samples for the other models in 1, while for all other settings, the models are trained on the entire training data. The y-axis represents the mean accuracy across all labels, computed using 100,000 bootstrapping iterations. Detailed average accuracy values, including confidence intervals, are summarized in Supplementary Table 1. Panel B features confusion matrices for GPT-4V in both zero-, and five-shot kNN-based sampling scenarios (n = 120 samples). The spider plot showcases the average classification accuracy per label per number of kNN-sampled shots, revealing a general trend towards increased classification accuracy across most labels with scaling of the number of few-shot image samples (Panel C). Source data are provided as a Source Data file.
Image in-context learning improves text-based reasoning
Vision-Language Models enable multimodal understanding. To more accurately evaluate the impact of few-shot image sampling on textual reasoning within VLMs, we further investigated the output of GPT-4V and created text embeddings using Ada-002. Next, we utilized t-Stochastic Neighbor Embedding (t-SNE) to analyze the semantic space of the model’s reasoning. Our results demonstrated distinct clusters of embeddings when compared to the model’s final answer (Fig. 5A, top), underscoring a potential correlation between text and image data. However, in the zero-shot scenario, the comparison of text embeddings to ground truth labels revealed that the model’s intrinsic reasoning only correlated poorly with the correct categorization of the images (labels). This suggests a limitation in the model’s ability to independently navigate to the correct label based solely on its learned representations. Contrastingly, the application of few-shot learning techniques improved the separation of text embeddings corresponding to different answers and labels. This enhancement is evident from the formation of a greater number of distinct clusters and more accurate alignment of data points with their respective ground truth categories, as shown in Fig. 5A. Moreover, the implementation of few-shot learning was associated with increased silhouette scores, indicating closer proximity of data points to their correct labels as the number of example images provided to the model increased. Collectively, these findings suggest that employing few-shot learning techniques can enhance the model’s capacity to analyze and interpret test images more accurately, thereby refining its decision-making process.
Panel A depicts the workflow, starting from GPT-4V’s initial prediction and its reasoning process (‘thoughts’), to the generation of text feature embeddings with Ada 002. The panel of t-SNEs demonstrates the evolution from a zero-shot framework on the far left, advancing through one-, three-, and five-shot kNN sampling to the right. All data is obtained from the CRC100K dataset. In the t-SNE plots, color coding distinguishes between the model’s final classifications (‘Answers’, top) and the ground truth (’Labels’, bottom). The introduction of few-shot image sampling noticeably refines the model’s textual reasoning, as evidenced by the formation of more distinct clusters in alignment with the model’s own responses (top) and the underlying ground truth (bottom). S denotes silhouette scores, which are calculated for each t-SNE. Complementary to these visualizations, Supplementary Fig. 2 features word clouds that further illustrate the alignment of the model’s vocabulary with clinical diagnoses, highlighting key terms such as “lymph node” for normal tissue and “metastatic / breast cancer” for malignancies, thereby enhancing the interpretability of the model’s diagnostic reasoning process. In Panel B, we present two exemplary scenarios to demonstrate the potential superiority of integrated vision-language models over stand-alone image classification models. On the left, an image is displayed where the original annotation identified the sample as stroma (STR), yet GPT-4V categorizes it as tumor (TUM). The rationale provided by the model appears plausible, notably pointing out several abnormally shaped nuclei, visible, for instance, in the lower right corner. This sample indeed appears to represent a borderline case. When comparing the top 500 closest patch embeddings to the reference image, a dominant fraction is classified as tumor (67%), with a lesser proportion being labeled as stroma (32%) and a negligible percentage (<1%) as lymphocytes or regular colon epithelium. The exploration of GPT-4V’s interpretive process can help identify and understand such complex edge cases that go beyond what is possible with conventional image classifiers alone. Right: Chicken-wire patterns are described in the histology of liposarcoma, which arises from adipocyte precursor cells. This description stems from its resemblance to chicken wire fences (shown to the right). GPT-4V effectively leverages this knowledge from another context to describe the morphology of the adipocytes shown in this image. This way of performing ‘transfer learning’ could have strong implications in teaching. * The image name in the CRC100K cohort is STR-TCGA-VEMARASN. + The image name in the CRC100K cohort is ADI-TCGA-QFVSMHDD.
To showcase the benefits multimodality might have in histopathology, we present two illustrative cases from our study. Figure 5B (left) depicts a scenario where GPT-4V falsely classifies an image as a tumor, while the underlying ground truth was considered to be stroma.
However, GPT-4Vs detailed reasoning, identifying morphological signs indicative of cancer, reveals the presence of tumor cells characterized by irregularly shaped nuclei. Analyzing the 500 closest image embeddings in feature space shows a similar trend, with two-thirds of image embeddings being categorized as tumors. Another case, shown in Fig. 5B (right), demonstrates GPT-4V’s proficiency in transferring knowledge from different domains to draw the right conclusions. According to existing literature22, the term “chicken wire pattern” is established within the ___domain of pathology, yet only regarding the appearance of adipose tissue in liposarcomas and other malignancies. However, it is not frequently used to describe the architecture of normal, healthy adipose tissue. The capability of GPT-4 to transfer its understanding of the physical appearance of chicken wire to the shape of adipose tissue in histopathology demonstrates the ability for transfer learning and holds potential in areas like AI explainability and teaching. Overall, these data indicate that vision language models possess substantial potential for medical image classification in histopathology, utilizing only a few sample images. This capability may provide inherent advantages over traditional image classifiers, due to their multimodal architecture.
Discussion
Foundation models have demonstrated substantial promise in medical image processing. Zhou et al. trained such a system using 1.6 million retinal images and illustrated that they could then fine-tune it with fewer annotated images to assist clinicians in identifying a range of ocular diseases23. Yet, the vast amount of data that is required and the necessity to develop one specific fine-tuned version for each clinical task, currently constrain training these models at scale, limiting their utility to researchers with extensive knowledge in computer sciences and access to the required hardware. Furthermore, the applicability of these models has been confined to the visual field only. Nonetheless, learning is a multimodal process. For example, in pathology, practitioners and students assimilate their knowledge by extracting visual patterns from images and synthesizing them with corresponding textual annotations. In summation, the ideal scenario would envision AI systems that seamlessly combine multimodal information in a data-efficient manner while having the flexibility to adapt their behavior to any given task on demand without the need for traditional retraining.
In this study, we demonstrate a proof of concept illustrating that achieving these properties is possible with in-context learning on vision language models, exemplified on GPT-4V: We show that this method not only is effective when classifying medical microscopy images but also that it can achieve performance comparable to conventional image classification models and that in-context learning provides a data- and resource-efficient learning method to drastically recover the performance gap between generalist foundation models and histopathology foundation models like Phikon and UNI, that are trained on a large corpus of microscopic images. We show that five to ten sample images per label are enough for GPT-4V to achieve classification accuracy scores close to the current gold standard models. These results are encouraging, especially considering that other current state-of-the-art pathology foundation models like Paige’s Virchow24 report performance metrics that marginally surpass our method, with reported accuracy scores of 82.7% compared to 83.3% for GPT-4V on the MHIST dataset and 92.7% versus 88.3% for GPT-4V on PatchCamelyon. For MHIST, we must note here that we excluded images without a full inter-rater agreement, which makes our use case most likely easier than the one used by Vorontsov et al.24. We acknowledge the lack of public access to the training corpus of GPT-4V, which raises the possibility that the model may have been trained on our test sets. Nevertheless, the performance observed in a zero-shot scenario marginally surpasses random guessing, making it less likely that the data had been used for training. We use this zero-shot baseline as a comparison to investigate the benefit of in-context learning. With our approach, we lay the foundation for a general-purpose framework that advances state-of-the-art prompting techniques for images. Additionally, our results indicate that deliberately selecting few-shot examples that are semantically similar to the test image can substantially improve the performance of the model. A notable aspect is the integration of text with vision, which can help in explainability in understanding a model’s reasoning processes. This addresses a critical limitation of conventional image classifiers, as textual feedback provides a more comprehensible way of understanding and interpretability for humans compared to visual tools such as Grad-CAM25. This aspect is crucial for reliable AI systems in medical applications26.
Some limitations of our work are that experiments were restricted to a yet small sample size due to the preview status of the GPT-4V API, which currently only permits a limited number of requests. Another limitation in this regard is that we did not include ensembling methods, which would require multiple model iterations over the same task as performed by MedPrompt or Med-PaLM 2, as this approach has a total of 44 model calls for a single task only27. Moreover, it is worth noting that the performance of in-context learning with images sometimes yields suboptimal results, particularly in classes like debris, mucus, and stroma within the CRC100K dataset. This observation is in line with findings by Huang et al. 28. While these outcomes have been acknowledged, we leave an in-depth investigation into the underlying reasons and the development of potential solutions as subjects for future research. Moreover, further explorations into prompt engineering techniques beyond Chain-of-Thought, like Tree-of-Thought29 could be used to optimize the models beyond the current results. Finally, the current reliance of our approach on generating and retrieving image embeddings through a specialized vision model (i.e., Phikon) is a drawback to the philosophy of an all-encompassing foundation VLM, which ideally would not require a specialized vision encoder to generate features. Although we have not tested this at scale due to the rate limits of the API, we speculate that the next generation of foundation models will be able to autonomously manage, embed, and retrieve sample data on demand for few-shot learning. Following the current paradigm of AI scaling laws30,31, it can be estimated that we have not yet reached a plateau in the performance benefits from even more powerful foundation models in the future. Furthermore, our experiments have not indicated any saturation point in model efficacy when increasing the number of k-shot examples—however further increasing the number of sample images per task was not feasible due to exceeding the models context window. Again this suggests the potential for continued enhancements when further scaling our approach and raises the question about the necessity and efficacy of researchers to develop their own specialized deep learning models for each task, particularly when a singular model may suffice in the foreseeable future. In summary, we further aim to scale our work to overcome these limitations and extend it to other domains like radiology imaging. Nevertheless, we believe that in-context learning with images holds great potential for improving the performance of vision language models on biomedical image classification tasks and beyond.
Methods
Ethics statement
This study does not include confidential information. All research procedures were conducted exclusively on publicly accessible, anonymized patient data and in accordance with the Declaration of Helsinki, maintaining all relevant ethical standards. The overall analysis was approved by the Ethics Commission of the Medical Faculty of the Technical University Dresden (BO-EK-444102022).
Datasets
Our benchmarking experiments are conducted on the following, open-source histopathology image datasets:
-
CRC-VAL-HE-7K19 is the evaluation set associated with the NCT-CRC-HE-100K dataset, consisting of 7180 image patches extracted from hematoxylin & eosin (H&E) stained formalin-fixed and paraffin-embedded (FFPE) sections from 50 individuals with colorectal cancer. Samples were collected at the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany) and digitized at 224 × 224 pixels (px) at a resolution of 0.5 microns per pixel (MPP). Throughout this manuscript, we will refer to this dataset as CRC100K. Following previous studies9,32, the background (BACK) class was excluded from our analysis.
-
PatchCamelyon (PCam)21 contains 327,680 H&E stained histologic image patches at 96 × 96px (0.243 MPP) from human sentinel lymph node sections obtained from the Camelyon16 Challenge, originally split into a training and validation set. Samples are annotated with a binary label to denote the presence or absence of metastatic breast cancer tissue at a balance close to 50/50.
-
MHIST20 is a dataset of 3152 H&E-stained FFPE-sections from colorectal polyps, collected at the Dartmouth–Hitchcock Medical Center (DHMC) and addresses the challenging problem of discriminating sessile serrated adenoma (SSA) from hyperplastic polyps (HP)33. Images are scanned at 224 × 224 px and labeled as either HP or SSA by the majority vote of seven pathologists, resulting in a 3:7 split.
For GPT-4V inference testing, we randomly generated randomly chosen test datasets containing 60 samples for MHIST and PatchCamelyon and 120 samples for CRC100K at a balanced 1:1 split for each of the available labels. For simplicity, we restricted the test images from the MHIST dataset to those achieving unanimous expert consensus for the presence of SSA or HP, respectively. All images that were used for inference testing are visualized in Supplementary Fig. 1.
GPT-4V model specifications
All experiments in this study were performed using the GPT-4V model in the chat completions endpoint of the official OpenAI Python API between November 15 and December 03, 2023. The official model name in the OpenAI API is gpt-4-vision-preview. For simplicity, we will use the term GPT-4V in all subsequent references to this model throughout our manuscript. Temperature was set to 0.1 based on initial experiments and no other modifications to model hyperparameters were made. For further implementation details, we refer to our official github repository.
Text embeddings were created using OpenAI’s default embedding model Ada 002, without further modifications.
Prompting and random few-shot image in-context learning
In the following, we present a brief overview of the implementation of the final prompts used in GPT-4V. For an in-depth explanation of both the system prompt (instructions dictating the expected model behavior) and user prompt (input commands or queries to the model), please refer to Supplementary Tables 5–7. There is currently no standardized blueprint for the development of effective model prompts; rather this is an iterative, dynamic process driven by trial and error. Our prompting strategies were developed on a selection of ten random image tiles per label from each dataset. Following current best practices, we utilized the system prompt to establish the setting (context) of the model and to guide its expected behavior. In our initial trials with GPT-4V, we encountered several limitations due to the model’s intensive policy alignment regarding its refusal to handle medical data. To address these issues, we modified our approach by presenting test cases as hypothetical scenarios (‘None of your answers are applied in a real-world scenario or have influences on real patients.’) and additionally included a selection of desired and undesired response pairs into the system prompt. To simplify the analysis of the results, we also configured GPT-4V to generate answers in JavaScript Object Notation (JSON) format. This included a structured template containing a field for providing logical reasoning (‘thoughts’), the final ‘answer’ as well as a certainty ‘score’.
Regarding the user prompt, we differentiate between the zero- and few-shot settings. In the zero-shot scenario, we started with enumerating all possible label options, followed by guiding the model to adopt a step-wise reasoning akin to Chain-of-Thought (CoT) prompting29. This was followed by a compilation of dataset-specific considerations: For instance, in the CRC100K dataset, we observed that the model would almost always choose to classify an image tile as a tumor whenever detecting malignant cells, despite simultaneously recognizing the major cell fraction being lymphocytes. To counteract these dataset-specific pitfalls, we included concise guidelines at this step (Supplementary Tables 5–7). Finally, GPT-4V was asked to thoroughly examine the appended patient image and provide its answer as described above.
In the few-shot sampling prompts, we presented a sequence of k example images (where k equals 1, 3, 5, or 10), each followed by its corresponding label, in a repeated pattern: Specifically, we presented a single image corresponding to each label y, cycling through the entire set of labels k times. This means that for every test image, we present the model a total of (k * y) + 1 images. This setting is the same for the kNN-based sample selection, which we further highlight below and in Box 1. Each image was prefaced with the phrase ‘The following image contains {y}:’ for any possible label y. Then GPT-4 was instructed to closely compare and extract meaningful knowledge from the images for subsequent comparison with the target image. Beyond this, the structure of the prompt remained consistent with the zero-shot template. Moreover, to the best of our knowledge, we followed all known best practices and prompting tricks (i.e., ‘Take a deep breath’)34. To mitigate the risk of overfitting the samples used during the refinements of the system and user prompts, we ensured that they were not included in the generation of inference test data. More specifically, we performed an initial investigation of zero-shot and random few-shot performance using an initial dataset comprising 30 random samples, collected exclusively from the CRC100K dataset, each containing either tumor or normal colon epithelium. This initial dataset served a dual purpose: developing effective prompts and providing an early insight into model responses. For the following evaluation phase, we collected a new subset of 30 samples. This way, we prevented sample leakage from our prompt creation dataset into our final evaluation test set. This was critical to prevent overfitting that could arise from sample-specific biases we might have included in the prompt. However, these samples were allowed to be part of either random or (as described in the next section) kNN-based sample selections. This process was repeated for every dataset.
kNN-based few-shot image sampling
The entire workflow is shown in detail as pseudocode in Box 1. Image feature vectors were created for each of the above-described datasets using the teacher backbone of the ‘Phikon’ Vision Transformer (ViT-B 40 M Pancancer)9 leading to a one- dimensional vector of length 768 for each image tile. During GPT-4V inference, for each test image x, the k closest images of each possible target label y were sampled for kNN-based in-context learning by measuring the cosine similarity in feature space. To prevent the model from learning patient-intrinsic morphologic tissue features as confounders to the desired label, we removed tile embeddings from the same patient if this information was available. Nevertheless, the main goal of our study is the comparison between GPT-4V in-context learning and training of specialized image classifiers. As outlined later, the comparisons are still valid in cases where overlap between test image and related patient tiles might occur, due to exactly matching in-context learning and training samples. The example images were included in the prompt in a way that the most similar images for each label were shown to the model first.
Tile-level classification benchmarks
In this study, we first compared few-shot image in-context learning of GPT-4V with the performance of specialized computer vision models by training a classification layer atop four distinct models: ResNet-18, ResNet-50, ViT-Tiny, and ViT-Small. Each model has been initialized with ImageNet pretrained weights as a standard procedure7,35. Considering the relatively small test sample sizes in the experiments involving GPT-4V, we initially ensured a balanced comparison by training a newly initialized model on the identical set of kNN-sampled and normalized images for each test image across all datasets, leading to a total of 3600 trained models (Number of models × Total number of datasets × (Number of shots − zero shot) × Number of samples per dataset). Every training run was performed for ten epochs, employing the Adam Optimizer with a learning rate of 0.001, and using cross-entropy as the loss function. Next, we trained these models on the full training dataset using adaptive momentum optimization and learning rate scheduling for MHIST, PCAM, and CRC100K individually with 10% of the data as a validation set and checkpointing on validation accuracy. For comparison against histopathology foundation models, we either place a linear layer on top of the feature embeddings obtained from the Phikon or UNI model and train this layer for 1, 3, 5, and 10 epochs using the same hyperparameters as described above or compare the test sample features with the features, extracted from the training set by measuring cosine distance in representation space (nearest neighbor classification). We provide all training hyperparameters in Supplementary Table 4. Due to the balanced target label distribution, unweighted accuracy scores are reported for each of the models.
Data analysis and visualization
For the generation of t-SNE, data was initially reduced to 200 principal components using Principal Component Analysis. The perplexity parameter was set at 30, and the process was initialized with a random seed of 42. All data visualization procedures were performed utilizing the Matplotlib and Seaborn packages.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets used in this study are publically available and can be downloaded from https://huggingface.co/datasets/DykeF/NCTCRCHE100K (CRC100K), https://github.com/basveeling/pcam (PatchCamelyon) and https://bmirds.github.io/MHIST/ (MHIST). Source data are provided in this paper.
Code availability
We provide all materials and code to reproduce and extend the analyses that were performed in this study upon publication under: https://github.com/Dyke-F/GPT-4V-In-Context-Learning.
References
Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 29, 2396–2398 (2023).
Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).
Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer 3, 1026–1038 (2022).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
El Nahhas, O. S. M. et al. From whole-slide image to biomarker prediction: a protocol for end-to-end deep learning in computational pathology. arXiv [cs.CV] https://arxiv.org/pdf/2312.10944 (2023).
Filiot, A. et al. Scaling self-Supervised Learning for histopathology with Masked Image Modeling. bioRxiV https://doi.org/10.1101/2023.07.21.23292757 (2023).
Chen, R. J. et al. A general-purpose self-supervised model for computational pathology. ArXiv https://arxiv.org/pdf/2308.15474 (2023).
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. ZeRO: memory optimizations toward training trillion parameter models. arXiv [cs.LG] https://arxiv.org/pdf/1910.02054 (2019).
Brown, T. B. et al. Language models are few-shot learners. arXiv [cs.CL] https://arxiv.org/pdf/2005.14165 (2020).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv [cs.CL] https://arxiv.org/pdf/2311.16452 (2023).
Rösler, W. et al. An overview and a roadmap for artificial intelligence in hematology and oncology. J. Cancer Res. Clin. Oncol. 149, 7997–8006 (2023).
Yang, Z., Li, L., Lin, K., Wang, J. & Lin, C. C. The dawn of lmms: preliminary explorations with gpt-4v (ision). arXiv https://arxiv.org/pdf/2309.17421 (2023).
Gemini Team et al. Gemini: a family of highly capable multimodal models. arXiv [cs.CL] https://arxiv.org/pdf/2312.11805 (2023).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. arXiv [cs.CV] https://arxiv.org/pdf/2304.08485 (2023).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. arXiv [cs.CV] 23716–23736 https://arxiv.org/pdf/2204.14198 (2022).
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 16, e1002730 (2019).
Wei, J. et al. A petri dish for histopathology image analysis. in Artificial Intelligence in Medicine 11–24 (Springer International Publishing, 2021).
Ehteshami Bejnordi, B. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. J. Am. Med. Assoc. 318, 2199–2210 (2017).
Vimal, M. & Nishanthi, A. Food eponyms in pathology. J. Clin. Diagn. Res. 11, EE01 (2017).
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Vorontsov, E. et al. Virchow: a million-slide digital pathology foundation model. arXiv [eess.IV] https://arxiv.org/pdf/2309.07778 (2023).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2020).
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).
Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv [cs.CL] https://arxiv.org/pdf/2305.09617 (2023).
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).
Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. arXiv [cs.CL] https://arxiv.org/pdf/2305.10601 (2023).
Hoffmann, J. et al. Training compute-optimal large language models. arXiv [cs.CL] (2022).
Henighan, T. et al. Scaling laws for autoregressive generative modeling. arXiv [cs.LG] (2020).
Wang, X. et al. TransPath: Transformer-Based Self-supervised Learning for Histopathological Image Classification. in Medical Image Computing and Computer Assisted Intervention—MICCAI 2021 186–195 (Springer International Publishing, 2021).
Wong, N. A. C. S., Hunt, L. P., Novelli, M. R., Shepherd, N. A. & Warren, B. F. Observer agreement in the diagnosis of serrated polyps of the large bowel. Histopathology 55, 63–66 (2009).
Yang, C. et al. Large language models as optimizers. arXiv [cs.LG]. https://arxiv.org/pdf/2309.03409 (2023).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Acknowledgements
J.N.K. is supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1-2520DAT111; SWAG, 01KD2215B), the Max-Eder-Program of the German Cancer Aid (grant #70113864), the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; SWAG, 01KD2215A; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan), the German Academic Exchange Service (SECAI, 57616814), the German Federal Joint Committee (Transplant.KI, 01VSF21048) the European Union’s Horizon Europe and innovation program (ODELIA, 101057091; GENIAL, 101096312) and the National Institute for Health and Care Research (NIHR, NIHR213331) Leeds Biomedical Research Center. DT is funded by the German Federal Ministry of Education and Research (TRANSFORM LIVER, 031L0312A), the European Union’s Horizon Europe and innovation program (ODELIA, 101057091), and the German Federal Ministry of Health (SWAG, 01KD2215B). G.W. is supported by Lothian NHS. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
D.F. designed and performed the experiments, evaluated and interpreted the results, and wrote the initial draft of the paper. G.W. provided scientific support for running the experiments and contributed to writing the paper. I.W., M.L., S.S., N.G.L., O.S.M.E.N., G.M.-F. contributed to writing the paper. DJ supervised the study. D.T. and J.N.K. designed and supervised the experiments and wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare the following competing interests. O.S.M.E.N. holds shares in StratifAI GmbH. J.N.K. declares consulting services for Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK, and Scailyte, Basel, Switzerland; furthermore J.N.K. holds shares in Kather Consulting, Dresden, Germany; and StratifAI GmbH, Dresden, Germany, and has received honoraria for lectures and advisory board participation by AstraZeneca, Bayer, Eisai, MSD, BMS, Roche, Pfizer and Fresenius. D.T. received honoraria for lectures by Bayer and holds shares in StratifAI GmbH, Germany. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ferber, D., Wölflein, G., Wiest, I.C. et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun 15, 10104 (2024). https://doi.org/10.1038/s41467-024-51465-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-51465-9
This article is cited by
-
Vision-language foundation models for medical imaging: a review of current practices and innovations
Biomedical Engineering Letters (2025)
-
Adapting ChatGPT for Color Blindness in Medical Education
Annals of Biomedical Engineering (2025)
-
Influence-based approaches for tumor classification in noisy brain MRI with deep learning and vision-language models
International Journal of Data Science and Analytics (2025)