In-context learning enables multimodal large language models to classify cancer pathology images

Ferber, Dyke; Wölflein, Georg; Wiest, Isabella C.; Ligero, Marta; Sainath, Srividhya; Ghaffari Laleh, Narmin; El Nahhas, Omar S. M.; Müller-Franzes, Gustav; Jäger, Dirk; Truhn, Daniel; Kather, Jakob Nikolas

doi:10.1038/s41467-024-51465-9

Download PDF

Article
Open access
Published: 21 November 2024

In-context learning enables multimodal large language models to classify cancer pathology images

Dyke Ferber^1,2,3,
Georg Wölflein ORCID: orcid.org/0000-0002-0407-7617⁴,
Isabella C. Wiest^3,5,
Marta Ligero³,
Srividhya Sainath³,
Narmin Ghaffari Laleh ORCID: orcid.org/0000-0003-0889-3352³,
Omar S. M. El Nahhas³,
Gustav Müller-Franzes⁶,
Dirk Jäger^1,2,
Daniel Truhn⁶ &
…
Jakob Nikolas Kather ORCID: orcid.org/0000-0002-3730-5348^1,2,3,7

Nature Communications volume 15, Article number: 10104 (2024) Cite this article

17k Accesses
35 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-___domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.

A visual-language foundation model for computational pathology

Article 19 March 2024

Aligning knowledge concepts to whole slide images for precise histopathology image analysis

Article Open access 30 December 2024

Automating cancer diagnosis using advanced deep learning techniques for multi-cancer image classification

Article Open access 23 October 2024

Introduction

Artificial intelligence (AI) is about to transform healthcare. While its potential is immense, it also presents unique challenges in medicine, arising from the field’s inherent complexity and the critical need for accuracy and reliability¹. Over the last years, applications of AI have been developed that focus on specific areas, especially computer vision models in radiology² and pathology³, or skin cancer detection⁴ for oncology.

Histopathology plays a central role in diagnosing diseases, notably cancer, and has consistently been at the forefront of computational advancements in medicine⁵. Recent developments have enabled the detection of cancer subtypes⁶ and biomarkers like genetic alterations⁷ which can potentially stratify and improve patient care directly from routine hematoxylin and eosin (H&E) stained microscopic images⁷. The current gold standard for computational pathology is training vision foundation models⁸ based on a vast and diverse dataset of images that can easily be customized for clinically relevant applications^9,10. However, these foundation models need a substantial volume of ___domain-specific images during training and are restricted to vision applications only. Moreover, before being applied to a medical task, these models require an additional re-training stage (fine-tuning) that is in itself computationally demanding¹¹ and requires additional annotated training data. This last step needs to be repeated for every potential application, which limits researchers to develop these models at scale.

In-context learning (ICL)—a concept borrowed from the field of natural language processing (NLP)—could provide a possible solution to this problem. The ability of large language models (LLMs) to learn from a few handcrafted examples that are provided to the LLM alongside the prompt, holds great potential and has been shown to improve model performance¹². A practical implementation in a medical setting might involve presenting the LLM with a detailed clinical scenario, such as a complex oncology case, accompanied by several comparable instances with different strategies on how to solve a certain challenge. This approach is called few-shot prompting. Numerous methodologies have been developed utilizing in-context learning. Their foundational principles are explained in detail in the ‘Supplementary Methods: In-Context Learning’ section.

In the medical field, one model has recently been built upon the aforementioned paradigms: MedPrompt¹³, which is based on the GPT-4 architecture. Central to this method is the application of the k-nearest Neighbor (kNN) search, which herein helps identify the most relevant few-shot examples for a specific clinical input. This process involves comparing text embeddings, which are numeric representations of words with the input in question, and then selecting samples with the closest alignment. We highlight further implementation details of this approach, as it has partial overlap with the methods developed in our study, in the ‘Supplementary Methods: Related Work—Enhancing LLM strategies’ section.

However, a major shortcoming is the restriction to text-based tasks. Medicine is a highly multimodal discipline, where a comprehensive understanding of a patient’s symptoms or diagnoses requires information from diverse data sources such as radiographic and microscopic imaging, clinical reports, laboratory values, and electronic health records¹⁴. Only recently, the AI community has entered into the field of vision language models (VLMs), exemplified by the release of GPT-4V¹⁵, the announcement of Google DeepMind’s Gemini¹⁶ family or open-source variants like LLaVA¹⁷.

Building on the trend of large vision language foundation models, we hypothesize that the principles applied for in-context learning of text-based models can be equally effective when extended to multimodal scenarios, such as medical imaging. In the non-medical setting, robust evidence for in-context learning with images has already been established¹⁸. Especially, in the medical field, where generating annotated ground truth data presents a critical challenge, the potential for performance improvements through this approach could be immensely beneficial. This issue is also of relevance for underrepresented medical cases, such as rare tumor types, which receive insufficient representation in traditional deep-learning training pipelines. Moreover, the concurrent integration of textual, theoretical knowledge, and visual information could pave the way toward a more holistic understanding of multidimensional medical data.

In this study, we present results of benchmarking the efficacy of in-context learning with GPT-4V against dedicated image classifiers across three histopathology benchmarking datasets. Notably, we demonstrate that the performance of GPT-4V in tissue classification can be improved through in-context learning and is on par with specialist computer vision models. This advancement casts doubt on the necessity of developing task-specific deep learning models in the future and democratizes access to generalist AI models to accelerate medical research.

Results

In-context learning with medical images improves classification accuracy for histopathology

In this study, we hypothesize that few-shot prompting can improve the performance of foundation vision models. This hypothesis has been shown with text-only tasks, but remains unclear for its application to biomedical images^12,18. We provide a high-level overview of our evaluation datasets (Fig. 1A) and the overall experimental concept in Fig. 1B. We first evaluate this hypothesis on a binary classification task between tumor (TUM) and non-tumorous normal mucosa (NORM) tissue tiles from the CRC100K dataset¹⁹. As shown in Fig. 2A, GPT-4V only marginally surpasses the expectation of random guessing when used in a zero-shot setting, attaining an accuracy of 61.7% (CI: 0.5–0.733). In-context learning changes this situation: We see a consistent improvement in classification accuracy with increasing numbers of few-shot samples with an accuracy of 66.7% in the three-shot sampling setting (CI: 0.55–0.783), 78.3% for five-shot sampling (CI: 0.667–0.883) and an accuracy of 90% when showing 10 images of each class to the model (CI: 0.817–0.967). In our subsequent ablation study (Fig. 2B), we compare random versus kNN sampling across the MHIST²⁰ and PatchCamelyon²¹ (PCAM) datasets. From a zero-shot baseline that again barely achieves a better classification than random guessing (MHIST accuracy 56,7%, CI: 0.433–0.683; PCAM accuracy 60%, CI: 0.467–0.717), we see that in both datasets, random image sampling can improve classification accuracy. These results can further be improved by selecting the sampled images based on their similarity to the target image (kNN sampling), which results in the best-achieved accuracy of 83.4% and 88.3% for detecting sessile-serrated adenoma over hyperplastic polyps (MHIST, CI: 0.733–0.917) and lymph-node metastases from breast cancer versus tumor-free lymphatic tissue (PCAM, CI: 0.8–0.95) in a ten-shot setting.

**Fig. 2: In-context learning for vision-language models.**

In summary, these results demonstrate that in-context learning can improve the performance of foundation vision models in classifying histopathology images. Moreover, we show that kNN sampling can further enhance accuracy over random sampling, especially when increasing the number of images that are shown to the model. Corresponding metrics can be found in Tables 1 and 2, with the best-performing method highlighted in bold.

Table 1 Metrics for GPT-4V few-shot sampling

Full size table

Table 2 Metrics for random and kNN few-shot learning with GPT-4V on MHIST and PCAM

Full size table

Vision-language models can achieve performance on par with retrained vision classifiers

Next, we compare few-shot sampling with the previous status-quo⁷ in image classification, which involves retraining models from ImageNet weights. As an initial comparison, we train one distinct model for each target image shown to GPT-4V, with the identical images used for in-context learning as the training set. This approach reveals that in-context learning is sufficiently robust to achieve results that are on par with, or even surpass, specialized narrow image classifiers under the same conditions. Specifically, the ten-shot in-context learning GPT-4V approach not only matches but exceeds the performance of all other models (Fig. 3A), leading to a classification accuracy of 83.3% for MHIST (CI: 0.733–0.917) and 88.3% for PatchCamelyon (CI: 0.8–0.95), outperforming the second-best model, Tiny-ViT, by 3.3% and 6.6% respectively. Notably, in the case of PatchCamelyon, even the three- and five-shot prompting were sufficient to outperform all other models in this setting. We show a detailed comparison of each model’s evaluation metrics in Supplementary Table 1. Further, we extend this comparison to include two scenarios: Firstly, we show that GPT-4V’s performance through in-context learning can partially match that of previously mentioned vision models, even when those models have been trained on the complete training datasets, such as CRC100K which includes tens of thousands of tiles (Supplementary Table 2). Secondly, we compare the efficacy of in-context learning with the current gold standards in histopathology image classification—specifically using the Phikon and UNI models as examples. All evaluation metrics, including specifications on training parameters, are shown in Supplementary Tables 3 and 4. The results indicate that in-context learning significantly narrows the performance gap between GPT-4V and these models. For instance, in-context learning reduced the disparity from 36.6% (zero-shot GPT-4V) to a 10% difference (ten-shot GPT-4V) relative to the performance achieved by kNN classification with the Phikon feature extractor on the MHIST dataset. We also discovered that GPT-4V demonstrated remarkable zero-shot capabilities for some of the targets: For PatchCamelyon, it correctly identified all tumor tiles, albeit with a high false positive rate of 80%. In the MHIST dataset it correctly recognized 83% of Sessile Serrated Adenomas but only 30% of Hyperplastic Polyps (Fig. 3B). Considerable improvements could be observed with few-shot prompting. In the case of PatchCamelyon, the model’s ability to identify normal lymph node tissue progressively increased with the number of example images, ranging from an accuracy of 67% for three-shot, 77% for five-shot to 80% for ten-shot image prompting. Similarly, for MHIST, the correct identification of hyperplastic polyps could be increased from 30% (zero-shot) to close to 90% (ten-shot). Notably, these enhancements did not compromise the model’s performance in detecting tumors in the PatchCamelyon dataset or SSAs in the MHIST dataset (Fig. 3C). These findings show that in-context learning with microscopic images can achieve an accuracy on par with fine-tuning specialized image classification models.

**Fig. 3: Performance analysis of GPT-4V with *kNN* ICL on PatchCamelyon and MHIST datasets.**

In-context learning reduces the performance gap between generalist and histopathology foundation models

In a subsequent evaluation, we tested GPT-4V on the CRC100K dataset, which is more challenging as it consists of a more diverse set of labels. As we increased the number of few-shot image samples, GPT-4V showed considerable improvements in performance. However, it did not achieve the levels observed on other datasets such as PatchCamelyon or MHIST (Fig. 4A). Despite this, there was a significant narrowing of the performance gap between GPT-4V and models fully trained on all data, such as ResNet-15, ResNet-18, ViT-Tiny, and ViT-Small, as well as compared to the downstream performance of the Phikon and UNI models. Initially, the performance deficit of GPT-4V in zero-shot classification relative to kNN stood at 61.7% and 62.5% for Phikon and UNI respectively. This gap was reduced to 18.3% and 19.2% when using five-shot in-context learning compared to the best scores achieved by Phikon and UNI in any of our settings. While our study does not claim to maximize potential performance across all models, it highlights that ICL can bring us closer to the performance levels of models extensively pretrained on these tasks. Also, the GPT-4V model natively excelled in identifying tumor and muscle tissue, achieving a recall score of 80% and 100%, respectively. However, it failed completely in recognizing debris (DEB), adipose tissue (ADI), lymphocytes (LYM), mucus (MUC), and tumor-associated stroma (STR). Herein, three instances are particularly noteworthy: lymphocytes were consistently misclassified as tumor tissue, debris was incorrectly categorized as a tumor in 93% of cases, and stroma was misclassified as muscle tissue in 87% of instances. The addition of few-shot examples led to a substantial improvement. The best results are achieved with five-shot kNN-sampling, where the model receives a total of 40 sample images. This leads to enhanced accuracy across all labels (Fig. 4B). A clear trend of continuous performance gains is evident as the number of few-shot samples is increased, demonstrating consistent improvements at each stage of the process (from zero- to one-, one- to three-, and three- to five-shot prompting) for almost all labels (LYM, MUC, NORM, STR), with the exception of debris (Fig. 4C). Details to confidence intervals are summarized in Supplementary Table 1. In summary, our findings underline the potential of few-shot image learning in GPT-4V, even in a multilabel classification setting.

**Fig. 4: Performance analysis of GPT-4V with *kNN-*based sampling on the CRC100K dataset.**

Image in-context learning improves text-based reasoning

Vision-Language Models enable multimodal understanding. To more accurately evaluate the impact of few-shot image sampling on textual reasoning within VLMs, we further investigated the output of GPT-4V and created text embeddings using Ada-002. Next, we utilized t-Stochastic Neighbor Embedding (t-SNE) to analyze the semantic space of the model’s reasoning. Our results demonstrated distinct clusters of embeddings when compared to the model’s final answer (Fig. 5A, top), underscoring a potential correlation between text and image data. However, in the zero-shot scenario, the comparison of text embeddings to ground truth labels revealed that the model’s intrinsic reasoning only correlated poorly with the correct categorization of the images (labels). This suggests a limitation in the model’s ability to independently navigate to the correct label based solely on its learned representations. Contrastingly, the application of few-shot learning techniques improved the separation of text embeddings corresponding to different answers and labels. This enhancement is evident from the formation of a greater number of distinct clusters and more accurate alignment of data points with their respective ground truth categories, as shown in Fig. 5A. Moreover, the implementation of few-shot learning was associated with increased silhouette scores, indicating closer proximity of data points to their correct labels as the number of example images provided to the model increased. Collectively, these findings suggest that employing few-shot learning techniques can enhance the model’s capacity to analyze and interpret test images more accurately, thereby refining its decision-making process.

**Fig. 5: Few-shot sampling improves text-based reasoning.**

To showcase the benefits multimodality might have in histopathology, we present two illustrative cases from our study. Figure 5B (left) depicts a scenario where GPT-4V falsely classifies an image as a tumor, while the underlying ground truth was considered to be stroma.

However, GPT-4Vs detailed reasoning, identifying morphological signs indicative of cancer, reveals the presence of tumor cells characterized by irregularly shaped nuclei. Analyzing the 500 closest image embeddings in feature space shows a similar trend, with two-thirds of image embeddings being categorized as tumors. Another case, shown in Fig. 5B (right), demonstrates GPT-4V’s proficiency in transferring knowledge from different domains to draw the right conclusions. According to existing literature²², the term “chicken wire pattern” is established within the ___domain of pathology, yet only regarding the appearance of adipose tissue in liposarcomas and other malignancies. However, it is not frequently used to describe the architecture of normal, healthy adipose tissue. The capability of GPT-4 to transfer its understanding of the physical appearance of chicken wire to the shape of adipose tissue in histopathology demonstrates the ability for transfer learning and holds potential in areas like AI explainability and teaching. Overall, these data indicate that vision language models possess substantial potential for medical image classification in histopathology, utilizing only a few sample images. This capability may provide inherent advantages over traditional image classifiers, due to their multimodal architecture.

Discussion

Foundation models have demonstrated substantial promise in medical image processing. Zhou et al. trained such a system using 1.6 million retinal images and illustrated that they could then fine-tune it with fewer annotated images to assist clinicians in identifying a range of ocular diseases²³. Yet, the vast amount of data that is required and the necessity to develop one specific fine-tuned version for each clinical task, currently constrain training these models at scale, limiting their utility to researchers with extensive knowledge in computer sciences and access to the required hardware. Furthermore, the applicability of these models has been confined to the visual field only. Nonetheless, learning is a multimodal process. For example, in pathology, practitioners and students assimilate their knowledge by extracting visual patterns from images and synthesizing them with corresponding textual annotations. In summation, the ideal scenario would envision AI systems that seamlessly combine multimodal information in a data-efficient manner while having the flexibility to adapt their behavior to any given task on demand without the need for traditional retraining.

In this study, we demonstrate a proof of concept illustrating that achieving these properties is possible with in-context learning on vision language models, exemplified on GPT-4V: We show that this method not only is effective when classifying medical microscopy images but also that it can achieve performance comparable to conventional image classification models and that in-context learning provides a data- and resource-efficient learning method to drastically recover the performance gap between generalist foundation models and histopathology foundation models like Phikon and UNI, that are trained on a large corpus of microscopic images. We show that five to ten sample images per label are enough for GPT-4V to achieve classification accuracy scores close to the current gold standard models. These results are encouraging, especially considering that other current state-of-the-art pathology foundation models like Paige’s Virchow²⁴ report performance metrics that marginally surpass our method, with reported accuracy scores of 82.7% compared to 83.3% for GPT-4V on the MHIST dataset and 92.7% versus 88.3% for GPT-4V on PatchCamelyon. For MHIST, we must note here that we excluded images without a full inter-rater agreement, which makes our use case most likely easier than the one used by Vorontsov et al.²⁴. We acknowledge the lack of public access to the training corpus of GPT-4V, which raises the possibility that the model may have been trained on our test sets. Nevertheless, the performance observed in a zero-shot scenario marginally surpasses random guessing, making it less likely that the data had been used for training. We use this zero-shot baseline as a comparison to investigate the benefit of in-context learning. With our approach, we lay the foundation for a general-purpose framework that advances state-of-the-art prompting techniques for images. Additionally, our results indicate that deliberately selecting few-shot examples that are semantically similar to the test image can substantially improve the performance of the model. A notable aspect is the integration of text with vision, which can help in explainability in understanding a model’s reasoning processes. This addresses a critical limitation of conventional image classifiers, as textual feedback provides a more comprehensible way of understanding and interpretability for humans compared to visual tools such as Grad-CAM²⁵. This aspect is crucial for reliable AI systems in medical applications²⁶.

Some limitations of our work are that experiments were restricted to a yet small sample size due to the preview status of the GPT-4V API, which currently only permits a limited number of requests. Another limitation in this regard is that we did not include ensembling methods, which would require multiple model iterations over the same task as performed by MedPrompt or Med-PaLM 2, as this approach has a total of 44 model calls for a single task only²⁷. Moreover, it is worth noting that the performance of in-context learning with images sometimes yields suboptimal results, particularly in classes like debris, mucus, and stroma within the CRC100K dataset. This observation is in line with findings by Huang et al. ²⁸. While these outcomes have been acknowledged, we leave an in-depth investigation into the underlying reasons and the development of potential solutions as subjects for future research. Moreover, further explorations into prompt engineering techniques beyond Chain-of-Thought, like Tree-of-Thought²⁹ could be used to optimize the models beyond the current results. Finally, the current reliance of our approach on generating and retrieving image embeddings through a specialized vision model (i.e., Phikon) is a drawback to the philosophy of an all-encompassing foundation VLM, which ideally would not require a specialized vision encoder to generate features. Although we have not tested this at scale due to the rate limits of the API, we speculate that the next generation of foundation models will be able to autonomously manage, embed, and retrieve sample data on demand for few-shot learning. Following the current paradigm of AI scaling laws^30,31, it can be estimated that we have not yet reached a plateau in the performance benefits from even more powerful foundation models in the future. Furthermore, our experiments have not indicated any saturation point in model efficacy when increasing the number of k-shot examples—however further increasing the number of sample images per task was not feasible due to exceeding the models context window. Again this suggests the potential for continued enhancements when further scaling our approach and raises the question about the necessity and efficacy of researchers to develop their own specialized deep learning models for each task, particularly when a singular model may suffice in the foreseeable future. In summary, we further aim to scale our work to overcome these limitations and extend it to other domains like radiology imaging. Nevertheless, we believe that in-context learning with images holds great potential for improving the performance of vision language models on biomedical image classification tasks and beyond.

Methods

Ethics statement

This study does not include confidential information. All research procedures were conducted exclusively on publicly accessible, anonymized patient data and in accordance with the Declaration of Helsinki, maintaining all relevant ethical standards. The overall analysis was approved by the Ethics Commission of the Medical Faculty of the Technical University Dresden (BO-EK-444102022).

Datasets

Our benchmarking experiments are conducted on the following, open-source histopathology image datasets:

CRC-VAL-HE-7K¹⁹ is the evaluation set associated with the NCT-CRC-HE-100K dataset, consisting of 7180 image patches extracted from hematoxylin & eosin (H&E) stained formalin-fixed and paraffin-embedded (FFPE) sections from 50 individuals with colorectal cancer. Samples were collected at the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany) and digitized at 224 × 224 pixels (px) at a resolution of 0.5 microns per pixel (MPP). Throughout this manuscript, we will refer to this dataset as CRC100K. Following previous studies^9,32, the background (BACK) class was excluded from our analysis.
PatchCamelyon (PCam)²¹ contains 327,680 H&E stained histologic image patches at 96 × 96px (0.243 MPP) from human sentinel lymph node sections obtained from the Camelyon16 Challenge, originally split into a training and validation set. Samples are annotated with a binary label to denote the presence or absence of metastatic breast cancer tissue at a balance close to 50/50.
MHIST²⁰ is a dataset of 3152 H&E-stained FFPE-sections from colorectal polyps, collected at the Dartmouth–Hitchcock Medical Center (DHMC) and addresses the challenging problem of discriminating sessile serrated adenoma (SSA) from hyperplastic polyps (HP)³³. Images are scanned at 224 × 224 px and labeled as either HP or SSA by the majority vote of seven pathologists, resulting in a 3:7 split.

For GPT-4V inference testing, we randomly generated randomly chosen test datasets containing 60 samples for MHIST and PatchCamelyon and 120 samples for CRC100K at a balanced 1:1 split for each of the available labels. For simplicity, we restricted the test images from the MHIST dataset to those achieving unanimous expert consensus for the presence of SSA or HP, respectively. All images that were used for inference testing are visualized in Supplementary Fig. 1.

GPT-4V model specifications

All experiments in this study were performed using the GPT-4V model in the chat completions endpoint of the official OpenAI Python API between November 15 and December 03, 2023. The official model name in the OpenAI API is gpt-4-vision-preview. For simplicity, we will use the term GPT-4V in all subsequent references to this model throughout our manuscript. Temperature was set to 0.1 based on initial experiments and no other modifications to model hyperparameters were made. For further implementation details, we refer to our official github repository.

Text embeddings were created using OpenAI’s default embedding model Ada 002, without further modifications.

Prompting and random few-shot image in-context learning

In the following, we present a brief overview of the implementation of the final prompts used in GPT-4V. For an in-depth explanation of both the system prompt (instructions dictating the expected model behavior) and user prompt (input commands or queries to the model), please refer to Supplementary Tables 5–7. There is currently no standardized blueprint for the development of effective model prompts; rather this is an iterative, dynamic process driven by trial and error. Our prompting strategies were developed on a selection of ten random image tiles per label from each dataset. Following current best practices, we utilized the system prompt to establish the setting (context) of the model and to guide its expected behavior. In our initial trials with GPT-4V, we encountered several limitations due to the model’s intensive policy alignment regarding its refusal to handle medical data. To address these issues, we modified our approach by presenting test cases as hypothetical scenarios (‘None of your answers are applied in a real-world scenario or have influences on real patients.’) and additionally included a selection of desired and undesired response pairs into the system prompt. To simplify the analysis of the results, we also configured GPT-4V to generate answers in JavaScript Object Notation (JSON) format. This included a structured template containing a field for providing logical reasoning (‘thoughts’), the final ‘answer’ as well as a certainty ‘score’.

Regarding the user prompt, we differentiate between the zero- and few-shot settings. In the zero-shot scenario, we started with enumerating all possible label options, followed by guiding the model to adopt a step-wise reasoning akin to Chain-of-Thought (CoT) prompting²⁹. This was followed by a compilation of dataset-specific considerations: For instance, in the CRC100K dataset, we observed that the model would almost always choose to classify an image tile as a tumor whenever detecting malignant cells, despite simultaneously recognizing the major cell fraction being lymphocytes. To counteract these dataset-specific pitfalls, we included concise guidelines at this step (Supplementary Tables 5–7). Finally, GPT-4V was asked to thoroughly examine the appended patient image and provide its answer as described above.

In the few-shot sampling prompts, we presented a sequence of k example images (where k equals 1, 3, 5, or 10), each followed by its corresponding label, in a repeated pattern: Specifically, we presented a single image corresponding to each label y, cycling through the entire set of labels k times. This means that for every test image, we present the model a total of (k * y) + 1 images. This setting is the same for the kNN-based sample selection, which we further highlight below and in Box 1. Each image was prefaced with the phrase ‘The following image contains {y}:’ for any possible label y. Then GPT-4 was instructed to closely compare and extract meaningful knowledge from the images for subsequent comparison with the target image. Beyond this, the structure of the prompt remained consistent with the zero-shot template. Moreover, to the best of our knowledge, we followed all known best practices and prompting tricks (i.e., ‘Take a deep breath’)³⁴. To mitigate the risk of overfitting the samples used during the refinements of the system and user prompts, we ensured that they were not included in the generation of inference test data. More specifically, we performed an initial investigation of zero-shot and random few-shot performance using an initial dataset comprising 30 random samples, collected exclusively from the CRC100K dataset, each containing either tumor or normal colon epithelium. This initial dataset served a dual purpose: developing effective prompts and providing an early insight into model responses. For the following evaluation phase, we collected a new subset of 30 samples. This way, we prevented sample leakage from our prompt creation dataset into our final evaluation test set. This was critical to prevent overfitting that could arise from sample-specific biases we might have included in the prompt. However, these samples were allowed to be part of either random or (as described in the next section) kNN-based sample selections. This process was repeated for every dataset.

Box 1 Pseudocode for the entire KNN-sampling process

Algorithm 1 k-NN few-shot image sampling

Require: Number of closest neighbors k.

Require: Encoder function Encoder(·) that maps an image to an embedding.

Require: Language model LLM(·).

Require: System prompt P_sys that describes the expected model’s behavior.

Require: Dataset \({{\mathcal{D}}}={\{({x}_{i},\,{y}_{i})\}}_{i=1}^{N}\) containing N image-label pairs, each with an image x_i and a label y_i ∈ \({{\mathcal{Y}}}\) (\({{\mathcal{Y}}}\) is the set of all possible labels).

Require: Task list \({{\mathcal{T}}}={\left\{\left({P}_{t},{x}_{t},{y}_{t}\right)\right\}}_{t=1}^{T}\) containing T tuples of user prompt P_t, target image x_t and ground truth label y_t.

Ensure: Final result \({{\mathcal{R}}}\) for each task in JSON format as {thoughts, answer, score}.

1: Initialize an empty mapping E from image ID to embedding.

2: for each image-label pair (x_i, y_i) in dataset \({{\mathcal{D}}}\) do

3: E_i ← Encoder(x_i) ▷ Pre-compute embedding

4: end for

5: Initialize an empty list \({{\mathcal{R}}}\) of results.

6: for each task (P_t, x_t, y_t) in task list \({{\mathcal{T}}}\) do

7: C ← ∅ ▷ Initialize the set of closest images

8: for each possible label y in \({{\mathcal{Y}}}\) do

9: Find the k closest embeddings to the task image x_t’s embedding, i.e.

10: the unique indices t₁,…, t_k that maximize \({{{\rm{argmax}}}}_{{t}_{1},\ldots,{t}_{k}}{\sum }_{i=1}^{k}{{\rm{Cosine\; Similarity}}}({E}_{t},\, {E}_{{t}_{i}})\) such that t_i ≠ t and \({y}_{{t}_{i}}=y\) for all i = 1,..., k, where the cosine similarity is defined as \({{\rm{Cosine}}}\; {{\rm{Similarity}}}\left(a,b\right)=\frac{a \, \cdot \, b}{{||a||||b||}}\).

11: \(C{{\leftarrow }}C{{\rm{\cup }}}{\{({x}_{{t}_{i}},{y})\}}_{i=1}^{k}\) ▷ Store closest images and their labels

12: end for

13: Format the input I to the LLM as follows:

14: Include system prompt P_sys describing the expected model’s

15: behavior.

16: Include user prompt P_t describing the setting and task.

17: Interleave the example images and labels from C represented as

18: \(\left\{\left(x,y\right){|x}\in X,y\in Y\right\}\).

19: for each each tuple (x, y) in C do

20: I ← I +(x, y) ▷ Append tuple to I

21: end for

22: Present the target image x_t.

23: R_t ← LLM(I) ▷ Invoke the LLM for this task

24: \({{\mathcal{R}}}\) ← \({{\mathcal{R}}}\) ∪ {(R_t, y_t)} ▷ Store the result and ground truth label

25: end for

kNN-based few-shot image sampling

The entire workflow is shown in detail as pseudocode in Box 1. Image feature vectors were created for each of the above-described datasets using the teacher backbone of the ‘Phikon’ Vision Transformer (ViT-B 40 M Pancancer)⁹ leading to a one- dimensional vector of length 768 for each image tile. During GPT-4V inference, for each test image x, the k closest images of each possible target label y were sampled for kNN-based in-context learning by measuring the cosine similarity in feature space. To prevent the model from learning patient-intrinsic morphologic tissue features as confounders to the desired label, we removed tile embeddings from the same patient if this information was available. Nevertheless, the main goal of our study is the comparison between GPT-4V in-context learning and training of specialized image classifiers. As outlined later, the comparisons are still valid in cases where overlap between test image and related patient tiles might occur, due to exactly matching in-context learning and training samples. The example images were included in the prompt in a way that the most similar images for each label were shown to the model first.

Tile-level classification benchmarks

In this study, we first compared few-shot image in-context learning of GPT-4V with the performance of specialized computer vision models by training a classification layer atop four distinct models: ResNet-18, ResNet-50, ViT-Tiny, and ViT-Small. Each model has been initialized with ImageNet pretrained weights as a standard procedure^7,35. Considering the relatively small test sample sizes in the experiments involving GPT-4V, we initially ensured a balanced comparison by training a newly initialized model on the identical set of kNN-sampled and normalized images for each test image across all datasets, leading to a total of 3600 trained models (Number of models × Total number of datasets × (Number of shots − zero shot) × Number of samples per dataset). Every training run was performed for ten epochs, employing the Adam Optimizer with a learning rate of 0.001, and using cross-entropy as the loss function. Next, we trained these models on the full training dataset using adaptive momentum optimization and learning rate scheduling for MHIST, PCAM, and CRC100K individually with 10% of the data as a validation set and checkpointing on validation accuracy. For comparison against histopathology foundation models, we either place a linear layer on top of the feature embeddings obtained from the Phikon or UNI model and train this layer for 1, 3, 5, and 10 epochs using the same hyperparameters as described above or compare the test sample features with the features, extracted from the training set by measuring cosine distance in representation space (nearest neighbor classification). We provide all training hyperparameters in Supplementary Table 4. Due to the balanced target label distribution, unweighted accuracy scores are reported for each of the models.

Data analysis and visualization

For the generation of t-SNE, data was initially reduced to 200 principal components using Principal Component Analysis. The perplexity parameter was set at 30, and the process was initialized with a random seed of 42. All data visualization procedures were performed utilizing the Matplotlib and Seaborn packages.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets used in this study are publically available and can be downloaded from https://huggingface.co/datasets/DykeF/NCTCRCHE100K (CRC100K), https://github.com/basveeling/pcam (PatchCamelyon) and https://bmirds.github.io/MHIST/ (MHIST). Source data are provided in this paper.

Code availability

We provide all materials and code to reproduce and extend the analyses that were performed in this study upon publication under: https://github.com/Dyke-F/GPT-4V-In-Context-Learning.

References

Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 29, 2396–2398 (2023).
Article CAS PubMed Google Scholar
Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer 3, 1026–1038 (2022).
Article PubMed Google Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
El Nahhas, O. S. M. et al. From whole-slide image to biomarker prediction: a protocol for end-to-end deep learning in computational pathology. arXiv [cs.CV] https://arxiv.org/pdf/2312.10944 (2023).
Filiot, A. et al. Scaling self-Supervised Learning for histopathology with Masked Image Modeling. bioRxiV https://doi.org/10.1101/2023.07.21.23292757 (2023).
Chen, R. J. et al. A general-purpose self-supervised model for computational pathology. ArXiv https://arxiv.org/pdf/2308.15474 (2023).
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. ZeRO: memory optimizations toward training trillion parameter models. arXiv [cs.LG] https://arxiv.org/pdf/1910.02054 (2019).
Brown, T. B. et al. Language models are few-shot learners. arXiv [cs.CL] https://arxiv.org/pdf/2005.14165 (2020).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv [cs.CL] https://arxiv.org/pdf/2311.16452 (2023).
Rösler, W. et al. An overview and a roadmap for artificial intelligence in hematology and oncology. J. Cancer Res. Clin. Oncol. 149, 7997–8006 (2023).
Article PubMed PubMed Central Google Scholar
Yang, Z., Li, L., Lin, K., Wang, J. & Lin, C. C. The dawn of lmms: preliminary explorations with gpt-4v (ision). arXiv https://arxiv.org/pdf/2309.17421 (2023).
Gemini Team et al. Gemini: a family of highly capable multimodal models. arXiv [cs.CL] https://arxiv.org/pdf/2312.11805 (2023).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. arXiv [cs.CV] https://arxiv.org/pdf/2304.08485 (2023).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. arXiv [cs.CV] 23716–23736 https://arxiv.org/pdf/2204.14198 (2022).
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 16, e1002730 (2019).
Article PubMed PubMed Central Google Scholar
Wei, J. et al. A petri dish for histopathology image analysis. in Artificial Intelligence in Medicine 11–24 (Springer International Publishing, 2021).
Ehteshami Bejnordi, B. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. J. Am. Med. Assoc. 318, 2199–2210 (2017).
Article Google Scholar
Vimal, M. & Nishanthi, A. Food eponyms in pathology. J. Clin. Diagn. Res. 11, EE01 (2017).
PubMed Central Google Scholar
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Vorontsov, E. et al. Virchow: a million-slide digital pathology foundation model. arXiv [eess.IV] https://arxiv.org/pdf/2309.07778 (2023).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2020).
Article Google Scholar
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).
Article CAS PubMed Google Scholar
Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv [cs.CL] https://arxiv.org/pdf/2305.09617 (2023).
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).
Article CAS PubMed Google Scholar
Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. arXiv [cs.CL] https://arxiv.org/pdf/2305.10601 (2023).
Hoffmann, J. et al. Training compute-optimal large language models. arXiv [cs.CL] (2022).
Henighan, T. et al. Scaling laws for autoregressive generative modeling. arXiv [cs.LG] (2020).
Wang, X. et al. TransPath: Transformer-Based Self-supervised Learning for Histopathological Image Classification. in Medical Image Computing and Computer Assisted Intervention—MICCAI 2021 186–195 (Springer International Publishing, 2021).
Wong, N. A. C. S., Hunt, L. P., Novelli, M. R., Shepherd, N. A. & Warren, B. F. Observer agreement in the diagnosis of serrated polyps of the large bowel. Histopathology 55, 63–66 (2009).
Article PubMed Google Scholar
Yang, C. et al. Large language models as optimizers. arXiv [cs.LG]. https://arxiv.org/pdf/2309.03409 (2023).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

J.N.K. is supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1-2520DAT111; SWAG, 01KD2215B), the Max-Eder-Program of the German Cancer Aid (grant #70113864), the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; SWAG, 01KD2215A; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan), the German Academic Exchange Service (SECAI, 57616814), the German Federal Joint Committee (Transplant.KI, 01VSF21048) the European Union’s Horizon Europe and innovation program (ODELIA, 101057091; GENIAL, 101096312) and the National Institute for Health and Care Research (NIHR, NIHR213331) Leeds Biomedical Research Center. DT is funded by the German Federal Ministry of Education and Research (TRANSFORM LIVER, 031L0312A), the European Union’s Horizon Europe and innovation program (ODELIA, 101057091), and the German Federal Ministry of Health (SWAG, 01KD2215B). G.W. is supported by Lothian NHS. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
Dyke Ferber, Dirk Jäger & Jakob Nikolas Kather
Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany
Dyke Ferber, Dirk Jäger & Jakob Nikolas Kather
Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
Dyke Ferber, Isabella C. Wiest, Marta Ligero, Srividhya Sainath, Narmin Ghaffari Laleh, Omar S. M. El Nahhas & Jakob Nikolas Kather
School of Computer Science, University of St Andrews, St Andrews, UK
Georg Wölflein
Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
Isabella C. Wiest
Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
Gustav Müller-Franzes & Daniel Truhn
Department of Medicine I, University Hospital Dresden, Dresden, Germany
Jakob Nikolas Kather

Authors

Dyke Ferber
View author publications
Search author on:PubMed Google Scholar
Georg Wölflein
View author publications
Search author on:PubMed Google Scholar
Isabella C. Wiest
View author publications
Search author on:PubMed Google Scholar
Marta Ligero
View author publications
Search author on:PubMed Google Scholar
Srividhya Sainath
View author publications
Search author on:PubMed Google Scholar
Narmin Ghaffari Laleh
View author publications
Search author on:PubMed Google Scholar
Omar S. M. El Nahhas
View author publications
Search author on:PubMed Google Scholar
Gustav Müller-Franzes
View author publications
Search author on:PubMed Google Scholar
Dirk Jäger
View author publications
Search author on:PubMed Google Scholar
Daniel Truhn
View author publications
Search author on:PubMed Google Scholar
Jakob Nikolas Kather
View author publications
Search author on:PubMed Google Scholar

Contributions

D.F. designed and performed the experiments, evaluated and interpreted the results, and wrote the initial draft of the paper. G.W. provided scientific support for running the experiments and contributed to writing the paper. I.W., M.L., S.S., N.G.L., O.S.M.E.N., G.M.-F. contributed to writing the paper. DJ supervised the study. D.T. and J.N.K. designed and supervised the experiments and wrote the paper.

Corresponding author

Correspondence to Jakob Nikolas Kather.

Ethics declarations

Competing interests

The authors declare the following competing interests. O.S.M.E.N. holds shares in StratifAI GmbH. J.N.K. declares consulting services for Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK, and Scailyte, Basel, Switzerland; furthermore J.N.K. holds shares in Kather Consulting, Dresden, Germany; and StratifAI GmbH, Dresden, Germany, and has received honoraria for lectures and advisory board participation by AstraZeneca, Bayer, Eisai, MSD, BMS, Roche, Pfizer and Fresenius. D.T. received honoraria for lectures by Bayer and holds shares in StratifAI GmbH, Germany. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ferber, D., Wölflein, G., Wiest, I.C. et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun 15, 10104 (2024). https://doi.org/10.1038/s41467-024-51465-9

Download citation

Received: 13 January 2024
Accepted: 05 August 2024
Published: 21 November 2024
DOI: https://doi.org/10.1038/s41467-024-51465-9

This article is cited by

Vision-language foundation models for medical imaging: a review of current practices and innovations
- Ji Seung Ryu
- Hyunyoung Kang
- Sejung Yang
Biomedical Engineering Letters (2025)
Adapting ChatGPT for Color Blindness in Medical Education
- Jinge Wang
- Thomas C. Yu
- Gangqing Hu
Annals of Biomedical Engineering (2025)
Influence-based approaches for tumor classification in noisy brain MRI with deep learning and vision-language models
- Minh-Hao Van
- Alycia N. Carey
- Xintao Wu
International Journal of Data Science and Analytics (2025)