Fig. 5: Few-shot sampling improves text-based reasoning.
From: In-context learning enables multimodal large language models to classify cancer pathology images

Panel A depicts the workflow, starting from GPT-4V’s initial prediction and its reasoning process (‘thoughts’), to the generation of text feature embeddings with Ada 002. The panel of t-SNEs demonstrates the evolution from a zero-shot framework on the far left, advancing through one-, three-, and five-shot kNN sampling to the right. All data is obtained from the CRC100K dataset. In the t-SNE plots, color coding distinguishes between the model’s final classifications (‘Answers’, top) and the ground truth (’Labels’, bottom). The introduction of few-shot image sampling noticeably refines the model’s textual reasoning, as evidenced by the formation of more distinct clusters in alignment with the model’s own responses (top) and the underlying ground truth (bottom). S denotes silhouette scores, which are calculated for each t-SNE. Complementary to these visualizations, Supplementary Fig. 2 features word clouds that further illustrate the alignment of the model’s vocabulary with clinical diagnoses, highlighting key terms such as “lymph node” for normal tissue and “metastatic / breast cancer” for malignancies, thereby enhancing the interpretability of the model’s diagnostic reasoning process. In Panel B, we present two exemplary scenarios to demonstrate the potential superiority of integrated vision-language models over stand-alone image classification models. On the left, an image is displayed where the original annotation identified the sample as stroma (STR), yet GPT-4V categorizes it as tumor (TUM). The rationale provided by the model appears plausible, notably pointing out several abnormally shaped nuclei, visible, for instance, in the lower right corner. This sample indeed appears to represent a borderline case. When comparing the top 500 closest patch embeddings to the reference image, a dominant fraction is classified as tumor (67%), with a lesser proportion being labeled as stroma (32%) and a negligible percentage (<1%) as lymphocytes or regular colon epithelium. The exploration of GPT-4V’s interpretive process can help identify and understand such complex edge cases that go beyond what is possible with conventional image classifiers alone. Right: Chicken-wire patterns are described in the histology of liposarcoma, which arises from adipocyte precursor cells. This description stems from its resemblance to chicken wire fences (shown to the right). GPT-4V effectively leverages this knowledge from another context to describe the morphology of the adipocytes shown in this image. This way of performing ‘transfer learning’ could have strong implications in teaching. * The image name in the CRC100K cohort is STR-TCGA-VEMARASN. + The image name in the CRC100K cohort is ADI-TCGA-QFVSMHDD.