Introduction

Automatic disease diagnosis is pivotal in clinical practice, leveraging clinical data to generate potential diagnoses with minimal human input1. It enhances diagnostic accuracy, supports clinical decision-making, and addresses healthcare disparities by providing high-quality diagnostic services2. Additionally, it boosts efficiency, especially for clinicians managing aging populations with multiple comorbidities3,4,5. For example, DXplain6 analyzes patient data to generate diagnoses with justifications. Online services also promote early diagnosis and large-scale screening for diseases like mental health disorders, raising awareness and mitigating risks4,7,8,9,10.

Advances in artificial intelligence (AI) have driven two waves of automated diagnostic systems11,12,13,14. Early approaches utilized machine learning techniques like support vector machines and decision trees15,16. With larger datasets and computational power, deep learning (DL) models, such as convolutional, recurrent, and generative adversarial networks, became predominant1,2,17,18,19,20. However, these models require extensive labeled data and are task-specific, limiting their flexibility1,19,21. The rise of generative large language models (LLMs), like GPT22 and LLaMA23, pre-trained on extensive corpora, has demonstrated significant potential in various clinical applications, such as question answering24,25 and information retrieval26,27. These models are increasingly applied to diagnostics. For example, PathChat28, a vision-language LLM fine-tuned with comprehensive instructions, set new benchmarks in pathology. Similarly, Kim et al.29 reported that GPT-4 outperformed mental health professionals in diagnosing obsessive-compulsive disorder, underscoring its potential in mental health diagnostics.

Despite growing interest, several key questions remain unresolved: Which diseases and medical data have been explored for LLM-based diagnostics (Q1)? What LLM techniques are most effective for diagnostic tasks (see Box 1), and how should they be selected (Q2)? What evaluation methods best assess performance of various diagnostic tasks (Q3)? Many reviews have explored the use of LLMs in medicine30,31,32,33,34,35,36,37, but they typically provide broad overviews of diverse clinical applications rather than focusing specifically on disease diagnosis. For instance, Pressman et al.38 highlighted introducing various clinical applications of LLMs, e.g., pre-consultation, treatment, and patient education. These reviews tend to overlook the nuanced development of LLMs for diagnostic tasks and do not analyze the distinct merits and challenges in this area, revealing a critical research gap. Some reviews39,40 have focused on specific specialties,such as digestive or infectious diseases,but failed to offer a comprehensive perspective that spans multiple specialties, data types, LLM techniques, and diagnostic tasks to fully address the critical questions at hand.

This review addresses the gap by offering a comprehensive examination of LLMs in disease diagnosis through in-depth analyses. First, we systematically investigated a wide range of disease types, corresponding clinical specialties, medical data, data modalities, LLM techniques, and evaluation methods utilized in existing diagnostic studies. Second, we critically evaluated the strengths and limitations of prevalent LLM techniques and evaluation strategies, providing recommendations for data preparation, technique selection, and evaluation approaches tailored to different contexts. Additionally, we identify the shortcomings of current studies and outline future challenges and directions. To the best of our knowledge, this is the first review dedicated exclusively to LLM-based disease diagnosis, presenting a holistic perspective and a blueprint for future research in this ___domain.

Results

Overview of the scope

This section outlines the scope of our review and key findings. Figure 1 provides an overview of disease types, clinical specialties, data types, and modalities (Q1), and introduces the applied LLM techniques (Q2) and evaluation methods (Q3), addressing the key questions. Our analysis spans 19 clinical specialties and over 15 types of clinical data in diagnostic tasks, covering modalities such as text, image, video, audio, time series, and multimodal data. We categorized existing works based on LLM techniques, which fall into four categories: prompting, retrieval-augmented generation (RAG), fine-tuning, and pre-training, with the latter three further subdivided. Table 1 summarizes the taxonomy of mainstream LLM techniques. Figure 2 illustrates the associations between clinical specialties, modalities of utilized data, and LLM techniques in the included papers. Additionally, Fig. 3 presents a meta-analysis, covering publication trends, widely-used LLMs for training and inference, and statistics on data sources, evaluation methods, data privacy, and data sizes. Collectively, these analyses comprehensively depict the development of LLM-based disease diagnosis.

Fig. 1: Overview of the investigated scope.
figure 1

It illustrated disease types and the associated clinical specialties, clinical data types, modalities of the utilized data, the applied LLM techniques, and evaluation methods. We only presented part of the clinical specialties, some representative diseases, and partial LLM techniques.

Table 1 Overview of LLM techniques for diagnostic tasks
Fig. 2
figure 2

Summary of the association between clinical specialties (left), data modalities (middle), and LLM techniques (right) across the included studies on disease diagnosis.

Fig. 3: Metadata of information from LLM-based diagnostic studies in the scoping review.
figure 3

a Quarterly breakdown of LLM-based diagnostic studies. Since the information for 2024-Q3 is incomplete, our statistics only cover up to 2024-Q2. b The top 5 widely-used LLMs for inference and training. c Breakdown of the data source by regions. d Breakdown of evaluation methods (note that some papers utilized multiple evaluation methods). e Breakdown of the employed datasets by privacy status. f Distribution of data size used for LLM techniques. The red line indicates the median value, while the box limits represent the interquartile range (IQR) from the first to third quartiles. Notably, pre-trained diagnostic models were often followed by other LLM techniques (e.g., fine-tuning), yet this figure only includes studies that primarily used fine-tuning or RAG. Statistics for prompting methods are not included because: (i) hard prompts generally utilize zero or very few demonstration samples, and (ii) although soft prompts require more training data, the number of relevant studies is insufficient for meaningful distribution analysis.

Study characteristics

As shown in Fig. 2, the included studies span all 19 clinical specialties, and some specialties receive particular attention, such as pulmonology and neurology. While most studies leveraged text modality, multi-modal data, such as text-image41 and text-tabular data42, are widely adopted for diagnostic tasks. Another observation is that various LLM techniques have been applied to diagnostic tasks, and all have been used with multi-modal data (Table 1). Additionally, we find an increasing number of LLM-based diagnostic studies all over the world, reflecting the field’s growing significance (Fig. 3a). Among these studies, GPT22 and LLaMA23 families dominate inference tasks, while LLaMA and ChatGLM43 are commonly adopted for model training (Fig. 3b). Figure 3c shows that most datasets originate from North America (50.6%) and Asia (33.9%), and 50.4% of the studies used public datasets (Fig. 3e). Evaluation methods vary: 66.8% rely on automated evaluation, 28.1% on human assessment, and 5.1% on LLM-based evaluation (Fig. 3d). Figure 3f reveals that the included studies employed large datasets (e.g., 5 × 105 samples) for pre-training diagnostic models, surpassing those primarily using fine-tuning or RAG. This phenomenon aligns with another observation that over half of pre-training models used data from multiple specialties.

Prompt-based disease diagnosis

A customized prompt typically includes four components: instruction (task specification), context (scenario or ___domain), input data (data to process), and output indicators (desired style or role). In this review, over 60% (N = 278) of studies employed prompt-based techniques, categorized as hard prompts and soft prompts. Hard prompts are static, interpretable, and written in natural language. The most common methods included zero-shot (N = 194), Chain-of-Thought (CoT) (N = 37), and few-shot prompting (N = 35). Among them, CoT prompting excels in thoroughly digesting input clinical cues in manageable steps to make a coherent diagnosis decision. Particularly, in differential diagnosis tasks, CoT reasoning allows the LLM to sequentially analyze medical images, radiology reports, and clinical history, generating intermediate outputs that lead to a holistic decision, with an accuracy of 64%44. Self-consistency prompting was used in a few studies (N = 4). For instance, a study combined self-consistency with CoT prompting to improve depression prediction by synthesizing diverse data sources through multiple reasoning paths. This hybrid approach reduced the mean absolute error by nearly 50% compared to standard CoT methods45.

In contrast, soft prompts (N = 6) are continuous vector embeddings trained to adapt the behavior of LLMs for specific tasks46. These prompts effectively integrate external knowledge, such as medical concept embeddings and clinical profiles, making them well-suited for complex diagnostic tasks requiring nuanced analysis. This knowledge-enhanced approach achieved F1 scores exceeding 0.94 for diagnosing common diseases like hypertension and coronary artery disease and demonstrated superiority in rare disease diagnosis47.

Most prompt-based studies (N = 221) focused on unimodal data, predominantly text (N = 171). Clinical text sources like clinical notes48, imaging reports49,50,51, and case reports52,53 were commonly used. These studies often prompted LLMs with clinical notes or case reports to predict potential diagnoses54,55,56,57. A smaller subset (N = 19) applied prompt engineering to medical image data, analyzing CT scans58, X-rays59,60, MRI scans58,61, and pathological images62,63 to detect abnormalities and provide evidence for differential diagnoses62,64,65,66.

With the advancement of multimodal LLMs, 57 studies explored their application in disease diagnosis through prompt engineering. Visual-language models (VLMs) like GPT-4V, LLaVA, and Flamingo (N = 37) integrated medical images (e.g., radiology scans) with textual descriptions (e.g., clinical notes)67,68,69. For example, incorporating ophthalmologist feedback and contextual details with eye movement images significantly improved GPT-4V’s diagnostic accuracy for amblyopia64.

Beyond image-text data, more advanced multimodal LLMs (e.g., GPT-4o and Gemini-1.5 Pro) have also integrated other data types to support disease diagnosis in complex clinical scenarios. Audio and video data have been used to diagnose neurological and neurodegenerative disorders, such as autism70,71 and dementia59,72. Time-series data, such as ECG signals and wearable sensor outputs, were used to support arrhythmia detection73,74. With the integration of tabular data such as user demographics75,76, and lab test results47,77, the applications have been extended to depression and anxiety screening45. Omics data has been integrated to aid in identifying rare genetic disorders78 and diagnose Alzheimer’s disease76. Some studies further enhanced diagnostic capabilities by integrating medical concept graphs to provide a richer context for conditions such as neurological disorders59.

Retrieval-augmented LLMs for diagnosis

To enhance the accuracy and credibility of the diagnosis, alleviate hallucination issues, and update LLMs’ stored medical knowledge without needing re-training, recent studies79,80,81 have incorporated external medical knowledge into diagnostic tasks. The external knowledge primarily comes from corpus64,79,82,83,84,85,86,87,88, databases74,80,89,90,91,92,93, and knowledge graph81,94, in the included papers. Based on the data modality, these RAG-based studies can be roughly categorized into text-based, text-image-based, and time-series-based augmentations.

In text-based RAG, most studies80,82,84,85,91,92,93 utilized basic retrieval methods where external knowledge was encoded as vector representations using sentence transformers, such as OpenAI’s text-embedding-ada-002. Queries were similarly encoded, and relevant knowledge was retrieved based on vector similarities. The retrieved data was then input into LLMs with specific prompts to produce diagnostic outcomes. In contrast, Li et al.88 developed guideline-based GPT agents for retrieving and summarizing content related to diagnosing traumatic brain injury. They found that these guideline-based GPT-4 agents significantly outperformed the off-the-shelf GPT-4 in terms of accuracy, explainability, and empathy evaluation. Similarly, Thompson et al.79 employed regular expressions to extract relevant knowledge for diagnosing pulmonary hypertension, achieving about a 20% improvement compared to structured methods. Additionally, Wen et al.81 integrated knowledge graph retrieval with LLMs to enable diagnostic inference by combining implicit and external knowledge, achieving an F1 score of 0.79.

In text-image data processing, a common approach87,91 involved extracting image features and text features and aligning them within a shared semantic space. For instance, Ferber et al.91 used GPT-4V to extract crucial image data for oncology diagnostics, achieving a 94% completeness rate and an 89.2% helpfulness rate. Similarly, Ranjit et al.87 utilized multimodal models to compute image-text similarities for chest X-ray analysis, leading to a 5% absolute improvement in the BERTScore metric. Notably, one study fine-tuned LLMs with retrieved documents to enhance X-ray diagnostics86, attaining an average accuracy of 0.86 across three datasets.

For time-series RAG, most studies focused on the electrocardiogram (ECG) analysis74,83. For example, Yu et al.83 transformed fundamental ECG conditions into enhanced text descriptions by utilizing relevant information for ECG analysis, resulting in an average AUC of 0.96 across two arrhythmia detection datasets. Additionally, Chen et al.95 integrated retrieved disease records with ECG data to facilitate the diagnosis of hypertension and myocardial infarction.

Fine-tuning LLMs for diagnosis

Fine-tuning an LLM typically encompasses two pivotal stages: supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). SFT trains models on task-specific instruction-response pairs, enabling it to interpret instructions and generate outputs across diverse modalities. This phase establishes a foundational understanding, ensuring the model processes inputs effectively. RLHF further refines the model by aligning its behavior with human preferences. Using reinforcement learning, the model is optimized to produce responses that are helpful, truthful, and aligned with societal and ethical standards96.

In medical applications, SFT enhances in-context learning, reasoning, planning, and role-playing capabilities, improving diagnostic performance. This process integrates inputs from various data modalities into the LLM’s word embedding space. For example, following the LLaVA approach97, visual data is converted into token embeddings using an image encoder and projector, then fed into the LLM for end-to-end training. In this review, 49 studies focused on SFT using medical texts, such as clinical notes98, medical dialogs99,100,101, or reports102,103,104. Additionally, 43 studies combined medical texts with images, including X-rays102,105,106,107, MRIs104,107,108, or pathology images109,110,111. A few studies explored disease detection from medical videos102,112, where video frames were sampled and converted into visual token embeddings. Generally, effective SFT requires collecting high-quality, diverse responses to task-specific instructions to ensure comprehensive training.

RLHF methods are categorized as online or offline. Online RLHF, integral to ChatGPT’s success113, involves training a reward model on datasets of prompts and human preferences and using reinforcement learning algorithms like Proximal Policy Optimization (PPO)114 to optimize the LLM. Studies have shown its potential in improving medical LLMs for diagnostic tasks115,116,117. For instance, Zhang et al.117 aligned their model with physician characteristics, achieving strong performance in disease diagnosis and etiological analysis; the diagnostic performance of their model, HuatuoGPT, surpassed GPT-3.5 in over 60% of cases of Meddialog118. However, online RLHF’s effectiveness depends heavily on the reward model’s quality, which may suffer from over-optimization119 and data distribution shifts120. Additionally, reinforcement learning often faces instability and control challenges121. Offline RLHF, such as Direct Preference Optimization (DPO)122, frames RLHF as optimizing a classification loss, bypassing the need for a reward model. This approach is more stable and computationally efficient, proving valuable for aligning medical LLMs123,124. Yang et al.124 reported significant performance drops on pediatric benchmarks when the offline RLHF phase was omitted. A high-quality dataset of prompts and human preferences is essential for online RLHF reward model calibration125 or the convergence of offline methods like DPO126, whether sourced from experts113 or advanced AI models127.

Since full training of LLMs is challenging due to high GPU demands, parameter-efficient fine-tuning (PEFT) reduces the number of tunable parameters. The most common PEFT method, Low-Rank Adaptation (LoRA)128, introduces trainable rank decomposition matrices into each layer without altering the model architecture or adding inference latency. In this review, all PEFT-based studies (N = 7) used LoRA to reduce training costs98,104,124.

Pre-training LLMs for diagnosis

Pre-training medical LLMs involves training on large-scale, unlabeled medical corpora to develop a comprehensive understanding of the structure, semantics, and context of medical language. Unlike fine-tuning, pre-training enables the acquisition of extensive medical knowledge, enhancing generalization to unseen cases and improving robustness across diverse diagnostic tasks. In this review, five studies performed text-only pretraining on the LLMs from different sources129,130,131,132, such as clinical notes, medical QA texts, dialogs, and Wikipedia. Moreover, eight studies injected medical visual knowledge into multimodal LLMs via pretraining109,133,134,135,136,137. For instance, Chen et al.137 employed an off-the-shelf multimodal LLM to reformat image-text pairs from PubMed into VQA data points for training their diagnostic model. To improve the quality of the image encoder, pretraining tasks like reconstructing images at tile-level or slide-level109, and aligning similar images or image-text pairs133 are common choices.

Performance evaluation

Evaluation methods for diagnostic tasks generally fall into three categories (Table 2): automated evaluation138, human evaluation138, and LLM evaluation139, each with distinct advantages and limitations (Fig. 4).

Table 2 Overview of evaluation metrics for diagnostic tasks
Fig. 4
figure 4

Summary of the evaluation approaches for diagnostic tasks.

In this review, most studies (N = 266) relied on automated evaluation, which is efficient, scalable, and well-suited for large datasets. These metrics can be grouped into three types. (1) Classification-based metrics, such as accuracy, precision, and recall, are commonly used for disease diagnosis. For instance, Liu et al.133 evaluated COVID-19 diagnostic performance using AUC, accuracy, and F1 score. (2) Differential diagnosis metrics, including top-k precision, assess ranked diagnosis lists. Tu et al.140 employed top-k accuracy to evaluate the correctness of differential diagnosis predictions. (3) Regression-based metrics, such as mean squared error (MSE)141, quantify deviations between predicted and actual values142. Despite their efficiency, automated metrics rely on ground-truth diagnoses143, which may be unavailable, and cannot understand contexts, such as the readability of diagnostic explanations or their clinical utility144. They also struggle with complex tasks, such as evaluating the medical correctness of diagnostic reasoning145.

Human evaluation (N = 112), conducted by medical experts24,138, does not require ground-truth labels and integrates expert judgment, making it suitable for complex, nuanced assessments. However, it is costly, time-consuming, and prone to subjectivity, limiting its feasibility for large-scale evaluation. Recent studies have explored using LLM evaluation (N = 20), a.k.a. LLM-as-Judges139, to replace human experts in evaluation and combine the interpretative depth of LLM judgment with the efficiency of automated evaluation. While ground-truth accessibility is not strictly necessary99,116, its inclusion improves reliability143. Popular LLMs used for this purpose include GPT-3.5, GPT-4, and LLaMA-3. However, this approach remains constrained by LLM limitations, including susceptibility to hallucinations99 and difficulties in handling complex diagnostic reasoning146. In summary, each evaluation approach has distinct advantages and limitations, with the choice dependent on the specific requirements of the task. Figure 4 guides the selection of suitable evaluation approaches for different scenarios.

Discussion

This section analyzes key findings from the included studies, discusses the suitability of mainstream LLM techniques for varying resource constraints and data preparation, and outlines challenges and future research directions.

The rapid rise of LLM-based diagnosis studies (Fig. 3a) might partially be attributed to the increased availability of public datasets147 and advanced off-the-shelf LLMs57. Besides, the top five LLMs used for training and inference differ significantly (Fig. 3b), reflecting the interplay between effectiveness and accessibility. Generally, closed-source LLMs, with their vast parameters and superior performance143, are favored for LLM inference, while open-source LLMs are essential for developing ___domain-specific models due to their adaptability148. These factors underscore the dual influence of effectiveness and accessibility on diagnostic applications. Additionally, the regional analysis of datasets (Fig. 3c) reveals that 84.5% of datasets originate from North America and Asia, potentially introducing racial biases in this research ___domain149.

Most studies employed prompting for disease diagnosis (Fig. 2), leveraging its advantages, such as minimal data requirements, ease of use, and low computational demands150. Meanwhile, LLMs’ extensive medical knowledge allowed them to perform competitively across diverse diagnostic tasks when effectively applied24,143. For example, a study fed two data samples into GPT-4 for depression detection151, and the performance significantly exceeded traditional DL-based models. In summary, prompting LLMs facilitates the development of effective diagnostic systems with minimal effort, contrasting with conventional DL-based approaches that require extensive supervised training on large datasets2,17.

We then compare the advantages and limitations of mainstream LLM techniques to indicate their suitability for varying resource constraints, along with a discussion of data preparation. Generally, the choice of LLM technique for diagnostic systems depends on the quality and quantity of available data. Prompt engineering is particularly effective in few-data scenarios (e.g., zero or three cases with ground-truth diagnoses), requiring minimal setup24,152. RAG relies on a high-quality external knowledge base, such as databases80 or corpora82, to retrieve accurate information during inference. Fine-tuning requires well-annotated datasets with sufficient labeled diagnostic cases133. Pre-training, by contrast, utilizes diverse corpora, including unstructured text (e.g., clinical notes, literature) and structured data (e.g., lab results), to establish a robust knowledge foundation via unsupervised language modeling42,153. Although fine-tuning and pre-training facilitate high performance and reliability133, they demand significant resources, including advanced hardware and extensive biomedical data (see Fig. 3f), which are costly and often hard to obtain24. In practice, not all diagnostic scenarios require expert-level accuracy. Applications such as large-scale screenings154, mobile health risk alerts155, or public health education30 prioritize cost-effectiveness and scalability. Overall, balancing accuracy with resource constraints depends on the specific use case.

Despite advances in LLM-based methods for disease diagnosis, this scoping review highlighted several barriers to their clinical utility (Fig. 5). One limitation lies in information gathering. Most studies implicitly assume that the available patient information is sufficient for diagnosis, which often fails156, especially in initial consultations or with complex diseases, increasing the risk of misdiagnosis157. In practice, clinical information gathering is iterative, starting with initial data (e.g., subjective symptoms), refining diagnoses, and conducting further tests or screenings158. This process relies heavily on experienced clinicians140. To reduce this dependence, recent studies have explored multi-round diagnostic dialogs to collect relevant information159,160. For example, AIME140 uses LLMs for clinical history-taking and diagnostic dialog, while Sun et al.160 utilized reinforcement learning to formulate disease screening questions. Future efforts could further embed awareness of information incompleteness into models or develop techniques for automatic diagnostic queries161. Another limitation arises from the reliance on single data modalities, whereas clinicians typically synthesize information from multiple modalities for accurate diagnosis44. Additionally, real-world health systems often operate in isolated data silos, with patient information distributed across institutions26. Addressing these issues will require efforts to collect and integrate multi-modal data and establish unified health systems that facilitate seamless data sharing across institutions162.

Fig. 5
figure 5

Summary of the limitations and future directions for LLM-based disease diagnosis.

Barriers also exist in the information integration process. Some studies utilized clinical vignettes for diagnostic tasks without fulfilling the SOAP standard163. While adhering to clinical guidelines is crucial142, limited studies have incorporated this factor into diagnostic systems164. For example, Kresevic et al.82 sought to enhance clinical decision support systems by accurately explaining guidelines for chronic Hepatitis C management. Besides, the integration and interpretation of lab test results pose significant value in healthcare165. For example, Bhasuran et al.166 reported that incorporating lab data enhanced the diagnostic accuracy of GPT-4 by up to 30%. A future direction is the effective integration of lab test results into LLM-based diagnostic systems.

Exploring clinician-patient-diagnostic system interactions offers a promising research direction167. Diagnostic systems are desired to assist clinicians by providing Supplementary information to improve accuracy and efficiency58,168, incorporating expert feedback for continuous refinement. A user-friendly interface is essential for effective human-machine interaction, enabling clinicians to input data and engage in discussions with the system. Human language interaction further enhances usability by allowing natural conversation with LLM-based diagnostic tools168, reducing cognitive load. Additionally, LLM-aided explanations improve transparency by providing rationales for suggested diagnoses145, fostering trust, and facilitating informed decision-making among clinicians and patients.

Most of the studies focused on diagnostic accuracy, but overlooked ethical considerations, like explainability, trustworthiness, privacy protection, and fairness169. Providing diagnostic predictions alone is insufficient in clinical scenarios, as the black-box nature of LLMs often undermines trust99. Designing diagnostic models with explainability is desired145. For example, Dual-Inf is a prompt-based framework that offers potential diagnoses while explaining its reasoning143. Besides, since LLMs suffer from hallucinations, how to enhance users’ trustworthiness toward LLM-based diagnostic models is worth exploring170. Potential solutions include using fact-checking tools to verify the output’s factuality171. Regarding privacy, adherence to regulations like HIPAA and GDPR, including de-identifying sensitive data, is essential26,172. For example, SkinGPT-4, a dermatology diagnostic system, was designed for local deployment to ensure privacy protection173. Fairness is another concern, as patients should not face discrimination based on gender, age, or race169, but research on fairness in LLM-based diagnostics remains scarce174.

In the context of modeling, building superior models for accurate and reliable diagnosis remains an exploration. While pre-training on extensive medical datasets benefits diagnostic reasoning175, many medical LLMs generally lag behind general-___domain counterparts in parameter scale148,176, underscoring the potential of developing large-scale generalist models for disease diagnosis. Besides, LLMs are prone to catastrophic forgetting177, where previously acquired knowledge or skills are lost when learning new information. Addressing this issue facilitates the development of generalist diagnostic models but requires incorporating robust continuous learning capabilities178. One alternative approach for accurate diagnosis involves coordinating multiple specialized models, simulating interdisciplinary clinical discussions to tackle complex cases179. For example, Med-MoE180 is a mixture-of-experts framework leveraging medical texts and images and achieved an accuracy of 91.4% in medical image classification. Additionally, hallucinations in LLMs undermine diagnostic reliability170, necessitating solutions such as knowledge editing181, external knowledge retrieval82, and novel model architectures or pre-training strategies175. Another promising avenue is longitudinal data modeling, as clinicians routinely analyze EHRs spanning multiple years to inform decision-making182,183. Besides, modeling temporal data helps with early diagnosis56,184 to improve patient outcomes. For example, early detection of lung adenocarcinoma might increase the 5-year survival rate to 52%185. However, challenges like irregular sampling intervals and missing data persist186, necessitating advanced methodologies to effectively capture temporal dependencies25.

Another challenge in developing diagnostic models is benchmark availability147. In this review, 49.6% of the included studies relied on private datasets, which were often inaccessible due to privacy concerns82. Additionally, the scarcity of annotated data limits progress, as well-annotated datasets with ground-truth diagnosis enable automated evaluation, reducing reliance on human assessment143. Hence, constructing and releasing annotated benchmark datasets would greatly support the research community147. Regarding performance evaluation, some studies either used small-scale data57 or unrealistic data, such as snippets from college books145 and LLM-generated clinical notes147, for disease diagnosis, while large-scale real-world data can truly validate diagnostic capabilities182. Besides, the lack of unified qualitative metrics is another issue. For example, the evaluation of diagnostic explanation varies in different studies143,187, including necessity187, consistency108, and compeleteness143. Unifying qualitative metrics foster a fair comparison. Additionally, many included studies failed to compare with conventional diagnostic models, while recent studies reported that traditional models, e.g., Transformer188, might beat LLM-based counterparts in clinical prediction189. Therefore, future studies should compare with traditional baselines for comprehensive evaluation.

Regarding the deployment of diagnostic systems, several challenges warrant further investigation, including model stability, generalizability, and efficiency. Current studies have highlighted that LLMs often struggle with diagnosis stability182, fail to generalize well across data from different institutions190, and encounter efficiency limitations191. For instance, even minor variations in instructions, such as from asking “final diagnosis” to “primary diagnosis”, can drop the accuracy 10.6% on cholecystitis diagnosis182. Addressing these limitations will advance the reliability and applicability of diagnostic models. Another promising avenue is deploying diagnostic algorithms on edge devices192. Such systems could enable the real-time collection of health data, such as ECG rhythms19, to support continuous health monitoring95. However, regulatory barriers, including the stringent approval standards imposed by agencies such as the U.S. Food and Drug Administration (FDA) and the European Union’s Medical Device Regulation (MDR)193, remain a significant obstacle to clinical adoption. Overcoming these challenges will be vital to ensure the safe and effective integration of LLM-based diagnostics into clinical practice.

In conclusion, our study provided a comprehensive review of LLM-based methods for disease diagnosis. Our contributions were multifaceted. First, we summarized the disease types, the associated clinical specialties, clinical data, the employed LLM techniques, and evaluation methods within this research ___domain. Second, we compared the advantages and limitations of mainstream LLM techniques and evaluation methods, offering recommendations for developing diagnostic systems based on varying user demands. Third, we identified intriguing phenomena from the current studies and provided insights into their underlying causes. Lastly, we analyzed the current challenges and outlined the future directions of this research field. In summary, our review presented an in-depth analysis of LLM-based disease diagnosis, outlined its blueprint, inspired future research, and helped streamline efforts in developing diagnostic systems.

Methods

Search strategy and selection criteria

This scoping review followed the PRISMA guidelines, as shown in Fig. 6. We conducted a literature search for relevant articles published between January 1, 2019, and July 18, 2024, across seven electronic databases: PubMed, CINAHL, Scopus, Web of Science, Google Scholar, ACM Digital Library, and IEEE Xplore. Search terms were selected based on expert consensus (see Supplementary Data 1).

Fig. 6: PRISMA flowchart of study records.
figure 6

PRISMA flowchart showing the study selection process.

A two-stage screening process focused on LLMs for human disease diagnosis. The first stage involved title and abstract screening by two independent reviewers, excluding papers based on the following criteria: (a) articles unrelated to LLMs or foundation models, and (b) articles irrelevant to the health ___domain. The second stage was full-text screening, emphasizing language models for diagnosis-related tasks (Supplementary Data 2), excluding non-English articles, review papers, editorials, and studies not explicitly focused on disease diagnosis. The scope included studies that predicted probability values of diseases (e.g., the probability of depression) and the studies in which the foundation models involved text modalities (e.g., vision-language models) and utilized non-text data (e.g., medical images) as input. Our review excluded the foundation models without text modality, such as vision foundation models, because the scope highlighted “language” models. Following related works194, we further excluded studies purely built on non-generative language models, like BERT188 and RoBERTa195, since the generative capability is a critical characteristic of LLMs to facilitate the development of the diagnostic system in the era of generative AI30,31. Final eligibility was determined by at least two independent reviewers, with disagreements resolved by consensus or a third reviewer.

Data extraction

Information from the articles was categorized into four groups: (1) Basic information: title, publication venue, publication date (year and month), and region of correspondence. (2) Data-related information: data sources (continents), dataset type, modality (e.g., text, image, video, text-image), clinical specialty, disease name, data availability (private or public), and data size. (3) Model-related information: base LLM type, parameter size, and technique type. (4) Evaluation: evaluation scheme (e.g., automated or human) and evaluation metrics (e.g., accuracy, precision). See Supplementary Table 1 for the data extraction form.

Data synthesis

We synthesized insights from the data extraction to highlight key themes in LLM-based disease diagnosis. First, we presented the review scope, covering disease-associated clinical specialties, clinical data, data modalities, and LLM techniques. We also analyzed meta-information, including development trends, the most widely used LLMs, and data source distribution. Next, we summarized various LLM-based techniques and evaluation strategies, discussing their strengths and weaknesses and offering targeted recommendations. We categorized modeling approaches into four areas (prompt-based methods, RAG, fine-tuning, and pre-training), with detailed subtypes. Additionally, we examined challenges in current research and outlined potential future directions. In summary, our synthesis covered data, LLM techniques, performance evaluation, and application scenarios, in line with established reporting standards.