Introduction

Research with real-world data typically relies on human-labeled data for training and validation. Though effective, human annotation can be costly, time-consuming, and prone to errors. Recent research suggests that the few-shot capabilities of generative large language models (LLMs) can be used to annotate text data with reduced time and cost burden1,2,3,4. These capabilities of generative LLMs can be applied to information extraction from patient clinical notes. Traditional methods for information extraction include rule-based approaches, which can be limited by low recall due to user-defined rules and variability of medical texts, and supervised machine learning models, which can be limited by a lack of labeled training data5,6,7. The zero- and few-shot capabilities of LLMs can enable more flexible and scalable information extraction from clinical notes without the need for extensive manual annotation.

While promising, state-of-the-art LLMs (such as GPT-48) are challenging to deploy in a scalable way in healthcare systems. Many of these models (including those from OpenAI, Anthropic, and Google) are proprietary and come with limited license terms. Concerns about patient privacy and lack of transparency in these proprietary models also lead to some hesitancy in their adoption for healthcare institutions9. Additionally, these models can be extremely large and require substantial computational resources (e.g., Llama 405B), limiting their deployment within typical health system IT settings10. So far, many of the successful deployments have been through partnerships where industry partners subsidize costs or provide in-kind contributions in terms of computing and engineering. This limits the number and type of institutions that are able to participate and the use cases to which generative AI can be applied. Additionally, setting up these partnerships can require additional administrative lift (e.g., legal negotiation and information security evaluation) compared to performing analyses in existing environments, whether institution-hosted or existing private cloud deployments11. Even where solutions have been widely available, such as partnerships for draft inbox responses12, the ability to achieve similar performance with smaller models will make customizing models to a specific institution, as well as serving inference requests at scale, substantially cheaper and less cumbersome.

Challenges in generative AI around scalability necessitate cost-effective and privacy-conscious solutions, which could be addressed through the development of open-source LLMs that can be integrated into existing healthcare system infrastructure. Open-source LLMs historically did not perform as well as their proprietary counterparts13, but recent progress has led to very competitive models across most evaluation metrics14. Recent efforts have been made to evaluate the capacity of locally deployable LLMs to extract clinical information with low hardware requirements15.

Synthetic data generation, distillation, and instruction tuning offer an opportunity to close the gap between open-source and proprietary models. Larger models can generate synthetic data that can be used to fine-tune a smaller model for a given task, with the idea that the smaller model could mirror the performance of the larger model for that task. This process, called distillation, has been shown to improve the performance of these models16,17. It allows researchers to develop models with the potential for wider adoption by reducing computational cost without sacrificing performance. Knowledge and data distillation approaches have been used for medical applications, including health event prediction18, medical dataset sharing19, and medical image analysis20,21. It can be challenging to access annotated data for medical applications due to privacy concerns, regulatory constraints, and the time- and resource-intensive process of manual annotation. Synthetic data generated by LLMs can serve as a potential alternative and have successfully been used as training data for knowledge distillation approaches22,23,24,25. Data augmentation approaches with synthetic data can be used to enhance model performance by producing high-quality, diverse training examples from which the LLM can learn26. For example, fine-tuning on synthetic data generated by GPT-4 improved zero-shot performance of Llama models27.

While synthetic data distillation approaches are actively evolving, there have been few approaches specifically focused on clinical information extraction from unstructured clinical notes. The ability to extract clinical information at scale from unstructured clinical notes could enhance patient phenotyping, which is important for research and clinical applications. Current phenotyping approaches often rely on structured data such as ICD codes, which are used for billing purposes and may not reflect the nuances of the patient’s condition. This can limit analytical precision and potentially introduce biases when studying research outcomes. Unstructured clinical notes, which contain information including medical, social, and family history that may not be captured by structured data, could offer more granular and reliable insight into patient history, particularly in heterogeneous populations where there can be large differences in disease manifestation and progression28,29. LLMs can perform zero-shot information extraction from notes to improve phenotyping accuracy over the use of ICD codes, without the need for extensive manual annotation30.

Another application for these methods is in clinical trial recruitment, which requires a comprehensive evaluation of both clinical trial eligibility criteria and patient medical histories in order to appropriately match patients who meet trial requirements31,32,33,34. Synthetic data distillation is particularly useful in this case where less labeled data (i.e., paired patient-criterion matching annotations) is available. A recent study developed an LLM framework that used GPT-4 to predict patient eligibility on a criterion-level basis with explanations and achieved near expert-level performance35. Recent work comparing proprietary and open-source models suggested that distillation, along with fine-tuning, can improve the performance of open-source LLMs for patient trial matching, approaching that of GPT-436. As opposed to Nievas et al.36, we used an open-source model to generate synthetic data, generated our data with MIMIC-III notes, and fine-tuned with QLoRA37. The fine-tuned models were evaluated against both the data used to create the synthetic question-answer pairs (MIMIC-III) as well as external data. Additionally, it is critical to use open-source models, even as a teacher. Deploying a model fine-tuned on GPT-4 outputs is likely against OpenAI’s terms of service38 as this would be deemed competing with OpenAI. As a whole, these developments show promise for the capacity of LLMs to aid in clinical information extraction for patient-trial matching.

Related work has also explored context distillation39 and the inclusion of intermediate reasoning steps and rationales40,41. For example, Huang et al.40 conducted ablation studies to show the effect of fine-tuning on reasoning for self-improvement. Hsieh et al.41 extracted chain-of-thought rationales and labels, which they used to fine-tune smaller T5 models. It can be informative to consider the impact of including model-generated rationales as well as other subsets of synthetic data. Examining model performance in answering single-order questions (e.g., what was the patient’s highest creatinine value) compared to questions requiring multiple steps (e.g., does this patient fit this trial’s eligibility criteria?) could provide additional insights.

In this work, we demonstrate the ability to perform synthetic data distillation for scalable clinical note annotation, using a large open-source model to generate realistic questions based on patient clinical records, which can be used to train a smaller model that can perform inference. Additionally, we perform an ablation study to understand which types of synthetic data yield optimal performance and the tradeoff between model size and performance. We conduct comprehensive evaluations against multiple datasets. This is critical because we observe it is substantially easier to achieve strong performance against synthetic data with manual review as opposed to fully human-generated evaluations. Alongside the work, we release source code which provides a framework for meaningful, clinical information extraction, synthetic data generation (https://github.com/bbj-lab/clinical-synthetic-data-distil), and an annotation tool built around making the annotation process faster, particularly when LLM predicted annotations are already available (https://github.com/bbj-lab/annotation-ui). We are also releasing two newly manually annotated datasets to PhysioNet, which will be available via the same data use agreement as MIMIC-III/IV : (1) Annotated Synthetic Trial Criteria Questions: 1000 questions generated by the large 70B model as Synthetic Data, which have been human-reviewed, and (2) Apixaban Trial Criteria Questions: 2300 questions based on trial criteria from the ARISTOTLE apixaban clinical trial42,43.

Results

The process of knowledge distillation by generating synthetic question and answer pairs using a large model (Llama 3.1 70B-Instruct) to teach a smaller model (e.g., Llama 3.1 8B-Instruct, Llama 3.2 3B-Instruct, Llama 3.2 1B-Instruct) is described in Fig. 1. We then evaluated the distillation process on three distinct tasks, 1.) a set of synthetic trial criteria questions (1000) which we manually reviewed (Table 2), 2.) real world data from the i2b2 n2c2 clinical trial cohort challenge44 (Fig. 2, Table 3), and 3.) a set of 2,300 questions derived from the MIMIC real world dataset to emulate the eligibility criteria of the ARISTOTLE apixaban clinical trial42,43 (Table 4).

Fig. 1: Synthetic distillation training workflow.
figure 1

MIMIC-III records, outlined in green, are provided to the 70B-parameter Llama-3.1 model, which in turn generates the elements outlined in blue. After post-processing, the elements outlined in purple are provided to the 8B-parameter Llama-3.1 model (or 3B- or 1B-parameter models) for fine-tuning the “All” version of the model.

Fig. 2: Comparison of model performance for the i2b2 (n2c2) Clinical Trial Eligibility Challenge.
figure 2

Evaluation includes the Training Set (a, c) because these data were not included during any of the pre-processing, hyperparameter selection or fine-tuning process of the models. All evaluations are zero-shot, but performance on Training (a, c) is separated from Test set (b, d) for clarity. (70B and 8B are Llama-3.1, 3B and 1B are Llama-3.2).

The knowledge distillation process worked by passing in a discharge summary to Llama 3.1 70B-Instruct along with prompt instructions (Supplementary Table 1) to create questions meeting specific criteria (e.g., yes/no, numeric, or questions that can not be answered based on the content of the note). In addition to questions, the model was tasked with providing the section of the discharge summary an answer could be found (e.g., Pertinent Results), the source or exact text that allowed the model to answer the question, and an explanation of why the answer was correct based on the source and rest of the note. The model was also tasked with estimating the difficulty of the question it created (Supplementary Table 2).

Next, these questions were filtered depending on which model was being fine-tuned (Table 1). For example, 8B-All includes all of the generated synthetic question and answer pairs (as do 3B-All and 1B-All), 8B-H-25K includes only the 25,000 questions the 70B-Instruct model ranked hardest within each category, 8B-NB-Only includes the 25,000 hardest numeric and boolean (yes/no) questions, and 8B-No-S includes the 25,000 hardest questions of each type but does not finetune on any of the supporting information (namely, the explanation, the section the model believed the answer was in when generating the question, or the source, which is the exact text which allowed for the model to answer the question). Next, QLoRA fine-tuning (detailed in Methods) was performed for each of the question categories to result in six fine-tuned models (8B-All, 8B-H-25k, 8B-NB-Only, 8-No-S, 3B-All, and 1B-All) in addition to the four instruct models open-sourced by Meta (70B-Instruct, 8B-Instruct, 3B-Instruct, and 1B-Instruct) (Table 1).

Table 1 Comparison of the different models that were compared throughout the clinical information extraction tasks

Each model was evaluated on three tasks: (i) annotated synthetic trial criteria questions, (ii) i2b2 Clinical Trial Eligibility Criteria Cohort Selection shared task from the 2018 National NLP Clinical Challenges, and (iii) apixaban trial criteria. We report performance metrics including Balanced Accuracy, which measures the average between sensitivity and specificity and can be used on imbalanced datasets, and Micro-F1 score. Micro-F1 was the primary metric used to judge the i2b2 challenge, which permits direct comparison between our results and challenge entries (for the test set).

Synthetic Data Evaluation

We evaluated model performance on a manually annotated subset of 1000 generated examples from the hold-out test set described in the methods datasets subsection (Table 2). The 8B-All model achieves the best overall accuracy (89.30%), outperforming even the 70B-Instruct model used for creating the synthetic data (76.20%). This was especially visible in the “NA” categories, where there appears to be a strong impact of training models explicitly on questions that cannot be answered based on the context (note) provided. Within each category, 8B-All and 8B-H-25k improved over 8B-Instruct, reflecting the impact of fine-tuning. 8B-H-25k also outperformed 70B-Instruct overall, suggesting that while the model benefits from further fine-tuning, a relatively small dataset of 25k examples can still provide an appreciable benefit. Unsurprisingly, the 8B-NB-Only model which was not fine-tuned on any “NA” data struggles in both of the NA columns, but it does perform very well on questions of numeric and boolean type and is actually the top performer for numeric questions. When comparing between the 8B-All, 3B-All, and 1B-All models, we find a general tradeoff between model size and performance, with a notable exception of the ability of the 3B-All and 1B-All models to identify questions that it could not answer (the NA-type questions).

Table 2 Model Accuracy on a subset of manually annotated Synthetic Labels (70B)

i2b2 clinical trial eligibility challenge evaluation

We next evaluated the performance of all base and fine-tuned models on the i2b2 2018 Clinical Trial Eligibility Challenge (Fig. 2). Because we did not train on or otherwise use these data in our fine-tuning process we were able to assess the performance of models across both the train and test sets for the original i2b2 challenge.

We evaluated two different values of two parameters, temperature and top_p (see Methods). We had a hypothesis that sampling strategies (i.e., higher temperature) might work well to force the model to provide an answer that aligned well with the explanation. However, we observed that the temperature did not have a large impact, and a temperature of 0 seems to slightly outperform higher temperatures (Supplementary Table 3). The 70B-Instruct model performed the best on both train and test data. The two fine-tuned models, which included all types and supporting information (8B-All and 8B-H-25K) outperformed the base 8B-Instruct model. The fine-tuned models that either did not include all types (8B-NB-Only) or did not include supporting information (8B-No-S) had worse performance than the base 8B-Instruct model. When comparing the 8B-All, 3B-All, and 1B-All models, we find that performance decreases as model size decreases. This held on both the training and test folds, for both balanced accuracy and micro-F1 score.

An interesting trend we observed throughout this work was the need to isolate criteria and thus the prompts provided to the models into questions that required only single order answers. This was illustrated when comparing the performance of both the base models and fine-tuned models for their ability to either a.) directly answer a prompt question for a given criterion (i.e. direct boolean “yes” or “no”) vs. b.) extracting the numeric value relevant to the criterion and then performing post-processing to arrive at a boolean “yes” or “no” answer (Table 3). Within the i2b2 n2c2 challenge, two questions asked whether labs were abnormal (serum creatinine and hemoglobin levels). Across all models, numeric extraction followed by post-processing achieved higher performance compared to asking the model to directly answer the question.

Table 3 Comparison between directly answering clinical trial criteria about laboratory value ranges vs. extracting a number and applying rules-based post processing to determine whether to answer “yes” or “no” (i.e., ask the model to return a number, if that number is above a range answer yes, otherwise answer no)

Trial Criteria Evaluation

As the third evaluation task, we compared the performance of the base and fine-tuned models using manual annotations based on 23 questions resembling eligibility criteria from the apixaban clinical trial for a random sample of 100 patient notes from MIMIC-IV (Table 4). The fine-tuned 8B-All model achieved high performance, exceeding Balanced Accuracy and Micro-F1 of 0.8 across all criteria assessed, with an overall average Balanced Accuracy of 0.93 and Micro-F1 of 0.94. This fine-tuned model outperformed the 8B-Instruct (Balanced Accuracy = 0.84, Micro-F1 = 0.86) and even the 70B-Instruct model (Balanced Accuracy = 0.89, Micro-F1 = 0.92). The model fine-tuned on the most difficult 25,000 questions, 8B-Instruct-H-25K, achieved a similarly high performance across criteria (average Balanced Accuracy = 0.95, Micro-F1 = 0.94), suggesting that either fewer total questions may be needed for fine-tuning, or that more difficult questions offer greater value in fine-tuning. Average performance (both balanced accuracy and balanced micro-F1) decreased monotonically with model size over the 8B-All, 3B-All, and 1B-All models. For each model size, there was a sizeable performance improvement due to finetuning (comparing the Instruct and All versions for each model size).

Table 4 Performance on clinical trial eligibility criteria for apixaban

There were some criteria where base 8B-Instruct model had relatively lower performance, including extraction of aspartate aminotransferase (AST) (Balanced Accuracy = 0.54, Micro-F1 = 0.54), blood glucose (Balanced Accuracy = 0.25, Micro-F1 = 0.25), and left ventricular ejection fraction (Balanced Accuracy = 0.72, Micro-F1 = 0.72). The use of the larger 70B-Instruct model dramatically improved performance for these criteria, exceeding Balanced Accuracy and Micro-F1 of 0.84. The fine-tuned models 8B-All and 8B-H-25k performed comparably to the 70B model, and in some cases outperformed it. All three models for the AST criteria led to Balanced Accuracy and Micro-F1 scores of 0.94 and above. For blood glucose, the fine-tuned models 8B-All (Balanced Accuracy = 0.98, Micro-F1 = 0.98) and 8B-H-25k (Balanced Accuracy = 0.94, Micro-F1 = 0.94) achieved higher performance than the 70B-Instruct model (Balanced Accuracy = 0.84, Micro-F1 = 0.84). For identification of hemorrhagic tendencies, the model fine-tuned on the 25k most difficult questions led to the biggest performance improvement (Balanced Accuracy = 0.96, Micro-F1 = 0.92) compared to both the 8B-All (Balanced Accuracy = 0.96, Micro-F1 = 0.92) and 70B-Instruct models (Balanced Accuracy = 0.96, Micro-F1 = 0.92).

For some criteria, the 70B-Instruct model did not perform as well as any of the 8B-Instruct models, including the base model. This was the case when detecting the presence of atrial fibrillation (8B-Instruct: Balanced Accuracy = 0.98, Micro-F1 = 0.97; 70B-Instruct: Balanced Accuracy = 0.65, Micro-F1 = 0.84) and whether there was planned/past ablation for atrial fibrillation (8B-Instruct: Balanced Accuracy = 0.89, Micro-F1 = 0.98; 70B-Instruct: Balanced Accuracy = 0.65, Micro-F1 = 0.94). There were also some criteria, including creatinine and platelets, where the models did not perform as well as other criteria as no model exceeded 0.85 for either balanced accuracy or micro-F1. Of the manually annotated notes, 60% did not have a numeric value for platelet count available in the note, while only 3% did not have a serum creatinine value available (Supplementary Table 4). This rate may be at least in part due to the fact that the de-identification process for MIMIC-III seemed to accidentally redact some platelet values. During the manual annotation process we did not observe this occurring with other laboratory values.

Resource requirements

Data distillation allowed the models to be run with vastly reduced resource requirements compared to the 70B-Instruct model. All model evaluation was done on the Center for Research Informatics’ “Randi” cluster at the University of Chicago. The cluster’s GPU nodes each contain 8 Nvidia A100 GPU’s with two 16-core 3.0-GHz AMD Milan processors. We monitored seconds/example, tokens in/second, and tokens out/second for both the 8B-parameter and 70-B parameter architectures and reported these in Fig. 3. These differences could translate into meaningful cost savings. For example, performing a study of the Apixaban criteria (23 questions) for 10,000 patients to identify a cohort on the least expensive cloud provider would be $3132 less expensive for the 8B vs. 70B parameter models (see Supplementary Table 5 for a comparison of current rates among the main providers). In this example, running the 8B-parameter model would cost less than $1000 (0.535 sec./ex. * 230k ex. * 1/3600 hr./sec. * $27.2/hr. = $929), while the 70B-parameter model would cost over $4000 (2.34 sec./ex. * 230k ex. * 1/3600 hr./sec. * $27.2/hr. = $4066).

Fig. 3: Comparison of inference speed across model sizes and evaluation tasks.
figure 3

a illustrates the average number of seconds needed to process an example for each dataset and model, b shows the average number of tokens read or ingested per second, and (c) depicts the average number of tokens generated per second. When comparing the center and right panels, note that token generation tends to be more time-consuming than token ingestion.

Discussion

In this study, we present an approach to improve the scalability of open-source LLMs for clinical information extraction using synthetic data distillation. We used the larger Llama-3.1-70B-Instruct to generate synthetic data, consisting of question-answer pairs with supporting information and difficulty scores. These were used to fine-tune smaller models: Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.2-1B-Instruct. We found a general tradeoff between the size and performance of the finetuned models. We also explored the impact of fine-tuning on different amounts and subsets of synthetic data (including one fine-tuned with all data, one fine-tuned with only the hardest 25 K questions, one fine-tuned without questions where the note does not contain the answer - NA, and one fine-tuned without any supporting information). We observe that the inclusion of NA and supporting information was critical to the high performance of fine-tuned models, especially when applied to fully human-generated evaluations as opposed to synthetic data with human review. When evaluating the accuracy of these models based on manually annotated synthetic data, we found that the model fine-tuned on all synthetic data (8B-All) achieved a high overall accuracy that exceeded that of a larger base model (70B-Instruct). We found that these fine-tuned models also performed well across different clinical tasks, including the i2b2 Clinical Trial Eligibility Challenge and a dataset designed to resemble real eligibility criteria from the apixaban clinical trial. The fine-tuned models can achieve performance comparable to, and in some cases exceeding, that of even the larger model that served as the teacher. Even when fine-tuning is performed using only a subset of the hardest questions in the synthetic dataset, the performance still improves over base models, suggesting that targeted fine-tuning with less data can still be beneficial. Finally, we release several artifacts we believe will be beneficial to researchers further developing approaches for clinical information extraction: (a) source code - both the framework for synthetic data generation for clinical information extraction model fine-tuning as well as the annotation tool which allowed for faster manual review of LLM pre-annotated notes, and (b) datasets - two manually annotated datasets (Annotated Synthetic Trial Criteria Questions and Apixaban Trial Criteria Questions) which will allow for researchers to evaluate future methods for clinical information extraction.

The use of LLMs to extract information from clinical notes has already demonstrated the potential to improve upon traditional methods relying on rule-based methods or extensive manual annotation. While proprietary models, such as GPT-3 and GPT-4, have shown strong performance for this purpose, their deployment in healthcare settings can be limited by computational costs and licensing barriers35. Our findings align with recent research suggesting that fine-tuning open-source models with synthetic data can improve their performance across clinical information extraction tasks, bringing it closer to that of proprietary models36. By generating synthetic data, this approach also reduces reliance on manually labeled data. Our findings also align with those of concurrent work exploring the utility of synthetic data distillation using Llama-3.1-405B-Instruct as a teacher model24. The use of larger teacher models such as the 405B-Instruct poses a challenge in terms of necessary computational requirements, particularly in academic or resource-constrained settings. We find that the smaller 70B-Instruct can successfully be used as a teacher model for distillation. The use of smaller, open-source models that can perform well opens the door for broader adoption of LLMs in healthcare settings in a cost-effective and privacy-compliant manner. The reduced computational requirements of these smaller models can make them more accessible to hospitals with limited healthcare IT infrastructure. Another advantage of this approach is the adaptability, as smaller models can be better tailored to the specific needs of individual hospitals. As resource-efficient LLMs continue to evolve, this approach enables rapid (e.g., <12 hours on 8 gpus) finetuning45.

Our work focuses specifically on clinical trial eligibility criteria as an evaluation task because these criteria are well-defined, but our proposed framework is more broadly adaptable and could be applied to other areas including cohort identification, patient phenotyping, or feature extraction. Many observational and retrospective studies have specific inclusion criteria that currently require some form of manual chart review. A solution that works for clinical trials would also apply in these settings. The goal of this work is to bring us closer to minimizing the need for manual chart review and annotation. The finetuned models we developed in this study would not be able to finalize candidate selection on their own but could be used to screen a large number of candidates, narrowing an initial set down to a smaller pool of patients much more likely to qualify. Manual review would only need to be performed on the smaller pool, allowing medical professionals to avoid having to look at a majority of the records. Recent work has explored information extraction tasks with minimal human oversight or annotations46, but a significant amount of work remains to be done before models could be accurate enough to perform the entire screening process without human oversight.

By enabling scalable information extraction from unstructured notes, this approach presents a promising opportunity for retrospective research through its potential impact on enhancing patient phenotyping. This is particularly important when studying complex and heterogeneous patient populations, where phenotyping approaches relying solely on structured data, such as ICD codes, fall short. Better phenotyping can result in improved quality and relevance of retrospective studies.

While this has exciting potential, we also note some of the limitations and challenges identified through manual review of the synthetic data generated by Llama-3.1-70B-Instruct that may begin to elucidate failure modes for these models. In general, the model struggled with ranges when forming numeric questions. In multiple instances, a range (e.g., 60-70%) would be collapsed to one of its limits (60% or 70%) in a numerical answer. In at least one instance, the model had difficulty comparing a range and a given value outside that range (e.g., concluding that >70% precludes 50%). This contrasts with the model’s generally consistent ability to locate the highest or lowest value in a sequence of measurements (e.g., finding the highest blood pressure recorded in a note containing multiple readings). Numeric ranges of values are a known area of difficulty for model reasoning47. Hager et al. explicitly provided example lab results along with reference ranges for those labs and asked multiple LLM’s to determine if the result fell below, within, or above the range; they concluded that “all LLMs performed very poorly”48. To avoid having the model reason about ranges of values, questions could be reworded to ask the model to return the patient’s measurement for a given metric, and then evaluate if that measurement falls within a certain reference range as a separate step (as in Table 3). For ranges of values that appear within notes, separate questions could be used to determine the maximum and minimum estimated values.

The model also sometimes struggled with redacted data and contextual understanding. In one instance, the model identified numbers in a redaction tag as the answer to a question. This tag would have contained the correct answer prior to redaction. The tag itself, “[**3-22**]”, contained numbers, and this may have contributed to the model’s confusion. Non-numerical redaction tokens may help to alleviate this sort of issue. Additionally, token representation should be taken into account when deciding how to include redactions, as the existing format generates several extra tokens per redaction requiring greater context size. In another example, the model successfully identified the inappropriately partially-redacted “[churgg [**Doctor Last Name **] disease” as Churg-Strauss disease. In another case, the model correctly identified a patient’s hemoglobin value but then incorrectly concluded that it fell below the normal range. This conclusion would have been correct had the patient been male; however, the patient was female, and the reference range is lower for females. In another case, the model asked if a female over 70 was “a candidate for future pregnancy?” Interestingly, the model was also able to identify and parse a fishbone diagram within a note, correctly answering questions about lab values contained within the diagram.

The model also sometimes lacked creativity when generating questions with unspecified answers. To generate questions that could not be answered using the contents of a note, the model seemed to commonly inquire about BMI (height and weight measurements are recorded separately from these notes and so are often not contained in the text) and the results from a 6-minute walking test (6MWT). In the full test set containing 42,498 instances, we found 997 questions related to BMI (98.7% of which resolved n/a) and 666 questions related to a 6-minute walk test (all of which resolved n/a). The model would also ask about measurements from a patient prior to them seeking medical attention, which are typically unavailable in these notes. Additionally, the models would sometimes struggle with repetitive generation. In the test set, we found 1,676 (3.94%) questions containing “creatinine”. Admittedly, our prompt for numeric type questions included an example “What was the patient’s highest creatinine measurement recorded in the note?” However, a majority (1151) of these questions were of na-numeric, boolean, or na-boolean type, and none of those prompts mention creatinine. Future work could potentially introduce a mechanism to deprioritize questions that are highly repetitive or likely to yield non-informative answers. With the development of models that support longer context windows, it is possible to keep track of previously generated question-answer pairs and use this to avoid redundancy during question generation. Retrieval-augmented generation (RAG), which combines LLMs with knowledge from external sources49,50, could also be incorporated into the question generation process. For example, it could be used to retrieve data (including lab values, medications, and diagnoses) and use this information to generate question-answer pairs more relevant to the recorded information. This could reduce the risk of generating unanswerable questions like those about BMI when height and weight are missing. RAG could also use information from other external sources including notes and question-answer pairs from other similar patients or clinical knowledge bases to help guide toward more diverse and contextually appropriate question generation.

We found that carefully worded prompts could help to avoid some of the incorrect model outputs described in the previous section. By rewording questions, we could deter the model from drawing inferences and obtain less ambiguous question-answer pairs. For questions that asked if a patient had a history of X, where X was not mentioned in the note, the model would sometimes conclude that a patient did not have a history of X because X was not mentioned in the note, and other times conclude that the question could not be definitively answered from the contents of the note. This ambiguity could be resolved by modifying the question to ask if a patient’s history of X could be found in the note. This is especially critical because it allows us to use a combination of clinical expertise and post-processing to knowingly make assumptions where appropriate about whether X would have been in the note if they had it, as opposed to the model making this assumption for us without our knowledge. We observed in multiple evaluations that performance is substantially higher when asking the model to answer single-order questions (e.g., what was the patient’s highest creatinine value?) as opposed to questions which require multiple steps (e.g., does this patient fit this trial’s eligibility criteria?). In future work, we will determine if chain-of-thought prompting51 or multi-step inference52 can circumvent the need for manual postprocessing.

Developing resource-efficient LLMs to extract relevant information from clinical notes is a rapidly advancing discipline with many open questions. For example, there may be better ways to make the distillation process more data-efficient. In this work, we showed how fine-tuning on only a fraction of the synthetic dataset (e.g., 8B-H-25k) still appreciably enhances the base 8B-Instruct model. Different criteria for selecting a subset of the fine-tuning data may better maintain performance while decreasing data requirements53,54. Ordering the fine-tuning set by increasing difficulty and interleaving question types may also help55.

Future work could consider whether a metric besides micro-F1 could better characterize good performance. We used micro-F1 in part because it benchmarked the original i2b2 challenge. However, some researchers view patient-clinical trial matching as a ranking problem and consequently report metrics like normalized discounted cumulative gain at k and precision at k35,36. We could also consider the optimal way to handle ambiguity in notes. Unlike tabular or structured data that typically complies with a strict format, notes often include estimates and conjectures, especially when discussing medical history. There are often question marks next to past diagnoses and values reported. Another potentially interesting extension of this work could look into how data from multiple notes could be combined. Many people have a medical history spanning decades. For selection criteria involving disease progression or patient history, multiple notes may be required to obtain a complete answer. Combining records in a time-aware manner remains an open problem.

In this study, we use synthetic data for its capacity to enhance datasets for distillation. We also note additional considerations that come with synthetic data use. While synthetic data is often used as a more privacy-protecting alternative to real data, it is important to consider how synthetic datasets are generated and their regulatory compliance56. For example, the European Union’s General Data Protection Regulation (GDPR) requires that generated data cannot be used to re-identify any individuals57. An additional consideration for synthetic data use is the potential for IP contamination58. By using open-source models in this study, we minimize the risk of IP contamination compared to alternatives using proprietary models.

It is also important to note the potential issues of representativeness and biases when using synthetic data. Representational biases introduced through the synthetic generation process have the potential to be exacerbated if the synthetic data does not accurately represent the patient population59. We use real-world data from the i2b2 n2c2 2018 challenge44 which was derived from Partners Health (now Mass General Brigham) and MIMIC, derived from Beth Israel. Evaluation in real-world data may avoid some of the potential bias exacerbation concern from synthetic data, but these datasets are both from health systems in Boston and may lack a population with sufficient representation to ensure generalization. Unfortunately, the barriers to releasing clinical notes in public or gated systems limit the datasets researchers have access to. In future studies, we aim to perform evaluation in additional real-world datasets from diverse health systems to work toward better generalizability and portability. We also aim to explore additional bias mitigation strategies that can intervene at various stages of the LLM workflow60. For example, prior knowledge distillation work has used data filtering and reweighting to produce more equitable teacher outputs61,62,63. The teacher’s predicted token probabilities can be reweighted before being passed to the student model to reduce the inheritance or amplification of bias in the student model. Data augmentation techniques, such as data balancing64,65,66, selective replacement67,68, or interpolation69,70 can also mitigate bias through the addition of training examples that may otherwise be underrepresented. RAG has also been explored for its potential to address biases in generative AI for health care through retrieving more inclusive or population-specific information (for example, gender-based reference ranges) to help generate more representative outputs71.

Synthetic data distillation and fine-tuning of smaller, open-source LLMs that can be locally deployed within existing healthcare IT infrastructures can serve as a scalable alternative to more resource-intensive, proprietary models for clinical information extraction. The ability for scalable extraction of information from unstructured clinical notes allows for broader adoption in diverse healthcare system settings, with the potential to strengthen retrospective research by enabling more precise and accurate phenotyping. This work contributes to efforts to support the effective and practical integration of LLMs in healthcare settings, with the ultimate goal of supporting medical research to improve patient outcomes.

Methods

In this section, we describe our knowledge distillation process which uses a large model, Llama-3.1-70B-Instruct, to generate training examples for the smaller model, Llama-3.1-8B-Instruct (or Llama-3.2-3B-Instruct or Llama-3.2-1B-Instruct; Fig. 1). We chose the Llama family of models over other open-source alternatives due to both their benchmarked performance metrics72 and the extent to which they have been integrated into software frameworks for finetuning and inference. Additionally, the QLoRA finetuning method37 that we describe in subsection 4 was originally tested with the Llama family of models.

Synthetic data generation

For each patient record, we used Llama-3.1-70B-Instruct72 to generate different, patient note-specific questions similar to clinical trial eligibility criteria of a given type (Supplementary Table 1). We prompted the model to supply its answers in json format. Each JSON includes the following: (1) the question; (2) the question type; (3) the answer, (4) the section of the note containing the answer (e.g., Past Medical History, Plan, etc.); (5) the verbatim source of the answer from the clinical note; (6) a difficulty level for the question on a scale of 1-10; and (7) an explanation justifying the answer choice, including how the source helped to answer the question.

We included the following question types: “boolean” (answer “Yes” or “No”), “numeric”, “na-boolean”, and “na-numeric”, where the “na” types corresponded to questions that could not be answered relying on the information in the note but seemed like they would be applicable to this patient and are similar to clinical trial eligibility criteria. For “na” type questions, we stipulated the section to be “Not Found” and the source was “Not in Note.” The purpose of the “na” types as well as the supporting data, was to try to teach the model not to provide seemingly confident answers (i.e., hallucinations) when there doesn’t exist sufficient evidence in the note to draw a conclusion. We provide example questions of each type (Supplementary Table 6) as well as a specific example supplied in the prompt to demonstrate the specific language used to generate each question type (Supplementary Table 1).

We generated 212,132 boolean question and answer (Q&A) pairings, 209,637 numeric Q&A pairings, 106,288 “na-boolean” Q&A pairings, and 106,245 “na-numeric” Q&A pairings. The number of questions arose from running the synthetic data generation process on 10,000 discharge summaries, where the model was asked to generate 20 boolean questions (10 with yes as the answer and 10 with no), 20 numeric questions, and 10 of each “na” category. The model tended to provide slightly more than the requested number of questions per note. The number of questions of each type per difficulty score assigned by Llama-3.1-70B are described in the supplement (Supplementary Table 2).

Data programming

For each question type, we select the 25,000 most difficult questions according to the LLM-estimated difficulty rating and randomly split them into a training and test set at a 90–10% ratio. We perform post-processing to extract our requests from the JSON response and handle malformed JSON outputs. The datasets are randomly shuffled prior to fine-tuning.

Limited human review

To ensure data quality for the fine-tuning process, we manually reviewed a random sample containing 1000 questions generated by Llama-3.1-70B-Instruct. For this purpose, we developed an open-source tool that facilitates record review from within a web browser (https://github.com/bbj-lab/annotation-ui). Users with minimal technical experience can check patient records against the generated question-answer pairs and refine answers if needed. Statistics about the number of questions that required refinement are available in the supplement (Supplementary Table 1). Manual review allowed us to both profile the accuracy of the synthetic data generation process and to better understand common failure modes.

QLoRA fine-tuning

After data programming and limited human review, we used the refined synthetic dataset to perform supervised fine-tuning on an instance of Llama-3.1-8B72. Specifically, we fine-tuned with QLoRA37, a quantized version of Low-Rank Adaptation (LoRA: 73). LoRA fine-tunes the attention weights in a pre-trained transformer with a low-rank update (a d×k matrix BA, where B is d×r and A is r×k where rmin{d,k}) that significantly reduces the number of required parameters and does not add to inference latency. QLoRA operates on a quantized transformer, i.e., one that uses 4-bit as opposed to 16-bit parameters, to further reduce memory requirements and uses paged optimizers that manage the exchange of memory between GPU and CPU components.

Inference - Sampling hyperparameter selection

During generation, we tested different values of temperature and top_p (specifically temperatures of 0 and 1 and top_p of 0.5 and 0.95). Temperature controls the randomness of sampling, with higher temperatures corresponding to more novelty in generated output. However, increasing temperature may also make text less coherent and hallucinations more likely. Consequently, higher values of temperature are often used for creative tasks, while lower values are used for dialoguing about matters of fact. Chang et al.74 hypothesized that lower values of temperature may be better suited to question-answering with attribution. However, Renze and Guven’s recent work75 indicates that LLM problem-solving performance does not significantly vary for temperature values between 0 and 1. The top_p parameter controls nuclear sampling76, with higher values corresponding to a more permissive threshold for filtering.

Setting temperature = 0 and top_p = 1 results in a nearly deterministic, greedy sampling strategy that aims to select the most likely token given the current context. Setting temperature = 1 and top_p = 0.5 restricts tokens to come from a likely subset of the token set, but otherwise samples according to the predicted odds. We limit this parameter evaluation to the i2b2 n2c2 challenge and report the full results for all parameters (Supplementary Table 3). Because we did not see a benefit when increasing the temperature we fixed temperature = 0 and top_p = 1 for all other evaluations.

Versions of Fine-tuned Models

The following models resulted from finetuning Llama-3.1-8B-Instruct released by Meta as the base model (Ablation study on fine-tuning data selection - Table 1):

  1. a.

    All (Labeled 8B-All). Fine-tuning was performed with the complete dataset, using question, answer, question type, section, source, and explanation as described in the methods.

  2. b.

    Hardest (Labeled 8B-H-25k). To determine the performance impact of reducing training set size, we selected the 25,000 questions the model determined had the highest difficulty in Step #1 (Synthetic Data Generation) for each question type. This subset of the original training data was then used to fine-tune the model.

  3. c.

    Hardest Boolean and Numeric (Labeled 8B-NB-Only). To determine the impact that n/a questions have on model fine tuning, we selected the most difficult 25,000 questions for only the boolean and numeric types (dropping “na-boolean” and “na-numeric”) from the original dataset for fine-tuning.

  4. d.

    No Support (Labeled 8B-No-S). To determine the usefulness of including textual references and an explanation of the correct answer, we dropped the section, source, and explanation from the original training set and fine-tuned the model with this data.

We also fine-tuned smaller versions of Llama-3.2 released by Meta:

  1. a.

    All (Labeled 3B-All). Finetuning was performed using the complete dataset (as with 8B-All), except with Llama-3.2-3B-Instruct as the base model.

  2. b.

    All (Labeled 1B-All). Finetuning was performed using the complete dataset (as with 8B-All), except with Llama-3.2-1B-Instruct as the base model.

Model evaluation

We took the models finetuned on synthetic training data and evaluated them on three separate datasets, including one synthetic dataset and two real-world datasets, as follows:

First, we evaluate methods on a held-out set of 42,498 synthetic examples generated in an identical manner to the dataset used for fine-tuning. The breakdown of examples by type was as follows: 10,722 (25.2%) boolean, 10,666 (25.1%) numeric, 10,664 (25.1%) na-boolean, and 10,446 (24.6%) na-numeric. From this set, we drew a random sample containing 1000 examples and manually annotated it as described in the “Limited Human Review” subsection of our methods, correcting questions, answers, and explanations when necessary. We calculated the accuracy for these questions, as given in the results section. We provided a summary of this dataset (Supplementary Table 1) and have released a copy of it on PhysioNet. Results for this dataset allow us to evaluate the extent to which the fine-tuning objective was successfully optimized. The next two subsections describe tests on real-world data.

Next, we evaluate methods on the clinical trial eligibility criteria cohort selection shared task from the i2b2 2018 National NLP Clinical Challenges (n2c2)44. Track 1 contains 288 de-identified longitudinal medical records for patients with diabetes, many of whom are at risk for heart disease. The records are manually annotated according to 13 selection criteria adapted from real clinical trials and split into a 202-patient training set and an 86-patient test set. We calculated balanced accuracy and micro-F1 score on both the training and test datasets corresponding to the original challenge. At the time of the challenge, the top-performing team adopted a rule-based method to obtain a micro-F1 score of 0.91 on the test set. Other teams achieved similar results (F1 > 0.9) with hybrid approaches; for example, cTakes77 was used by 3 of the top 5 teams to extract knowledge from the text. Because we only use this dataset to test zero-shot extraction and do not train on it, we are able to evaluate the model performance on both the training and test sets to have a larger sample size.

We also evaluate methods on clinical trial eligibility criteria resembling those of the 2011 ARISTOTLE clinical trial comparing apixaban to warfarin42. We developed 23 human-generated boolean and numeric questions assessing these criteria (Supplementary Table 4). Using these questions, we manually annotated notes for 2300 total question-answer pairs within MIMIC-IV78,79. Notes from MIMIC-IV were taken from after 2012 to ensure no overlap with any of the notes from MIMIC-III, which were used to generate synthetic data. We evaluated the models on these question-answer pairs and calculated both balanced accuracy and micro-F1 score. We are releasing the dataset and manual annotations to PhysioNet and will make them available under the same data use terms as MIMIC-III/IV.