A Dataset of Medical Questions Paired with Automatically Generated Answers and Evidence-supported References

Gupta, Deepak; Bartels, Davis; Demner-Fushman, Dina

doi:10.1038/s41597-025-05233-z

Download PDF

Data Descriptor
Open access
Published: 19 June 2025

A Dataset of Medical Questions Paired with Automatically Generated Answers and Evidence-supported References

Scientific Data volume 12, Article number: 1035 (2025) Cite this article

1103 Accesses
Metrics details

Subjects

Abstract

New Large Language Models (LLM)-based approaches to medical Question Answering show unprecedented improvements in the fluency, grammaticality, and other qualities of the generated answers. However, the systems occasionally produce coherent, topically relevant, and plausible answers that are not based on facts and may be misleading and even harmful. New types of datasets are needed to evaluate the truthfulness of generated answers and develop reliable approaches for detecting answers that are not supported by evidence. The MedAESQA (Medical Attributable and Evidence Supported Question Answering) dataset presented in this work is designed for developing, fine-tuning, and evaluating language generation models for their ability to attribute or support the stated facts by linking the statements to the relevant passages of reliable sources. The dataset comprises 40 naturally occurring aggregated deidentified questions. Each question has 30 human and LLM-generated answers in which each statement is linked to a scientific abstract that supports it. The dataset provides manual judgments on the accuracy of the statements and the relevancy of the scientific papers.

A toolbox for surfacing health equity harms and biases in large language models

Article Open access 23 September 2024

Large language models encode clinical knowledge

Article Open access 12 July 2023

Toward expert-level medical question answering with large language models

Article Open access 08 January 2025

Background & Summary

The unprecedented improvements in the quality of the answers to medical questions generated by AI models are enabled by datasets comprised of question-answer pairs, such as MultiMedQA¹. These traditional datasets were generated on the assumption that having pairs of questions and ideal answers along with the sets of relevant documents, such as PubMed abstracts, is sufficient to train and test the answer generation systems². This assumption held for the answers that were traditionally extracted from the relevant documents. With the advent of Large Language Models (LLMs) capable of generating answers using solely their internal representations of the training data, in so-called zero-shot settings, it became clear that the coherent, grammatically perfect, and topically relevant answers may not necessarily be factual. Studies on the evaluation of LLMs’ abilities to support (ground) generated statements with verifiable evidence from reliable sources have shown that the models may provide harmful answers¹, perform significantly worse on lay-user generated questions, and often fail to reference relevant sources³. This can pose a risk to public health⁴. Unsupported statements are, therefore, a major barrier to using LLMs in any applications that may affect health. Methods for grounding generated statements in reliable sources along with practical evaluation approaches are needed to overcome this barrier. To support these goals, we have developed MedAESQA (Medical attributable and evidence supported question answering) a publicly available dataset of naturally occurring health-related questions asked by the general population paired with sets of human and AI-generated answers. Each answer statement in the dataset is required to be supported by evidence, and the evidence and the documents containing the evidence are judged for accuracy and support. The dataset is designed to be used for developing, fine-tuning, and evaluating language generation models in several approaches that address the model’s ability to attribute or support the stated facts by linking the statements to the relevant passages of reliable sources. The approaches, Retrieval Augmented Generation (RAG)⁵ and retrofit attribution⁶, provide sources to the models to guide answer generation or to find support and post-edit the generated output, respectively. Additionally, approaches may interleave retrieval and generation tasks⁷. The non-medical Question-Answering datasets that were used to support attribution include the Natural Questions dataset⁸. The questions in this dataset consist of real anonymized aggregated queries seeking factual information using the Google search engine. The answers consist of a Wikipedia page, a bounding box on this page (effectively, a summary of the page), called the long answer, and the short answer, such as one or more named entities mentioned in the Wikipedia article, yes/no, or NULL, if the page does not answer the question.

In the medical ___domain, some datasets can be adapted to train models to support attribution. For example, the BioASQ data⁹ contains factoid, yes/no, list, and summary questions formulated by biomedical experts. The questions are linked to sets of biomedical terms (concepts) related to the question and a set of research articles that are sufficient to answer the question. Text snippets containing one or more sentences that answer the question fully or partially are marked in the articles by the experts. The dataset is primarily focused on drug-target-disease relations for medical investigations. The MEDIQA-AnS dataset¹⁰ contains consumer-health questions, the full text from reliable web pages, extracted passages from the full text, and manually created summaries. In general, an attribution verification dataset must contain a question and at least one answer in which each statement of a fact required to answer the question is annotated and linked to a corresponding statement in an evidence source that supports or contradicts the fact stated in the answer. While the above datasets may be retrofitted to adhere to this format, to the best of our knowledge, we present the first medical question answering dataset specifically generated to test attribution to identified sources when assessing the output of natural language generation models. The distinct characteristics of the MedAESQA dataset are as follows: 1) the questions are naturally occurring popular questions submitted by the public to the National Library of Medicine, 2) the questions are annotated with the main concept of interest and with the user’s intent, e.g., to learn a fact or to support a clinical decision; 3) each question has a manually generated answer in which each sentence is linked to a PubMed abstract; 4) each question has 30 answers automatically generated by large language models. Each statement in the automatically generated answers is manually judged as required, unnecessary, or inappropriate. Each PubMed abstract provided by the models to support the specific statements is also manually judged as supporting, contradicting, or topically relevant or not to the answer. Finally, in each document that supports or contradicts the answer statements, a specific passage of text is annotated as evidence that supports the judgment. Figure 1 provides the workflow of the dataset creation and an example of a data entry.

Methods

Question Formulation

The MedAESQA questions are developed using information requests submitted by self-identified non-clinicians to the MedlinePlus¹¹ service provided by the National Library of Medicine. We chose the previously unseen, most popular, forty questions asked by MedlinePlus users. Each question also includes medical informatics expert-coded topic and narrative to support the efficient retrieval of the relevant documents. The topic signifies the key subject (focus) of the question, whereas narrative provides the context and background information on the question. Additionally, we provide an expert-coded structured representation of the information in the question, which we call question frame¹². Frame representation resembles a predicate-argument structure where a predicate is connected to its arguments and their semantic roles, such as THEME and AGENT¹³. A question frame includes a trigger for the question type, one or more THEME arguments, and optional semantic roles, all tied to their text mentions. In a question frame, question focus is the main theme of the question, which describes the key entities of the question, and the question type signifies the aspect of interest about the question focus (expected answer type). To characterize the disorders, user intents, body systems and anatomical structures, and clinical specialties encompassed by the collection, we also label each question with its ‘Subject Matter’, i.e., the broad medical area, such as genetics, or focus on clinical drugs; ‘Body System’, and ‘Specialty’, i.e., a clinical expert best suited to answer the question, such as cardiology or endocrinology. These labels show that the MedAESQA dataset contains questions that look for a variety of answers: treatment, effect, etiology, etc. In our further fine-grained analysis on Body System on MedAESQA questions, we found that questions cover almost all the body systems, starting from ‘immune’ to ‘vision’. We also analyzed the Clinical Specialty of the medical specialist that would typically address or manage the issue, and we observed that MedAESQA questions cover an array of diverse specialties. We have provided the distribution of the question frame (Task, Answer type, and Subject Matter) and list of all Body Systems and Specialties of MedAESQA questions in Table 1 and 2, respectively.

Table 1 Distribution of the different question frame categories in the MedAESQA dataset.

Full size table

Table 2 Body Systems and Clinical Specialties covered in questions present in the MedAESQA dataset.

Full size table

Expert-curated Answers

Following the work of Attal et al.¹⁴, a medical information expert utilizes the question focus and answer type to query PubMed¹⁵ and retrieve articles that could potentially provide the answer to the question. In the next step of answer formulation, the expert reads the titles and abstracts of relevant articles and formulates an answer sentence by considering one or more abstracts. For each answer sentence, the expert also includes the appropriate PMIDs to provide evidence for the assertions stated in the answer sentence. By following the aforementioned strategy, an answer (with multiple sentences) is formulated in such a way that it remains complete, accurate, coherent, and evidence-supported with appropriate PubMed identifiers (PMIDs) for each assertion.

Expert-curated Nuggets

Additionally, we provide manually generated information nuggets for factual evaluation. An information nugget can be used by an assessor to make a binary decision as to whether the fact represented by said nugget is contained in a response¹⁶. An assessor may determine nuggets to be required for an answer and may match nuggets to the sentences that contain them. This allows for a finer level of granularity in the evaluation and the assessment of an atomic fact rather than a sentence as a whole. Nuggets were generated from the 40 expert-curated answers in the MedAESQA dataset, where exactly one nugget was generated for every fact contained in an answer. We used a Predicate (subject, object) form to capture the information nuggets. Each medical concept in a nugget is associated with a Concept Unique Identifier (CUI) from the Unified Medical Language System (UMLS)¹⁷. These CUIs were identified by manually searching the UMLS Metathesaurus Browser for the closest match. Some facts required more complex nugget structure including, but not limited to, “if, then” clauses and comparisons. An attempt was made to normalize language across answers with common predicates and formatting (e.g. Treat (treatment, condition) or Prevent (method of prevention, condition)), while retaining information from the original sentence as much as possible. Each nugget was reviewed by at least two reviewers.

Machine-generated Answers

To generate answers with appropriate references, we organized a community evaluation¹⁸ at the 2024 Text Retrieval Conference (TREC). The participants proposed their approaches for generating answers. Analyzing the participants’ approaches, we outline a framework that takes an input question along with additional metadata (topic and narrative) and provides the generated answer as output.

The detailed steps include:

1.
Query Formulation and Expansion: Given a topic, question, and narrative, a query is formulated to search PubMed articles to obtain the relevant documents from PubMed collection. The query can be formulated by considering either the topic, question, narrative, or any combination of these metadata available with each question. To improve the retrieval process query expansion can also be applied where a query is expanded or transformed with additional terms or phrases that are semantically related or contextually relevant.
2.
Document Retrieval: To retrieve the relevant documents, the 2024 annual baseline snapshot of Medline/PubMed, which goes approximately through the end of 2023 was used. We provided a pre-processed set of 20, 727, 695 PMIDs representing the abstracts in the 2023 snapshot. The approaches have primarily used lexical retrieval (BM25¹⁹) to retrieve the top-k relevant documents for each question by utilizing the index built upon the title and abstract of the PubMed collection. The approaches also experimented with extracting the relevant snippets from the documents and considered the snippets as the relevant passages for the next stage of the framework.
3.
Document Reranking: The ranking of the documents/snippets is an important step to further improve the ranking of documents retrieved in the first stage of the retrieval system. The goal is to reorder the retrieved documents to present the most relevant and high-quality results at the top of the list. Multiple re-rankers were utilized to re-rank the documents/snippets: pointwise (monoT5²⁰, TAS-B²¹, ANCE²²), pairwise (duoT5), and listwise (RankGPT²³) approaches to rerank the documents/snippets.
4.
Answer Generation: The reranked documents along with the corresponding question were used to generate an answer to the question. Various open (Mistral-7B²⁴, Llama3.1²⁵) and closed-sourced (gemini-1.5-flash-001²⁶, gpt4o-mini, GPT-4o²⁷) LLMs were utilized to generate the answers, additionally, LLMs were instructed to cite the appropriate PMIDs of the ranked documents while stating the fact in the answer.
5.
Post-hoc Citations (Optional): This is an optional alternative step in our framework, where an answer is generated first without referencing any documents, and in the post-hoc stage, each sentence is required to cite supporting documents. LLMs were employed to provide appropriate citations for each sentence from the reranked list of the documents.

Human Judgment on Answers

We evaluated two different aspects of the answers: (a) reference attribution and (b) the quality and factuality of the answers. The former aims to judge the support the referred documents provide for an assertion generated by the machine and the latter focuses on evaluating the answer to a clinical question asked by clinicians to answer health-related questions asked by their patients. We envisioned that clinicians would review each answer and subsequently explain it in plain language. The evaluation was conducted by clinicians employed by Centaur Labs²⁸ in a crowdsourcing manner. Each final judgment is a consensus on the majority vote of at least three annotators. We also computed the inter-annotator agreement score, which is defined as the percentage of annotators who assigned the majority label to the total number of annotators. We have provided the details of inter-annotator agreement scores on different annotation tasks in the creation of the MedAESQA dataset in the Table 3.

Table 3 Detailed agreement scores on different annotation tasks in the creation of the MedAESQA dataset.

Full size table

We follow a two-step judgment on machine-generated answers:

Step 1: Evaluating Answer Alignment with Questions and Answer Quality and Completeness: We begin the evaluation by assessing whether the machine-generated text, as a whole, directly answers the question. In the next step, we examined the relevance of each assertion in the answer sentences to the question. Toward this, we categorized each assertion in the generated answer using one of the following four labels:
- Required: Given assertion is necessary to have in the generated answer for completeness of the answer.
- Unnecessary: Given assertion is not required to be included in the generated answer. An assertion can be categorized as unnecessary for multiple reasons including (a) it provides general information on the topic, (b) it recommends seeing a doctor, while the task states the patient has already contacted the provider, or the provider is asking the question.
- Borderline: An assertion can be marked borderline, if it is relevant, possibly even -“good to know”,- but not required to be part of the answer. For example, if the question is about the most commonly used treatments, information about treatments in the early stages of clinical trials is not necessary.
- Inappropriate: If an assertion may harm the patient if followed, it is marked as inappropriate. E.g., if, according to the answer, physical therapy reduces the pain level, but the patient experiences more pain due to hip mobilization, the patient may start doubting they are receiving adequate treatment.
We have provided examples of the borderline and unnecessary answer sentences for some healthcare questions in Table 4.
Table 4 Sample examples showing the assertion relevance category for borderline and unnecessary labels.
Full size table
Step 2: Evaluating Answer Alignment with Evidence Support: In the second step, we evaluated the referenced document(s) for each generated answer sentence to determine the relationship between the document and the generated assertion, if any. Each cited document was labeled with one of four possible relationships to the answer sentence: ‘Supports’, ‘Contradicts’, ‘Neutral’, or ‘Not Relevant’. Additionally, the experts also provided a passage from the referenced document to support their assessment of the evidence relation.
- Supports: A relation between the referenced document and the answer sentence is marked as support, if there is at least one sentence in the referenced document that supports/agrees with the assertion made in the answer sentence, e.g.: “opioids were the mainstay of perioperative pain control”. In addition, no other sentence in the document contradicts the statement.
- Contradicts: A relation between the referenced document and the answer sentence is marked as contradicts if there is at least one sentence in the referenced document that disagrees with the assertion or states its opposite, e.g.: “Increasing pain levels after the first week postoperatively, for 3 days, are most likely to be caused by the change to more extensive mobilization and physiotherapy in the rehabilitation unit.” (The answer in this case stated that the pain decreases steadily after the surgery.)
- Neutral: The referenced document is marked neutral, if it is topically relevant, but lacks any information to validate or invalidate the assertion made in the answer sentence.
- Not relevant: The referenced document is considered Not Relevant if the referenced document is not relevant to the sentence.

Data Records

We have archived MedAESQA²⁹ data records with Open Science Framework (OSF), available at https://doi.org/10.17605/OSF.IO/ydbzq. The OSF link contains a directory called MedAESQA. The MedAESQA directory contains the entire dataset in a JSON file, which lists data items. The README file also contains detailed information about each field in the dataset, including sample code to process the data. Each item in the JSON file contains the relevant key (metadata name) and value (metadata information) pairs, which are question_id, question, question_frame, expert_curated_answer, expert_curated_nuggets and machine_generated_answers. The detailed statistics of the MedAESQA dataset are shown in Table 5. Figure 2, shows the JSON tree to visualize the data samples.

Table 5 Detailed MedAESQA dataset statistics for questions, expert-curated answers, machine-generated answers, citations, and evidence.

Full size table

Technical Validation

To generate multiple answers and validate the MedAESQA dataset, we organized a community evaluation¹⁸ at the 2024 Text Retrieval Conference (TREC), in which participants were provided with the questions and PubMed collection and asked to generate the answers in which the appropriate PMIDs supported each assertion (equated to sentences in this evaluation). We acknowledge that in this edition of the TREC evaluation, which resulted in the MedAESQA dataset, we limit ourselves to the title and abstract of the PubMed document, which may inhibit real-world usability of supporting the assertions. In the future, we plan to use the full-text PubMed document to assess the validity of the assertion. Participants were asked to generate the answer sentence that has to be supported by up to three attributions (cited references), with a maximum of 30 documents allowed per answer. Documents had to be cited in the answers using PMIDs enclosed in square brackets, as illustrated in Figure 1. For each question, we received thirty answers generated using different approaches. All the answer-generation approaches are depicted in the Appendix. The different strategies used by the participants offer diverse answers. Furthermore, each answer as a whole and the assertions with corresponding citations are manually evaluated by the experts, which provides the level of correctness of the machine-generated answers. We next describe the metrics we used to validate the answer quality, associations between the citations and assertions, and relevance of the cited documents.

Benchmarking Metrics

We conducted a comprehensive evaluation of machine-generated answers across multiple levels and dimensions.

Answer Quality: We evaluate the quality of the generated answers considering multiple perspectives: accuracy, precision, recall, and redundancy. The details of the metrics are as follows:
- Answer Accuracy evaluates the accuracy of the generated answer using human-provided judgment. It measures how many of the answers to the total of 40 questions were deemed acceptable (judged as answering the question at least partially) for each answer generation approach.
  $$\,{\rm{Accuracy}}=\frac{{\rm{Number\; of\; Acceptable\; Answers}}}{{\rm{Total\; Number\; of\; Questions}}}$$
  (1)
- Answer Completeness (Recall) evaluates the extent to which a given answer covers the facts (aspects) deemed required by the assessors. The required aspects are aggregated across all system-generated answers. To identify answer aspects, we cluster the answer sentences using sentence embeddings generated by the SentenceTransformer model (sentence-transformers/all-mpnet-base-v2) and the SimCSE model (princeton-nlp/sup-simcse-roberta-large). We set up multiple evaluation levels for computing recall. In a strict evaluation, only sentences judged required and supported by evidence were considered for grouping. For a lenient evaluation, all sentences judged required were considered. For a relaxed evaluation, the borderline sentences were considered in addition to the required sentences. The number of aspects for the automated grouping is set to 10 using K-means clustering.
  $$\,{\rm{Completeness}}=\frac{{\rm{Number\; of\; Distinct\; Clusters\; Containing\; Sentences\; from\; Answer}}}{{\rm{Number\; of\; Clusters}}}$$
  (2)
- Answer Precision assesses the proportion of the assertions that were judged required or acceptable in the answer.
  $$\,{\rm{Precision}}=\frac{{\rm{Number\; of\; Generated\; Required\; Sentences}}}{{\rm{Total\; Number\; of\; Generated\; Sentences}}}$$
  (3)
- Redundancy Score quantifies unnecessary answer sentences and penalizes a system for generating unnecessary sentences. This score measures the informativeness of the generated answers as a proportion of generated unnecessary answer sentences among all generated answer sentences.
  $$\,{\rm{Redundancy\; Score}}=\frac{{\rm{Number\; of\; Generated\; Unnecessary\; Sentences}}}{{\rm{Total\; Number\; of\; Generated\; Sentences}}}$$
  (4)
- Irrelevancy Score quantifies inappropriate/potentially harmful answer sentences and penalizes a system for generating these sentences. The score measures the potential of the generated answer to mislead the reader.
  $$\,{\rm{Irrelevancy\; Score}}=\frac{{\rm{Number\; of\; Generated\; Inappropriate\; Sentences}}}{{\rm{Total\; Number\; of\; Generated\; Sentences}}}$$
  (5)
Citation Quality: A system-generated answer statement may be supported or contradicted by the documents provided as references. It is also possible that answer sentences may not include any references or may include references that are only topically relevant or irrelevant. The following metrics are designed to assess the quality of these references:
- Citation Coverage measures how well the required and borderline generated answer sentences are backed by the appropriate (judged as supports) citations.
  $$\,{\rm{Citation\; Coverage}}=\frac{{\rm{Number\; of\; Systems\; Generated\; Answer\; Sentences\; with\; One\; or\; More\; Supportive\; Citation}}}{{\rm{Total\; Number\; of\; Generated\; Answer\; Sentences}}}$$
  (6)
- Citation Support Rate assesses the proportion of the citations provided by a system that were judged by the experts as supporting the corresponding statement in the generated answer.
  $$\,{\rm{Citation\; Support\; Rate}}=\frac{{\rm{Number\; of\; Supports\; Citations}}}{{\rm{Total\; Number\; of\; Citations}}}$$
  (7)
- Citation Contradiction Rate assesses the proportion of the citations provided by a system that were judged by the experts as contradicting the corresponding statement in the generated answer. In a fact-verification task, this measure can indicate how effectively a system identifies contradictory evidence.
  $$\,{\rm{Citation\; Contradict\; Rate}}=\frac{{\rm{Number\; of\; Contradict\; Citations}}}{{\rm{Total\; Number\; of\; Citations}}}$$
  (8)
Document Relevancy: By pooling all documents judged relevant to a given topic, we compute standard recall and precision. The set of relevant documents includes documents judged as supporting, contradicting, or neutral.
$$\,{\rm{Recall}}=\frac{{\rm{Number\; of\; relevant\; retrieved\; documents}}}{{\rm{all\; relevant\; documents}}}$$
(9)
$$\,{\rm{Precision}}=\frac{{\rm{Number\; of\; relevant\; retrieved\; documents}}}{{\rm{Number\; of\; references\; provided}}}$$
(10)

MedAESQA Dataset Analysis

We performed a detailed analysis of the developed MedAESQA dataset by calculating dataset statistics at various levels of granularity. We found 1,108 out of 1,200 machine-generated answers were deemed acceptable by experts. We also analyzed the answer sentence relevancy assessed by the experts and found that 3,958 answer sentences were judged required out of 5,162 generated answer sentences. Similarly, in our analysis of the evidence relation, we found 5489 references out of 8111 supported the assertion made in the generated answers. The detailed analysis of different categories of answer sentence relevancy and evidence relation is presented in Fig. 3.

The MedAESQA dataset comprises 40 questions along with their expert-curated and machine-generated answers. Each question is associated with 30 machine-generated answers, resulting in a total of 1,240 answers. For each machine-generated answer, a sentence-level assessment is conducted to evaluate answer accuracy, sentence relevance, and evidence relation. The dataset includes a total of 5,162 answer sentences. In the expert-curated answers, we identified 316 references, with a minimum of 3 and a maximum of 10 references per answer. On the other hand, the machine-generated answers yielded a minimum of 0 and a maximum of 31 references per answer. The MedAESQA dataset also contains a total of 7,651 human-curated evidence excerpts from referring documents to support the assessed evidence relations. A detailed analysis of the MedAESQA dataset is provided in Table 5.

Benchmarking Evaluation

We evaluated the performance of methods used to generate answers in the created MedAESQA dataset on various evaluation metrics. The detailed results are presented in Tables 6, 7 and 8. On the answer accuracy metric, 26 out of 30 methods achieved more than 92% accuracy with a maximum of 100% (11 methods) and a minimum of 92.5% (2 methods). More than one-third of the methods (11 out of 30) achieved perfect accuracy which shows the acceptable quality of the generated answers. For precision of the answer, method M17 achieved the best performance with a precision score of 90.23. The precision score for 17 out of 30 methods was in the range of 73.54 to 85.54 and for 3 out of 30 methods, the precision score was in the range of 85.54 to 90.23. The redundancy scores were high (>15%) for only five methods. Method M4 recorded the lowest redundancy score of 4.04% with a precision of 79.08. Harmfulness is a key metric that aims to assess the tendency of the system to generate harmful sentences. We found that 19 of 30 methods achieved a perfect harmfulness score of 0%. The remaining methods also perform well and seem cautious while generating the answers as the highest harmfulness score we recorded was only 1.88. Nevertheless, we believe the harmful sentences are a good source for training and testing approaches for identifying and mitigating harmful answers.

Table 6 Performance comparison of the multiple machine-generated answers regarding answer quality focusing on accuracy, precision, redundancy, and harmfulness metrics.

Full size table

Table 7 Performance comparison of the multiple machine-generated answers regarding answer quality focusing on completeness (recall) metric using two different approaches to form the cluster: Sentence Transformer and SimCSE.

Full size table

Table 8 Performance comparison of the multiple machine-generated answers regarding citation quality and document relevance on their respective metrics.

Full size table

To measure the answer completeness (recall), we followed a clustering approach where we clustered all the generated answers for a given question together to assess the distinct aspects that are covered in the machine-generated answers. We utilized two different answer sentence representation approaches Sentence Transformer and SimCSE. The comparative results under three different settings (S+R, R, and R+B) are demonstrated in Table 7. S+R, which considers only supported and required facts, is the most strict evaluation and R+B, which includes all required and borderline statements disregarding the support, is the most lenient evaluation. We found that method M24 (Sentence Transfomer) obtained the best recall scores with 40.25 and 42.75 on the S+R and R+B settings respectively. A similar trend is also observed while using SimCSE as the sentence representation. Some methods recorded recall scores of 0 under S+R settings because those methods did not generate the citations for the generated answers.

While analyzing the results of the citation quality of the different approaches to the machine-generated answers, we observed that the majority of the approaches (18) achieved citation coverage in the range of 62 to 93. Method M1 recorded the highest citation coverage of 91.92 and M28 (excluding M25 and M27 as these methods did not generate citations along with the answer) as the lowest citation coverage of 6.38. For the citation support rate (CSR) the method M26 obtained the highest score of 77.88. Citation contradiction rate (CCR) is another key metric to evaluate citation quality as it assesses how often a system is citing a contradictory document with an assertion. We found some of the best CSR score systems, M6 and M26 recorded CCR values of 3.86 and 1.82 which signify that the answer generation systems were good in citing the appropriate documents. We also analyzed the document recall and precision scores and observed that most of the systems yield low recall however the precision scores were moderate. Method M24 achieved the highest document recall of 23.77 with a precision of 74.98.

Expert-curated Answers vs. Machine-generated Answers

We also analyze how close the machine-generated answers are to the expert-curated answers. Towards this, we analyze answer-level and citation-level similarities between two different modes of the answer-curation. For the answer-level similarity, we computed the sentence similarly between the expert-curated and each machine-generated answer and reported the BLEU, BLEU-4, ROUGE-2, and ROUGE-L metrics. Since an answer can be stated in different way, n-gram similarities may not always be the best choice to measure the similarity, therefore, we also reported the semantic similarity by reporting the BERTScore between the the expert-curated and each machine-generated answer. The detailed results are shown in Table 9. On n-gram similarities-based metrics, we found that method M11 achieved the highest BLEU and ROUGE scores (0.0117 and 0.1845) and comparable BERTScore (0.8514). The method M20 records competitive scores to M11 on both n-gram-based evaluation (BLEU: 0.117, ROUGE: 0.1825) and semantic similarity (BERTScore: 0.8517) metrics. For citation-level similarity, we created a list of cited documents in the answer to a given question for expert citation and machine-generated citation. We considered the expert citation list as the ground truth citation and computed the true positives (generated citations that are also present in the expert citation), false positives (generated citations that are not present in the expert citation), and false negatives (expert citations that are not present in the generated citation). With these, we computed the citation precision, recall, and f-score and reported the performance in Table 9. We found that method M1 achieved the highest citation F-score of 13.63, however, method M11 and M20 (best on answer level similarity) also recorded the competitive F-Score of 13.00 and 12.46 respectively.

Table 9 Comparison of the different approaches of machine-generated answer to the expert-curated answer on answer level and citation level.

Full size table

Usage Notes

We have provided detailed instructions in the README file of the Open Science Framework repository (https://osf.io/ydbzq) describing how to process the MedAESQA datasets. The source code to evaluate the system performance can be found in the GitHub repository (https://github.com/deepaknlp/MedAESQA).

Code availability

The code to process the MedAESQA and evaluate the system can be found at GitHub (https://github.com/deepaknlp/MedAESQA).

References

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Demner-Fushman, D. & Lin, J. Answering clinical questions with knowledge-based and statistical techniques. Computational Linguistics 33, 63–103 (2007).
Article Google Scholar
Bašaragin, B. et al. How do you know that? teaching generative language models to reference answers to biomedical questions. In Demner-Fushman, D., Ananiadou, S., Miwa, M., Roberts, K. & Tsujii, J. (eds.) Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, 536–547, https://doi.org/10.18653/v1/2024.bionlp-1.44 (Association for Computational Linguistics, Bangkok, Thailand, 2024).
Biden, J. R. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence (2023).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020).
Google Scholar
Gao, L. et al. Rarr: Researching and revising what language models say, using language models. In The 61st Annual Meeting Of The Association For Computational Linguistics (2023).
Jain, P., Baldini Soares, L. & Kwiatkowski, T. From RAG to riches: Retrieval interlaced with sequence generation. In Al-Onaizan, Y., Bansal, M. & Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 8887–8904, https://doi.org/10.18653/v1/2024.emnlp-main.502 (Association for Computational Linguistics, Miami, Florida, USA, 2024).
Kwiatkowski, T. et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, 452–466, https://doi.org/10.1162/tacl_a_00276 (2019).
Article Google Scholar
Krithara, A., Nentidis, A., Bougiatiotis, K. & Paliouras, G. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data 10, 170 (2023).
Article PubMed PubMed Central Google Scholar
Savery, M., Abacha, A. B., Gayen, S. & Demner-Fushman, D. Question-driven summarization of answers to consumer health questions. Scientific Data 7, 322 (2020).
Article PubMed PubMed Central Google Scholar
MedlinePlus [Internet]. National Library of Medicine (US) (1998 -). Last accessed 7 April 2025, Available from: https://medlineplus.gov.
Kilicoglu, H. et al. Semantic annotation of consumer health questions. BMC bioinformatics 19, 1–28 (2018).
Article Google Scholar
Gildea, D. & Jurafsky, D. Automatic labeling of semantic roles. Computational linguistics 28, 245–288 (2002).
Article Google Scholar
Attal, K., Ondov, B. & Demner-Fushman, D. A dataset for plain language adaptation of biomedical abstracts. Scientific Data 10, 8 (2023).
Article PubMed PubMed Central Google Scholar
PubMed [Internet]. National Library of Medicine (US) (1996 -). Last accessed 7 April 2025, Available from: https://pubmed.ncbi.nlm.nih.gov.
Voorhees, E. M. & Buckland, L. Overview of the trec 2003 question answering track. In TREC, vol. 2003, 54–68 (2003).
Lindberg, D. A., Humphreys, B. L. & McCray, A. T. The unified medical language system. Yearbook of medical informatics 2, 41–51 (1993).
Article Google Scholar
Gupta, D., Demner-Fushman, D., Hersh, W., Bedrick, S. & Roberts, K. Overview of TREC 2024 biomedical generative retrieval (BioGen) track. In The Thirty-Third Text REtrieval Conference Proceedings (TREC 2024) (2024). Last accessed 7 April 2025, https://trec.nist.gov/pubs/trec33/papers/Overview_biogen.pdf.
Robertson, S. & Zaragoza, H. et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval 3, 333–389 (2009).
Article Google Scholar
Nogueira, R., Jiang, Z., Pradeep, R. & Lin, J. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, 708–718 (2020).
Hofstätter, S., Lin, S.-C., Yang, J.-H., Lin, J. & Hanbury, A. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 113–122 (2021).
Xiong, L. et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations (2020).
Sun, W. et al. Is chatgpt good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 14918–14937 (2023).
Jiang, A. Q. et al. Mistral 7B [Internet] (2023). Last accessed 7 April 2025, https://mistral.ai/news/announcing-mistral-7b.
Dubey, A. et al. The llama 3 herd of models [Internet] (2024). Last accessed 7 April 2025, https://ai.meta.com/research/publications/the-llama-3-herd-of-models/.
Gemini [Internet]. Google (2025). Last accessed 7 April 2025, https://blog.google/technology/ai/google-gemini-ai/.
Achiam, J. et al. Gpt-4 technical report [Internet] (2023). Last accessed 7 April 2025, https://cdn.openai.com/papers/gpt-4.pdf.
Centaur Labs [Internet]. Accurate and scalable health data labeling and model evaluation (2025). Last accessed 7 April 2025, https://centaur.ai.
Gupta, D., Bartels, D. & Demner-Fushman, D. Medaesqa, https://doi.org/10.17605/OSF.IO/YDBZQ (2025).

Download references

Acknowledgements

This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH. This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov). Creation of this dataset would not have been possible without the LLM-generated answers provided by the teams that participated in the Text REtrieval Conference (TREC) hosted by the National Institute of Standards (NIST): The Information Engineering Lab (University of Queensland, CSIRO), Australia; IIUoT, University of Tsukuba, Japan; UR-IW, University of Regensburg, Webis, Friedrich-Schiller-Universität Jena and Leipzig University; H2oloo, University of Waterloo. The authors thank Srishti Kapur, Centaur Labs for expertly managing the manual evaluation process.

Funding

Open access funding provided by the National Institutes of Health.

Author information

Authors and Affiliations

National Library of Medicine, National Institutes of Health, HHS, Bethesda, MD, USA
Deepak Gupta, Davis Bartels & Dina Demner-Fushman

Authors

Deepak Gupta
View author publications
Search author on:PubMed Google Scholar
Davis Bartels
View author publications
Search author on:PubMed Google Scholar
Dina Demner-Fushman
View author publications
Search author on:PubMed Google Scholar

Contributions

D.G. contributed to project design, implemented data processing code and pipelines conducted baseline and machine learning experiments, contributed to collection creation, and wrote and edited the manuscript. D.B. contributed to the collection creation and writing and editing of the manuscript. D.D.-F. conceived of the project led and contributed to data generation, contributed to writing and editing the manuscript, and otherwise provided feedback on all aspects of the study.

Corresponding author

Correspondence to Deepak Gupta.

Ethics declarations

Competing interests

Dina Demner-Fushman is currently an Editorial Board Member for Scientific Data. She has not been involved with any of the editorial aspects or decisions for this publication.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Dataset 1

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gupta, D., Bartels, D. & Demner-Fushman, D. A Dataset of Medical Questions Paired with Automatically Generated Answers and Evidence-supported References. Sci Data 12, 1035 (2025). https://doi.org/10.1038/s41597-025-05233-z

Download citation

Received: 31 January 2025
Accepted: 19 May 2025
Published: 19 June 2025
DOI: https://doi.org/10.1038/s41597-025-05233-z