Abstract
As the volume of medical literature accelerates, necessitating efficient tools to synthesize evidence for clinical practice and research, the interest in leveraging large language models (LLMs) for generating clinical reviews has surged. However, there are significant concerns regarding the reliability associated with integrating LLMs into the clinical review process. This study presents a systematic comparison between LLM-generated and human-authored clinical reviews, revealing that while AI can quickly produce reviews, it often has fewer references, less comprehensive insights, and lower logical consistency while exhibiting lower authenticity and accuracy in their citations. Additionally, a higher proportion of its references are from lower-tier journals. Moreover, the study uncovers a concerning inefficiency in current detection systems for identifying AI-generated content, suggesting a need for more advanced checking systems and a stronger ethical framework to ensure academic transparency. Addressing these challenges is vital for the responsible integration of LLMs into clinical research.
Similar content being viewed by others
Introduction
The landscape of medical research is expanding at an unprecedented rate, characterized by a deluge of new findings and clinical trials published daily1. Keeping abreast of this ever-expanding body of knowledge poses a daunting task for healthcare professionals and researchers alike. In this context, the role of clinical reviews assumes paramount importance, as they synthesize evidence from a vast array of studies to inform clinical practice and guide future research directions2. However, the manual process of review generation is labor intensive and potentially unsustainable given the current pace of scientific discovery. To address this challenge, there is a growing interest in leveraging the capabilities of large language models (LLMs) to automate the clinical review process3.
These models, driven by machine learning and natural language processing, such as OpenAI’s ChatGPT-3.5, have sparked strong reactions upon their introduction4. These LLMs, capable of answering questions in natural language and providing high-quality responses, have quickly gained market traction due to the unprecedented experiences they offer5. As LLMs are further explored, people have gradually realized and developed more innovative applications. In the field of medical consultation today, from a patient’s perspective, LLMs can read medical records and view image reports to provide advice and diagnostic rationale comparable to that of professional doctors6. From a doctor’s perspective, LLMs can rapidly summarize medical information, swiftly organizing decades of a single patient’s records into a PDF highlighting crucial information, significantly reducing the workload for doctors7. From a hospital management perspective, researchers have demonstrated that LLMs can simulate the operation of an entire hospital on their own, thereby better enhancing hospital operational efficiency8.
LLMs hold tremendous potential for automating the process of clinical reviews, as these models have demonstrated exceptional proficiency in understanding and generating text that is closely aligned with human writing9. The potential of LLMs to rapidly assimilate vast amounts of medical literature and produce structured, insightful reviews presents a promising avenue for managing the overwhelming influx of new information in the medical field10. However, the growing momentum to integrate LLMs into the clinical review process has been met with several controversies and concerns. First, the reliability of these generated clinical reviews represents a significant concern. While these models are adept at processing and synthesizing vast amounts of data, they are not infallible11. The quality of an LLM-generated review is heavily dependent on the quality and diversity of the training data it has been exposed to12. If the training data are biased or incomplete, the resulting review may contain inaccuracies or overlook critical information, giving rise to quality degradation. Second, for academic tasks like clinical reviews, there are very high demands for accurate citation of references and in-depth elaboration to support one’s own viewpoints. However, current research has identified issues with false citations and fabricated references in the article generation process of LLMs13.
As the technology of LLMs continues to advance, specialized LLM-based review generation platforms have emerged to address these issues14. These platforms utilize their own literature databases to provide content for LLMs before output and employ web search15, semantic analysis16, and machine learning techniques to offer reliable reference materials to LLMs, thereby reducing hallucinations and serving academic purposes. They have gained significant recognition for their exceptionally high citation accuracy and insightful commentary14. However, In the past, although certain aspects of the review generation process, such as search query construction17 and information extraction18, have been evaluated, direct evaluation of the generated clinical reviews generated by these platforms has not been undertaken due to the large workload, breadth of evaluation metrics, and multidirectional researcher collaboration. Meanwhile, ethical considerations in publishing also warrant attention19. There is a risk that these powerful tools could be exploited by those with less than noble intentions. For example, some individuals or groups may use LLMs for bulk paper publishing. This possibility was supported by the recent AI prompt publishing incident20. However, to date, research on whether traditional reassessment tests and the recent advent of AIGC (Artificial Intelligence Generated Content) tests are effective in detecting and intercepting these generated clinical reviews is lacking.
In this study, we aimed to systematically assess the gaps between clinical reviews generated by existing platforms and those written by humans and to test whether existing checking systems and AIGC tests are effective in intercepting the generated manuscripts. We hope that this study will serve as a valuable reference for medical researchers and policy makers.
Results
Baseline characteristics of the generated clinical reviews
A total of 2439 clinical reviews were generated, after manually excluding reviews with significant discrepancies in the number of paragraphs or characters, as well as those with zero references, a total of 2169 articles were included in the analysis.
Covering the circulatory system (n = 365), digestive system (n = 309), endocrine system (n = 196), immune system (n = 323), nervous system (n = 256), reproductive system (n = 102), respiratory system (n = 271), urinary system (n = 97), and other comprehensive types (n = 250).
Overall overview
Regarding the consistency of expert evaluation in various subjective indicators, the Single Measures results range from 0.858 to 0.932, showing high consistency among experts (see details in Supplementary Table 6).
In terms of basic quality, AI-generated clinical reviews have significantly fewer paragraphs, and references, and are less comprehensive, authentic, and accurate compared to those written by humans. Although the authenticity of references is slightly lacking, the overall difference from human-written reviews is not substantial. For various subjective indicators, AI’s performance is far inferior to that of humans. Regarding the distribution of references, the proportion of references from the past five years in AI articles is relatively high. Meanwhile, AI articles have a lower proportion of high-impact factor or CiteScore articles. The citation rate of references in AI shows no difference compared to that of humans. Lastly, in terms of risks associated with academic publishing, Ai demonstrates a relatively low plagiarism detection rate and a highly variable AIGC detection rate.
The analysis results of the three indicators that have statistical significance (p < 0.001) are presented in Fig. 1 (see details in Supplementary Table 3).
The boxplot illustrates the data distribution: the box represents the interquartile range (IQR) from the first quartile (Q1) to the third quartile (Q3), with the line inside indicating the median and a square symbol marking the mean. The whiskers extend up to 1.5 times the IQR, and any points beyond this range are marked as outliers. In terms of objective metrics, AI demonstrates lower paragraph count, number of references, comprehensiveness, authenticity, and accuracy compared to humans. On subjective metrics, AI performs worse than humans across all levels. However, there is no significant difference between the two in terms of the cumulative and the average citation count of references, while the references exhibit different distribution patterns.
Basic quality of the article
Compared to human-written clinical reviews, AI clinical reviews exhibit fewer paragraphs (AI: 13.000 [7.000, 83.000], Human: 36.000 [29.000, 48.000]), lower numbers of references (AI: 20.000 [8.000, 78.000], Human: 87.000 [71.000, 115.000]), lower comprehensiveness of references (%) (AI: 0.367 [0.055, 2.041], Human: 2.113 [0.723, 4.285]), and lower authenticity (AI: 100.000 [70.550, 100.000], Human: 100.000 [100.000, 100.000]) and accuracy of references (%) (AI: 100.000 [73.550, 100.000], Human: 100.000 [100.000, 100.000]) (see Fig. 1). Additionally, on subjective indicators, AI clinical reviews demonstrate lower language quality (AI: 80.000 [70.000, 82.000], Human: 100.000 [100.000, 100.000]), lower depth of reference evaluation (AI: 75.000 [65.000, 78.000], Human: 100.000 [100.000, 100.000]), lower logical ability (AI: 78.000 [70.000, 80.000], Human: 100.000 [100.000, 100.000]), lower innovative ability (AI: 70.000 [60.000, 73.000], Human: 100.000 [90.000, 100.000]), and lower overall quality (AI: 78.000 [70.000, 80.000], Human: 100.000 [99.000, 100.000]).
Distribution of references
Compared to human-written clinical reviews, the proportion of references from the past five years in AI reviews is relatively high (AI: 46.700 [37.800, 67.100], Human: 36.905 [25.000, 54.054]), Meanwhile, regarding the JCR zone, there is a significant difference in the proportion of references in the Q1 section of AI clinical reviews in the JCR partition (%) (AI: 34.300 [25.600, 44.898], Human: 60.355 [47.959, 70.370]). Furthermore, in terms of impact factor, AI’s high-impact factor references have a lower proportion (%) (impact factor 0-3: AI: 28.571 [18.800, 37.100], Human: 7.368 [5.769, 13.776]; impact factor 3-5: AI: 16.100 [9.500, 23.333], Human: 15.278 [9.524, 22.321]; impact factor 5-10: AI: 14.286 [7.700, 20.800], Human: 15.686 [8.850, 22.500]; impact factor ≥10: AI: 12.300 [6.400, 18.750], Human: 30.233 [19.355, 45.833]). Similarly, AI’s high CiteScore references also exhibited a lower proportion (%) (CiteScore ≥10: AI: 33.100 [20.000, 46.800], Human: 52.778 [38.776, 64.045]).
Quality of references
Compared to manual clinical reviews, there is no significant difference in the cumulative citation of all references (p = 0.004) and the average number of citations per reference (p = 0.211).
Academic publishing risk
The result of the plagiarism checks is visually displayed in Fig. 2. Specifically, AI has a low plagiarism detection rate (%) of 28.000 [16.000, 45.000]. Figure 3 illustrates the overview of the performance of eight AIGC detection platforms in AI-generated reviews and human-written reviews. Specifically, human-written reviews exhibited a lower AI detection rate compared to AI-generated reviews. However, the AI detection rate for AI-generated reviews showed a significant range of fluctuation (Minimum AIGC Detection Rate - Maximum AIGC Detection Rate: 8-100) (see details in Supplementary Table 7).
The boxplot illustrates the data distribution: the box represents the IQR from the Q1 to the Q3, with the line inside indicating the median and a square symbol marking the mean. The whiskers extend up to 1.5 times the IQR, and any points beyond this range are marked as outliers. On the left side of the boxplot, a scatterplot displays the distribution of the data points. Among all the submitted articles, AI exhibited a high variability in detection rates and a high detection rate.
Subgroup analysis
Subgroup analyses were conducted within AI-generated reviews across several factors: different journals as sources for control articles, diverse clinical domains, distinct methods of generation, and different platforms/models of generation (see details in Supplementary Table 4).
For different journals, in terms of the basic quality of articles, The Lancet demonstrates a higher character count (12,803.000 [4716.000, 67,414.000]) and number of paragraphs (12.000 [6.000, 83.000]), while NEJM shows better authenticity of references (100.000 [80.530, 100.000]) and BMJ exhibits higher reference accuracy (100.000 [82.120, 100.000]). The overall quality of articles from The Lancet is the highest (78.000 [70.000, 80.000]). Regarding the distribution of references, NEJM has a higher proportion of references from the past five years (56.000 [38.200, 73.400]) and a greater proportion of references in Q4 journals (6.400 [1.900, 11.538]). There is no statistically significant difference in the quality of references.
For different clinical domains, AI exhibits a certain bias. In terms of basic quality, the nervous system (70.000 [65.000, 75.000]), respiratory system (70.000 [65.000, 75.000]), and urinary system (70.000 [65.000, 75.000]) demonstrate a relatively high level of innovation in AI-generated reviews. Articles related to the digestive system generated by AI demonstrate the higher authenticity (Authenticity (References) (%): 100.000 [93.460, 100.000]). Additionally, the accuracy of references is highest in clinical reviews of both the digestive system and other comprehensive types (Accuracy (References) (%): Digestive System: 100.000 [95.000, 100.000], Other: 100.000 [95.000, 100.000]).
For different generation methods, compared with the objective method, the outline method significantly increases the word count (64,691.000 [37,172.000, 82,797.000]), paragraph count (82.000 [25.000, 116.000]), and number of references (77.000 [36.000, 124.000]). Meanwhile, all subjective scores within the basic quality dimension improved. (overall quality: objective: 78.000 [50.000, 80.000], outline: 80.000 [72.000, 80.000]). Regarding the distribution of references, the use of the outline method has led to a decrease in the proportion of articles published in the past year (20.900 [9.600, 31.200]). The proportion of articles with an impact factor ≥10 has increased (12.300 [7.300, 18.100]), while the proportion of articles with a CiteScore ≥10 has decreased (31.600 [19.672, 45.300]).
For AI platforms and AI models, Overall, in terms of the Basic quality of the article, LLMs demonstrate higher numbers of characters (45,857.000 [5174.000, 74,638.000]), paragraphs (19.000 [9.000, 99.000]), and references (23.000 [8.000, 99.000]), as well as outperforming generative platforms across all five subjective metrics. However, their performance is inferior to generative platforms in terms of authenticity (95.000 [47.800, 100.000]) and accuracy of references (95.000 [51.220, 100.000]). Regarding the distribution of references, LLMs have a higher proportion of references from the past five years (58.400 [40.700, 73.000]), but a lower proportion of Q1 references (32.900 [25.300, 41.300]) and references with an impact factor ≥10 (11.500 [6.500, 17.000]) compared to generative platforms. In terms of reference quality, references generated by LLMs exhibit a higher Average number of citations per reference and cumulative citations of all references. Specifically, for each platform, o1-mini has the highest word count in terms of basic article quality (72,985.000 [5153.000, 10,6627.000]), while claude-3-5-sonnet has the highest number of paragraphs (147.000 [17.000, 182.000]). o1-preview has the highest number of references (39.000 [16.000, 252.000]), and claude-3.5-haiku demonstrates the lowest reference authenticity (%) (31.120 [16.890, 45.330]) and accuracy (%) (38.050 [21.340, 50.360]). o1-preview achieves the highest overall quality score (80.000 [80.000, 82.000]). In terms of reference distribution, AskYourPdf-2 has the highest concentration of references from the past decade (%) (100.000 [85.714, 100.000]), while GPT-4o has the highest proportion of references from the past year (%) (31.900 [20.000, 40.400]). Template.net has the highest proportion of Q1 journal references (44.872 [32.258, 51.163]), AskYourPdf-1 has the highest proportion of references with an impact factor ≥10 (34.483 [20.000, 55.556]), and the highest proportion of references with a CiteScore ≥10 (57.143 [42.308, 70.000]). Regarding reference quality, AskYourPdf-1 has the highest cumulative citation count across all references (8255.000 [2021.000, 20,401.000]) and the highest average citation count (428.000 [208.000, 823.000]).
Discussion
To date, this study is the first to systematically compare human-authored and AI-generated clinical reviews while also pioneering the publication risk assessment of AI-generated clinical reviews. On the one hand, the overall results indicate that, compared to human-written reviews, AI-generated clinical reviews exhibit significant deficiencies in most basic article quality metrics, including the number of paragraphs, the quantity of references, and the authenticity, comprehensiveness, and accuracy of the references. Additionally, they fall short in five major subjective criteria encompassing language quality, depth of reference evaluation, logical capability, Innovation degree, and overall quality. In addition, although clinical reviews generated by AI exhibit different patterns in citation distribution, particularly in the proportion of publications in high-impact journals, there is no statistical difference between AI and human experts in terms of the average number of citations per article and the cumulative citations of all references. This indicates that the quality of references cited by AI does not significantly differ from that of humans. Lastly, in terms of publication risk assessment, existing publication detection systems face significant challenges. The AI-generated literature exhibits low plagiarism and high volatility AIGC detection rates, suggesting that current inspection systems may not effectively prevent their infiltration into human-written clinical reviews. On the other hand, results from the subgroup analysis indicate that AI exhibits a noticeable bias. There is a significant difference across various journals. Different journals exhibit significant differences. The Lancet demonstrates better overall quality and metrics related to article structure, such as word count and number of paragraphs, while other selected journals show variations in the authenticity, accuracy, and distribution of references. At the same time, on specific topics, general topics achieve significantly better authenticity and accuracy of references compared to topics focused on a particular human system. In the configuration of the model, high-training models have not shown a weakening in the accuracy of references due to hallucination. Seamless-2, based on GPT-4.0, demonstrates better performance in terms of word count and has seen an improvement in the comprehensiveness of articles compared to Seamless-1. Its overall quality is consistent with Seamless-1, and there has been a marked enhancement in the logicality of summaries. The performance of platforms solely using GPT generation is noticeably inferior to the current adjusted generation platform. Specifically, AskYourPdf-2 accesses the GPT store and uses the native GPT and APIs provided by AskYourPdf to respond. Although it saw some improvement in comprehensiveness, it lagged behind AskYourPdf-1 in the Depth of Reference Evaluation, Logical Capability, Innovation Degree, and Overall Quality.
The results of the research can be interpreted from the following perspectives. First, Hewitt and others have previously proposed the limitation of LLMs in producing longer texts. Specifically, the scarcity of long output examples from Supervised Fine-Tuning (SFT) somewhat affects the results’ length21. For instance, we can find many short question-and-answer dialogs as sources for model training in the comment sections of ubiquitous social media. However, while we can find many papers and novels online as long dialog responses, they are not clearly defined for use as long output example questions. Meanwhile, these potential long output contents are often protected by copyright, preventing their arbitrary inclusion in training datasets, making long output examples even scarcer22. Moreover, since today’s advanced models are often closed source and deployed on commercial servers, providers inevitably limit output length and complexity to minimize computational costs23. The direct consequence of reduced text length is a decrease in the number of paragraphs and references.
According to the platform’s introduction, in order to avoid hallucination problems in LLMs, the aforementioned review generation platforms do not directly fine-tune the original LLMs for direct output. For example, Seamless attempts to use machine learning programs to match literature within the platform’s reserved literature database first and then allow LLMs to perform the final output based on these contents combined with user input. However, while this largely resolves past inconsistencies between LLMs’ outputs and cited references, it places higher demands on the quality of the company’s own reserved literature database. From a business perspective, these generation platforms typically prioritize purchasing rights from publishers with lower copyright fees, as their journals have lower visibility, and the purchase costs are not high. In contrast, acquiring copyrights from larger academic publishers usually involves significant costs, and despite their journals having higher impact factors or CiteScores, they may not be prioritized due to cost constraints and are thus not included in these reserved article databases. Therefore, a possible assumption about the output is that these generation platforms cite lower-impact and lower-quality articles more frequently than humans because such articles constitute a much larger proportion of their knowledge base.
In terms of AI-generated content detection rates, taking GPTZero, a widely known example we have access to, its founder Edward Tian explained that existing AI-generated content detection relies on two key indicators: “perplexity” and “burstiness.“24 Specifically, perplexity can be understood as predictability. When a detector can accurately predict the next word or sentence in a text, the text’s perplexity is lower, making it more likely to be identified as AI-generated. Burstiness refers to the changes in sentence length and complexity. AI-generated sentences tend to have uniform length and structure, while human writing is more dynamic and free-form, which is why tutorials on “reducing AI detection” often mention adding punctuation and changing long sentences to short ones. However, while these methods might be effective for integrated direct output LLMs, they may fall short for current academic generation platforms where multiple LLMs collaborate in multi-process handling. Not only is there significant volatility in such detection results, but the emergence of AIGC detection products, similar to the earlier plagiarism detection rates, has also spurred the development of tools aimed at reducing AIGC detection rates. Although in our initial research, among the 8 AIGC detection platforms, SurferSEO achieved the best performance, with the highest detection rate for AIGC reaching 100.000 [98.000, 100.000], on the platform we studied, Merlin offers both AIGC detection and AIGC detection reduction services, claiming to “bypass all AIGC detectors on the market.” We conducted a supplementary experiment where articles with an AIGC detection rate exceeding 75% were processed using their tool to reduce the AIGC detection rate (see details in Supplementary Table 7). The results, shown in Fig. 4, indicate that after using this tool, the AIGC detection rates across all platforms decreased, with reductions ranging from 21% to 82%. For most articles, the AIGC detection rate dropped below 50% (the threshold at which all platforms classify content as AI-generated). This raises concerns about the reliability of current academic detection systems. Regarding plagiarism detection rates, past detection heavily relied on the similarity of sentence meaning and structure within duplication databases25. LLMs have the ability to generate different sentence structures by varying prompts, even under the same topic. This flexibility makes it difficult for traditional detection systems to identify LLM-generated content using conventional methods. Furthermore, in more complex generation environments, the collaboration of multiple LLMs and the integration of LLMs with other technologies further increases the difficulty of detection. These factors combined make traditional detection methods increasingly inadequate when dealing with modern AI-generated content.
A series of recent LLMs, whether open-source or closed-source, have undeniably shortened the character count, paragraph count, and reference count of generated summaries, bringing them closer to human-written literature while significantly improving aspects such as language quality, depth of reference evaluation, logical capability, innovation degree, and overall quality. However, due to the issue of hallucination, the authenticity of references generated by these models fluctuates greatly, with numerous mismatches between reference titles, authors, and publication years. Compared to these models, literature generated by platforms operating within standardized frameworks, while addressing the issue of false references, performs poorly in the five subjective metrics. Thus, future development urgently requires finding an effective balance between the two approaches to maximize the benefits of generated content. On the other hand, models designed for general commercial users, such as GPT-4o-mini, and small-scale open-source models targeted at small businesses, like Llama-3.1-8b, do not exhibit significant differences in performance compared to their upgraded versions. Therefore, for tasks like clinical review generation, current users or small businesses do not need to invest more in deploying advanced models to achieve better results. However, it is important to note that specialized open-source models, such as Palmyra-Med, which focuses on healthcare, and Galactica, which specializes in scientific research, are constrained by token length and the specific task types they target. As such, they are not suitable for completing this particular task and, at least at this stage, should not be considered as a good choice for these types of tasks.
The configuration of prompts and parameters has a significant impact on traditional AI tasks. In the preliminary consistency exploration prior to the study, different prompts had a substantial influence on the quality of review generation (see details in Supplementary Note 3). Providing detailed guidance on the approximate content of each paragraph in the review yielded better results. Regarding parameter configuration, updating the base model to newer versions, such as transitioning from Seamless-1’s GPT-3.5 to Seamless-2’s GPT-4.0, delivered predictably better results. For model parameters, some exploration of temperature values was conducted. For proprietary models, multiple sets of temperature and top_p values were tested, ranging from 0 to 1 in 0.2 intervals, to optimize generation. However, this did not resolve issues of suboptimal outputs or the inability to understand task intent. For general-purpose models, simple post-hoc experiments were conducted on both open-source and closed-source models (see details in Supplementary Note 5). The results generated by the two types of models under different temperature and top_p values did not show statistically significant differences. Therefore, default configurations can be considered for experiments in this type of task.
This study aims to provide a reference model for the application of LLMs in clinical reviews across various subfields in the future. However, given the highly diverse and complex nature of research directions in the medical field, the methodology of this study is constrained by workload and the breadth of expertize, and it does not encompass all medical fields or make targeted adjustments to evaluation metrics for specific research areas. This is a process that requires the joint efforts of international researchers from various fields. The application of the conclusions from this study to different domains must be supplemented and interpreted in light of their specific characteristics, particularly in areas that heavily rely on a small number of authoritative references. The methodology of this study is constrained by workload limitations, making it impossible to cover all medical domains. Additionally, analyzing and statistically accounting for instances where authoritative references are indirectly cited in generated literature presents certain challenges. When applying the conclusions of this study or using AI clinical reviews in one’s own research field, in addition to referring to the conclusions already drawn in this study, It is recommended to incorporate the following measure to further evaluate LLM-generated reviews in specific research area (see details in Fig. 5): First, conduct bibliometric analyses of literature within the specific ___domain, such as using tools like CiteSpace to identify key publications at critical time points. Second, after the LLM generates a review, perform an overlap analysis between the cited references and key publications to ensure sufficient citation of authoritative sources. Third, to address the issue of indirect citation of authoritative references, manual screening can be introduced to ensure direct citation of key publications. Finally, compare the citation quality and overall quality of LLM-generated reviews with those of manually written reviews, thereby evaluating the practical application of LLM reviews in the ___domain and promoting their implementation. Through these measures, the applicability and reliability of LLMs in generating reviews within the specific medical field can be further enhanced, providing more effective reference methods for generating reviews in specialized subfields.
The articles generated from the themes of the four journals exhibit significant differences. Taking overall quality as an example, The Lancet demonstrates better overall article quality. Specifically, this higher quality is reflected in several subjective indicators, including language quality (80.000 [60.000, 82.000]), depth of reference evaluation (74.000 [60.000, 78.000]), logical coherence (78.000 [70.000, 81.000]), and degree of innovation (70.000 [60.000, 75.000]). Further analysis of these differences reveals that The Lancet tends to focus more on topics exploring disease mechanisms, future outlooks, or innovations in its selected themes. In comparison, JAMA emphasizes clinical practice and diagnostic treatment in its extracted themes, BMJ focuses on summarizing the latest treatments, and NEJM emphasizes the latest research and details related to diseases. After removing these specific themes for reanalysis (deleted themes include: “The future of cystic fibrosis therapy: from disease mechanisms to novel,” “Dilated cardiomyopathy: etiology, mechanisms, and current and future approaches to treatment,” and “Innovation in infection prevention and control—revisiting Pasteur’s vision”), the overall quality no longer showed statistically significant differences (p = 0.105). This reflects biases in the platform and the fine-tuning of proprietary models during the training process of LLMs, potentially due to the inclusion of more training data that discusses disease mechanisms, future outlooks, or innovation-related topics. This phenomenon is also evident in the differences in the distribution of references mentioned in the study and the subsequent variations observed across different clinical fields or human systems.
The outline method and the objective method exhibit certain differences. Since the outline method involves multiple rounds of generation, it is not difficult to understand its increases in word count, number of paragraphs, and number of citations. From the perspective of subjective quality dimensions, articles generated using the outline method generally exhibit higher overall quality. This may be because outlines help better organize the structure of the article, making the content more logically clear and coherent, thereby performing better in aspects such as language quality and the depth of citation evaluation. Under the outline method, topic coverage tends to be addressed at a more detailed level. For example, when discussing oral cancer, the objective method presents a prompt in the form of introducing treatments for oral cancer, whereas the outline method would delve into more specific aspects, such as surgical treatments for oral cancer. This reflects the importance of more detailed prompts in the generation process. However, it is important to note that this approach may also lead to the observed bias in citations distribution due to insufficient coverage in the training data. Therefore, to some extent, the outline method, as a temporary tool to address differences in word count compared to manually written articles, is not suitable for long-term use.
Despite the results, both these review generation platforms and the current detection systems have clear limitations, our research provides insights for researchers, AI startups, and publishers. Firstly, for medical researchers, AI tools offer a means of quickly grasping the content in their field. Our study supports previous recommendations that AI tools serve as a reliable resource to facilitate research and learning. In the past, there were concerns about hallucinations in LLMs causing false citations, which was also found in previous studies using pure LLMs26. However, this issue has been resolved in several platforms we tested. For instance, AI-generated reviews can provide accurate references, as demonstrated by the AI platform Seamless AI, which boasts near 100% citation accuracy. Additionally, over the past year, the proportion of references in its clinical reviews has been significantly higher than those generated by humans. Secondly, for AI startups, a key consideration is how to adapt their models to various clinical research fields. In our study, the references in AI clinical reviews have lower impact factors and CiteScores, although their average cumulative citation counts do not differ significantly from those of reviews written by humans. However, foundational and significant articles typically appear in well-known journals. Therefore, startups need to consider whether to negotiate the purchase of specific high-quality top journal articles to balance the cost of journal subscriptions, rather than acquiring the entire publisher’s copyright usage, to alleviate financial burdens while enhancing a higher quality literature repository. Meanwhile, our research found significant differences in the models’ ability to review specific human systems and journal topics, which might be due to a lack of fine-tuning or significant biases in the datasets used for fine-tuning. Startups could consider involving clinically diverse frontline physicians to participate in cross-disciplinary efforts, developing high-quality proprietary datasets for specific human system reviews for fine-tuning27. Furthermore, AI companies need to address the problem of generating shorter articles, which can severely impact article quality and overlook key references. Although using an outline approach to generate longer outputs partially mitigates this limitation, it is not an effective and enduring solution. Hence, the solutions proposed in existing studies to overcome character limitations in outputs warrant consideration and further exploration28,29,30. In the past, publishers attempted to detect or check for duplication in LLM articles in a manner similar to GPTZero. However, we must acknowledge that LLMs are advancing rapidly, altering sentence structures, and becoming more unpredictable in numerous ways. Sadasivan et al. assert that as language models become more complex and adept at mimicking human text, even the best detectors’ effectiveness will significantly diminish31. This is evident from OpenAI’s decision to withdraw its own detector32; therefore, publisher should recognized that we cannot evaluate a paper in the same way we use to do.
Our research has several benefits. First, collaboration across multiple interdisciplinary fields, such as computer science and medicine, enables us to comprehensively evaluate different clinical reviews in four major dimensions. In the past, manually gathering citation distributions and paragraph statistics for individual articles was laborious and time-consuming due to the lack of ready-to-use batch extraction tools. Additionally, evaluating clinical reviews from different directions is also very challenging. Second, the multidimensional evaluation conducted incorporates subjective indicators as well as dimensions such as reference distribution, reference quality, and academic publishing risk, which were generally lacking in previous studies. Finally, the selection of top-quality articles from the top four most recognized medical journals ensures the robustness of the comparison results and the representativeness of the benchmark. However, despite these efforts, our research still has some limitations. On the one hand, we select representative internationally accessible review generation platforms within an appropriate workload, which may not cover region-specific generation platforms in advertisements or future updated LLMs. Therefore, the interpretation of results should be approached with caution, meanwhile, due to the lack of a universally accepted standard for clinical reviews of average quality, our study did not explore the gap between AI-generated clinical reviews and human-authored ones of average quality. On the other hand, since experts with different research backgrounds in the same field may come from many countries, the participation of Chinese and British experts might not fully cover all directions, especially topics in less prominent areas. Therefore, caution is needed when applying the results of this study to less prominent topics outside the scope of the research.
Future research should comprehensively consider both the iterative development of LLM-generated reviews and detection systems. Firstly, regarding the review generation by LLMs, on the one hand, although the current resolution of the reliability of references is encouraging, the overall quality of the articles still poses significant issues in fully serving clinical purposes. From the current models of major platforms, a noticeable problem is that the review generation is overly adapted to the commercial deployment of LLMs. Despite the introduction of human knowledge bases and other auxiliary technologies, the ultimate goal remains a one-time unified output by LLMs. However, this generation method does not align with traditional review writing practices. Traditionally, after selecting a topic, a search-based construction is conducted, involving database searches, specifying inclusion and exclusion criteria, and then excluding articles based on titles and abstracts, followed by reading and summarizing each article33. This traditional research model is entirely worth learning from and emulating. It is possible to attempt to have LLMs serve several segments of the humanized process rather than altering traditional processes to fit the LLMs. We randomly selected several topics related to various human systems involved in this study for a simple post-hoc exploratory study. The improvement in results was significant, ensuring the reliability of the cited references while greatly enhancing the overall quality of the review, with no statistical differences in paragraph and word count compared to the original text (see details in Supplementary Note 4). On the other hand, due to the rapid iteration of models in the current LLM field, the performance of different models on the same task varies. Although this article incorporates the latest models available at the time for exploratory research, future studies should not only focus on this aspect but also explore more efficient generation frameworks to enable the swift integration of the latest models, rather than repeatedly conducting similar experiments. Additionally, regarding detection systems, a new detection system framework urgently needs to be established. In the past, the current detection systems were placed in opposition to LLMs, akin to the captcha wars used to block web crawlers34. More and more academic publishers and professional detection agencies are launching their own “AI checking rates“35. We seem to have fallen into a misconception that academic papers must be completed by humans or mostly by humans; otherwise, they should not be recognized. This overly profit-driven mindset overlooks the fact that every article, regardless of its author, should be valued if its conclusions are reliable and contribute to life. LLMs offer us a new perspective, as they can help complete review articles and potentially assess the value of a review in advancing the industry or its educational value for future researchers. From a non-profit perspective, leveraging LLM to establish a detection system based on the contribution of papers in specific fields, rather than overly focusing on whether they are completed by AI, should be directed towards research on whether they serve everyday life. This should become a focal point of future research.
Methods
This comparison study is part of a collaborative project involving North Sichuan Medical College, the University of Electronic Science and Technology, and the University of Glasgow. This study was reviewed by the Institutional Review Board of North Sichuan Medical College. Since it does not involve any human subjects or tissue materials, no additional ethical approval is required. Prior to the start of the study, nine experts with backgrounds in different human systems (see details in Supplementary Table 1) composed the expert evaluation panel. Each expert had extensive clinical research experience in their respective fields, with a median of 12.5 years, ranging from 2 to 18 years. In addition, a linguist and two computer PhD were responsible for the follow-up language-related discussions of the study and the software production involved in the study, respectively. The overall design of the study is illustrated in Fig. 6.
Generation sources
We narrow down the scope and ultimately determine the source journals for the generation of topics through the following three stages.
First, we attempt to define the basic classification range of journals, categorizing them into low-quality, medium-quality, and high-quality. Unfortunately, there is no unified standard today to distinguish between low-quality and medium-quality journals. However, in the field of medicine, numerous top-tier journals, widely recognized, publish high-quality articles, including Nature, Science, The New England Journal of Medicine (NEJM), The Lancet, and npj Digital Medicine. Most articles in these journals are generally considered high-quality by the academic community. Therefore, we decided to make our selection from these top-tier journals.
Second, given the numerous recognized top-tier journals, comprehensively evaluating each journal is impractical. Thus, we selected the top 10 journals ranked by their 2023 impact factor (https://jcr.clarivate.com/jcr/) for our study. To ensure a comprehensive selection of topics, we chose clinical comprehensive journals.
Third, to enhance the reading experience, we prefer journals with a certain historical association.
Finally, NEJM, The Lancet, the British Medical Journal (BMJ), and the Journal of the American Medical Association (JAMA)36, all of which were founded in the 19th century and are clinical comprehensive journals, were selected to be showcased in the main text. Moreover, to ensure comprehensiveness of results, same research was conducted on other top 10 ranked clinical comprehensive journals, including Nature Reviews Disease Primers and its related journal Nature Medicine (see details in Supplementary Table 5).
Generation topics
The expert panel selected clinical reviews from the top four medical journals (TFMJ). The selection criteria were as follows: 1) The reviews were clinical reviews. 2) At least one member of the expert panel had previously been involved in research similar to the topic under review. 3) The publication date was within the last 5 years.
The quality assessment was conducted on the selected clinical reviews (see details in Supplementary Table 3). Clinical reviews were ultimately included in this study when the average scores for unidirectional quality in Language Quality, Reference Evaluation Depth, Logical Capability, and Innovation Degree were greater than 90 points.
A total of 62 clinical reviews were ultimately included (see details in Supplementary Table 2).
Generation platforms and models
We generated relevant reviews based on review generation platforms and the current mainstream models.
On one hand, the selection of review generation platforms followed the general logic of user usage: 1) choosing a search engine, 2) entering search queries, and 3) conducting searches to select usable platforms. First, on January 10, 2024, we visited StatCounter37, a global website for search engine market share statistics, to determine commonly used search engines and selected the top five: Google38, Bing39, Yandex40, Yahoo!41, and Baidu42 (the sixth-ranked share labeled as “Other” does not represent a single engine and was not selected). Subsequently, we designed search queries referencing PubMed’s43 MeSH terms and Embase’s44 Emtree terms related to LLMs. The search queries included terms such as “clinical literature review generator,” “clinical literature review generation,” and “clinical review large language model.” Finally, to avoid potential selection bias caused by commercial promotion, the screening scope was appropriately expanded. Specifically, using the search queries, we reviewed every item on the first five pages of search results for the international versions of these five search engines. The selection criteria were any tool-based platforms capable of generating medical reviews, with no language restrictions. To ensure the timeliness of the research, we conducted an updated second round search on January 7, 2025. A total of 4059 entries were ultimately retrieved across all platforms, with 9-15 entries per page (see Table 1 for details). After cross-screening and validation by three researchers, seven currently available review generation platforms were selected: Seamless45, Paper Digest46, AskYourPdf47, Easy-Peasy.AI48, HyperWrite49, LitReview50, and Template.net51. It is important to note that currently, no dedicated platforms for generating clinical reviews have been identified; they are all general review generation platforms. In terms of their respective characteristics, Seamless offers GPT-3.5 and GPT-4.0 as the foundation for generation52. AskYourPdf provides generation services through its website or the OpenAI application store53. Easy-Peasy.AI offers GPT-4.0 as the foundation for generation. These platforms represent nearly all internationally accessible and usable review generation platforms. We categorized these platforms into nine corresponding generation sources: Seamless-1 (Seamless using GPT-3.5), Seamless-2 (Seamless using GPT-4.0), AskYourPdf-1 (AskYourPdf using the official website), AskYourPdf-2 (AskYourPdf using the OpenAI application store), and Paper Digest from the first round of selection, as well as Easy-Peasy.AI (Easy-Peasy.AI using GPT-4.0), HyperWrite, LitReview, and Template.net from the second round. Among them, LitReview and Template.net are free platforms, while the others are paid or partially paid. Additionally, although some platforms label their model version numbers, these platforms have not updated their model versions as of now.
On the other hand, to address the issue of outdated ___domain relevance caused by older platform models, mainstream LLMs were included for review generation. First, we visited the top-ranking lists of Hugging Face54 and OpenCompass55, selecting the top 10 open-source and closed-source models as candidates, with the scope extending to December 2024. Then, the selection of relevant LLMs was considered from the following aspects: 1) Closed-source commercial models: To accommodate the usage habits of both general and subscription-based users, we selected models with different user orientations from the same company, such as GPT-4o for subscription users and GPT-4o-mini for general users. Additionally, to avoid selection bias toward a single company, we chose models from several mainstream companies, including Anthropic, OpenAI, and Google. 2) Open-source models: To cater to enterprises with varying hardware capabilities, we selected models with different user orientations from the same company, such as Llama-3.1-8B for general enterprises and Llama-3.1-405B for enterprises with better hardware. Additionally, we selected the specialized medical open-source model Palmyra-Med-70B as a professional candidate. 3) Expert consensus-oriented models: Based on recommendations from review experts, we adopted Bloom and Galactica as supplementary models for specialized use cases. The final selected models included GPT-4o, Claude-3.5-Sonnet, Claude-3.5-Haiku, Gemini-1.5-Pro, GPT-4o-mini, Llama-3.1-405B, Llama-3.1-8B, o1-mini, o1-preview, Palmyra-Med-70B, Qwen2.5-72B, Bloom, and Galactica-120B. Their respective version numbers and usage documentation are provided in Table 2.
All models and platforms were invoked using default parameters, except for the maximum token count, which was set to the maximum to avoid truncation. The influence of parameter adjustments is discussed in the supplementary research displayed in the article discussion section.
Generation prompts
Regarding prompts, since Seamless, Paper Digest, Easy-Peasy.AI, HyperWrite, LitReview, Template.net and AskYourPdf-1 do not accept custom prompts and only accept direct review topics or purposes, a specialized prompt was designed for AskYourPdf-2, GPT-4o, Claude-3-5-sonnet, Claude-3.5-haiku, Gemini-1.5-pro, GPT-4o-mini, LIaMA-3.1-405B, LIaMA-3.1-8B, o1-mini, o1-preview, Palmyra-Med-70B, Qwen2.5-72B, Bloom and Galactica. Before the study began, ZNL, DRW, JBX, YSP, and YXR constructed the prompts. ZNL is a certified Prompt Engineer at Datawhale (https://github.com/datawhalechina), with certification number DWPE011528. DRW holds positions at multiple computing centers, while JBX, YSP, and YXR are attending physicians in the expert group. Specifically, we constructed prompts focusing on logic, stability, and optimal performance release.
ZNL and DRW identified an initial set of prompts and selected five articles from the topics to be generated for prompt testing. The principles for constructing prompt words have been explained elsewhere56. Each prompt was inputted to execute repeated outputs three times. The average of expert evaluation results formed the score for each output, and the consistency of the three repeated outputs was tested. From this set of prompts, the one with the best overall properties was selected and optimized multiple times, repeating the above process until the score difference between the clinical review generated by the optimized prompt and the original review’s prompt was less than 5 points on several quantitative subjective indicators. Our definitions of logicality, stability, and optimal performance release are as follows: 1) Regarding logicality, we focus on the logical rationality of applying this technology to actual clinical environments in the future. Specifically, from the perspective of prompt development, further elaborating on the objectives or deeply explaining the current focus of topics related to these objectives might enable LLMs to perform better. However, from a clinical logic standpoint, individuals conducting clinical reviews may not always have scholars proficient in this field to guide or assist them. Therefore, to better align with future clinical scenarios, the extracted objectives were not given further guidance. 2) Concerning stability, given the inherent uncertainty in LLM outputs, it is critically important to ensure whether these different outputs yield results that are either identical or have minimal variation (within a five-point margin). This is to guarantee the quality stability of outputs in a working environment. 3) With respect to optimal performance release, varying prompts have been shown to significantly impact performance on medical tasks57. We aim to select a prompt that maximally enhances performance by adjusting the prompt itself. Although testing every conceivable prompt is regarded as impossible, we have referenced Nature’s research on LLMs58. As these generative platforms will be updated in the future, the prompts we test may not entirely transfer to new models. Additionally, in clinical settings, clinical scholars are unlikely to immediately find the best prompt but rather seek a prompt with relatively good performance. Considering these factors, we developed and tested prompts under conditions closely aligned with real-world scenarios. Examples of iterative results for prompts on a single model or platform are shown in Table 3. More details on this can be found in Supplementary Note 3.
Generation methods
To mitigate bias from significantly shorter articles, clinical review generation employs two methods: 1) The objective method, where well-trained research assistants manually extract the research objectives of each study and input them into the platform to generate the final text. 2) The outline method, where well-trained assistants manually extract the objectives of each study and input them into the predetermined best framework generation model (see details in Supplementary Note 1), Gemini Pro59, using the prompt “Here are the research objectives for the clinical review I am writing, please help me develop an outline.” Each outline title is then input into the platform separately, and finally, they are connected to form the subsequent test article. Meanwhile, to comply with the usage license of LLM companies, OpenAI and Google’s services are operated on the servers of UK by DRW from University of Glasgow using a private security service in Citrix Workspace.
Blind setting
Computer PhDs have developed a specialized evaluation system. After training on the hierarchical grading criteria, each expert receives a system account and is assigned a certain number of clinical reviews randomly in their respective research direction. During the evaluation process, the experts were not informed of the source of these articles.
Multilevel evaluation measures
Before the start of the study, an exploration of indicator selection was conducted (see details in Supplementary Note 2). Our aim was to provide a relatively comprehensive evaluation of a review article using both subjective and objective indicators. The subjective indicators were determined through literature searches. Specifically, we selected the top 30 models and their corresponding company names from the OpenCompass model leaderboard, as well as relevant MeSH terms from PubMed and Emtree terms from Embase. Based on these, we constructed search strategies and retrieved articles related to LLM evaluation published in the two databases since the release of OpenAI’s ChatGPT-3.5. From these, we identified two existing short evaluation articles60,61 as the foundational references for our evaluation indicators. On this basis, we preliminarily developed a subjective indicator system consisting of five major components, each scored on a 100-point scale. For the objective indicators, we conducted an initial exploration of existing scales. After excluding some scales that were clearly incompatible with the article type, we invited linguists to design the objective indicators. These indicators also incorporated factors such as the impact factor and journal ranking, which have historically influenced article evaluation systems. Finally, certain attention was given to the risks associated with academic publishing. Since most publishers have already disclosed their plagiarism detection platform, iThenticate62, this platform was directly used for plagiarism rate detection. However, for the recently emerging AIGC detection rate, there is currently no universally recognized and publicly available benchmark system among publishers. Instead, online AIGC detection platforms were used as substitutes. The process of selecting AIGC detection platforms was similar to the process of choosing platforms for generating clinical reviews (see details in Supplementary Note 6) and will not be elaborated here. Ultimately, eight platforms—Scribbr63, Typeset.io60, GPTZero61, Grammarly64, SurferSEO65, Decopy.ai66, AIHumanize67, and GetMerlin68—were selected for AIGC detection. To control detection costs, 1-2 reviews demonstrating the best performance for each human system were sampled and submitted for testing from each generation platform or model.
The final evaluation criteria are as follows: 1) basic quality of the article: word count, paragraph count, number of references, comprehensiveness of references, authenticity of references, accuracy of references, language quality, depth of reference evaluation, logical ability, level of innovation, and overall quality; 2) distribution of references: the percentage of references within 1 year, 3 years, 5 years, and 10 years of publication/generation of the article in the total number of references; the percentage of references in the Q1, Q2, Q3, and Q4 zones of the JCR among the total references; the percentage of references with impact factors of 0–3, 3–5, 5–10, and ≥10 in the total references; and the percentage of references with CiteScore of 0–3, 3–5, 5–10, and ≥10 in the total references3.) Quality of references: cumulative citations of all references, average number of citations per reference4.) Academic publishing risk: Plagiarism checks62 and AIGC checks. The definitions of each specific indicator are shown in Table 4.
Comparability control in reviews
To ensure the comparability of the final results, the study implemented controls for comparability from the following aspects. First, experts with the same or similar research backgrounds were selected for evaluation. Specifically, subjective indicators within the same article, including overall quality, language quality, depth of reference evaluation, logical reasoning, and degree of innovation, were assessed by experts with the same or similar research experience within the expert group. Second, each article was evaluated through cross-assessment by at least two experts. Under the premise of maintaining the same or similar research background as the manually written literature, at least two experts were selected for cross-assessment, and the average of their evaluations was taken as the final evaluation result for these items. Additionally, we referred to two previous studies69,70, expanding their baseline sample size by threefold and broadening the scope from a single topic to all human systems to avoid heterogeneity caused by a single topic. Finally, general reviews with relatively fixed formats were selected for generation, avoiding biases introduced by methodological differences in specialized reviews, such as meta-analyses.
Statistical analysis
Data analysis was conducted using SPSS (version 29.2.1.0 (171)). The overall analysis used the Mann‒Whitney U test. For subgroup analysis, the difference analysis of skewed data was conducted using the Kruskal‒Wallis H test, whereas normally distributed data were analyzed using analysis of variance (ANOVA). During multiple hypothesis testing, the Bonferroni correction method was used to adjust the p values. A p value less than 0.001 was considered to indicate statistical significance. In addition, an intraclass correlation coefficient was used for consistency testing. To ensure the generalizability of the results, a two-way random-effects model was chosen. Since the evaluation of the same article might involve two or more experts, when the number of evaluators exceeds two, pairwise combinations of the evaluation results from multiple experts are formed for consistency testing. A single measurement result greater than 0.8 is considered indicative of high consistency. Finally, in the presentation of the results, the overall analysis includes all results from both generation methods across all platforms and models, collectively named Ai.
Data availability
The review used in this article can be downloaded from Google Drive (https://drive.google.com/drive/folders/1hEinwiqcNHvIhi_NejiUf11-PLdpwWNL?usp=drive_link). Please note that before downloading, you should be aware of and confirm that you have the necessary permissions to download and access articles from BMJ, JAMA, NEJM, Lancet, and Springer. We also provide step-by-step screenshots of the manual detection process for AIGC detection rates(https://drive.google.com/drive/folders/1wZkH-srMsgYU-ehaE6rIqnAXOaw74j_l?usp=drive_link). Please note that since some researchers are based in China, the default browser may translate the webpage into Chinese during the screenshot process. However, this does not affect the original data. We are more than happy to provide language assistance. If you require assistance, you can contact Zining Luo ([email protected]) for support.
Code availability
All the code involved in this study was written in VB.NET (.NET Framework 4.8), and the original GUI interface was created in Chinese. It includes two client applications and one server application deployed on a Windows server system. The VB.NET code can be requested from the corresponding author, Xie Jiebin ([email protected]). To facilitate the reproduction of the conclusions in this article, a Python version of the reproduction library and workflow is provided. Links to the platforms used in the article can be found in the URLs of the corresponding references. The prompts involved are listed in the tables. For the models, the libraries used for invoking each model include: GPT series (https://github.com/openai/openai-cookbook), Claude series (https://github.com/anthropics/anthropic-cookbook), Gemini series (https://github.com/google-gemini/cookbook), Bloom (https://huggingface.co/bigscience/bloom), Qwen2.5-72B (https://huggingface.co/Qwen/Qwen2.5-72B-Instruct), Galactica-120B (https://huggingface.co/facebook/galactica-120b), Llama series (https://docs.llama-api.com/quickstart), and Polaris framework(https://drive.google.com/drive/folders/1G5nxkV0LzJdCu7WEHycwg8fzc8QbsFWa?usp=sharing). Please note that we only accept requests for research purposes. Commercial requests are currently not allowed. We do not provide paid API keys, but we are happy to offer technical and linguistic assistance. If you require related support, feel free to contact [email protected], and please specify your requirements when making the request.
References
Rita, G-M, Luca, S., Benjamin, M. S., Philipp, B. & Dmitry, K. The landscape of biomedical research. bioRxiv (2024).
Literature Review and Synthesis Implications on Healthcare Research, Practice, Policy, and Public Messaging. (Springer Publishing Company, New York, NY, 2022).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
The New York Times. How ChatGPT Kicked Off an A.I. Arms Race. (https://www.nytimes.com/2023/02/03/technology/chatgpt-openai-artificial-intelligence.html) (2023).
Large Language Model Market Size, Share & Trends Analysis Report By Application (Customer Service, Content Generation), By Deployment, By Industry Vertical, By Region, And Segment Forecasts, 2024 - 2030. (https://www.grandviewresearch.com/industry-analysis/large-language-model-llm-market-report) (2024).
Zhiyao, R., Yibing, Z., Baosheng, Y., Liang, D. & Dacheng, T. Healthcare Copilot: Eliciting the Power of General LLMs for Medical Consultation. arXiv.org (2024).
Kathryn, G. B., Nicole, G. I., Ashley, G. O., Julien, O. T. & Andrew, G. R. Use of generative AI to identify helmet status among patients with micromobility-related injuries from unstructured clinical notes. Jama Netw. Open 7, e2425981 (2024).
Junkai, L. et al. Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents. arXiv.org (2024).
Dergaa, I., Chamari, K., Zmijewski, P. & Ben Saad, H. From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing. Biol. Sport 40, 615–622 (2023).
Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 6, 158 (2023).
Mugaanyi, J., Cai, L., Cheng, S., Lu, C. & Huang, J. Evaluation of large language model performance and reliability for citations and references in scholarly writing: cross-disciplinary study. J. Med. Internet Res. 26, e52935 (2024).
Soni, A., Arora, C., Kaushik, R. & Upadhyay, V. Evaluating the impact of data quality on machine learning model performance. J. Nonlinear Anal. Optim. 14, 13–18 (2023).
Walters, W. H. & Wilder, E. I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep.13, 14045 (2023).
Shubham, A., Issam, H. L., Laurent, C. & Christopher, P. LitLLM: A Toolkit for Scientific Literature Review. arXiv.org (2024).
Haoyi, X. et al. When Search Engine Services meet Large Language Models: Visions and Challenges. arXiv.org (2024).
Yousif, M. J. Systematic review of semantic analysis methods. Appl. Comput. J. 286-300 (2023).
Shuai, W., Harrisen, S., Bevan, K. & Guido, Z. Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? arXiv.org (2023).
John, D. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).
Koller, D. et al. Why we support and encourage the use of large language models in NEJM AI submissions. NEJM AI 1 (2024).
Zhang, M., Wu, L., Yang, T., Zhu, B. & Liu, Y. Retracted: the three-dimensional porous mesh structure of Cu-based metal-organic-framework - Aramid cellulose separator enhances the electrochemical performance of lithium metal anode batteries. Surf. Interfaces 46, 104081 (2024).
John, H., Christopher, D. M. & Percy, L. Truncation sampling as language model desmoothing. arXiv.org (2022).
Hughes, J. Size matters (or should) in copyright law. Soc. Sci. Res. Netw. 74, 575 (2005).
Sudhi, S. & Young, M. L. Challenges with developing and deploying AI models and applications in industrial systems. Discov. Artif. Intell.4, 55 (2024).
Tian, E. Perplexity, burstiness, and statistical AI detection. (https://gptzero.me/news/perplexity-and-burstiness-what-is-it/) (2023).
Meo, S. A., Talha, M. Turnitin: is it a text matching or plagiarism detection tool? Saudi J. Anaesth. 13, S48-S51 (2019).
Junyi, L. et al. The dawn after the dark: an empirical study on factuality hallucination in large language models. arXiv.org (2024).
Saini, T. How does a dataset affects performance of AI Model? (2023).
Jack, W. R., Anna, P., Siddhant, M. J. & Timothy, P. L. Compressive transformers for long-range sequence modelling. arXiv.org (2019).
Saurav, P. et al. The what, why, and how of context length extension techniques in large language models -- a detailed survey. arXiv.org (2024).
Burtsev, M. The working limitations of large language models (https://readwise.io/reader/shared/01hh2cwcjt53r4er2vpzn7sy92/) (2023).
Vinu, S. S., Aounon, K., Sriram, B., Wenxiao, W. & Soheil, F. Can AI-Generated Text be Reliably Detected? arXiv.org (2023).
Forlini, E. D. OpenAI quietly shuts down AI text-detection tool over inaccuracies. (https://www.pcmag.com/news/openai-quietly-shuts-down-ai-text-detection-tool-over-inaccuracies) (2023).
Iddagoda M. T., Flicker L. Clinical systematic reviews–a brief overview. BMC Med. Res. Methodol. 23, 226 (2023).
Bentley, P. How AI finally won its war on CAPTCHA images? (https://www.sciencefocus.com/future-technology/ai-vs-captcha) (2024).
Staiman, A. Publishers, Don’t Use AI Detection Tools! (https://scholarlykitchen.sspnet.org/2023/09/14/publishers-dont-use-ai-detection-tools/) (2024).
Jinlin, W. et al. Chinese contribution to NEJM, Lancet, JAMA, and BMJ from 2011 to 2020: a 10-year bibliometric study. Ann. Transl. Med. 10, 505 (2021).
Statcounter. Search Engine Market Share Worldwide - Dec 2023 - Dec 2024 (https://gs.statcounter.com/search-engine-market-share) (2025).
Google. Google search engine. (https://www.google.com/) (2025).
Bing. Microsoft Bing (https://www.bing.com/) (2025).
Yandex search engine (https://yandex.com/) (2025).
Yahoo. Yahoo search engine (https://www.yahoo.com/) (2025).
Baidu. Baidu search engine (https://www.baidu.com/) (2025).
National Library of Medicine - National Center for Biotechnology Information (https://pubmed.ncbi.nlm.nih.gov/) (2025).
Elsevier. Welcome to Embase (https://www.embase.com/emtree) (2025).
Draft your Literature Review 100x faster with AI (https://seaml.es/science.html) (2024).
The platform to follow, search, review & rewrite scientific literature with no hallucinations (https://www.paperdigest.org/) (2024).
AI Literature Review Writer Tool (https://askyourpdf.com/tools/literature-review-writer) (2024).
Easy-Peasy.AI-Literature Review Generator (https://easy-peasy.ai/presets/literature-review-generator) (2025).
HyperWrite. AI Literature Review Generator (https://www.hyperwriteai.com/aitools/ai-literature-review-generator) (2025).
Litreview. Academic Review Generator (https://litreview.slideai.net/academic-review-generator) (2025).
Template.net-Super Charge with AI (https://www.template.net/ai-literature-review-generator) (2025).
Introducing ChatGPT (https://openai.com/index/chatgpt/) (2024).
Introducing the GPT Store (https://openai.com/blog/introducing-the-gpt-store/) (2024).
Hugging Face is way more fun with friends and colleagues! (https://huggingface.co/collections) (2025).
OpenCompass. CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research. (https://rank.opencompass.org.cn/home) (2025).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).
Yan, S. et al. Prompt engineering on leveraging large language models in generating response to InBasket messages. J. Am. Med. Inform. Assn. 31, 2263–2270 (2024).
Lexin, Z. et al. Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024).
Gemini Models (https://deepmind.google/technologies/gemini/) (2024).
SCISPACE. Academic AI Detector. Catch GPT-4, ChatGPT, Jasper, and any AI in scholarly content (https://typeset.io/ai-detector) (2025).
GPTZero. More than an AI detector. Preserve what’s human. (https://gptzero.me/) (2025).
Check for similarity with the tool trusted by the world’s leading publishers, researchers, and scholars. (https://www.ithenticate.com/) (2024).
Scribbr. Free AI Detector. Identify AI-generated content, including ChatGPT and Copilot, with Scribbr’s free AI detector. (https://www.scribbr.com/ai-detector/) (2025).
Grammarly. AI Detector by Grammarly. Navigate responsible AI use with our AI checker, trained to identify AI-generated text. A clear score shows how much of your work appears to be written with AI so you can submit it with peace of mind. (https://www.grammarly.com/ai-detector) (2025).
Surferseo. Free AI Detector (https://surferseo.com/ai-content-detector/) (2025).
Decopy.ai. Free AI Content Detector(https://decopy.ai/ai-content-detector/) (2025).
Aihumanize. Detect content from AI writing tools like ChatGPT (https://aihumanize.com/ai-detector/) (2025).
Getmerlin. Free AI Detector & AI Checker Tool With AI Humanizer (https://www.getmerlin.in/ai-detection) (2025).
Christopher, L. W. et al. Addition of dexamethasone to prolong peripheral nerve blocks: a ChatGPT-created narrative review. Reg. Anesthes. Pain Med.49, 777–781 (2023).
Choueka, D., Tabakin, A. L. & Shalom, D. F. ChatGPT in urogynecology research: novel or not? Urogynecology 30, 962–967 (2024).
Acknowledgements
We would like to express our heartfelt gratitude to Mr. Guangzhi Luo, who contributed significantly to the overall design of this research in its early stages. Unfortunately, he passed away due to illness on the evening of September 27th at 9:15 PM. We are deeply thankful for the guidance he provided us as a mentor. At the same time, during the process of revising the article, Z.N.L. is deeply grateful to XXY for her companionship during his emotional low points. May I ask, would you be my girlfriend? Meanwhile, we would like to thank the funding provided by the Nanchong City-University Cooperation Project (Grant No. 22XQT0309), the Doctoral Research Start-up Fund of North Sichuan Medical College (Grant No. CBY22-QDA15), the Affiliated Hospital of North Sichuan Medical College (Grant No. 2022LC005), and the Key Cultivation Project of North Sichuan Medical College (Grant No. CBY23-ZDA10).
Author information
Authors and Affiliations
Contributions
All authors conceived and designed the study. Z.N.L. and D.R.W. developed all the software and algorithms involved in the research. Y.Q., X.Y.X., X.Y.L., M.Y.X., A.J.K., and Z.N.L. were involved in the collection and preparation of all datasets during the fine-tuning process. Z.N.L., J.B.X., Y.X.R., Y.L., Y.S.P., D.C.L., X.F.D., X.X., S.J.X., and Z.L.L. participated in the evaluation of expert indicators in the formal study. Z.N.L. and Z.Y.L. analyzed and validated all the indicators on the mobile phone. A.M.H., Z.N.L., J.B.X., and Y.X.R. had access to all the data in the study and drafted the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Luo, Z., Qiao, Y., Xu, X. et al. Cross sectional pilot study on clinical review generation using large language models. npj Digit. Med. 8, 170 (2025). https://doi.org/10.1038/s41746-025-01535-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01535-z