Abstract
The exponential growth of scientific articles has presented challenges in information organization and extraction. Automation is urgently needed to streamline literature reviews and enhance insight extraction. We explore the potential of Large Language Models (LLMs) in key-insights extraction from scientific articles, including OpenAI’s GPT-4.0, MistralAI’s Mixtral 8 × 7B, 01AI’s Yi, and InternLM’s InternLM2. We have developed an article-level key-insight extraction system based on LLMs, calling it ArticleLLM. After evaluating the LLMs against manual benchmarks, we have enhanced their performance through fine-tuning. We propose a multi-actor LLM approach, merging the strengths of multiple fine-tuned LLMs to improve overall key-insight extraction performance. This work demonstrates not only the feasibility of LLMs in key-insight extraction, but also the effectiveness of cooperation of multiple fine-tuned LLMs, leading to efficient academic literature survey and knowledge discovery.
Similar content being viewed by others
*Correspondence: [email protected]; Tel.: +82-10-3254-9260.
Introduction
The exponential growth of scientific articles has engendered a complexity in organizing, acquiring, and amalgamating academic information. According to the investigation by Bornmann et al.1, the overall growth rate of scientific articles stands at 4.1%, doubling every 17 years. The immediate increase of scientific articles has accelerated information overload, hindered the discovery of new insights, and contributed to the potential spread of false information. Although most scientific articles are published in a structured text format to speed up understanding of knowledge, the basic content of these articles remains unstructured text. This means that the literature review is still a time-consuming task that requires manual involvement2. Therefore, it is necessary to assist the literature review process by automatically extracting key information from scientific articles.
The automation of key information extraction can be categorized into two classes: metadata extraction and key-insights extraction3. Metadata extraction encompasses retrieving fundamental attributes from scientific articles, including title, author names, publication year, publishing entity, abstract, and other pertinent foundational details. Researchers and digital repositories use metadata to determine the relevance of specific articles to their fields of interest or to facilitate search and filtering tasks. The key-insight extraction is a summary of the content of the article, such as the problem to solve, the methodology used, evaluation methods, results, limitations, and the future work. Automatic retrieval of those insights will provide researchers with clear and concise concepts of research articles and increase the efficiency of literature reviews.
Compared with metadata extraction, key-insight extraction is more challenging. The scholarly ___domain has reported a high degree of accuracy in metadata extraction4. However, key-insight extraction does not offer an excellent solution because it covers only the parts of the article. Previous studies relied on machine learning technology to extract information at the phrase5 or sentence level6. Those approaches have limitations as is that the model struggles to capture complex contexts and semantics at the phrase or sentence level, leading to poor performance in capturing insight. Therefore, another key-insight extraction system is needed to extract key-insights at the section or article level than at the phrase or sentence level. Moreover, the scarcity of annotated training data, the variations among different domains, and the ongoing evolution of research paradigms impede the effectiveness of those models. To our knowledge, there is no effective solution for automatic extraction of key-insights from scientific articles.
Large Language Models (LLMs) such as GPT-4.0 provide possibilities to solve this challenge. LLMs represent a pioneering advancement in the field of natural language processing, characterized by their colossal neural network architectures, comprising billions to trillions of parameters. These LLMs have emerged as a forefront technology in contemporary artificial intelligence research and application, bearing transformative capabilities in text generation, comprehension, and processing. Bubeck et al.7 reported that GPT-4 has reached near-human performance on a variety of natural language tasks. This opens new opportunities to deeply understand article contents and extract key-insights from them. With their powerful contextual understanding and generation capabilities, LLMs may be able to better capture the details of article contents, enabling more accurate extraction of key-insights. However, there are no studies evaluating the ability of LLMs in key-insights extraction.
This study aims to develop and evaluate an LLM-based system for extracting key insights from scientific articles. We explore the effectiveness of various state-of-the-art LLMs and enhance their performance through fine-tuning with high-quality datasets. Additionally, we advance their capabilities by constructing a multi-actor system to further improve performance. Specifically, we first employ OpenAI’s ChatGPT-4.0, MistralAI’s Mixtral 8 × 7B, 01AI’s Yi, and InternLM’s InternLM2 respectively as a candidate LLM for extracting key-insights from scientific articles. As the next step, we evaluate the performance of each LLM on key-insight extraction tasks through the manual evaluation. We find that the performance of GPT-4.0 is close to human level. However, it is too expensive to use GPT-4.0 for key-insights extraction on a large scale. Therefore, we use the output of GPT-4 as a label for fine-tuning other open-source LLMs so as to improve their performance. However, the performance of the fine-tuned LLMs still has not reached the peak. Therefore, instead of relying on a single LLM, we present a multi-actor method to merge all the key-insights extracted by multiple fine-tuned open-source LLMs, demonstrating an advancement in the quality of key-insights. As a result, it is shown that we can extract key-insights at article level only using the multiple fine-tuned open-source LLMs.
Related works
Key-insight extraction
Key-insight extraction refers to identifying valuable information for research contained in scientific articles. Current research extracts key-insights based on sentences6 or phrases5. These researchers extract specific sentences or phrases as key-insights. Sentence-level key-insight extraction can be viewed as a classification task, which classifies the sentences into specific classes. Phrase-level key-insight extraction is more concerned with extracting phrases, or fragments from the text. Key-insight extraction is mainly based on machine learning technology, such as Bayesian classifiers8, Conditional Random Fields9, Support Vector Machines10, and Deep Neural Networks11. Most of those studies are based on extracting key-insights from articles’ abstract. However, those methods have two limitations:
-
1.
Abstract does not necessarily fully represent the key-insights of the article; for example, the limitations of the research and future research may not exist in the abstract.
-
2.
The extracted information may not be sufficient to capture the true meaning of the article. In many cases, the key-insights of an article require a broader context synthesis based on the textual summary.
The extensive understanding of various topics exhibited by LLMs, exemplified by GPT-4, has garnered significant attention across the scientific community. Comprehensive evaluations have been conducted to assess GPT-4’s performance across a multitude of natural language processing tasks12,13,14,15,16,17,18,19. The results indicate that GPT-4’s performance varies across different tasks. For instance, in such tasks as information retrieval17, information extraction12,18, and text summarization7, GPT-4 demonstrates superior performance to traditional models. This could be attributed to GPT-4’s training data encompassing diverse ___domain knowledge, enabling it to retrieve relevant information effectively from a wide range of languages and document types. However, in the task of relation extraction, GPT-4’s performance falls short of benchmark models. Han et al.14 reported that this discrepancy might be attributed to GPT-4’s limited understanding of subject-object relationships within relation extraction tasks.
In contrast to traditional methods that focus on paragraph-level and sentence-level key-insight extraction, an advantage of LLMs is their capacity to perform full-text level key-insight extraction. This task can be considered a synthesis of text summarization and information extraction. Currently, the use of LLMs for key-insight extraction tasks lacks a systematic evaluation. Existing large datasets intended to evaluate LLMs typically assess their performance in areas such as information extraction20,21, text summarization22, and QA23. However, the task of key-insight extraction demands that models have the capability to extract and synthesize information from multiple perspectives. Therefore, the performance of LLMs in information extraction, text summarization, and QA cannot be directly equated to their effectiveness in key-insight extraction tasks. Therefore, it is necessary to systematically evaluate the performance of LLMs on key-insight extraction tasks.
Fine-tuning of LLMs
The key behind achieving high performance of LLMs lies in the two main stages of the training process: (1) initial pre-training on massive text corpora, endowing the LLM with an expansive grasp of linguistic knowledge and structure; (2) fine-tuning on specific tasks to adapt to distinct domains and applications. This dual-training paradigm equips LLMs with unparalleled adaptability, rendering them instrumental in the modern landscape of natural language processing. Since initial pre-training of LLMs requires a large amount of hardware support, scholars tend to fine-tune the pre-trained LLMs to adapt to tasks in different fields24.
Supervised fine-tuning refers to the process of taking a pre-trained model and adapting it to a new task or dataset by making small adjustments to its parameters25. Compared with fine-tuning small-scale models, fine-tuning techniques for LLMs are more complex because of scalability and hardware performance issues. Fine-tuning methods for LLMs are often called PEFT (Parameter-Efficient Fine-Tuning). PEFT methods aim to tweak only a small fraction of the LLM’s parameters, thereby mitigating computational and memory costs. This approach allows for the efficient adaptation of LLMs to specific tasks without the need for extensive resources, making high-quality LLM personalization more accessible.
LoRA (Low-Rank Adaptation)26 is a PEFT technology that optimizes LLMs like GPT-3 by introducing trainable rank decomposition matrices into their architecture, significantly reducing the number of adjusted parameters while maintaining or improving LLM performance. As a result, LoRA distinguishes itself by providing an optimal balance between LLM efficiency and performance enhancement. By harnessing the power of Low-Rank Adaptation, LoRA enables more nuanced and targeted adjustments to LLMs without the substantial increase in parameters typically associated with such refinements.
Multi-actor approaches in LLM
The introduction of multi-actors in LLM systems for improving the performance of LLMs leverages the concept of ensemble learning by coordinating the efforts of multiple AI actors, each embodying an instance of LLM like GPT-4, to act in concert. Through such harmonious interaction, the actors combine their varied knowledge and contextual understanding, addressing intricate challenges with a level of efficiency and inventiveness that exceeds the scope of any single LLM. This transition from individual to collective AI endeavors symbolizes the notion that the aggregate output of an ensemble of actors far exceeds what they could achieve independently.
Extensive research has been conducted to enhance the performance of large language models (LLMs) for specialized tasks using multi-actor approaches, yet there is no consensus on the optimal way for these actors to collaborate. For instance, employing Dawid-Skene Model to iteratively optimize weights for each actor proves highly effective for tasks where labels are definite27. However, the multi-actor nature of LLMs significantly increases computational demand28. To mitigate this, a novel routing architecture for multi-actor LLMs based on a reward model has been introduced, which presumes each sub-actor’s proficiency in specific tasks29. This method improves both efficiency and accuracy by allocating particular tasks to designated expert actors, similar in philosophy to the popular MoE (Mixture of Experts)30 model but with greater scalability because it avoids the hard-coding of expert models inherent in MoE architecture.
In the sphere of practical implementation, the research by Hong et al.31 serves as a compelling illustration of the extraordinary capabilities of multi-actor systems. They have adeptly merged human-inspired Standard Operating Procedures (SOPs) with role specialization within an advanced meta-programming architecture, demonstrating how structured cooperation can enhance the performance of LLMs to unprecedented levels. Further, Liang et al.32 have advanced this area by creating a Multi-Agent Debate (MAD) framework, tailor-made to navigate the complexities of intricate reasoning challenges that confront LLMs. This framework provides a systematic arena for agents to participate in deliberative debates, thereby boosting the collective intellectual capacity of the ensemble of agents, showcasing the profound influence that well-orchestrated ensemble strategies can have in transcending the limitations of existing AI paradigms.
ArticleLLM
We propose a scientific-article key-insight extraction system, called ArticleLLM, using multi-actor of multiple fine-tuned open-source LLMs. The key-insights we want to extract are the followings: aim of study, motivation of study, problem to solve, method used for solution, evaluation metrics, findings, contributions, limitations, and future work.
Open-source LLMs used for fine-tuning
Compared to commercially available proprietary LLMs, fine-tuned and locally deployed open-source LLMs hold advantages in terms of data security, scalability, and cost-effectiveness33. Fine-tuning is a pivotal process that enables the LLM to perform better on specific tasks by adjusting the parameters of a pre-trained LLM to fit particular datasets or application contexts34. Fine-tuning is more resource- and time-efficient than training from scratch, as it leverages the general knowledge acquired by the pre-trained LLM.
In this study, we use MistralAI’s Mixtral, 01AI’s Yi, and InternLM’s InternLM2 ranked by the transformers framework of Hugging Face35 as a candidate open-source LLM for fine-tuning as shown in Table 1. These LLMs all rank high in performance on AlpacaEval leaderboard36 and are considered to have high potential for key-insight extraction. Since these LLMs are compatible with the transformers framework of Hugging Face, they can be fine-tuned based on the same set of methods.
Fine-tuning algorithm
We utilize the instruction fine-tuning37 method to refine LLMs. This method primarily involves targeted fine-tuning based on specific instructions atop the foundational language model, thereby enhancing the LLM’s capability to comprehend and respond to user commands or inquiries. We fine-tuning the model using Instruction-Response pairs, where the Instructions serve as input data and the Responses are treated as labels. During the fine-tuning phase, the optimization algorithm modifies the LLM’s parameters by calculating the loss function between the predicted outputs and the actual labels.
To efficiently fine-tune LLMs with minimal performance loss, we employ a 4-bit precision loading approach using the GPTQ algorithm38. GPTQ aims to significantly reduce the LLMs’ memory and computational requirements by compressing the weight bit representation from 32 bits to just 4 bits. This reduction not only minimizes the LLM’s size but also enhances its operational speed by treating weights more discretely.
During the fine-tuning stage, we apply LoRA26 optimization to adjust the pre-trained LLM for specific tasks efficiently. LoRA operates by selectively modifying a subset of the LLM’s weights using low-rank matrices, maintaining the original structure while enabling swift fine-tuning. This method allows for precise adjustment without retraining the full network. We adopt Adamw39 for optimization, which optimizes learning rates based on gradient moments, ensuring a balance between task-specific performance enhancement and generalizability. We use the default cross-entropy as the loss function because it can effectively handle the multi-class label prediction problem and ensure the accuracy of the probability distribution of the model output. We set the train epoch to 1 to avoid overfitting. The parameters governing the LoRA adaptation and the overall fine-tuning process are carefully selected as shown in Table 2 to balance between refining the LLM’s performance on specific tasks and maintaining its generalized capabilities.
Data set used and preparation for fine-tuning
The data set used in this study comes from the PDF files of arXiv40 public articles having the keyword of healthcare. The academic consensus suggests that the optimal dataset size for fine-tuning LLMs lies between 1,000 and 10,000 dialogue entries41,42, most of which consist of dialogue sentences of a few hundred words. However, given that the article data involved in key-insight extraction tasks are often dozens of times longer than these dialogue entries, there is a significant increase in training time costs. Therefore, we choose to use 1,000 articles as our training dataset. Among them, following the usual practice, we set the size of training data set to 700 articles.
Researchers have confirmed that long articles can reduce the performance of LLMs43. To fully exploit the performance of a LLM, the length of the input text must be reduced. Moreover, a long article will require a large amount of GPU memory during the fine-tuning phase. Therefore, we selected short-length articles whose number of words are less than 7000 to form the dataset.
Converting the PDF format of an article into the text format that can be processed by LLMs is a prerequisite for executing subsequent algorithms. We use Tika-python44 to convert PDF format files into string data. Tika-python returns the string data contained in the entire PDF file. Since the string data may contain some useless characters and symbols, we also use Python’s pySBD45 library to filter out them.
To fine-tune the open-source LLMs in Table 1, the dataset is organized into the following structure: Instruction and Response. Instruction is an order given to the LLM, telling the LLM what information to extract and in what format to return it. We showed in Table 3 the instruction used for extracting key-insights from the article text. Response is the result expected to be returned by the LLM. In this supervised fine-tuning stage, the response represents key-insights of the article. We used the key-insights generated by GPT-4 as a label for the response.
Multi-actor of fine-tuned LLMs
The multi-actor of the fine-tuned LLMs operates on the independently obtained response of each fine-tuned LLM: Mixtral FT, Yi FT, and InternLM2 FT. Each LLM is tasked with extracting key-insights according to the structured instruction as showcased in Table 3. Following the initial extraction phase, each output from each LLM is subjected to a synthesis process. For this step, we harness the capability of one of the LLMs as the centerpiece for integrating diverse perspectives and insights generated by the other LLMs.
Specifically, we used InternLM2 to summarize the information from the output of each LLM FT using the instructions as shown in Table 4. To ensures that InternLM2 is optimally tailored to effectively integrate and summarize the outputs of the LLM FTs, we fine-tuned the InternLM2 using the dataset of 1,000 entities randomly selected from the key-insight outputs of Mixtral FT, Yi FT, InternLM2 FT, and GPT-4. Among them, the key-insight sentences extracted by Mixtral FT, Yi FT, and InternLM2 FT are used as input, and the key-insight sentences extracted by GPT-4 are used as labels.
Performance evaluation
Evaluation metrics
We use the following three metrics to evaluate the performance of ArticleLLM on the key-insights extraction task:
-
Manual evaluation Manual evaluation is considered the gold standard for key-insight extraction tasks3. However, its high cost limits its application on large datasets. We use manual evaluation to measure the performance of the LLM in a small article dataset. Because the key-insight extraction task involves explicit goals and objectives, we use a relevance score to assess how accurately the LLM extracts key-insights. The relevance score ranges from 0 to 1, where 0 indicates ‘completely irrelevant’ and 1, ‘completely relevant’. Specifically, human researchers assess whether the key-insights extracted by the LLM comprehensively capture all critical information points identified in manually annotated references, evaluating the presence or absence of essential details or elements. To reduce the subjective errors caused by humans, two researchers independently assessed the results and re-evaluated the contradictory parts.
-
GPT-4 score Through carefully designed instructions as shown in Table 5, GPT-4 scores the semantic similarity of the extracted key-insights. The score ranges from 0 to 100, where 0 indicates ‘completely dissimilar’ and 100, ‘completely similar’. The key-insights generated by an open source LLM are evaluated by the key-insights generated by GPT-4. Because, compared to Bleu score, GPT-4 is considered to better capture the deep semantic similarity between texts, GPT-4 score has been widely used in LLM evaluation46,47.
-
Vector similarity Vector similarity quantifies the semantic similarity between two sentences using the cosine similarity between their vector representations48. In this study, vector representations of each key-insight are produced using the Sentence Transformer model, all-MiniLM-L6-v249. The similarity scores are normalized to a scale ranging from 0 to 100, where 0 indicates ’completely dissimilar’ and 100 denotes ’completely similar’.
Evaluation results
For evaluation test, we used a Linux server with 4 Nvidia TITAN RTX GPUs. The operating system is CentOS7. The CPU is an Intel (R) Xeon (R) Silver 4114 CPU @ 2.20 GHz with 40 cores. We used a total of 300 articles for test dataset. In terms of software, we employ Python’s Transformers library35 as the foundational framework for fine-tuning and inference processes. We utilize the PEFT library50 to perform parameter efficient fine-tuning.
Manual performance comparison of LLMs before fine-tuning
The manual evaluation results for the key-insight extraction performances of the LLMs before fine-tuning are shown in Table 6, using a total of 34 articles for test dataset. GPT-4 demonstrates superior efficacy across all key-insights, achieving an average score of 0.97. This is followed by InternLM2, which exhibits commendable performance with an average score of 0.80, showcasing its potential in extracting key-insights from scientific literature. Conversely, Yi and Mixtral lag slightly behind, with average scores of 0.65 and 0.60, respectively. Notably, GPT-4 achieves perfect scores in understanding the aim, motivation, evaluation metrics, and contribution of scientific articles, underscoring its advanced capability in discerning intricate academic content.
Performance comparison of fine-tuned LLMs
The performance comparison results are presented in Fig. 1, and the numerical values of the results are shown in Table 7. With respect to the GPT-4 score after fine-tuning, both Yi FT and Mixtral FT exhibited slight improvements, with their average scores increasing from 68.5 to 71.4 and from 49.4 to 51.6, respectively. InternLM2 and its fine-tuned counterpart, InternLM2 FT, stood out for their exceptional performance, with InternLM2 FT achieving the highest average score of 77.8. Unlike GPT-4 scores, the vector similarity scores show less variation, suggesting this metric evaluates a different aspect. Specifically, after fine-tuning, Yi FT and Mixtral FT showed marginal improvements in their scores, shifting from 64.2 to 65.8 and from 63.5 to 63.9 respectively, showcasing slight but noticeable progress. InternLM2 and its fine-tuned version, InternLM2 FT, delivered the highest scores among the individual models at 68.8, indicating a stronger semantic alignment.
On the other hand, the underperformance of all open-source LLMs in extracting “Evaluation metrics” and “Limitations” can be principally attributed to two specific challenges. First, the task of extracting “Evaluation metrics” often involves identifying multiple distinct indicators such as accuracy, precision, and recall, which may be intricately described within the text. Open-source LLMs may struggle with this multifaceted extraction, leading to the omission of certain indicators that are critical for a comprehensive evaluation. Second, the extraction of “Limitations” presents its unique set of difficulties, as not all authors explicitly mention limitations within their articles. This situation forces open-source LLMs to attempt the synthesis of plausible yet not entirely accurate limitations, which can deviate significantly from the article’s intended messages. These challenges highlight the nuanced understanding and contextual interpretation required for accurately extracting such sophisticated elements from scientific texts, thereby suggesting the need for LLMs to be specifically fine-tuned or trained with a focus on recognizing and handling the diverse and complex nature of “Evaluation metrics” and “Limitations” in scientific literature.
The multi-actor approach consistently outperforms other models in both metrics, though the margin is narrower in vector similarity, suggesting its overall superior linguistic and cognitive capabilities. Specifically, the multi-actor approach effectively leverages the combined strengths of three fine-tuned LLMs (InternLM2 FT, Yi FT, and Mixtral FT). This collaborative strategy significantly enhances its ability to extract key-insights from scientific articles, as illustrated in Fig. 1 which summarizes performance across nine essential categories. This underscores the synergistic effect of leveraging multiple fine-tuned LLMs, which collectively enhance the precision and breadth of extracted key-insights, surpassing the capabilities of individual LLMs. Moreover, the lower score of “Limitations” compared to other categories suggests that while multi-actor approach significantly improves performance, identifying areas requiring further refinement remains critical. Collectively, these results solidify the position of multi-actor LLMs as a powerful tool in comprehensively understanding and analyzing the complexities of scientific literature, outperforming singular fine-tuned LLMs in both depth and accuracy of extracted information.
Statistical significance of results
Given that the observed performances are so close to each other, we need to see if the observed differences are statistically significant. Considering that our data have natural boundaries which may lead to a skewed distribution, we implemented the Wilcoxon Signed-Rank Test51 to compare the median differences between LLMs. This non-parametric method is more appropriate due to the potential non-normality of the data caused by the boundaries. Our null hypothesis states that there would be no significant difference in GPT-4 score or vector similarity, whereas the alternative hypothesis anticipates a noticeable difference. Supplementary Table 1 presents the results of the Wilcoxon Signed-Rank Test for all LLMs. Within the realm of GPT-4 scores, all fine-tuned models exhibit significant differences in certain key-insights (p < 0.05). InternLM2 FT shows significance in aim, methods, question addressed, evaluation metrics, findings, limitations, and future work; Yi FT in aim, question addressed, and findings; Mixtral FT in aim, motivation, question addressed, and contribution. In particular, the multi-actor approach shows significance in all key-insights.
However, the results of vector similarity differ from those of GPT-4, with no significant statistical differences observed across all key-insights for Mixtral FT and InternLM2 FT. Yi FT shows significance only in aim, methods, and findings. This may be due to the fact that vector similarity is less sensitive to semantic variations. Figure 1b illustrates this point by demonstrating that the variance of vector similarity between different models is lower, resulting in closer curves. Nevertheless, the multi-actor approach shows significance in all key-insights except aim. This reveals that the multi-actor approach significantly improves the key-insight extraction performance. Overall, the results of statistical tests highlight the significant improvements in key-insight extraction made by the multi-actor approach, thereby validating its effectiveness.
Repeated testing of GPT-4 score
Considering the potential variability in GPT-4 score outputs, where identical instructions may yield varying results52, we conducted repeated experiments. As illustrated in Fig. 2, we performed 30 replicates of testing on the GPT-4 scores across all key-insights. The mean of GPT-4 scores for 30 repeated tests is 88.4 ± 0.63. This result shows that while there is some variation in GPT-4 scores across repeated tests, this variation has minimal impact on the overall outcome.
Discussion
The findings from this study underscore the effectiveness of fine-tuned LLMs for the extraction of key-insights from scientific articles. Notably, the fine-tuned version of the InternLM2, InternLM2 FT, emerged as a standout performer, evidencing significant performance enhancements post-fine-tuning, thus directly evidencing the benefits of fine-tuning. This enhancement in performance, particularly in terms of understanding and articulating critical scientific information, affirms the pivotal role of fine-tuning in optimizing LLMs for specialized academic tasks.
The multi-actor of LLMs showcases their superiority in handling complex scientific texts by integrating the strengths of the three individually fine-tuned LLMs—Mixtral FT, Yi FT, and InternLM2 FT—each bringing different nuanced insights to the output. This multi-faceted approach ensures a more comprehensive analysis than what any single LLM could achieve. The synthesis process, led by InternLM2 FT, exemplifies a sophisticated method of consolidating diverse perspectives into a coherent, analytically rich summary. This strategy significantly elevates the benchmark for automated text analysis, offering unparalleled precision and depth in extracting key-insights from dense academic literature, and highlights its immense potential for advancing literature review and analysis. This development aligns with recent studies emphasizing the cooperation of multiple LLMs in improving the performance of LLM applications across various fields31,32. It is a kind of collective intelligence, where diverse sources of information lead to more robust and reliable conclusions, a concept widely supported in the interdisciplinary research community.
Distinct from the existing multi-actor approaches27,28,29, the approach for extracting key-insights from scientific articles merges the unique viewpoints from disparate models. Consequently, such customary solutions as weight updating27 and expert routing29 prove ineffective for this task. Our approach addresses this by fine-tuning a specialized actor tasked with integrating and summarizing information from other actors. While this method may not achieve the execution efficiency of expert routing techniques, it significantly enhances the accuracy of key-insights extraction, ensuring maximal fidelity to the source article.
The results of this study also suggest an insight: the inherent abilities of the original LLMs are more important than the enhancements brought about by fine-tuning. Despite the tangible improvements observed by post-fine-tuning, the baseline performance of an LLM such as InternLM2 sets a precedential standard for excellence in the ___domain of key-insight extraction. This denotes that the foundational architecture and pre-training of these LLMs may be more pivotal in determining their effectiveness in complex academic tasks, overshadowing the incremental gains achieved through fine-tuning. Specifically, the fact that GPT-4 and InternLM2 demonstrated superior efficacy across various dimensions before fine-tuning, highlights the critical importance of the LLM’s original design and training corpus in grasping intricate academic content. Therefore, while fine-tuning serves as a valuable tool for LLM optimization, the selection of the base LLM is crucial for success in such specialized applications as key-insights extraction.
An advantage of ArticleLLM is its capability for local deployment, which facilitates the construction of a private large-scale literature management system. This feature is particularly crucial for applications that emphasize data confidentiality and security. Another benefit of ArticleLLM is its cost-effectiveness. Extracting key-insights from an article of approximately 10,000 words using GPT-4 costs about $0.15. This expense can prove substantial for building an extensive private literature management system. In contrast, a locally deployed ArticleLLM eliminates those concerns. By the way, the approach of multi-actor of LLMs proposed in this study requires about 10 min to extract key-insights from an article under our hardware setup. However, the server used in this research does not employ NVLink technology but rather relies on PCIe (Peripheral Component Interconnect Express) for data transmission, indicating significant room for improvement in system performance efficiency53.
Compared with previous research3,5,8,10 on key-insight extraction, the advantage of ArticleLLM is that it can process the entire article to obtain more reasonable results, demonstrating the feasibility of LLMs for extracting key-insights from articles. ArticleLLM could revolutionize the way academic databases curate and present information, allowing for a more nuanced and efficient retrieval process. Researchers could benefit from contextually relevant summaries, thereby significantly enhancing their research productivity and knowledge discovery.
There are some limitations worth mentioning. Although the multi-actor system based on multiple fine-tuned LLMs has achieved excellent performance, the restriction to articles under 7,000 words due to GPU memory limits and the focus on healthcare-related articles from arXiv may affect the generalizability of our findings. Additionally, the execution efficiency of the system is still a concern due to the collaboration of multiple LLMs involved. The extraction of key-insights such as “limitations” that may not exist in the article or appear in an ambiguous form requires further research. Compared to the arXiv papers used in this study, more rigorous peer-reviewed papers may avoid this problem. On the other hand, due to the hardware limitations, this study employed 4-bit quantization to fine-tune the LLMs, inevitably leading to a performance loss.
Conclusion
In this paper, it is shown that we can extract key-insights from scientific articles only using open-source LLMs. Our findings affirm the critical role of fine-tuning in enhancing the proficiency of LLMs for extracting key-insights, with InternLM2 FT showcasing remarkable improvements after fine-tuning. In addition, the multi-actor of LLMs, incorporating diverse perspectives from individually fine-tuned LLMs, significantly broadens the scope and accuracy of resulting key-insights. However, the inherent architecture and pre-training of LLMs seem to be more influential than fine-tuning in determining their efficacy for extracting key-insights Looking forward, ArticleLLM can present a promising avenue for enhancing academic research productivity, though such challenges as execution efficiency of ensemble systems warrant further exploration to optimize the utilization of LLMs in scholarly applications.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Bornmann, L., Haunschild, R. & Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit. Soc. Sci. Commun. 8, 224 (2021).
Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 7, e012545 (2017).
Nasar, Z., Jaffry, S. W. & Malik, M. K. Information extraction from scientific articles: a survey. Scientometrics 117, 1931–1990 (2018).
Boukhers, Z. & Bouabdallah, A. Vision and natural language for metadata extraction from scientific PDF documents. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries 1–5. https://doi.org/10.1145/3529372.3533295 (ACM, 2022).
Tateisi, Y., Ohta, T., Miyao, Y., Pyysalo, S. & Aizawa, A. Typed entity and relation annotation on computer science papers. In Proceedings of the 10th International Conference on Language Resources and Evaluation. 3836–3843 (2016).
Kovačević, A., Konjović, Z., Milosavljević, B. & Nenadic, G. Mining methodologies from NLP publications: a case study in automatic terminology recognition. Comput. Speech Lang. 26, 105–126 (2012).
Bubeck, S. et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. Preprint at http://arxiv.org/abs/2303.12712 (2023).
Lakhanpal, S., Gupta, A. K. & Agrawal, R. Towards Extracting Domains from Research Publications. In Midwest Artificial Intelligence and Cognitive Science Conference. https://api.semanticscholar.org/CorpusID:5220521 (2015).
Hirohata, K., Okazaki, N., Ananiadou, S. & Ishizuka, M. Identifying Sections in Scientific Abstracts using Conditional Random Fields. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I. https://aclanthology.org/I08-1050/ (2008).
Ronzano, F. & Saggion, H. Dr. Inventor Framework: Extracting Structured Information from Scientific Publications. In International Conference on Discovery Science (eds. Japkowicz, N. & Matwin, S.) vol. 9356, 209–220. https://doi.org/10.1007/978-3-319-24282-8 (Springer International Publishing, 2015).
He, H. et al. An Insight Extraction System on BioMedical Literature with Deep Neural Networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2691–2701. https://doi.org/10.18653/v1/D17-1285 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2017).
Polak, M. P. & Morgan, D. Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering. Preprint at http://arxiv.org/abs/2303.05352 (2023).
Huang, J. & Tan, M. The role of ChatGPT in scientific communication: writing better scientific review articles. Am. J. Cancer Res. 13, 1148–1154 (2023).
Han, R. et al. Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors. Preprint at http://arxiv.org/abs/2305.14450 (2023).
Yuan, C., Xie, Q. & Ananiadou, S. Zero-shot Temporal Relation Extraction with ChatGPT. Preprint at http://arxiv.org/abs/2304.05454 (2023).
Wang, S., Scells, H., Koopman, B. & Zuccon, G. Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? Preprint at http://arxiv.org/abs/2302.03495 (2023).
Zhang, J., Chen, Y., Niu, N., Wang, Y. & Liu, C. Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting. Preprint at http://arxiv.org/abs/2304.12562 (2023).
Li, B. et al. Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness. Preprint at http://arxiv.org/abs/2304.11633 (2023).
Sun, W. et al. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. Preprint at http://arxiv.org/abs/2304.09542 (2023).
Jahan, I., Laskar, M. T. R., Peng, C. & Huang, J. Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. Preprint at http://arxiv.org/abs/2306.04504 (2023).
Kempf, S., Krug, M. & Puppe, F. K. I. E. T. A. Key-insight extraction from scientific tables. Appl. Intell. 53, 9513–9530 (2023).
Laban, P. et al. SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 9662–9676. https://aclanthology.org/2023.emnlp-main.600/ (Association for Computational Linguistics, 2023).
Cai, H. et al. SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis. Preprint at http://arxiv.org/abs/2403.01976 (2024).
Touvron, H. et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. Preprint at http://arxiv.org/abs/2307.09288 (2023).
Howard, J. & Ruder, S. Universal Language Model Fine-tuning for Text Classification. Preprint at http://arxiv.org/abs/1801.06146 (2018).
Hu, E. et al. Lora: Low-Rank Adaptation of Large Language Models. Preprint at https://arxiv.org/abs/2106.09685 (2021).
Fang, C. et al. LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction. Preprint at http://arxiv.org/abs/2403.00863 (2024).
Jiang, D., Ren, X. & Lin, B. Y. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. Preprint at http://arxiv.org/abs/2306.02561 (2023).
Lu, K. et al. Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models. Preprint at http://arxiv.org/abs/2311.08692 (2023).
Jiang, A. Q. et al. Mixtral of Experts. Preprint at http://arxiv.org/abs/2401.04088 (2024).
Hong, S. et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. Preprint at http://arxiv.org/abs/2308.00352 (2023).
Liang, T. et al. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. Preprint at http://arxiv.org/abs/2305.19118 (2023).
Chen, H. et al. ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up? Preprint at http://arxiv.org/abs/2311.16989 (2023).
Huang, H., Qu, Y., Liu, J., Yang, M. & Zhao, T. An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. Preprint at http://arxiv.org/abs/2403.02839 (2024).
Wolf, T. et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. Preprint at http://arxiv.org/abs/1910.03771 (2019).
Xuechen L. et al. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval (2023).
Chung, H. W. et al. Scaling Instruction-Finetuned Language Models. Preprint at http://arxiv.org/abs/2210.11416 (2022).
Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. Preprint at https://arxiv.org/abs/2210.17323 (2022).
Ilya L. & Frank H. Decoupled Weight Decay Regularization. Preprint at https://arxiv.org/abs/1711.05101 (2019).
Paul Ginsparg. ArXiv. https://arxiv.org/ (1991).
Bakker, M. A. et al. Fine-tuning language models to find agreement among humans with diverse preferences. Preprint at https://arxiv.org/abs/2211.15006 (2022).
Hu, Z. et al. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. Preprint at http://arxiv.org/abs/2304.01933 (2023).
Liu, N. F. et al. Lost in the Middle: How Language Models Use Long Contexts. Preprint at http://arxiv.org/abs/2307.03172 (2023).
Chris A. M. et al. Tika-Python. https://github.com/chrismattmann/tika-python (2014).
Sadvilkar, N. & Neumann, M. PySBD: Pragmatic Sentence Boundary Disambiguation. Preprint at http://arxiv.org/abs/2010.09657 (2020).
Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Preprint at http://arxiv.org/abs/2306.05685 (2023).
Gao, M., Hu, X., Ruan, J., Pu, X. & Wan, X. LLM-based NLG Evaluation: Current Status and Challenges. Preprint at http://arxiv.org/abs/2402.01383 (2024).
Farouk, M. Measuring Sentences Similarity: A Survey. Preprint at http://arxiv.org/abs/1910.03940 (2019).
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Preprint at http://arxiv.org/abs/1908.10084 (2019).
Mangrulkar, S. et al. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft (2022).
Wilcoxon, F. Individual comparisons by ranking methods. Biometrics Bull. 1, 80 (1945).
Kocoń, J. et al. Jack of all trades, master of none. Inf. Fusion. 99, 101861 (2023).
Li, A. et al. Evaluating Modern PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31, 94–110 (2020).
Funding
This work was supported by the Dong-A University research fund.
Author information
Authors and Affiliations
Contributions
Z.S. wrote the main manuscript text. G.-Y.H. and B.-K.P. contributed to the manuscript by performing modifications and proofreading. S.H. and X.Z. provided analysis and discussion on the results. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Song, Z., Hwang, GY., Zhang, X. et al. A scientific-article key-insight extraction system based on multi-actor of fine-tuned open-source large language models. Sci Rep 15, 1608 (2025). https://doi.org/10.1038/s41598-025-85715-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-85715-7