*Correspondence: [email protected]; Tel.: +82-10-3254-9260.

Introduction

The exponential growth of scientific articles has engendered a complexity in organizing, acquiring, and amalgamating academic information. According to the investigation by Bornmann et al.1, the overall growth rate of scientific articles stands at 4.1%, doubling every 17 years. The immediate increase of scientific articles has accelerated information overload, hindered the discovery of new insights, and contributed to the potential spread of false information. Although most scientific articles are published in a structured text format to speed up understanding of knowledge, the basic content of these articles remains unstructured text. This means that the literature review is still a time-consuming task that requires manual involvement2. Therefore, it is necessary to assist the literature review process by automatically extracting key information from scientific articles.

The automation of key information extraction can be categorized into two classes: metadata extraction and key-insights extraction3. Metadata extraction encompasses retrieving fundamental attributes from scientific articles, including title, author names, publication year, publishing entity, abstract, and other pertinent foundational details. Researchers and digital repositories use metadata to determine the relevance of specific articles to their fields of interest or to facilitate search and filtering tasks. The key-insight extraction is a summary of the content of the article, such as the problem to solve, the methodology used, evaluation methods, results, limitations, and the future work. Automatic retrieval of those insights will provide researchers with clear and concise concepts of research articles and increase the efficiency of literature reviews.

Compared with metadata extraction, key-insight extraction is more challenging. The scholarly ___domain has reported a high degree of accuracy in metadata extraction4. However, key-insight extraction does not offer an excellent solution because it covers only the parts of the article. Previous studies relied on machine learning technology to extract information at the phrase5 or sentence level6. Those approaches have limitations as is that the model struggles to capture complex contexts and semantics at the phrase or sentence level, leading to poor performance in capturing insight. Therefore, another key-insight extraction system is needed to extract key-insights at the section or article level than at the phrase or sentence level. Moreover, the scarcity of annotated training data, the variations among different domains, and the ongoing evolution of research paradigms impede the effectiveness of those models. To our knowledge, there is no effective solution for automatic extraction of key-insights from scientific articles.

Large Language Models (LLMs) such as GPT-4.0 provide possibilities to solve this challenge. LLMs represent a pioneering advancement in the field of natural language processing, characterized by their colossal neural network architectures, comprising billions to trillions of parameters. These LLMs have emerged as a forefront technology in contemporary artificial intelligence research and application, bearing transformative capabilities in text generation, comprehension, and processing. Bubeck et al.7 reported that GPT-4 has reached near-human performance on a variety of natural language tasks. This opens new opportunities to deeply understand article contents and extract key-insights from them. With their powerful contextual understanding and generation capabilities, LLMs may be able to better capture the details of article contents, enabling more accurate extraction of key-insights. However, there are no studies evaluating the ability of LLMs in key-insights extraction.

This study aims to develop and evaluate an LLM-based system for extracting key insights from scientific articles. We explore the effectiveness of various state-of-the-art LLMs and enhance their performance through fine-tuning with high-quality datasets. Additionally, we advance their capabilities by constructing a multi-actor system to further improve performance. Specifically, we first employ OpenAI’s ChatGPT-4.0, MistralAI’s Mixtral 8 × 7B, 01AI’s Yi, and InternLM’s InternLM2 respectively as a candidate LLM for extracting key-insights from scientific articles. As the next step, we evaluate the performance of each LLM on key-insight extraction tasks through the manual evaluation. We find that the performance of GPT-4.0 is close to human level. However, it is too expensive to use GPT-4.0 for key-insights extraction on a large scale. Therefore, we use the output of GPT-4 as a label for fine-tuning other open-source LLMs so as to improve their performance. However, the performance of the fine-tuned LLMs still has not reached the peak. Therefore, instead of relying on a single LLM, we present a multi-actor method to merge all the key-insights extracted by multiple fine-tuned open-source LLMs, demonstrating an advancement in the quality of key-insights. As a result, it is shown that we can extract key-insights at article level only using the multiple fine-tuned open-source LLMs.

Related works

Key-insight extraction

Key-insight extraction refers to identifying valuable information for research contained in scientific articles. Current research extracts key-insights based on sentences6 or phrases5. These researchers extract specific sentences or phrases as key-insights. Sentence-level key-insight extraction can be viewed as a classification task, which classifies the sentences into specific classes. Phrase-level key-insight extraction is more concerned with extracting phrases, or fragments from the text. Key-insight extraction is mainly based on machine learning technology, such as Bayesian classifiers8, Conditional Random Fields9, Support Vector Machines10, and Deep Neural Networks11. Most of those studies are based on extracting key-insights from articles’ abstract. However, those methods have two limitations:

  1. 1.

    Abstract does not necessarily fully represent the key-insights of the article; for example, the limitations of the research and future research may not exist in the abstract.

  2. 2.

    The extracted information may not be sufficient to capture the true meaning of the article. In many cases, the key-insights of an article require a broader context synthesis based on the textual summary.

The extensive understanding of various topics exhibited by LLMs, exemplified by GPT-4, has garnered significant attention across the scientific community. Comprehensive evaluations have been conducted to assess GPT-4’s performance across a multitude of natural language processing tasks12,13,14,15,16,17,18,19. The results indicate that GPT-4’s performance varies across different tasks. For instance, in such tasks as information retrieval17, information extraction12,18, and text summarization7, GPT-4 demonstrates superior performance to traditional models. This could be attributed to GPT-4’s training data encompassing diverse ___domain knowledge, enabling it to retrieve relevant information effectively from a wide range of languages and document types. However, in the task of relation extraction, GPT-4’s performance falls short of benchmark models. Han et al.14 reported that this discrepancy might be attributed to GPT-4’s limited understanding of subject-object relationships within relation extraction tasks.

In contrast to traditional methods that focus on paragraph-level and sentence-level key-insight extraction, an advantage of LLMs is their capacity to perform full-text level key-insight extraction. This task can be considered a synthesis of text summarization and information extraction. Currently, the use of LLMs for key-insight extraction tasks lacks a systematic evaluation. Existing large datasets intended to evaluate LLMs typically assess their performance in areas such as information extraction20,21, text summarization22, and QA23. However, the task of key-insight extraction demands that models have the capability to extract and synthesize information from multiple perspectives. Therefore, the performance of LLMs in information extraction, text summarization, and QA cannot be directly equated to their effectiveness in key-insight extraction tasks. Therefore, it is necessary to systematically evaluate the performance of LLMs on key-insight extraction tasks.

Fine-tuning of LLMs

The key behind achieving high performance of LLMs lies in the two main stages of the training process: (1) initial pre-training on massive text corpora, endowing the LLM with an expansive grasp of linguistic knowledge and structure; (2) fine-tuning on specific tasks to adapt to distinct domains and applications. This dual-training paradigm equips LLMs with unparalleled adaptability, rendering them instrumental in the modern landscape of natural language processing. Since initial pre-training of LLMs requires a large amount of hardware support, scholars tend to fine-tune the pre-trained LLMs to adapt to tasks in different fields24.

Supervised fine-tuning refers to the process of taking a pre-trained model and adapting it to a new task or dataset by making small adjustments to its parameters25. Compared with fine-tuning small-scale models, fine-tuning techniques for LLMs are more complex because of scalability and hardware performance issues. Fine-tuning methods for LLMs are often called PEFT (Parameter-Efficient Fine-Tuning). PEFT methods aim to tweak only a small fraction of the LLM’s parameters, thereby mitigating computational and memory costs. This approach allows for the efficient adaptation of LLMs to specific tasks without the need for extensive resources, making high-quality LLM personalization more accessible.

LoRA (Low-Rank Adaptation)26 is a PEFT technology that optimizes LLMs like GPT-3 by introducing trainable rank decomposition matrices into their architecture, significantly reducing the number of adjusted parameters while maintaining or improving LLM performance. As a result, LoRA distinguishes itself by providing an optimal balance between LLM efficiency and performance enhancement. By harnessing the power of Low-Rank Adaptation, LoRA enables more nuanced and targeted adjustments to LLMs without the substantial increase in parameters typically associated with such refinements.

Multi-actor approaches in LLM

The introduction of multi-actors in LLM systems for improving the performance of LLMs leverages the concept of ensemble learning by coordinating the efforts of multiple AI actors, each embodying an instance of LLM like GPT-4, to act in concert. Through such harmonious interaction, the actors combine their varied knowledge and contextual understanding, addressing intricate challenges with a level of efficiency and inventiveness that exceeds the scope of any single LLM. This transition from individual to collective AI endeavors symbolizes the notion that the aggregate output of an ensemble of actors far exceeds what they could achieve independently.

Extensive research has been conducted to enhance the performance of large language models (LLMs) for specialized tasks using multi-actor approaches, yet there is no consensus on the optimal way for these actors to collaborate. For instance, employing Dawid-Skene Model to iteratively optimize weights for each actor proves highly effective for tasks where labels are definite27. However, the multi-actor nature of LLMs significantly increases computational demand28. To mitigate this, a novel routing architecture for multi-actor LLMs based on a reward model has been introduced, which presumes each sub-actor’s proficiency in specific tasks29. This method improves both efficiency and accuracy by allocating particular tasks to designated expert actors, similar in philosophy to the popular MoE (Mixture of Experts)30 model but with greater scalability because it avoids the hard-coding of expert models inherent in MoE architecture.

In the sphere of practical implementation, the research by Hong et al.31 serves as a compelling illustration of the extraordinary capabilities of multi-actor systems. They have adeptly merged human-inspired Standard Operating Procedures (SOPs) with role specialization within an advanced meta-programming architecture, demonstrating how structured cooperation can enhance the performance of LLMs to unprecedented levels. Further, Liang et al.32 have advanced this area by creating a Multi-Agent Debate (MAD) framework, tailor-made to navigate the complexities of intricate reasoning challenges that confront LLMs. This framework provides a systematic arena for agents to participate in deliberative debates, thereby boosting the collective intellectual capacity of the ensemble of agents, showcasing the profound influence that well-orchestrated ensemble strategies can have in transcending the limitations of existing AI paradigms.

ArticleLLM

We propose a scientific-article key-insight extraction system, called ArticleLLM, using multi-actor of multiple fine-tuned open-source LLMs. The key-insights we want to extract are the followings: aim of study, motivation of study, problem to solve, method used for solution, evaluation metrics, findings, contributions, limitations, and future work.

Open-source LLMs used for fine-tuning

Compared to commercially available proprietary LLMs, fine-tuned and locally deployed open-source LLMs hold advantages in terms of data security, scalability, and cost-effectiveness33. Fine-tuning is a pivotal process that enables the LLM to perform better on specific tasks by adjusting the parameters of a pre-trained LLM to fit particular datasets or application contexts34. Fine-tuning is more resource- and time-efficient than training from scratch, as it leverages the general knowledge acquired by the pre-trained LLM.

In this study, we use MistralAI’s Mixtral, 01AI’s Yi, and InternLM’s InternLM2 ranked by the transformers framework of Hugging Face35 as a candidate open-source LLM for fine-tuning as shown in Table 1. These LLMs all rank high in performance on AlpacaEval leaderboard36 and are considered to have high potential for key-insight extraction. Since these LLMs are compatible with the transformers framework of Hugging Face, they can be fine-tuned based on the same set of methods.

Table 1 Open-source LLMs used for fine-tuning.

Fine-tuning algorithm

We utilize the instruction fine-tuning37 method to refine LLMs. This method primarily involves targeted fine-tuning based on specific instructions atop the foundational language model, thereby enhancing the LLM’s capability to comprehend and respond to user commands or inquiries. We fine-tuning the model using Instruction-Response pairs, where the Instructions serve as input data and the Responses are treated as labels. During the fine-tuning phase, the optimization algorithm modifies the LLM’s parameters by calculating the loss function between the predicted outputs and the actual labels.

To efficiently fine-tune LLMs with minimal performance loss, we employ a 4-bit precision loading approach using the GPTQ algorithm38. GPTQ aims to significantly reduce the LLMs’ memory and computational requirements by compressing the weight bit representation from 32 bits to just 4 bits. This reduction not only minimizes the LLM’s size but also enhances its operational speed by treating weights more discretely.

During the fine-tuning stage, we apply LoRA26 optimization to adjust the pre-trained LLM for specific tasks efficiently. LoRA operates by selectively modifying a subset of the LLM’s weights using low-rank matrices, maintaining the original structure while enabling swift fine-tuning. This method allows for precise adjustment without retraining the full network. We adopt Adamw39 for optimization, which optimizes learning rates based on gradient moments, ensuring a balance between task-specific performance enhancement and generalizability. We use the default cross-entropy as the loss function because it can effectively handle the multi-class label prediction problem and ensure the accuracy of the probability distribution of the model output. We set the train epoch to 1 to avoid overfitting. The parameters governing the LoRA adaptation and the overall fine-tuning process are carefully selected as shown in Table 2 to balance between refining the LLM’s performance on specific tasks and maintaining its generalized capabilities.

Table 2 Fine-tuning parameters.

Data set used and preparation for fine-tuning

The data set used in this study comes from the PDF files of arXiv40 public articles having the keyword of healthcare. The academic consensus suggests that the optimal dataset size for fine-tuning LLMs lies between 1,000 and 10,000 dialogue entries41,42, most of which consist of dialogue sentences of a few hundred words. However, given that the article data involved in key-insight extraction tasks are often dozens of times longer than these dialogue entries, there is a significant increase in training time costs. Therefore, we choose to use 1,000 articles as our training dataset. Among them, following the usual practice, we set the size of training data set to 700 articles.

Researchers have confirmed that long articles can reduce the performance of LLMs43. To fully exploit the performance of a LLM, the length of the input text must be reduced. Moreover, a long article will require a large amount of GPU memory during the fine-tuning phase. Therefore, we selected short-length articles whose number of words are less than 7000 to form the dataset.

Converting the PDF format of an article into the text format that can be processed by LLMs is a prerequisite for executing subsequent algorithms. We use Tika-python44 to convert PDF format files into string data. Tika-python returns the string data contained in the entire PDF file. Since the string data may contain some useless characters and symbols, we also use Python’s pySBD45 library to filter out them.

To fine-tune the open-source LLMs in Table 1, the dataset is organized into the following structure: Instruction and Response. Instruction is an order given to the LLM, telling the LLM what information to extract and in what format to return it. We showed in Table 3 the instruction used for extracting key-insights from the article text. Response is the result expected to be returned by the LLM. In this supervised fine-tuning stage, the response represents key-insights of the article. We used the key-insights generated by GPT-4 as a label for the response.

Table 3 Instruction for extracting key-insights.

Multi-actor of fine-tuned LLMs

The multi-actor of the fine-tuned LLMs operates on the independently obtained response of each fine-tuned LLM: Mixtral FT, Yi FT, and InternLM2 FT. Each LLM is tasked with extracting key-insights according to the structured instruction as showcased in Table 3. Following the initial extraction phase, each output from each LLM is subjected to a synthesis process. For this step, we harness the capability of one of the LLMs as the centerpiece for integrating diverse perspectives and insights generated by the other LLMs.

Specifically, we used InternLM2 to summarize the information from the output of each LLM FT using the instructions as shown in Table 4. To ensures that InternLM2 is optimally tailored to effectively integrate and summarize the outputs of the LLM FTs, we fine-tuned the InternLM2 using the dataset of 1,000 entities randomly selected from the key-insight outputs of Mixtral FT, Yi FT, InternLM2 FT, and GPT-4. Among them, the key-insight sentences extracted by Mixtral FT, Yi FT, and InternLM2 FT are used as input, and the key-insight sentences extracted by GPT-4 are used as labels.

Table 4 Instruction for summarizing the key-insights from each fine-tuned LLM.

Performance evaluation

Evaluation metrics

We use the following three metrics to evaluate the performance of ArticleLLM on the key-insights extraction task:

  • Manual evaluation Manual evaluation is considered the gold standard for key-insight extraction tasks3. However, its high cost limits its application on large datasets. We use manual evaluation to measure the performance of the LLM in a small article dataset. Because the key-insight extraction task involves explicit goals and objectives, we use a relevance score to assess how accurately the LLM extracts key-insights. The relevance score ranges from 0 to 1, where 0 indicates ‘completely irrelevant’ and 1, ‘completely relevant’. Specifically, human researchers assess whether the key-insights extracted by the LLM comprehensively capture all critical information points identified in manually annotated references, evaluating the presence or absence of essential details or elements. To reduce the subjective errors caused by humans, two researchers independently assessed the results and re-evaluated the contradictory parts.

  • GPT-4 score Through carefully designed instructions as shown in Table 5, GPT-4 scores the semantic similarity of the extracted key-insights. The score ranges from 0 to 100, where 0 indicates ‘completely dissimilar’ and 100, ‘completely similar’. The key-insights generated by an open source LLM are evaluated by the key-insights generated by GPT-4. Because, compared to Bleu score, GPT-4 is considered to better capture the deep semantic similarity between texts, GPT-4 score has been widely used in LLM evaluation46,47.

  • Vector similarity Vector similarity quantifies the semantic similarity between two sentences using the cosine similarity between their vector representations48. In this study, vector representations of each key-insight are produced using the Sentence Transformer model, all-MiniLM-L6-v249. The similarity scores are normalized to a scale ranging from 0 to 100, where 0 indicates ’completely dissimilar’ and 100 denotes ’completely similar’.

Table 5 Instruction for semantic similarity.

Evaluation results

For evaluation test, we used a Linux server with 4 Nvidia TITAN RTX GPUs. The operating system is CentOS7. The CPU is an Intel (R) Xeon (R) Silver 4114 CPU @ 2.20 GHz with 40 cores. We used a total of 300 articles for test dataset. In terms of software, we employ Python’s Transformers library35 as the foundational framework for fine-tuning and inference processes. We utilize the PEFT library50 to perform parameter efficient fine-tuning.

Manual performance comparison of LLMs before fine-tuning

The manual evaluation results for the key-insight extraction performances of the LLMs before fine-tuning are shown in Table 6, using a total of 34 articles for test dataset. GPT-4 demonstrates superior efficacy across all key-insights, achieving an average score of 0.97. This is followed by InternLM2, which exhibits commendable performance with an average score of 0.80, showcasing its potential in extracting key-insights from scientific literature. Conversely, Yi and Mixtral lag slightly behind, with average scores of 0.65 and 0.60, respectively. Notably, GPT-4 achieves perfect scores in understanding the aim, motivation, evaluation metrics, and contribution of scientific articles, underscoring its advanced capability in discerning intricate academic content.

Table 6 Manual performance evaluation for LLMS before fine-tuning.

Performance comparison of fine-tuned LLMs

The performance comparison results are presented in Fig. 1, and the numerical values of the results are shown in Table 7. With respect to the GPT-4 score after fine-tuning, both Yi FT and Mixtral FT exhibited slight improvements, with their average scores increasing from 68.5 to 71.4 and from 49.4 to 51.6, respectively. InternLM2 and its fine-tuned counterpart, InternLM2 FT, stood out for their exceptional performance, with InternLM2 FT achieving the highest average score of 77.8. Unlike GPT-4 scores, the vector similarity scores show less variation, suggesting this metric evaluates a different aspect. Specifically, after fine-tuning, Yi FT and Mixtral FT showed marginal improvements in their scores, shifting from 64.2 to 65.8 and from 63.5 to 63.9 respectively, showcasing slight but noticeable progress. InternLM2 and its fine-tuned version, InternLM2 FT, delivered the highest scores among the individual models at 68.8, indicating a stronger semantic alignment.

On the other hand, the underperformance of all open-source LLMs in extracting “Evaluation metrics” and “Limitations” can be principally attributed to two specific challenges. First, the task of extracting “Evaluation metrics” often involves identifying multiple distinct indicators such as accuracy, precision, and recall, which may be intricately described within the text. Open-source LLMs may struggle with this multifaceted extraction, leading to the omission of certain indicators that are critical for a comprehensive evaluation. Second, the extraction of “Limitations” presents its unique set of difficulties, as not all authors explicitly mention limitations within their articles. This situation forces open-source LLMs to attempt the synthesis of plausible yet not entirely accurate limitations, which can deviate significantly from the article’s intended messages. These challenges highlight the nuanced understanding and contextual interpretation required for accurately extracting such sophisticated elements from scientific texts, thereby suggesting the need for LLMs to be specifically fine-tuned or trained with a focus on recognizing and handling the diverse and complex nature of “Evaluation metrics” and “Limitations” in scientific literature.

The multi-actor approach consistently outperforms other models in both metrics, though the margin is narrower in vector similarity, suggesting its overall superior linguistic and cognitive capabilities. Specifically, the multi-actor approach effectively leverages the combined strengths of three fine-tuned LLMs (InternLM2 FT, Yi FT, and Mixtral FT). This collaborative strategy significantly enhances its ability to extract key-insights from scientific articles, as illustrated in Fig. 1 which summarizes performance across nine essential categories. This underscores the synergistic effect of leveraging multiple fine-tuned LLMs, which collectively enhance the precision and breadth of extracted key-insights, surpassing the capabilities of individual LLMs. Moreover, the lower score of “Limitations” compared to other categories suggests that while multi-actor approach significantly improves performance, identifying areas requiring further refinement remains critical. Collectively, these results solidify the position of multi-actor LLMs as a powerful tool in comprehensively understanding and analyzing the complexities of scientific literature, outperforming singular fine-tuned LLMs in both depth and accuracy of extracted information.

Fig. 1
figure 1

(a) LLM performance evaluated by GPT-4. (b) LLM performance evaluated by and vector similarity.

Table 7 The numerical results of LLM performance evaluated by GPT-4.

Statistical significance of results

Given that the observed performances are so close to each other, we need to see if the observed differences are statistically significant. Considering that our data have natural boundaries which may lead to a skewed distribution, we implemented the Wilcoxon Signed-Rank Test51 to compare the median differences between LLMs. This non-parametric method is more appropriate due to the potential non-normality of the data caused by the boundaries. Our null hypothesis states that there would be no significant difference in GPT-4 score or vector similarity, whereas the alternative hypothesis anticipates a noticeable difference. Supplementary Table 1 presents the results of the Wilcoxon Signed-Rank Test for all LLMs. Within the realm of GPT-4 scores, all fine-tuned models exhibit significant differences in certain key-insights (p < 0.05). InternLM2 FT shows significance in aim, methods, question addressed, evaluation metrics, findings, limitations, and future work; Yi FT in aim, question addressed, and findings; Mixtral FT in aim, motivation, question addressed, and contribution. In particular, the multi-actor approach shows significance in all key-insights.

However, the results of vector similarity differ from those of GPT-4, with no significant statistical differences observed across all key-insights for Mixtral FT and InternLM2 FT. Yi FT shows significance only in aim, methods, and findings. This may be due to the fact that vector similarity is less sensitive to semantic variations. Figure 1b illustrates this point by demonstrating that the variance of vector similarity between different models is lower, resulting in closer curves. Nevertheless, the multi-actor approach shows significance in all key-insights except aim. This reveals that the multi-actor approach significantly improves the key-insight extraction performance. Overall, the results of statistical tests highlight the significant improvements in key-insight extraction made by the multi-actor approach, thereby validating its effectiveness.

Repeated testing of GPT-4 score

Considering the potential variability in GPT-4 score outputs, where identical instructions may yield varying results52, we conducted repeated experiments. As illustrated in Fig. 2, we performed 30 replicates of testing on the GPT-4 scores across all key-insights. The mean of GPT-4 scores for 30 repeated tests is 88.4 ± 0.63. This result shows that while there is some variation in GPT-4 scores across repeated tests, this variation has minimal impact on the overall outcome.

Fig. 2
figure 2

Repeat testing for GPT-4 score.

Discussion

The findings from this study underscore the effectiveness of fine-tuned LLMs for the extraction of key-insights from scientific articles. Notably, the fine-tuned version of the InternLM2, InternLM2 FT, emerged as a standout performer, evidencing significant performance enhancements post-fine-tuning, thus directly evidencing the benefits of fine-tuning. This enhancement in performance, particularly in terms of understanding and articulating critical scientific information, affirms the pivotal role of fine-tuning in optimizing LLMs for specialized academic tasks.

The multi-actor of LLMs showcases their superiority in handling complex scientific texts by integrating the strengths of the three individually fine-tuned LLMs—Mixtral FT, Yi FT, and InternLM2 FT—each bringing different nuanced insights to the output. This multi-faceted approach ensures a more comprehensive analysis than what any single LLM could achieve. The synthesis process, led by InternLM2 FT, exemplifies a sophisticated method of consolidating diverse perspectives into a coherent, analytically rich summary. This strategy significantly elevates the benchmark for automated text analysis, offering unparalleled precision and depth in extracting key-insights from dense academic literature, and highlights its immense potential for advancing literature review and analysis. This development aligns with recent studies emphasizing the cooperation of multiple LLMs in improving the performance of LLM applications across various fields31,32. It is a kind of collective intelligence, where diverse sources of information lead to more robust and reliable conclusions, a concept widely supported in the interdisciplinary research community.

Distinct from the existing multi-actor approaches27,28,29, the approach for extracting key-insights from scientific articles merges the unique viewpoints from disparate models. Consequently, such customary solutions as weight updating27 and expert routing29 prove ineffective for this task. Our approach addresses this by fine-tuning a specialized actor tasked with integrating and summarizing information from other actors. While this method may not achieve the execution efficiency of expert routing techniques, it significantly enhances the accuracy of key-insights extraction, ensuring maximal fidelity to the source article.

The results of this study also suggest an insight: the inherent abilities of the original LLMs are more important than the enhancements brought about by fine-tuning. Despite the tangible improvements observed by post-fine-tuning, the baseline performance of an LLM such as InternLM2 sets a precedential standard for excellence in the ___domain of key-insight extraction. This denotes that the foundational architecture and pre-training of these LLMs may be more pivotal in determining their effectiveness in complex academic tasks, overshadowing the incremental gains achieved through fine-tuning. Specifically, the fact that GPT-4 and InternLM2 demonstrated superior efficacy across various dimensions before fine-tuning, highlights the critical importance of the LLM’s original design and training corpus in grasping intricate academic content. Therefore, while fine-tuning serves as a valuable tool for LLM optimization, the selection of the base LLM is crucial for success in such specialized applications as key-insights extraction.

An advantage of ArticleLLM is its capability for local deployment, which facilitates the construction of a private large-scale literature management system. This feature is particularly crucial for applications that emphasize data confidentiality and security. Another benefit of ArticleLLM is its cost-effectiveness. Extracting key-insights from an article of approximately 10,000 words using GPT-4 costs about $0.15. This expense can prove substantial for building an extensive private literature management system. In contrast, a locally deployed ArticleLLM eliminates those concerns. By the way, the approach of multi-actor of LLMs proposed in this study requires about 10 min to extract key-insights from an article under our hardware setup. However, the server used in this research does not employ NVLink technology but rather relies on PCIe (Peripheral Component Interconnect Express) for data transmission, indicating significant room for improvement in system performance efficiency53.

Compared with previous research3,5,8,10 on key-insight extraction, the advantage of ArticleLLM is that it can process the entire article to obtain more reasonable results, demonstrating the feasibility of LLMs for extracting key-insights from articles. ArticleLLM could revolutionize the way academic databases curate and present information, allowing for a more nuanced and efficient retrieval process. Researchers could benefit from contextually relevant summaries, thereby significantly enhancing their research productivity and knowledge discovery.

There are some limitations worth mentioning. Although the multi-actor system based on multiple fine-tuned LLMs has achieved excellent performance, the restriction to articles under 7,000 words due to GPU memory limits and the focus on healthcare-related articles from arXiv may affect the generalizability of our findings. Additionally, the execution efficiency of the system is still a concern due to the collaboration of multiple LLMs involved. The extraction of key-insights such as “limitations” that may not exist in the article or appear in an ambiguous form requires further research. Compared to the arXiv papers used in this study, more rigorous peer-reviewed papers may avoid this problem. On the other hand, due to the hardware limitations, this study employed 4-bit quantization to fine-tune the LLMs, inevitably leading to a performance loss.

Conclusion

In this paper, it is shown that we can extract key-insights from scientific articles only using open-source LLMs. Our findings affirm the critical role of fine-tuning in enhancing the proficiency of LLMs for extracting key-insights, with InternLM2 FT showcasing remarkable improvements after fine-tuning. In addition, the multi-actor of LLMs, incorporating diverse perspectives from individually fine-tuned LLMs, significantly broadens the scope and accuracy of resulting key-insights. However, the inherent architecture and pre-training of LLMs seem to be more influential than fine-tuning in determining their efficacy for extracting key-insights Looking forward, ArticleLLM can present a promising avenue for enhancing academic research productivity, though such challenges as execution efficiency of ensemble systems warrant further exploration to optimize the utilization of LLMs in scholarly applications.