This document describes the integration of the PaperQA Agentic RAG system into the RAGView benchmarking platform, covering system advantages, integration workflow, engineering adaptations, and practical usage experiences.

Abstract

This document records the process of integrating the PaperQA solution into RAGView. It covers the advantages of the PaperQA approach, an overview of the integration workflow, and practical usage experiences. The document can serve as a practical guide and a pitfall-avoidance reference for users who intend to integrate this solution.

1. PaperQA: A High-Precision RAG System for Scientific Literature

PaperQA is a high-precision Retrieval-Augmented Generation (RAG) system specifically designed for scientific literature scenarios [1]. Its latest version, PaperQA2, introduces an Agentic workflow, which completes literature search, evidence collection, and evidence aggregation through multiple iterative steps, and ultimately generates answers with precise citations. According to official benchmark results [2], PaperQA2 has demonstrated superhuman-level performance in tasks such as scientific question answering, literature summarization, and contradiction detection.

Unlike traditional RAG systems that generate answers directly from retrieved text fragments, PaperQA adopts a structured “Retrieval–Contextual Summary–Reranking–Generation” (RCS) workflow to mitigate the issue of fragmented or out-of-context reasoning. Specifically, for each retrieved document chunk, PaperQA generates a query-aware contextual summary to extract evidence directly relevant to the user’s question. These summaries are then scored and filtered by large language models to remove irrelevant or noisy information. The final answer is generated solely based on the filtered evidence, and each sentence in the answer is explicitly annotated with its corresponding literature source.

Compared with traditional RAG approaches that rely primarily on embedding similarity for retrieval, this method significantly improves evidence traceability and effectively reduces the risk of hallucinated content.

In addition, conventional RAG systems typically follow a linear workflow: once retrieval is completed, the system proceeds to answer generation regardless of the quality of the retrieved results, lacking any dynamic evaluation of retrieval adequacy. As an Agentic RAG system, PaperQA introduces an intelligent agent with autonomous decision-making capabilities, decomposing complex scientific tasks into dynamically adjustable steps, including literature search, evidence collection, information evaluation, and answer generation.

When the system detects insufficient evidence or conflicting information, it does not forcibly generate low-quality answers; instead, it can proactively trigger additional searches or conduct deeper reading of specific documents, and further verify evidence using the RCS algorithm. This “search-while-thinking” mechanism enables PaperQA to perform self-correction and contextual completion to a certain extent, giving it a clear advantage over traditional RAG systems in scenarios involving cross-document reasoning and high-precision scientific verification.

2. Integration Overview

The integration of PaperQA described in this work is based on the official v2025.12.17 release [3]. In order to meet the evaluation and engineering requirements of RAGView, the following adaptations and modifications were carried out.

2.1 Adding Embedding Token Usage Statistics

The original PaperQA implementation only tracks token usage for large language models (LLMs). Considering that RAGView is a benchmarking platform for RAG solutions, the cost incurred during the embedding stage is also a critical factor for users when evaluating the cost-effectiveness of a RAG system. Therefore, we extended the system to support token usage statistics for embedding models.

In terms of implementation, this feature reuses the existing LLM token accounting mechanism in PaperQA. By leveraging the capabilities of the underlying model invocation libraries, litellm and fhlmi, usage information returned by APIs is uniformly captured and aggregated, enabling more fine-grained measurement of overall system cost.

2.2 API Encapsulation and Containerized Deployment

Since PaperQA does not provide an official Web API implementation, we encapsulated its core functionalities into a set of RESTful services based on the FastAPI framework, enabling standardized invocation and integration within RAGView.

The project implements four core APIs:

File Upload API (/api/upload)
Accepts user-uploaded documents and stores them in a specified namespace directory.
Index Construction API (/api/index)
Internally calls PaperQA’s get_directory_index function, triggering document parsing and embedding generation in build=True mode to construct a local index.
Query API (/api/query)
Invokes agent_query to execute the complete Agentic RAG workflow, and additionally calls get_directory_index in build=False mode to trace back and extract the original document chunks cited in the generated answer.
Data Cleanup API (/api/clear)
Removes the physical storage associated with a specified namespace, completing the data lifecycle management.

After completing API development, the codebase was packaged into a wheel package for easier deployment and reuse. As PaperQA does not officially provide a Docker image, we further encapsulated the service into a standard Docker image to satisfy production requirements for portability and deployment consistency.

Through these efforts, the core capabilities of PaperQA were successfully abstracted into a set of independently deployable and standardized Web APIs. Combined with containerized packaging, this work lays a solid foundation for stable integration and future extensibility within RAGView.

3. Practical Usage Impressions

After completing the integration of PaperQA and conducting testing, the most immediate impression is that PaperQA flexibly invokes multiple LLMs with different roles depending on task complexity. Its overall workflow involves three types of LLMs:

General LLM (llm)
Serves as the primary model in the pipeline, responsible for inferring citation information, extracting titles, DOIs, and authors, and generating the final answer.
Summary LLM (summary_llm)
Dedicated to contextual summarization, generating query-relevant summaries for each retrieved document chunk during the gather evidence stage.
Agent LLM (agent_llm)
Orchestrates the workflow across literature search, evidence collection, and answer generation, deciding which tools to invoke and when.

In a complete PaperQA workflow, the agent_llm determines the search and reasoning strategy, the summary_llm produces high-quality evidence summaries, and the llm generates the final answer based on the aggregated evidence. In addition, PaperQA augments the user’s original query into multiple enhanced queries, collects evidence in parallel based on these augmented queries, and finally synthesizes all gathered evidence to produce the answer.

Overall, as an Agentic RAG system, PaperQA demonstrates clear advantages over traditional RAG approaches in terms of rigor and answer verifiability. However, its reliance on multiple LLM invocations also implies higher inference costs, which may impose an additional burden for cost-sensitive users.

4. Future Optimization Directions

As a benchmarking platform that integrates multiple mainstream RAG solutions, RAGView uniformly divides each RAG workflow into an index stage and a query stage, enabling separate evaluation of cost consumption at different phases. This design provides important guidance for users when selecting suitable RAG solutions under cost-sensitive constraints.

However, PaperQA’s Agentic design inherently emphasizes end-to-end autonomous decision-making. Forcibly extracting and front-loading its document indexing stage, while beneficial for fair comparison with other RAG approaches, may partially weaken the agent’s autonomy within the workflow and introduce tension with its original design philosophy.

Therefore, identifying a better balance—one that adheres to PaperQA’s Agentic design principles while still enabling fair and comparable evaluation within RAGView—remains an important direction for future optimization.

References

[1] https://github.com/future-house/paper-qa
[2] Skarlinski, M., Cox, S., Laurent, J. M., et al. Language Agents Achieve Superhuman Synthesis of Scientific Knowledge. arXiv:2409.13740, 2024.
[3] https://github.com/Future-House/paper-qa/releases/tag/v2025.12.17

PaperQA Dev Journal