RagView Milestone Plan
Key Features
- Test Set Auto-Generation: Based on the documents in the document set, use a naive chunking method and an LLM to automatically generate Q&A pairs from each chunk, producing the test set data.
- Custom RAG Integration: Provide an SDK/API for developers to integrate their own RAG solutions into RagView, enabling comparison between their solutions and open-source solutions.
- Evaluation Task Optimization: Support setting up and comparing multiple configurations (different hyperparameters) of the same RAG solution.
- Evaluation Report Generation: Support automatic generation of PDF reports from evaluation results.
Usability Enhancements
- Email Notifications: Since evaluations are asynchronous and may take minutes to tens of minutes, add email notifications to inform users when evaluation results are ready.
- Result Charting: Generate bar charts, pie charts, radar charts, etc., based on metric scores to facilitate visual comparison.
- Hardware Resource Profiling: Collect statistics on hardware resource usage for different evaluation pipelines, aiding developers in assessing production feasibility.
- Optional Metrics: Make evaluation metrics optional (no longer mandatory), allowing users to select only the metrics they are interested in.
More RAG solutions
Legend:
✅ = Integrated | 🚧 = In Progress | ⏳ = Pending Integration
| No. | Name | GitHub Link | Features | Status |
|---|---|---|---|---|
| 0 | Langflow | langflow-ai/langflow | Build, scale, and deploy RAG and multi-agent AI apps.But we use it to build a naive RAG. | ✅ |
| 1 | R2R | SciPhi-AI/R2R | SoTA production-grade RAG system with Agentic RAG architecture and RESTful API support. | ✅ |
| 2 | KAG | OpenSPG/KAG | Retrieval framework combining OpenSPG engine and LLM, using logical forms for guided reasoning; overcomes traditional vector similarity limitations; supports domain-specific QA. | ⏳ |
| 3 | GraphRAG | microsoft/graphrag | Modular graph-based retrieval RAG system from Microsoft. | 🚧 |
| 4 | LightRAG | HKUDS/LightRAG | "Simple and Fast Retrieval-Augmented Generation," designed for simplicity and speed. | 🚧 |
| 5 | dsRAG | D-Star-AI/dsRAG | High-performance retrieval engine for unstructured data, suitable for complex queries and dense text. | 🚧 |
| 6 | paper-qa | Future-House/paper-qa | Scientific literature QA system with citation support and high accuracy. | ⏳ |
| 7 | cognee | topoteretes/cognee | Lightweight memory management for AI agents ("Memory for AI Agents in 5 lines of code"). | ⏳ |
| 8 | trustgraph | trustgraph-ai/trustgraph | Next-generation AI product creation platform with context engineering and LLM orchestration; supports API and private deployment. | ⏳ |
| 9 | graphiti | getzep/graphiti | Real-time knowledge graph builder for AI agents, supporting enterprise-grade applications. | ⏳ |
| 10 | DocsGPT | arc53/DocsGPT | Private AI platform supporting Agent building, deep research, document analysis, multi-model support, and API integration. | ✅ |
| 11 | youtu-graphrag | youtugraph/youtu-graphrag | Graph-based RAG framework from Tencent Youtu Lab, focusing on knowledge graph construction and reasoning for domain-specific applications. | ⏳ |
| 12 | Kiln | https://github.com/Kiln-AI/Kiln | Desktop app for zero-code fine-tuning, evals, synthetic data, and built-in RAG tools. | ⏳ |
| 13 | Quivr | https://github.com/QuivrHQ/quivr | a RAG that is opinionated, fast and efficient so you can focus on your product. | ⏳ |
More RAG Evaluation Metrics
We will gradually add functionality and performance metrics for RAG evaluation, including:
| Metric Type | Metric Name | Description |
|---|---|---|
| Effectiveness / Quality Metrics | Recall@k | Proportion of queries where the correct answer appears in the top k retrieved documents |
| Precision@k | Proportion of relevant documents among the top k retrieved documents | |
| MRR (Mean Reciprocal Rank) | Average reciprocal rank of the first relevant document | |
| nDCG (Normalized Discounted Cumulative Gain) | Ranking relevance metric that considers the importance of document order | |
| Answer Accuracy / F1 | Match between generated answers and reference answers (Exact Match or F1) | |
| ROUGE / BLEU / METEOR | Text overlap / language quality metrics | |
| BERTScore / MoverScore | Semantic-based answer matching metrics | |
| Context Precision | Proportion of retrieved documents that actually contribute to the answer | |
| Context Recall | Proportion of reference answer information covered by retrieved documents | |
| Context F1 | Combined score of Precision and Recall | |
| Answer-Context Alignment | Whether the answer strictly derives from the retrieved context | |
| Overall Score | Composite metric, usually a weighted combination of answer quality and context utilization | |
| Efficiency / Cost Metrics | Latency | Time required from input to answer generation |
| Token Consumption | Number of tokens consumed during answer generation | |
| Memory Usage | Memory or GPU usage during model execution | |
| API Cost / Compute Cost | Estimated cost of calling the model or retrieval API | |
| Throughput | Number of requests the system can handle per unit time | |
| Scalability | System performance change when data volume or user requests increase |
RAG VIEW