RAG Server (RAGTY)
Multi-tenant Retrieval-Augmented Generation platform with vision-based document parsing, hybrid vector search, and AI-powered chat with citations.
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by giving them access to your private data at query time — without retraining the model.
The Problem RAG Solves
LLMs like GPT-4 or Claude are powerful but have critical limitations:
Knowledge cutoff — they don't know about your internal documents, policies, or recent data
Hallucination — without grounding, they invent plausible-sounding but wrong answers
No access control — public LLMs can't respect your organization's data permissions
No citations — you can't verify where an answer came from
How RAG Works

Instead of relying solely on the LLM's training data, RAG retrieves relevant passages from your knowledge base and augments the LLM's prompt with them before generating a response.
Real-World Examples
Scenario | Without RAG | With RAG |
|---|---|---|
"What is our vacation policy?" | LLM guesses generic policies | Retrieves HR handbook → exact policy with page reference |
"How do I configure the payment gateway?" | Outdated or wrong instructions | Retrieves latest internal docs → correct config steps |
"What did we agree in the Q3 planning?" | "I don't have access to that" | Retrieves meeting notes → summary with citations |
"Show me compliance requirements for DACH" | Generic EU regulation info | Retrieves your compliance docs → specific requirements |
Why a Dedicated RAG Server?
Building RAG properly requires solving many hard problems:
Document parsing — PDFs with tables, scanned pages, complex layouts need vision-based parsing (not just text extraction)
Chunking — documents must be split into meaningful pieces that preserve context
Hybrid search — combining semantic search (understands meaning) with keyword search (finds exact terms like product IDs)
Multi-tenancy — each team/customer sees only their own documents
Scalability — handling thousands of documents and concurrent users
Observability — knowing why a particular answer was generated
The RAGTY server solves all of these as a production-ready platform.
Architecture

Tech Stack
Component | Technology |
|---|---|
Backend | Python 3.12, FastAPI 0.115, Pydantic 2.11 |
Frontend | Next.js 15, React 19, TypeScript 5.8 |
Vector Database | Qdrant or pgvector (dense cosine similarity) |
Cache/Queue | Redis 7 (ARQ task queue) |
Object Storage | MinIO/S3 |
Relational DB | PostgreSQL 16 |
Document Parsing | DeepDoc (vision-based), MarkItDown |
Embeddings | OpenAI, FastEmbed, Ollama |
LLM | LiteLLM proxy (OpenAI, Anthropic, local models) |
Observability | Langfuse tracing, Ragas quality metrics |
Prerequisites
uv (Python package manager)
Quick Start
Service | URL |
|---|---|
Frontend | http://localhost:3000 |
Backend | http://localhost:8000 |
MCP | http://localhost:8000/mcp/ |
Qdrant | http://localhost:6333 (optional) |
Redis | localhost:6379 |
PostgreSQL | localhost:5432 |
MinIO | http://localhost:9000 |
Local Development
Backend
Frontend
Project Structure
Topics
Multi-Tenancy — Roles, permissions, tenant isolation
MCP Server — AI agent integration, tools, client configuration
API Keys — Personal and dataset-scoped key management
Document Ingestion — Parsing, chunking, embedding pipeline
Search & Chat — Hybrid retrieval, reranking, streaming chat
Configuration — Environment variables, security, deployment