Help Instance Help

RAG Server (RAGTY)

Multi-tenant Retrieval-Augmented Generation platform with vision-based document parsing, hybrid vector search, and AI-powered chat with citations.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by giving them access to your private data at query time — without retraining the model.

The Problem RAG Solves

LLMs like GPT-4 or Claude are powerful but have critical limitations:

  1. Knowledge cutoff — they don't know about your internal documents, policies, or recent data

  2. Hallucination — without grounding, they invent plausible-sounding but wrong answers

  3. No access control — public LLMs can't respect your organization's data permissions

  4. No citations — you can't verify where an answer came from

How RAG Works

How RAG Works

Instead of relying solely on the LLM's training data, RAG retrieves relevant passages from your knowledge base and augments the LLM's prompt with them before generating a response.

Real-World Examples

Scenario

Without RAG

With RAG

"What is our vacation policy?"

LLM guesses generic policies

Retrieves HR handbook → exact policy with page reference

"How do I configure the payment gateway?"

Outdated or wrong instructions

Retrieves latest internal docs → correct config steps

"What did we agree in the Q3 planning?"

"I don't have access to that"

Retrieves meeting notes → summary with citations

"Show me compliance requirements for DACH"

Generic EU regulation info

Retrieves your compliance docs → specific requirements

Why a Dedicated RAG Server?

Building RAG properly requires solving many hard problems:

  • Document parsing — PDFs with tables, scanned pages, complex layouts need vision-based parsing (not just text extraction)

  • Chunking — documents must be split into meaningful pieces that preserve context

  • Hybrid search — combining semantic search (understands meaning) with keyword search (finds exact terms like product IDs)

  • Multi-tenancy — each team/customer sees only their own documents

  • Scalability — handling thousands of documents and concurrent users

  • Observability — knowing why a particular answer was generated

The RAGTY server solves all of these as a production-ready platform.

Architecture

RAG Server Architecture

Tech Stack

Component

Technology

Backend

Python 3.12, FastAPI 0.115, Pydantic 2.11

Frontend

Next.js 15, React 19, TypeScript 5.8

Vector Database

Qdrant or pgvector (dense cosine similarity)

Cache/Queue

Redis 7 (ARQ task queue)

Object Storage

MinIO/S3

Relational DB

PostgreSQL 16

Document Parsing

DeepDoc (vision-based), MarkItDown

Embeddings

OpenAI, FastEmbed, Ollama

LLM

LiteLLM proxy (OpenAI, Anthropic, local models)

Observability

Langfuse tracing, Ragas quality metrics

Prerequisites

Quick Start

docker compose up --build

Service

URL

Frontend

http://localhost:3000

Backend

http://localhost:8000

MCP

http://localhost:8000/mcp/

Qdrant

http://localhost:6333 (optional)

Redis

localhost:6379

PostgreSQL

localhost:5432

MinIO

http://localhost:9000

Local Development

Backend

cd backend uv sync uv run uvicorn app.main:app --reload

Frontend

cd frontend npm install npm run dev

Project Structure

m8ty-rag/ ├── backend/ │ ├── app/ │ │ ├── main.py # FastAPI composition root │ │ ├── auth/ # Authentication bounded context │ │ ├── chat/ # Dialog/chat bounded context │ │ ├── embedding/ # Embedding bounded context │ │ ├── ingestion/ # Dataset/document ingestion │ │ ├── search/ # Hybrid search bounded context │ │ ├── mcp/ # MCP server (tools for AI agents) │ │ ├── tenant/ # Multi-tenancy bounded context │ │ └── shared/ # Config, dependencies, security │ ├── pyproject.toml │ └── Dockerfile ├── frontend/ │ ├── src/app/ # Next.js App Router pages │ ├── src/components/ # Reusable UI components │ ├── src/lib/ # API client, config │ └── Dockerfile └── docker-compose.yml

Topics

12 June 2026