Document Ingestion
The ingestion pipeline processes uploaded documents into searchable vector chunks stored in the vector backend (Qdrant or pgvector, depending on configuration).
Supported Formats
The RAG server leverages the RAGFlow/DeepDoc parsing engine which supports a wide range of file types:
File Type | Formats | Parser |
|---|---|---|
DeepDoc (vision: OCR, TSR, layout detection) | ||
Word | DOCX | Word parser (structure-preserving JSON) |
PowerPoint | PPTX, PPT | Slide-by-slide extraction |
Spreadsheet | XLSX, XLS, CSV | HTML output preserving table layout |
Image | PNG, JPG, JPEG, GIF, TIF | OCR or VLM model |
EML | Field-based extraction (subject, body) | |
Text & Markup | TXT, MD, MDX, HTML, JSON | Tag removal, clean text output |
Audio | MP3, WAV | ASR transcription to text |
Video | MP4, AVI, MKV | Audio extraction + ASR |
PDF Parser Options
Parser | Description |
|---|---|
DeepDoc (default) | Vision model performing OCR, Table Structure Recognition (TSR), and Document Layout Recognition (DLR). Best for complex layouts. |
Naive | Skips OCR/TSR/DLR — fast, for plain-text PDFs only |
MinerU (experimental) | External open-source PDF-to-machine-readable converter |
Docling (experimental) | External document processing tool for gen AI |
Third-party VLM | Any configured vision-language model |
Parser Routing
In the m8ty-rag implementation, the parser engine is configured per dataset:
Engine | Handled Formats | Notes |
|---|---|---|
| PDF (vision pipeline), PNG, JPG, JPEG, GIF, TIF, TIFF, BMP, WebP (OCR) | Falls back to MarkItDown for other formats |
| DOCX, XLSX, PPTX, HTML, XML, JSON, CSV, MD, TXT | Microsoft MarkItDown library |
The parser engine is selected per dataset. When DEEPDOC is selected:
PDF files → full vision pipeline (layout detection + TSR + OCR + bounding boxes)
Image files → local OCR (text detection + recognition, no internet required)
Other files → automatic fallback to MarkItDown
Format Details
PDF (DeepDoc Vision)
Feature | Description |
|---|---|
Layout detection | Identifies text blocks, tables, figures, headers, footers |
Table Structure Recognition | Extracts rows, columns, cells, merged cells |
OCR | Handles scanned pages and embedded images |
Bounding boxes | Stores exact (x, y, width, height) for citation overlays |
Multi-column | Detects and reorders multi-column layouts |
Page-level metadata | page_num, page_width, page_height per chunk |
Best for: contracts, reports, academic papers, scanned documents, forms with tables.
Images (DeepDoc OCR)
Feature | Description |
|---|---|
Formats | PNG, JPG, JPEG, GIF, TIF, TIFF, BMP, WebP |
Text detection | ONNX model ( |
Text recognition | ONNX model ( |
Reading order | Top-to-bottom, left-to-right sorting |
Confidence filtering | Low-confidence results (< 0.5) are dropped |
Processing | 100% local — no internet, no API costs |
Best for: screenshots, photos of documents, whiteboard captures, infographics with text.
Word Documents (MarkItDown)
Feature | Description |
|---|---|
Formats | DOCX |
Extraction | Paragraphs, headings, tables, lists |
Output | Clean Markdown preserving document structure |
Spreadsheets (MarkItDown)
Feature | Description |
|---|---|
Formats | XLSX, CSV |
Extraction | Converts to Markdown tables |
Output | Cell content with row/column structure preserved |
Presentations (MarkItDown)
Feature | Description |
|---|---|
Formats | PPTX |
Extraction | Slide-by-slide title + body + notes |
Output | Markdown with slide separators |
Web & Markup (MarkItDown)
Feature | Description |
|---|---|
Formats | HTML, XML, JSON, MD, TXT |
HTML | Strips tags, preserves text structure |
JSON | Converts to readable text representation |
Markdown/TXT | Direct chunking (no conversion needed) |
Not Yet Supported (Roadmap)
Format | Planned Parser | Status |
|---|---|---|
Audio (MP3, WAV) | ASR transcription | Requires Whisper/ASR model |
Video (MP4, AVI) | Audio extraction + ASR | Requires ffmpeg + ASR |
Email (EML) | Field extraction | Planned |
Scanned TIFF (multi-page) | DeepDoc page-by-page | Planned |
Pipeline

Parser Engines
DeepDoc (Vision-Based)
Used for PDFs with complex layouts. Performs:
Layout detection — identifies text blocks, tables, figures, headers
Table Structure Recognition (TSR) — extracts table cells and relationships
OCR — processes scanned/image-based pages
Bounding box extraction — stores exact position coordinates for citation overlays
GPU worker is gated to concurrency=1 to prevent VRAM OOM. Configure DEEPDOC_URL for remote offload (TensorRT/Triton).
MarkItDown
Used for structured digital documents (DOCX, PPTX, Markdown). Fast, no GPU required.
Embedding Providers
Provider | Config | Models |
|---|---|---|
OpenAI | API key via Settings → Providers | text-embedding-3-small, text-embedding-3-large |
FastEmbed | No config needed | BAAI/bge-small-en-v1.5 (local ONNX) |
Ollama |
| Any Ollama embedding model |
The embedding model is configured per dataset. Switching models triggers re-embedding of affected documents.
Chunking Configuration
Parameter | Description | Default |
|---|---|---|
Chunk size | Max tokens per chunk | 512 |
Chunk overlap | Overlapping tokens between chunks | 10 |
Chunk method | Splitting strategy |
|
Delimiter | Custom split delimiter |
|
Document Versioning
Each document has a parse_generation counter:
Re-parsing increments the generation
Old chunks are filtered out at query time (
parse_generation >= current)Background cleanup removes stale chunks
Task Queue
Document processing runs asynchronously via ARQ (Redis-backed):
GPU workers — concurrency=1 per GPU (DeepDoc inference)
CPU workers — higher concurrency (MarkItDown, embeddings)
Task payloads contain only references (document_id, artifact_uri) — never binary data