Document Ingestion

The ingestion pipeline processes uploaded documents into searchable vector chunks stored in the vector backend (Qdrant or pgvector, depending on configuration).

Supported Formats

The RAG server leverages the RAGFlow/DeepDoc parsing engine which supports a wide range of file types:

File Type	Formats	Parser
PDF	PDF	DeepDoc (vision: OCR, TSR, layout detection)
Word	DOCX	Word parser (structure-preserving JSON)
PowerPoint	PPTX, PPT	Slide-by-slide extraction
Spreadsheet	XLSX, XLS, CSV	HTML output preserving table layout
Image	PNG, JPG, JPEG, GIF, TIF	OCR or VLM model
Email	EML	Field-based extraction (subject, body)
Text & Markup	TXT, MD, MDX, HTML, JSON	Tag removal, clean text output
Audio	MP3, WAV	ASR transcription to text
Video	MP4, AVI, MKV	Audio extraction + ASR

PDF Parser Options

Parser	Description
DeepDoc (default)	Vision model performing OCR, Table Structure Recognition (TSR), and Document Layout Recognition (DLR). Best for complex layouts.
Naive	Skips OCR/TSR/DLR — fast, for plain-text PDFs only
MinerU (experimental)	External open-source PDF-to-machine-readable converter
Docling (experimental)	External document processing tool for gen AI
Third-party VLM	Any configured vision-language model

Parser Routing

In the m8ty-rag implementation, the parser engine is configured per dataset:

Engine	Handled Formats	Notes
`DEEPDOC`	PDF (vision pipeline), PNG, JPG, JPEG, GIF, TIF, TIFF, BMP, WebP (OCR)	Falls back to MarkItDown for other formats
`MARKITDOWN`	DOCX, XLSX, PPTX, HTML, XML, JSON, CSV, MD, TXT	Microsoft MarkItDown library

The parser engine is selected per dataset. When DEEPDOC is selected:

PDF files → full vision pipeline (layout detection + TSR + OCR + bounding boxes)
Image files → local OCR (text detection + recognition, no internet required)
Other files → automatic fallback to MarkItDown

Format Details

PDF (DeepDoc Vision)

Feature	Description
Layout detection	Identifies text blocks, tables, figures, headers, footers
Table Structure Recognition	Extracts rows, columns, cells, merged cells
OCR	Handles scanned pages and embedded images
Bounding boxes	Stores exact (x, y, width, height) for citation overlays
Multi-column	Detects and reorders multi-column layouts
Page-level metadata	page_num, page_width, page_height per chunk

Best for: contracts, reports, academic papers, scanned documents, forms with tables.

Images (DeepDoc OCR)

Feature	Description
Formats	PNG, JPG, JPEG, GIF, TIF, TIFF, BMP, WebP
Text detection	ONNX model (`det.onnx`) locates text regions
Text recognition	ONNX model (`rec.onnx`) reads characters
Reading order	Top-to-bottom, left-to-right sorting
Confidence filtering	Low-confidence results (< 0.5) are dropped
Processing	100% local — no internet, no API costs

Best for: screenshots, photos of documents, whiteboard captures, infographics with text.

Word Documents (MarkItDown)

Feature	Description
Formats	DOCX
Extraction	Paragraphs, headings, tables, lists
Output	Clean Markdown preserving document structure

Spreadsheets (MarkItDown)

Feature	Description
Formats	XLSX, CSV
Extraction	Converts to Markdown tables
Output	Cell content with row/column structure preserved

Presentations (MarkItDown)

Feature	Description
Formats	PPTX
Extraction	Slide-by-slide title + body + notes
Output	Markdown with slide separators

Web & Markup (MarkItDown)

Feature	Description
Formats	HTML, XML, JSON, MD, TXT
HTML	Strips tags, preserves text structure
JSON	Converts to readable text representation
Markdown/TXT	Direct chunking (no conversion needed)

Not Yet Supported (Roadmap)

Format	Planned Parser	Status
Audio (MP3, WAV)	ASR transcription	Requires Whisper/ASR model
Video (MP4, AVI)	Audio extraction + ASR	Requires ffmpeg + ASR
Email (EML)	Field extraction	Planned
Scanned TIFF (multi-page)	DeepDoc page-by-page	Planned

Pipeline

Parser Engines

DeepDoc (Vision-Based)

Used for PDFs with complex layouts. Performs:

Layout detection — identifies text blocks, tables, figures, headers
Table Structure Recognition (TSR) — extracts table cells and relationships
OCR — processes scanned/image-based pages
Bounding box extraction — stores exact position coordinates for citation overlays

GPU worker is gated to concurrency=1 to prevent VRAM OOM. Configure DEEPDOC_URL for remote offload (TensorRT/Triton).

MarkItDown

Used for structured digital documents (DOCX, PPTX, Markdown). Fast, no GPU required.

Embedding Providers

Provider	Config	Models
OpenAI	API key via Settings → Providers	text-embedding-3-small, text-embedding-3-large
FastEmbed	No config needed	BAAI/bge-small-en-v1.5 (local ONNX)
Ollama	`OLLAMA_URL`	Any Ollama embedding model

The embedding model is configured per dataset. Switching models triggers re-embedding of affected documents.

Chunking Configuration

Parameter	Description	Default
Chunk size	Max tokens per chunk	512
Chunk overlap	Overlapping tokens between chunks	10
Chunk method	Splitting strategy	`naive`
Delimiter	Custom split delimiter	`\n`

Document Versioning

Each document has a parse_generation counter:

Re-parsing increments the generation
Old chunks are filtered out at query time (parse_generation >= current)
Background cleanup removes stale chunks

Task Queue

Document processing runs asynchronously via ARQ (Redis-backed):

GPU workers — concurrency=1 per GPU (DeepDoc inference)
CPU workers — higher concurrency (MarkItDown, embeddings)
Task payloads contain only references (document_id, artifact_uri) — never binary data

14 June 2026