Help Instance Help

Document Ingestion

The ingestion pipeline processes uploaded documents into searchable vector chunks stored in the vector backend (Qdrant or pgvector, depending on configuration).

Supported Formats

The RAG server leverages the RAGFlow/DeepDoc parsing engine which supports a wide range of file types:

File Type

Formats

Parser

PDF

PDF

DeepDoc (vision: OCR, TSR, layout detection)

Word

DOCX

Word parser (structure-preserving JSON)

PowerPoint

PPTX, PPT

Slide-by-slide extraction

Spreadsheet

XLSX, XLS, CSV

HTML output preserving table layout

Image

PNG, JPG, JPEG, GIF, TIF

OCR or VLM model

Email

EML

Field-based extraction (subject, body)

Text & Markup

TXT, MD, MDX, HTML, JSON

Tag removal, clean text output

Audio

MP3, WAV

ASR transcription to text

Video

MP4, AVI, MKV

Audio extraction + ASR

PDF Parser Options

Parser

Description

DeepDoc (default)

Vision model performing OCR, Table Structure Recognition (TSR), and Document Layout Recognition (DLR). Best for complex layouts.

Naive

Skips OCR/TSR/DLR — fast, for plain-text PDFs only

MinerU (experimental)

External open-source PDF-to-machine-readable converter

Docling (experimental)

External document processing tool for gen AI

Third-party VLM

Any configured vision-language model

Parser Routing

In the m8ty-rag implementation, the parser engine is configured per dataset:

Engine

Handled Formats

Notes

DEEPDOC

PDF (vision pipeline), PNG, JPG, JPEG, GIF, TIF, TIFF, BMP, WebP (OCR)

Falls back to MarkItDown for other formats

MARKITDOWN

DOCX, XLSX, PPTX, HTML, XML, JSON, CSV, MD, TXT

Microsoft MarkItDown library

The parser engine is selected per dataset. When DEEPDOC is selected:

  1. PDF files → full vision pipeline (layout detection + TSR + OCR + bounding boxes)

  2. Image files → local OCR (text detection + recognition, no internet required)

  3. Other files → automatic fallback to MarkItDown

Format Details

PDF (DeepDoc Vision)

Feature

Description

Layout detection

Identifies text blocks, tables, figures, headers, footers

Table Structure Recognition

Extracts rows, columns, cells, merged cells

OCR

Handles scanned pages and embedded images

Bounding boxes

Stores exact (x, y, width, height) for citation overlays

Multi-column

Detects and reorders multi-column layouts

Page-level metadata

page_num, page_width, page_height per chunk

Best for: contracts, reports, academic papers, scanned documents, forms with tables.

Images (DeepDoc OCR)

Feature

Description

Formats

PNG, JPG, JPEG, GIF, TIF, TIFF, BMP, WebP

Text detection

ONNX model (det.onnx) locates text regions

Text recognition

ONNX model (rec.onnx) reads characters

Reading order

Top-to-bottom, left-to-right sorting

Confidence filtering

Low-confidence results (< 0.5) are dropped

Processing

100% local — no internet, no API costs

Best for: screenshots, photos of documents, whiteboard captures, infographics with text.

Word Documents (MarkItDown)

Feature

Description

Formats

DOCX

Extraction

Paragraphs, headings, tables, lists

Output

Clean Markdown preserving document structure

Spreadsheets (MarkItDown)

Feature

Description

Formats

XLSX, CSV

Extraction

Converts to Markdown tables

Output

Cell content with row/column structure preserved

Presentations (MarkItDown)

Feature

Description

Formats

PPTX

Extraction

Slide-by-slide title + body + notes

Output

Markdown with slide separators

Web & Markup (MarkItDown)

Feature

Description

Formats

HTML, XML, JSON, MD, TXT

HTML

Strips tags, preserves text structure

JSON

Converts to readable text representation

Markdown/TXT

Direct chunking (no conversion needed)

Not Yet Supported (Roadmap)

Format

Planned Parser

Status

Audio (MP3, WAV)

ASR transcription

Requires Whisper/ASR model

Video (MP4, AVI)

Audio extraction + ASR

Requires ffmpeg + ASR

Email (EML)

Field extraction

Planned

Scanned TIFF (multi-page)

DeepDoc page-by-page

Planned

Pipeline

RAG Document Processing Pipeline

Parser Engines

DeepDoc (Vision-Based)

Used for PDFs with complex layouts. Performs:

  1. Layout detection — identifies text blocks, tables, figures, headers

  2. Table Structure Recognition (TSR) — extracts table cells and relationships

  3. OCR — processes scanned/image-based pages

  4. Bounding box extraction — stores exact position coordinates for citation overlays

GPU worker is gated to concurrency=1 to prevent VRAM OOM. Configure DEEPDOC_URL for remote offload (TensorRT/Triton).

MarkItDown

Used for structured digital documents (DOCX, PPTX, Markdown). Fast, no GPU required.

Embedding Providers

Provider

Config

Models

OpenAI

API key via Settings → Providers

text-embedding-3-small, text-embedding-3-large

FastEmbed

No config needed

BAAI/bge-small-en-v1.5 (local ONNX)

Ollama

OLLAMA_URL

Any Ollama embedding model

The embedding model is configured per dataset. Switching models triggers re-embedding of affected documents.

Chunking Configuration

Parameter

Description

Default

Chunk size

Max tokens per chunk

512

Chunk overlap

Overlapping tokens between chunks

10

Chunk method

Splitting strategy

naive

Delimiter

Custom split delimiter

\n

Document Versioning

Each document has a parse_generation counter:

  • Re-parsing increments the generation

  • Old chunks are filtered out at query time (parse_generation >= current)

  • Background cleanup removes stale chunks

Task Queue

Document processing runs asynchronously via ARQ (Redis-backed):

  • GPU workers — concurrency=1 per GPU (DeepDoc inference)

  • CPU workers — higher concurrency (MarkItDown, embeddings)

  • Task payloads contain only references (document_id, artifact_uri) — never binary data

14 June 2026