Help Instance Help

LLM & Model Providers

The RAG server supports a wide range of LLM, embedding, and reranking providers. All LLM calls are routed through a LiteLLM-compatible proxy — no direct vendor SDKs in application code.

How It Works

User Request → RAG Backend → LiteLLM Proxy → Provider API

Providers are configured in Settings → Providers with an API key and optional base URL. The server uses OpenAI-compatible API format for all providers.

Chat (LLM) Providers

Provider

Models

Auth

Notes

OpenAI

gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini

API key

Default provider

Anthropic

claude-sonnet-4, claude-opus-4, claude-3-5-haiku

API key

Google Gemini

gemini-2.0-flash, gemini-2.0-pro, gemini-1.5-flash, gemini-1.5-pro

API key

DeepSeek

deepseek-chat, deepseek-reasoner

API key

Reasoning model support

Mistral AI

mistral-large, mistral-medium, mistral-small, codestral

API key

Groq

llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b

API key

Ultra-fast inference

Azure OpenAI

gpt-4o, gpt-4o-mini

API key + base URL

Enterprise deployments

AWS Bedrock

claude-3-5-sonnet, claude-3-haiku, titan-embed

API key + base URL

NVIDIA NIM

llama-3.1-405b, llama-3.1-70b

API key

OpenRouter

gpt-4o, claude-3.5-sonnet, gemini-2.0-flash, llama-3.1-405b

API key

Multi-provider routing

Together AI

llama-3.1-70b, mixtral-8x7b

API key

Perplexity

sonar-large-128k-online, sonar-small-128k-online

API key

Online search-augmented

Fireworks AI

llama-3.1-70b, mixtral-8x7b

API key

Replicate

llama-3-70b, mixtral-8x7b

API key

Moonshot (Kimi)

moonshot-v1-128k, v1-32k, v1-8k

API key

Long context (Chinese)

ZhipuAI (GLM)

glm-4-plus, glm-4-flash

API key

Chinese LLM

MiniMax

abab6.5s-chat, abab5.5-chat

API key

SiliconFlow

DeepSeek-V2.5, Qwen2.5-72B

API key

StepFun

step-2-16k, step-1-128k

API key

Baidu (ERNIE)

ernie-4.0-8k, ernie-3.5-8k

API key

Upstage

solar-pro-preview

API key

Local LLM Providers (Self-Hosted)

Provider

Auth

Setup

Ollama

None (base URL only)

OLLAMA_URL=http://localhost:11434

LM Studio

None (base URL only)

Point to LM Studio server

LocalAI

None (base URL only)

OpenAI-compatible local server

Xinference

None (base URL only)

Supports chat, embedding, and rerank

OpenAI-Compatible

API key + base URL

Any server with OpenAI API format

Embedding Providers

Provider

Models

Dimensions

Notes

OpenAI

text-embedding-3-small, text-embedding-3-large

1536 / 3072

FastEmbed (Local)

bge-small-en-v1.5, bge-base-en-v1.5, bge-large-en-v1.5

384 / 768 / 1024

No API key, runs on CPU (ONNX)

Ollama

nomic-embed-text, mxbai-embed-large

768 / 1024

Local

Mistral AI

mistral-embed

1024

Google Gemini

text-embedding-004

768

Jina AI

jina-embeddings-v3

1024

Cohere

embed-english-v3.0, embed-multilingual-v3.0

1024

HuggingFace

all-MiniLM-L6-v2, all-mpnet-base-v2

384 / 768

Together AI

m2-bert-80M-32k-retrieval

768

Fireworks AI

nomic-embed-text-v1.5

768

Voyage AI

voyage-3, voyage-3-lite

1024 / 512

NVIDIA NIM

nv-embedqa-e5-v5

1024

ZhipuAI

embedding-3

2048

SiliconFlow

bge-large-zh-v1.5

1024

Baidu

bge-large-zh

1024

Upstage

solar-embedding-1-large-query

4096

Azure OpenAI

text-embedding-3-small, text-embedding-3-large

1536 / 3072

AWS Bedrock

titan-embed-text-v2

1024

Reranking Providers

Provider

Models

Notes

Jina AI

jina-reranker-v2-base-multilingual

Cohere

rerank-english-v3.0, rerank-multilingual-v3.0

Voyage AI

rerank-2

Xinference

Any compatible model

Self-hosted

Configuration

Providers are managed through the web UI at Settings → Providers:

  1. Select a provider from the catalog

  2. Enter API key and base URL (if required)

  3. Add specific models you want to use

  4. Set a default model for each capability (LLM, Embedding, Rerank)

Custom Models

For OpenAI-Compatible, Ollama, LM Studio, LocalAI, and Xinference providers, you can add custom models not in the preset list. Specify:

  • Model name (as the server identifies it)

  • Model type (chat, embedding, or rerank)

  • Max tokens

  • Vision support (yes/no)

  • Tool call support (yes/no)

Thinking / Reasoning Mode

Models that support extended reasoning (DeepSeek-Reasoner, QwQ, etc.) can be used with enable_thinking: true. The server streams reasoning tokens separately from the final answer, allowing the frontend to display the thought process.

12 June 2026