LLM & Model Providers
The RAG server supports a wide range of LLM, embedding, and reranking providers. All LLM calls are routed through a LiteLLM-compatible proxy — no direct vendor SDKs in application code.
How It Works
Providers are configured in Settings → Providers with an API key and optional base URL. The server uses OpenAI-compatible API format for all providers.
Chat (LLM) Providers
Provider | Models | Auth | Notes |
|---|---|---|---|
OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini | API key | Default provider |
Anthropic | claude-sonnet-4, claude-opus-4, claude-3-5-haiku | API key | |
Google Gemini | gemini-2.0-flash, gemini-2.0-pro, gemini-1.5-flash, gemini-1.5-pro | API key | |
DeepSeek | deepseek-chat, deepseek-reasoner | API key | Reasoning model support |
Mistral AI | mistral-large, mistral-medium, mistral-small, codestral | API key | |
Groq | llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b | API key | Ultra-fast inference |
Azure OpenAI | gpt-4o, gpt-4o-mini | API key + base URL | Enterprise deployments |
AWS Bedrock | claude-3-5-sonnet, claude-3-haiku, titan-embed | API key + base URL | |
NVIDIA NIM | llama-3.1-405b, llama-3.1-70b | API key | |
OpenRouter | gpt-4o, claude-3.5-sonnet, gemini-2.0-flash, llama-3.1-405b | API key | Multi-provider routing |
Together AI | llama-3.1-70b, mixtral-8x7b | API key | |
Perplexity | sonar-large-128k-online, sonar-small-128k-online | API key | Online search-augmented |
Fireworks AI | llama-3.1-70b, mixtral-8x7b | API key | |
Replicate | llama-3-70b, mixtral-8x7b | API key | |
Moonshot (Kimi) | moonshot-v1-128k, v1-32k, v1-8k | API key | Long context (Chinese) |
ZhipuAI (GLM) | glm-4-plus, glm-4-flash | API key | Chinese LLM |
MiniMax | abab6.5s-chat, abab5.5-chat | API key | |
SiliconFlow | DeepSeek-V2.5, Qwen2.5-72B | API key | |
StepFun | step-2-16k, step-1-128k | API key | |
Baidu (ERNIE) | ernie-4.0-8k, ernie-3.5-8k | API key | |
Upstage | solar-pro-preview | API key |
Local LLM Providers (Self-Hosted)
Provider | Auth | Setup |
|---|---|---|
Ollama | None (base URL only) |
|
LM Studio | None (base URL only) | Point to LM Studio server |
LocalAI | None (base URL only) | OpenAI-compatible local server |
Xinference | None (base URL only) | Supports chat, embedding, and rerank |
OpenAI-Compatible | API key + base URL | Any server with OpenAI API format |
Embedding Providers
Provider | Models | Dimensions | Notes |
|---|---|---|---|
OpenAI | text-embedding-3-small, text-embedding-3-large | 1536 / 3072 | |
FastEmbed (Local) | bge-small-en-v1.5, bge-base-en-v1.5, bge-large-en-v1.5 | 384 / 768 / 1024 | No API key, runs on CPU (ONNX) |
Ollama | nomic-embed-text, mxbai-embed-large | 768 / 1024 | Local |
Mistral AI | mistral-embed | 1024 | |
Google Gemini | text-embedding-004 | 768 | |
Jina AI | jina-embeddings-v3 | 1024 | |
Cohere | embed-english-v3.0, embed-multilingual-v3.0 | 1024 | |
HuggingFace | all-MiniLM-L6-v2, all-mpnet-base-v2 | 384 / 768 | |
Together AI | m2-bert-80M-32k-retrieval | 768 | |
Fireworks AI | nomic-embed-text-v1.5 | 768 | |
Voyage AI | voyage-3, voyage-3-lite | 1024 / 512 | |
NVIDIA NIM | nv-embedqa-e5-v5 | 1024 | |
ZhipuAI | embedding-3 | 2048 | |
SiliconFlow | bge-large-zh-v1.5 | 1024 | |
Baidu | bge-large-zh | 1024 | |
Upstage | solar-embedding-1-large-query | 4096 | |
Azure OpenAI | text-embedding-3-small, text-embedding-3-large | 1536 / 3072 | |
AWS Bedrock | titan-embed-text-v2 | 1024 |
Reranking Providers
Provider | Models | Notes |
|---|---|---|
Jina AI | jina-reranker-v2-base-multilingual | |
Cohere | rerank-english-v3.0, rerank-multilingual-v3.0 | |
Voyage AI | rerank-2 | |
Xinference | Any compatible model | Self-hosted |
Configuration
Providers are managed through the web UI at Settings → Providers:
Select a provider from the catalog
Enter API key and base URL (if required)
Add specific models you want to use
Set a default model for each capability (LLM, Embedding, Rerank)
Custom Models
For OpenAI-Compatible, Ollama, LM Studio, LocalAI, and Xinference providers, you can add custom models not in the preset list. Specify:
Model name (as the server identifies it)
Model type (chat, embedding, or rerank)
Max tokens
Vision support (yes/no)
Tool call support (yes/no)
Thinking / Reasoning Mode
Models that support extended reasoning (DeepSeek-Reasoner, QwQ, etc.) can be used with enable_thinking: true. The server streams reasoning tokens separately from the final answer, allowing the frontend to display the thought process.