Help Instance Help

AI Chat

AI Chat Architecture Overview

The m8ty server provides a streaming AI chat endpoint that acts as a Backend-for-Frontend (BFF) orchestrator. Any client sends conversation history and optionally a system prompt; the server handles LLM interaction, tool selection, and tool execution internally.

Architecture Overview

AI Chat Architecture

Detailed Request Flow

The following diagram shows the complete sequence of a single AI chat request, including tool call handling:

AI Chat Sequence Diagram

Flow Steps

  1. Validation — The BFF validates the request: messages non-empty, within limits (max 30), content not blank, last message is user role.

  2. System Prompt Resolution — Priority: client-provided systemPrompt → environment variable M8TY_AI_CHAT_SYSTEM_PROMPT → built-in default.

  3. Tool Selection — The latest user message is embedded and compared against the MCP RAG tool catalog via semantic search. Only the top-K matching tools (above minimum score) are selected.

  4. LLM Request — The BFF constructs a full OpenAI-compatible request: system message + conversation history + selected tool definitions.

  5. Streaming Response — The LLM streams back either content deltas (forwarded as SSE to client) or tool call requests.

  6. Tool Execution — When the LLM requests a tool call, the BFF executes it via MCP using the client's Bearer token for authentication.

  7. Conversation Loop — Tool results are appended to the conversation and sent back to the LLM for the next iteration (max iterations configurable).

  8. Final Answer — Once the LLM produces a text response (no more tool calls), it is streamed to the client followed by a done event.

Endpoint

POST /api/v1/ai/chat Authorization: Bearer <user-token> Content-Type: application/json

Response: text/event-stream

Request Format

Field

Type

Required

Description

messages

Array

Yes

Chronological user/assistant conversation history. Min 1 item. Last message must be user.

systemPrompt

String

No

Custom system instruction (max 8000 chars). If omitted, the server default is used.

Each message object:

Field

Type

Description

role

String

user or assistant

content

String

Message text (non-blank)

Response Format (SSE)

The response is a stream of server-sent events. Each event has a type and JSON payload:

Event Type

Description

content

Assistant text delta — append to the response

done

Stream complete

error

Error occurred — treat as terminal

Examples

Simple question (no history)

{ "messages": [ {"role": "user", "content": "What accounts do I have?"} ] }

Follow-up with conversation history

{ "messages": [ {"role": "user", "content": "What accounts do I have?"}, {"role": "assistant", "content": "You have 3 accounts: checking (€2,450), savings (€15,200), and a securities depot."}, {"role": "user", "content": "Which one had the latest transaction?"} ] }

With custom system prompt

{ "systemPrompt": "You are a retirement planning advisor. Answer in German. Always use tools for portfolio data.", "messages": [ {"role": "user", "content": "Wie entwickelt sich mein Depot?"} ] }

SSE response example

event: content data: {"type":"content","content":"Your checking account had the latest transaction: "} event: content data: {"type":"content","content":"a SEPA debit of €89.99 to Netflix on June 10."} event: done data: {"type":"done","finishReason":"stop"}

System Prompt

The system prompt defines assistant behavior. It is resolved in this priority:

  1. Client sends systemPrompt in request — used for that specific request

  2. Environment variable M8TY_AI_CHAT_SYSTEM_PROMPT — used as the server default

  3. Built-in default — banking assistant instruction (see below)

Built-in default system prompt

Conversation History

The server is stateless — it does not persist conversations. The client owns the history:

  1. Client stores full conversation locally (e.g. secure storage, database)

  2. On every new user message, client sends the entire relevant history

  3. Server processes it, returns the assistant response via SSE

  4. Client appends the new assistant response to local history

  5. Next request includes all prior messages again

This follows the standard pattern used by OpenAI and other chat APIs.

History limits

Constraint

Default

Env Variable

Max messages per request

30

M8TY_AI_CHAT_MAX_MESSAGE_COUNT

Max content per message

32,000 chars

M8TY_AI_CHAT_MAX_CONTENT_LENGTH

Clients should trim oldest message pairs when approaching the limit.

Configuration

All settings are configurable via environment variables:

Variable

Description

Default

M8TY_AI_CHAT_LLM_BASE_URL

LLM backend base URL

http://localhost:8000

M8TY_AI_CHAT_LLM_PATH

Chat completions endpoint path

/v1/chat/completions

M8TY_AI_CHAT_LLM_API_KEY

API key for the LLM backend

M8TY_AI_CHAT_LLM_AUTHORIZATION_HEADER

Custom authorization header value

M8TY_AI_CHAT_LLM_MODEL

Model identifier

default

M8TY_AI_CHAT_LLM_REQUEST_TIMEOUT

Request timeout

120s

M8TY_AI_CHAT_LLM_DISABLE_THINKING

Disable thinking/reasoning mode

false

M8TY_AI_CHAT_MAX_MESSAGE_COUNT

Max messages per request

30

M8TY_AI_CHAT_MAX_CONTENT_LENGTH

Max content length per message

32000

M8TY_AI_CHAT_MAX_TOOL_CALL_ITERATIONS

Max tool call loop iterations

3

M8TY_AI_CHAT_TOOL_SELECTION_TOP_K

Top-K tools selected per request

8

M8TY_AI_CHAT_TOOL_SELECTION_MIN_SCORE

Minimum relevance score for tool selection

0.25

M8TY_AI_CHAT_SYSTEM_PROMPT

Default system prompt

Built-in banking instruction

Tool Selection

The BFF automatically selects relevant MCP tools for each request:

  1. Extracts the latest user message as a semantic query

  2. Uses the MCP RAG embedding search to find matching tools (top-K, min score)

  3. Sends only the selected tools to the LLM — not the full catalog

  4. When the LLM calls a tool, the BFF executes it using the client's Bearer token

  5. Tool results are fed back to the LLM until a final text answer is produced

The client does not need to know about tools — all orchestration is server-side.

Tool call iteration limit

If the LLM keeps calling tools without producing a final answer, the loop is capped at M8TY_AI_CHAT_MAX_TOOL_CALL_ITERATIONS (default: 3). After that, an error event is emitted.

Security

  • Authentication: Every request must carry a valid Bearer token

  • Token forwarding: The client's Bearer token is forwarded to MCP servers for tool execution — ensuring tool calls respect the user's permissions

  • No secrets in messages: The API rejects system/tool/developer roles from clients. Only user and assistant are accepted.

  • Content isolation: Tool call internals (arguments, raw results) are not exposed to the client

Validation Rules

  • messages must not be empty

  • messages must not exceed 30 items (configurable)

  • Each content must not be blank

  • Each content must not exceed 32,000 characters (configurable)

  • Last message must have role user

  • Only user and assistant roles are accepted

Error Responses

Status

Condition

400

Validation failure (empty messages, content too long, wrong role)

401

Missing or invalid Bearer token

500

Internal server error

Errors within the stream are delivered as SSE error events:

event: error data: {"type":"error","error":"Maximum tool-call iterations reached"}
14 June 2026