AI Chat

The m8ty server provides a streaming AI chat endpoint that acts as a Backend-for-Frontend (BFF) orchestrator. Any client sends conversation history and optionally a system prompt; the server handles LLM interaction, tool selection, and tool execution internally.

Architecture Overview

Detailed Request Flow

The following diagram shows the complete sequence of a single AI chat request, including tool call handling:

Flow Steps

Validation — The BFF validates the request: messages non-empty, within limits (max 30), content not blank, last message is user role.
System Prompt Resolution — Priority: client-provided systemPrompt → environment variable M8TY_AI_CHAT_SYSTEM_PROMPT → built-in default.
Tool Selection — The latest user message is embedded and compared against the MCP RAG tool catalog via semantic search. Only the top-K matching tools (above minimum score) are selected.
LLM Request — The BFF constructs a full OpenAI-compatible request: system message + conversation history + selected tool definitions.
Streaming Response — The LLM streams back either content deltas (forwarded as SSE to client) or tool call requests.
Tool Execution — When the LLM requests a tool call, the BFF executes it via MCP using the client's Bearer token for authentication.
Conversation Loop — Tool results are appended to the conversation and sent back to the LLM for the next iteration (max iterations configurable).
Final Answer — Once the LLM produces a text response (no more tool calls), it is streamed to the client followed by a done event.

Endpoint

POST /api/v1/ai/chat
Authorization: Bearer <user-token>
Content-Type: application/json

Response: text/event-stream

Request Format

Field	Type	Required	Description
`messages`	Array	Yes	Chronological user/assistant conversation history. Min 1 item. Last message must be `user`.
`systemPrompt`	String	No	Custom system instruction (max 8000 chars). If omitted, the server default is used.

Each message object:

Field	Type	Description
`role`	String	`user` or `assistant`
`content`	String	Message text (non-blank)

Response Format (SSE)

The response is a stream of server-sent events. Each event has a type and JSON payload:

Event Type	Description
`content`	Assistant text delta — append to the response
`done`	Stream complete
`error`	Error occurred — treat as terminal

Examples

Simple question (no history)

{
  "messages": [
    {"role": "user", "content": "What accounts do I have?"}
  ]
}

Follow-up with conversation history

{
  "messages": [
    {"role": "user", "content": "What accounts do I have?"},
    {"role": "assistant", "content": "You have 3 accounts: checking (€2,450), savings (€15,200), and a securities depot."},
    {"role": "user", "content": "Which one had the latest transaction?"}
  ]
}

With custom system prompt

{
  "systemPrompt": "You are a retirement planning advisor. Answer in German. Always use tools for portfolio data.",
  "messages": [
    {"role": "user", "content": "Wie entwickelt sich mein Depot?"}
  ]
}

SSE response example

event: content
data: {"type":"content","content":"Your checking account had the latest transaction: "}

event: content
data: {"type":"content","content":"a SEPA debit of €89.99 to Netflix on June 10."}

event: done
data: {"type":"done","finishReason":"stop"}

System Prompt

The system prompt defines assistant behavior. It is resolved in this priority:

Client sends systemPrompt in request — used for that specific request
Environment variable M8TY_AI_CHAT_SYSTEM_PROMPT — used as the server default
Built-in default — banking assistant instruction (see below)

Built-in default system prompt

You are a banking assistant inside a mobile banking app. The user is already authenticated in the app. Assume requests are about banking, payments, cards, accounts, investments, documents, or app capabilities unless clearly unrelated.

Answer in the user's language. Keep answers concise, factual, and risk-aware.

For account balances, transactions, spending, payments, cards, securities, documents, personal data, or other user-specific banking data, use available functions before answering. Never guess, estimate, or invent balances, transactions, payment status, holdings, or personal data. If required data is not returned by a function, say that it is unavailable.

For requests about what you can do, available features, or supported actions, use the available capability/tool discovery functions instead of answering from memory.

For money-moving, approval, cancellation, deletion, blocking, or other sensitive state-changing actions, use the app's normal confirmation and authorization flow. Clearly state when an action has only been prepared versus actually submitted or completed.

Conversation History

The server is stateless — it does not persist conversations. The client owns the history:

Client stores full conversation locally (e.g. secure storage, database)
On every new user message, client sends the entire relevant history
Server processes it, returns the assistant response via SSE
Client appends the new assistant response to local history
Next request includes all prior messages again

This follows the standard pattern used by OpenAI and other chat APIs.

History limits

Constraint	Default	Env Variable
Max messages per request	30	`M8TY_AI_CHAT_MAX_MESSAGE_COUNT`
Max content per message	32,000 chars	`M8TY_AI_CHAT_MAX_CONTENT_LENGTH`

Clients should trim oldest message pairs when approaching the limit.

Configuration

All settings are configurable via environment variables:

Variable	Description	Default
`M8TY_AI_CHAT_LLM_BASE_URL`	LLM backend base URL	`http://localhost:8000`
`M8TY_AI_CHAT_LLM_PATH`	Chat completions endpoint path	`/v1/chat/completions`
`M8TY_AI_CHAT_LLM_API_KEY`	API key for the LLM backend	—
`M8TY_AI_CHAT_LLM_AUTHORIZATION_HEADER`	Custom authorization header value	—
`M8TY_AI_CHAT_LLM_MODEL`	Model identifier	`default`
`M8TY_AI_CHAT_LLM_REQUEST_TIMEOUT`	Request timeout	`120s`
`M8TY_AI_CHAT_LLM_DISABLE_THINKING`	Disable thinking/reasoning mode	`false`
`M8TY_AI_CHAT_MAX_MESSAGE_COUNT`	Max messages per request	`30`
`M8TY_AI_CHAT_MAX_CONTENT_LENGTH`	Max content length per message	`32000`
`M8TY_AI_CHAT_MAX_TOOL_CALL_ITERATIONS`	Max tool call loop iterations	`3`
`M8TY_AI_CHAT_TOOL_SELECTION_TOP_K`	Top-K tools selected per request	`8`
`M8TY_AI_CHAT_TOOL_SELECTION_MIN_SCORE`	Minimum relevance score for tool selection	`0.25`
`M8TY_AI_CHAT_SYSTEM_PROMPT`	Default system prompt	Built-in banking instruction

Tool Selection

The BFF automatically selects relevant MCP tools for each request:

Extracts the latest user message as a semantic query
Uses the MCP RAG embedding search to find matching tools (top-K, min score)
Sends only the selected tools to the LLM — not the full catalog
When the LLM calls a tool, the BFF executes it using the client's Bearer token
Tool results are fed back to the LLM until a final text answer is produced

The client does not need to know about tools — all orchestration is server-side.

Tool call iteration limit

If the LLM keeps calling tools without producing a final answer, the loop is capped at M8TY_AI_CHAT_MAX_TOOL_CALL_ITERATIONS (default: 3). After that, an error event is emitted.

Security

Authentication: Every request must carry a valid Bearer token
Token forwarding: The client's Bearer token is forwarded to MCP servers for tool execution — ensuring tool calls respect the user's permissions
No secrets in messages: The API rejects system/tool/developer roles from clients. Only user and assistant are accepted.
Content isolation: Tool call internals (arguments, raw results) are not exposed to the client

Validation Rules

messages must not be empty
messages must not exceed 30 items (configurable)
Each content must not be blank
Each content must not exceed 32,000 characters (configurable)
Last message must have role user
Only user and assistant roles are accepted

Error Responses

Status	Condition
400	Validation failure (empty messages, content too long, wrong role)
401	Missing or invalid Bearer token
500	Internal server error

Errors within the stream are delivered as SSE error events:

event: error
data: {"type":"error","error":"Maximum tool-call iterations reached"}

14 June 2026