198 lines
8.7 KiB
Markdown
198 lines
8.7 KiB
Markdown
|
|
# Agents
|
||
|
|
|
||
|
|
An `Agent` is an LLM persona pinned to a specific `Llm`, with a system prompt,
|
||
|
|
a description that surfaces in MCP `tools/list`, optional attachment to a
|
||
|
|
`Project`, and LiteLLM-style sampling defaults. Conversations are persisted
|
||
|
|
as `ChatThread` + `ChatMessage` rows so REPL sessions resume across runs.
|
||
|
|
|
||
|
|
Two surfaces use an agent:
|
||
|
|
|
||
|
|
1. **Direct chat** via `mcpctl chat <name>` (interactive REPL or one-shot
|
||
|
|
`-m "msg"`). Streams over SSE; tool calls and tool results print to
|
||
|
|
stderr in dim brackets. Slash-commands `/set`, `/system`, `/tools`,
|
||
|
|
`/clear`, `/save`, `/quit` adjust runtime behavior.
|
||
|
|
|
||
|
|
2. **Virtual MCP server** registered into every project session by
|
||
|
|
mcplocal's agents plugin. The agent shows up as `agent-<name>` with
|
||
|
|
one tool `chat`, whose description is the agent's own description.
|
||
|
|
Other Claude sessions / MCP clients see the agent as just another
|
||
|
|
tool in `tools/list` and can consult it.
|
||
|
|
|
||
|
|
## Data model
|
||
|
|
|
||
|
|
Three Prisma models added to `src/db/prisma/schema.prisma`:
|
||
|
|
|
||
|
|
- **`Agent`** — `name` (unique), `description`, `systemPrompt`, `llmId`
|
||
|
|
(FK Restrict — an Llm in active use cannot be deleted), `projectId`
|
||
|
|
(FK SetNull — agents survive project deletion), `proxyModelName`
|
||
|
|
(optional informational override), `defaultParams` (Json,
|
||
|
|
LiteLLM-style), `extras` (Json, reserved for future LoRA / tool
|
||
|
|
allowlists), `ownerId`, version, timestamps.
|
||
|
|
|
||
|
|
- **`ChatThread`** — `agentId`, `ownerId`, `title`, `lastTurnAt`,
|
||
|
|
timestamps. Cascade delete on agent.
|
||
|
|
|
||
|
|
- **`ChatMessage`** — `threadId`, `turnIndex` (monotonic per thread,
|
||
|
|
enforced by `@@unique([threadId, turnIndex])`), `role`
|
||
|
|
(`'system' | 'user' | 'assistant' | 'tool'`), `content`, `toolCalls`
|
||
|
|
(Json — assistant turn's `[{id,name,arguments}]`), `toolCallId`
|
||
|
|
(which call a tool turn answers), `status`
|
||
|
|
(`'pending' | 'complete' | 'error'`), `createdAt`. Cascade delete
|
||
|
|
on thread.
|
||
|
|
|
||
|
|
`status` stays `pending` while the orchestrator runs an in-flight assistant
|
||
|
|
or tool turn, then flips to `complete` once the round settles. On any
|
||
|
|
exception in the chat loop, every `pending` row in the thread is flipped to
|
||
|
|
`error` so the trail stays auditable.
|
||
|
|
|
||
|
|
## Chat parameters (LiteLLM-style passthrough)
|
||
|
|
|
||
|
|
Per-call resolution: request body → `agent.defaultParams` → adapter default.
|
||
|
|
Setting a key to `null` in the request explicitly clears a default.
|
||
|
|
|
||
|
|
| Key | Type | Notes |
|
||
|
|
|---|---|---|
|
||
|
|
| `temperature` | number | 0..2 |
|
||
|
|
| `top_p` | number | 0..1 |
|
||
|
|
| `top_k` | integer | Anthropic-only; OpenAI ignores |
|
||
|
|
| `max_tokens` | integer | adapter clamps to provider max |
|
||
|
|
| `stop` | string \| string[] | up to 4 sequences |
|
||
|
|
| `presence_penalty` | number | OpenAI |
|
||
|
|
| `frequency_penalty` | number | OpenAI |
|
||
|
|
| `seed` | integer | reproducibility (provider-dependent) |
|
||
|
|
| `response_format` | object | `text` \| `json_object` \| `json_schema` |
|
||
|
|
| `tool_choice` | enum/object | `auto`\|`none`\|`required`\|`{type:'function',function:{name}}` |
|
||
|
|
| `tools_allowlist` | string[] | restricts which project MCP tools the agent can call this turn |
|
||
|
|
| `systemOverride` | string | replaces `agent.systemPrompt` for this call |
|
||
|
|
| `systemAppend` | string | concatenated to system block (after project Prompts) |
|
||
|
|
| `messages` | array | full message history override; if set, `message`/threadId history is ignored |
|
||
|
|
| `extra` | object | provider-specific knobs (Anthropic `metadata.user_id`, vLLM `repetition_penalty`) — adapters cherry-pick |
|
||
|
|
|
||
|
|
## HTTP API (mcpd)
|
||
|
|
|
||
|
|
```
|
||
|
|
GET /api/v1/agents list (RBAC: view:agents)
|
||
|
|
GET /api/v1/agents/:idOrName describe (view:agents)
|
||
|
|
POST /api/v1/agents create (create:agents)
|
||
|
|
PUT /api/v1/agents/:idOrName update (edit:agents)
|
||
|
|
DELETE /api/v1/agents/:idOrName delete (delete:agents)
|
||
|
|
POST /api/v1/agents/:name/chat chat — non-streaming or SSE (run:agents:<name>)
|
||
|
|
POST /api/v1/agents/:name/threads create thread (run:agents:<name>)
|
||
|
|
GET /api/v1/agents/:name/threads list threads (run:agents:<name>)
|
||
|
|
GET /api/v1/threads/:id/messages replay history (view:agents)
|
||
|
|
GET /api/v1/projects/:p/agents project-scoped list (view:projects:<p>)
|
||
|
|
```
|
||
|
|
|
||
|
|
The chat endpoint reuses the SSE pattern from `llm-infer.ts` exactly: same
|
||
|
|
headers (`text/event-stream`, `X-Accel-Buffering: no`), same `data: …\n\n`
|
||
|
|
framing, same `[DONE]` terminator. SSE chunk types:
|
||
|
|
|
||
|
|
- `{type:'text', delta}` — assistant text increments
|
||
|
|
- `{type:'tool_call', toolName, args}` — model decided to call a tool
|
||
|
|
- `{type:'tool_result', toolName, ok}` — tool dispatch outcome
|
||
|
|
- `{type:'final', threadId, turnIndex}` — terminal turn
|
||
|
|
- `{type:'error', message}` — fatal error in the loop
|
||
|
|
|
||
|
|
## Tool-use loop
|
||
|
|
|
||
|
|
When the agent's project has MCP servers attached, mcpd's `ChatService` lists
|
||
|
|
each server's tools (via `mcp-proxy.service.ts` — same path real MCP traffic
|
||
|
|
uses) and presents them to the model namespaced as `<server>__<tool>`. On a
|
||
|
|
`tool_calls` response the loop dispatches each call back through the same
|
||
|
|
proxy, persists the assistant + tool turns linked by `toolCallId`, and loops
|
||
|
|
(cap = 12 iterations) until the model returns terminal text.
|
||
|
|
|
||
|
|
Persistence is **non-transactional across the loop** because tool calls can
|
||
|
|
take minutes; long-held DB transactions would starve other writers.
|
||
|
|
|
||
|
|
## RBAC
|
||
|
|
|
||
|
|
Agents are their own resource (`agents`), independent of project bindings.
|
||
|
|
Recommended:
|
||
|
|
|
||
|
|
- `view:agents` — list / describe
|
||
|
|
- `create:agents` / `edit:agents` / `delete:agents` — CRUD
|
||
|
|
- `run:agents:<name>` — drive a chat turn or manage its threads
|
||
|
|
|
||
|
|
Project-attached agents do not implicitly inherit project RBAC. If a project
|
||
|
|
member should be able to chat with the project's agents, grant them
|
||
|
|
`run:agents:<each-name>` (or wildcard `run:agents`) explicitly.
|
||
|
|
|
||
|
|
## YAML round-trip
|
||
|
|
|
||
|
|
`get agent foo -o yaml | mcpctl apply -f -` is a no-op. The `apply` schema
|
||
|
|
also accepts shorthand:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
apiVersion: mcpctl.io/v1
|
||
|
|
kind: agent
|
||
|
|
metadata: { name: deployer }
|
||
|
|
spec:
|
||
|
|
description: "I help you deploy code"
|
||
|
|
llm: qwen3-thinking # shorthand for `{ name: qwen3-thinking }`
|
||
|
|
project: mcpctl-dev # shorthand for `{ name: mcpctl-dev }`
|
||
|
|
systemPrompt: |
|
||
|
|
You are a deployment assistant for mcpctl. Always check fulldeploy.sh
|
||
|
|
and the k8s context before suggesting actions.
|
||
|
|
defaultParams:
|
||
|
|
temperature: 0.2
|
||
|
|
max_tokens: 4096
|
||
|
|
top_p: 0.9
|
||
|
|
stop: ["</deploy>"]
|
||
|
|
```
|
||
|
|
|
||
|
|
## Wiring against your in-cluster qwen3-thinking
|
||
|
|
|
||
|
|
The `kubernetes-deployment` repo provisions LiteLLM in the `nvidia-nim`
|
||
|
|
namespace (`http://litellm.nvidia-nim.svc.cluster.local:4000/v1` in-cluster,
|
||
|
|
`https://llm.ad.itaz.eu/v1` external) and a virtual key reserved for mcpctl
|
||
|
|
in the Pulumi secret `secrets:litellmMcpctlGatewayToken`. Pulling it once:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /path/to/kubernetes-deployment
|
||
|
|
LITELLM_TOKEN=$(pulumi config get --stack homelab secrets:litellmMcpctlGatewayToken)
|
||
|
|
|
||
|
|
# fallback if Pulumi isn't authed locally:
|
||
|
|
# LITELLM_TOKEN=$(kubectl --context worker0-k8s0 -n nvidia-nim get secret litellm-secrets \
|
||
|
|
# -o jsonpath='{.data.LITELLM_MCPCTL_GATEWAY_TOKEN}' | base64 -d)
|
||
|
|
|
||
|
|
cd /path/to/mcpctl
|
||
|
|
mcpctl create secret litellm-key --data "API_KEY=${LITELLM_TOKEN}"
|
||
|
|
mcpctl create llm qwen3-thinking \
|
||
|
|
--type openai --model qwen3-thinking \
|
||
|
|
--url http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \
|
||
|
|
--api-key-ref litellm-key/API_KEY \
|
||
|
|
--description "Qwen3-30B-A3B-Thinking-FP8 via in-cluster vLLM behind LiteLLM"
|
||
|
|
mcpctl create agent reviewer \
|
||
|
|
--llm qwen3-thinking \
|
||
|
|
--description "I review what you're shipping; ask after each major change." \
|
||
|
|
--default-temperature 0.2 --default-max-tokens 4096
|
||
|
|
mcpctl chat reviewer
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
- **Namespace collision** in mcplocal: if a project has an upstream MCP
|
||
|
|
server literally named `agent-<x>`, the agents plugin detects the
|
||
|
|
collision in `onSessionCreate`, skips that agent's registration, and
|
||
|
|
emits a `ctx.log.warn` line. Document the `agent-` prefix as reserved
|
||
|
|
on real server names.
|
||
|
|
|
||
|
|
- **Llm-in-use blocks delete**: `Agent.llm` is `onDelete: Restrict`. Detach
|
||
|
|
every agent (or delete them) before deleting the underlying Llm.
|
||
|
|
|
||
|
|
- **Stale `pending` rows**: a crash mid-loop leaves `pending` ChatMessage
|
||
|
|
rows. The next request recovers — `markPendingAsError` flips them on the
|
||
|
|
next failure path, and `loadHistory` filters out `error` rows when
|
||
|
|
rebuilding context for the next turn.
|
||
|
|
|
||
|
|
- **`proxyModelName` is informational only** for agents. The agent's own
|
||
|
|
internal tool loop runs server-side in mcpd and bypasses mcplocal's
|
||
|
|
proxymodel pipeline entirely. Don't try to plumb it.
|
||
|
|
|
||
|
|
- **Anthropic + tools**: the Anthropic adapter currently drops `tool` role
|
||
|
|
messages and doesn't translate OpenAI `tool_calls` to Anthropic
|
||
|
|
`tool_use` / `tool_result` blocks. Use an OpenAI-compatible provider
|
||
|
|
(LiteLLM, vLLM, OpenAI) for agents that need tool calling until that
|
||
|
|
translation lands.
|