docs/agents.md

# Agents

An `Agent` is an LLM persona pinned to a specific `Llm`, with a system prompt,
a description that surfaces in MCP `tools/list`, optional attachment to a
`Project`, and LiteLLM-style sampling defaults. Conversations are persisted
as `ChatThread` + `ChatMessage` rows so REPL sessions resume across runs.

Two surfaces use an agent:

1. **Direct chat** via `mcpctl chat <name>` (interactive REPL or one-shot
   `-m "msg"`). Streams over SSE; tool calls and tool results print to
   stderr in dim brackets. Slash-commands `/set`, `/system`, `/tools`,
   `/clear`, `/save`, `/quit` adjust runtime behavior.

2. **Virtual MCP server** registered into every project session by
   mcplocal's agents plugin. The agent shows up as `agent-<name>` with
   one tool `chat`, whose description is the agent's own description.
   Other Claude sessions / MCP clients see the agent as just another
   tool in `tools/list` and can consult it.

## Data model

Three Prisma models added to `src/db/prisma/schema.prisma`:

- **`Agent`** — `name` (unique), `description`, `systemPrompt`, `llmId`
  (FK Restrict — an Llm in active use cannot be deleted), `projectId`
  (FK SetNull — agents survive project deletion), `proxyModelName`
  (optional informational override), `defaultParams` (Json,
  LiteLLM-style), `extras` (Json, reserved for future LoRA / tool
  allowlists), `ownerId`, version, timestamps.

- **`ChatThread`** — `agentId`, `ownerId`, `title`, `lastTurnAt`,
  timestamps. Cascade delete on agent.

- **`ChatMessage`** — `threadId`, `turnIndex` (monotonic per thread,
  enforced by `@@unique([threadId, turnIndex])`), `role`
  (`'system' | 'user' | 'assistant' | 'tool'`), `content`, `toolCalls`
  (Json — assistant turn's `[{id,name,arguments}]`), `toolCallId`
  (which call a tool turn answers), `status`
  (`'pending' | 'complete' | 'error'`), `createdAt`. Cascade delete
  on thread.

`status` stays `pending` while the orchestrator runs an in-flight assistant
or tool turn, then flips to `complete` once the round settles. On any
exception in the chat loop, every `pending` row in the thread is flipped to
`error` so the trail stays auditable.

## Chat parameters (LiteLLM-style passthrough)

Per-call resolution: request body → `agent.defaultParams` → adapter default.
Setting a key to `null` in the request explicitly clears a default.

| Key | Type | Notes |
|---|---|---|
| `temperature` | number | 0..2 |
| `top_p` | number | 0..1 |
| `top_k` | integer | Anthropic-only; OpenAI ignores |
| `max_tokens` | integer | adapter clamps to provider max |
| `stop` | string \| string[] | up to 4 sequences |
| `presence_penalty` | number | OpenAI |
| `frequency_penalty` | number | OpenAI |
| `seed` | integer | reproducibility (provider-dependent) |
| `response_format` | object | `text` \| `json_object` \| `json_schema` |
| `tool_choice` | enum/object | `auto`\|`none`\|`required`\|`{type:'function',function:{name}}` |
| `tools_allowlist` | string[] | restricts which project MCP tools the agent can call this turn |
| `systemOverride` | string | replaces `agent.systemPrompt` for this call |
| `systemAppend` | string | concatenated to system block (after project Prompts) |
| `messages` | array | full message history override; if set, `message`/threadId history is ignored |
| `extra` | object | provider-specific knobs (Anthropic `metadata.user_id`, vLLM `repetition_penalty`) — adapters cherry-pick |

## HTTP API (mcpd)

```
GET    /api/v1/agents                  list (RBAC: view:agents)
GET    /api/v1/agents/:idOrName        describe (view:agents)
POST   /api/v1/agents                  create (create:agents)
PUT    /api/v1/agents/:idOrName        update (edit:agents)
DELETE /api/v1/agents/:idOrName        delete (delete:agents)
POST   /api/v1/agents/:name/chat       chat — non-streaming or SSE (run:agents:<name>)
POST   /api/v1/agents/:name/threads    create thread (run:agents:<name>)
GET    /api/v1/agents/:name/threads    list threads (run:agents:<name>)
GET    /api/v1/threads/:id/messages    replay history (view:agents)
GET    /api/v1/projects/:p/agents      project-scoped list (view:projects:<p>)
```

The chat endpoint reuses the SSE pattern from `llm-infer.ts` exactly: same
headers (`text/event-stream`, `X-Accel-Buffering: no`), same `data: …\n\n`
framing, same `[DONE]` terminator. SSE chunk types:

- `{type:'text', delta}` — assistant text increments
- `{type:'tool_call', toolName, args}` — model decided to call a tool
- `{type:'tool_result', toolName, ok}` — tool dispatch outcome
- `{type:'final', threadId, turnIndex}` — terminal turn
- `{type:'error', message}` — fatal error in the loop

## Tool-use loop

When the agent's project has MCP servers attached, mcpd's `ChatService` lists
each server's tools (via `mcp-proxy.service.ts` — same path real MCP traffic
uses) and presents them to the model namespaced as `<server>__<tool>`. On a
`tool_calls` response the loop dispatches each call back through the same
proxy, persists the assistant + tool turns linked by `toolCallId`, and loops
(cap = 12 iterations) until the model returns terminal text.

Persistence is **non-transactional across the loop** because tool calls can
take minutes; long-held DB transactions would starve other writers.

## RBAC

Agents are their own resource (`agents`), independent of project bindings.
Recommended:

- `view:agents` — list / describe
- `create:agents` / `edit:agents` / `delete:agents` — CRUD
- `run:agents:<name>` — drive a chat turn or manage its threads

Project-attached agents do not implicitly inherit project RBAC. If a project
member should be able to chat with the project's agents, grant them
`run:agents:<each-name>` (or wildcard `run:agents`) explicitly.

## YAML round-trip

`get agent foo -o yaml | mcpctl apply -f -` is a no-op. The `apply` schema
also accepts shorthand:

```yaml
apiVersion: mcpctl.io/v1
kind: agent
metadata: { name: deployer }
spec:
  description: "I help you deploy code"
  llm: qwen3-thinking          # shorthand for `{ name: qwen3-thinking }`
  project: mcpctl-dev          # shorthand for `{ name: mcpctl-dev }`
  systemPrompt: |
    You are a deployment assistant for mcpctl. Always check fulldeploy.sh
    and the k8s context before suggesting actions.
  defaultParams:
    temperature: 0.2
    max_tokens: 4096
    top_p: 0.9
    stop: ["</deploy>"]
```

## Wiring against your in-cluster qwen3-thinking

The `kubernetes-deployment` repo provisions LiteLLM in the `nvidia-nim`
namespace (`http://litellm.nvidia-nim.svc.cluster.local:4000/v1` in-cluster,
`https://llm.ad.itaz.eu/v1` external) and a virtual key reserved for mcpctl
in the Pulumi secret `secrets:litellmMcpctlGatewayToken`. Pulling it once:

```bash
cd /path/to/kubernetes-deployment
LITELLM_TOKEN=$(pulumi config get --stack homelab secrets:litellmMcpctlGatewayToken)

# fallback if Pulumi isn't authed locally:
# LITELLM_TOKEN=$(kubectl --context worker0-k8s0 -n nvidia-nim get secret litellm-secrets \
#   -o jsonpath='{.data.LITELLM_MCPCTL_GATEWAY_TOKEN}' | base64 -d)

cd /path/to/mcpctl
mcpctl create secret litellm-key --data "API_KEY=${LITELLM_TOKEN}"
mcpctl create llm qwen3-thinking \
    --type openai --model qwen3-thinking \
    --url http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \
    --api-key-ref litellm-key/API_KEY \
    --description "Qwen3-30B-A3B-Thinking-FP8 via in-cluster vLLM behind LiteLLM"
mcpctl create agent reviewer \
    --llm qwen3-thinking \
    --description "I review what you're shipping; ask after each major change." \
    --default-temperature 0.2 --default-max-tokens 4096
mcpctl chat reviewer
```

## Troubleshooting

- **Namespace collision** in mcplocal: if a project has an upstream MCP
  server literally named `agent-<x>`, the agents plugin detects the
  collision in `onSessionCreate`, skips that agent's registration, and
  emits a `ctx.log.warn` line. Document the `agent-` prefix as reserved
  on real server names.

- **Llm-in-use blocks delete**: `Agent.llm` is `onDelete: Restrict`. Detach
  every agent (or delete them) before deleting the underlying Llm.

- **Stale `pending` rows**: a crash mid-loop leaves `pending` ChatMessage
  rows. The next request recovers — `markPendingAsError` flips them on the
  next failure path, and `loadHistory` filters out `error` rows when
  rebuilding context for the next turn.

- **`proxyModelName` is informational only** for agents. The agent's own
  internal tool loop runs server-side in mcpd and bypasses mcplocal's
  proxymodel pipeline entirely. Don't try to plumb it.

- **Anthropic + tools**: the Anthropic adapter currently drops `tool` role
  messages and doesn't translate OpenAI `tool_calls` to Anthropic
  `tool_use` / `tool_result` blocks. Use an OpenAI-compatible provider
  (LiteLLM, vLLM, OpenAI) for agents that need tool calling until that
  translation lands.
feat(agents): smoke tests + README + docs (Stage 6, final) Closes the agents feature. Smoke tests (run via `pnpm test:smoke` against a live mcpd at $MCPD_URL, default https://mcpctl.ad.itaz.eu): * tests/smoke/agent.smoke.test.ts — full CRUD round-trip: create secret + Llm + agent with sampling defaults; `get agents` surfaces it; `get agent foo -o yaml \| apply -f` round-trips identically; create + list a thread via the HTTP API; agent delete leaves Llm + secret intact (Restrict + SetNull as designed). Self- skips with a warning when /healthz is unreachable. * tests/smoke/agent-chat.smoke.test.ts — gated on MCPCTL_SMOKE_LLM_URL + MCPCTL_SMOKE_LLM_KEY. Provisions secret + Llm + agent against a real upstream, runs `mcpctl chat -m … --no- stream` (asserts a reply lands), then runs the streaming default (asserts text on stdout + `(thread: …)` on stderr). The fast path for verifying the in-cluster qwen3-thinking deployment: MCPCTL_SMOKE_LLM_URL=http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \ MCPCTL_SMOKE_LLM_MODEL=qwen3-thinking \ MCPCTL_SMOKE_LLM_KEY=$(pulumi config get --stack homelab \ secrets:litellmMcpctlGatewayToken) \ pnpm test:smoke Docs: * README.md — new "Agents" section under Resources with the qwen3-thinking quickstart and links to docs/agents.md and docs/chat.md. Adds llm + agent rows to the resources table. * docs/agents.md (new) — full reference: data model, chat-parameter table, HTTP API, RBAC mapping, tool-use loop semantics, yaml round-trip shorthand, the kubernetes-deployment wiring recipe, and a troubleshooting section (namespace collision, llm-in-use, pending-row recovery, Anthropic-tool limitation). * docs/chat.md (new) — user-facing `mcpctl chat` walkthrough: modes, per-call flags, slash-commands, threads, and a troubleshooting section. * CLAUDE.md — adds a "Resource types" cheatsheet with one-line pointers to each, including the new `agent` row that links to the docs. All suites still green: mcpd 759/759, mcplocal 715/715, cli 430/430. Smoke tests typecheck and self-skip when no live mcpd is reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-25 17:08:37 +01:00			`# Agents`

			An `Agent` is an LLM persona pinned to a specific `Llm`, with a system prompt,
			a description that surfaces in MCP `tools/list`, optional attachment to a
			`Project`, and LiteLLM-style sampling defaults. Conversations are persisted
			as `ChatThread` + `ChatMessage` rows so REPL sessions resume across runs.

			`Two surfaces use an agent:`

			1. Direct chat via `mcpctl chat <name>` (interactive REPL or one-shot
			`-m "msg"`). Streams over SSE; tool calls and tool results print to
			stderr in dim brackets. Slash-commands `/set`, `/system`, `/tools`,
			`/clear`, `/save`, `/quit` adjust runtime behavior.

			`2. Virtual MCP server registered into every project session by`
			mcplocal's agents plugin. The agent shows up as `agent-<name>` with
			one tool `chat`, whose description is the agent's own description.
			`Other Claude sessions / MCP clients see the agent as just another`
			tool in `tools/list` and can consult it.

			`## Data model`

			Three Prisma models added to `src/db/prisma/schema.prisma`:

			- `Agent` — `name` (unique), `description`, `systemPrompt`, `llmId`
			(FK Restrict — an Llm in active use cannot be deleted), `projectId`
			(FK SetNull — agents survive project deletion), `proxyModelName`
			(optional informational override), `defaultParams` (Json,
			LiteLLM-style), `extras` (Json, reserved for future LoRA / tool
			allowlists), `ownerId`, version, timestamps.

			- `ChatThread` — `agentId`, `ownerId`, `title`, `lastTurnAt`,
			`timestamps. Cascade delete on agent.`

			- `ChatMessage` — `threadId`, `turnIndex` (monotonic per thread,
			enforced by `@@unique([threadId, turnIndex])`), `role`
			(`'system' \| 'user' \| 'assistant' \| 'tool'`), `content`, `toolCalls`
			(Json — assistant turn's `[{id,name,arguments}]`), `toolCallId`
			(which call a tool turn answers), `status`
			(`'pending' \| 'complete' \| 'error'`), `createdAt`. Cascade delete
			`on thread.`

			`status` stays `pending` while the orchestrator runs an in-flight assistant
			or tool turn, then flips to `complete` once the round settles. On any
			exception in the chat loop, every `pending` row in the thread is flipped to
			`error` so the trail stays auditable.

			`## Chat parameters (LiteLLM-style passthrough)`

			Per-call resolution: request body → `agent.defaultParams` → adapter default.
			Setting a key to `null` in the request explicitly clears a default.

			`\| Key \| Type \| Notes \|`
			`\|---\|---\|---\|`
			\| `temperature` \| number \| 0..2 \|
			\| `top_p` \| number \| 0..1 \|
			\| `top_k` \| integer \| Anthropic-only; OpenAI ignores \|
			\| `max_tokens` \| integer \| adapter clamps to provider max \|
			\| `stop` \| string \\| string[] \| up to 4 sequences \|
			\| `presence_penalty` \| number \| OpenAI \|
			\| `frequency_penalty` \| number \| OpenAI \|
			\| `seed` \| integer \| reproducibility (provider-dependent) \|
			\| `response_format` \| object \| `text` \\| `json_object` \\| `json_schema` \|
			\| `tool_choice` \| enum/object \| `auto`\\|`none`\\|`required`\\|`{type:'function',function:{name}}` \|
			\| `tools_allowlist` \| string[] \| restricts which project MCP tools the agent can call this turn \|
			\| `systemOverride` \| string \| replaces `agent.systemPrompt` for this call \|
			\| `systemAppend` \| string \| concatenated to system block (after project Prompts) \|
			\| `messages` \| array \| full message history override; if set, `message`/threadId history is ignored \|
			\| `extra` \| object \| provider-specific knobs (Anthropic `metadata.user_id`, vLLM `repetition_penalty`) — adapters cherry-pick \|

			`## HTTP API (mcpd)`

			```
			`GET /api/v1/agents list (RBAC: view:agents)`
			`GET /api/v1/agents/:idOrName describe (view:agents)`
			`POST /api/v1/agents create (create:agents)`
			`PUT /api/v1/agents/:idOrName update (edit:agents)`
			`DELETE /api/v1/agents/:idOrName delete (delete:agents)`
			`POST /api/v1/agents/:name/chat chat — non-streaming or SSE (run:agents:<name>)`
			`POST /api/v1/agents/:name/threads create thread (run:agents:<name>)`
			`GET /api/v1/agents/:name/threads list threads (run:agents:<name>)`
			`GET /api/v1/threads/:id/messages replay history (view:agents)`
			`GET /api/v1/projects/:p/agents project-scoped list (view:projects:<p>)`
			```

			The chat endpoint reuses the SSE pattern from `llm-infer.ts` exactly: same
			headers (`text/event-stream`, `X-Accel-Buffering: no`), same `data: …\n\n`
			framing, same `[DONE]` terminator. SSE chunk types:

			- `{type:'text', delta}` — assistant text increments
			- `{type:'tool_call', toolName, args}` — model decided to call a tool
			- `{type:'tool_result', toolName, ok}` — tool dispatch outcome
			- `{type:'final', threadId, turnIndex}` — terminal turn
			- `{type:'error', message}` — fatal error in the loop

			`## Tool-use loop`

			When the agent's project has MCP servers attached, mcpd's `ChatService` lists
			each server's tools (via `mcp-proxy.service.ts` — same path real MCP traffic
			uses) and presents them to the model namespaced as `<server>__<tool>`. On a
			`tool_calls` response the loop dispatches each call back through the same
			proxy, persists the assistant + tool turns linked by `toolCallId`, and loops
			`(cap = 12 iterations) until the model returns terminal text.`

			`Persistence is non-transactional across the loop because tool calls can`
			`take minutes; long-held DB transactions would starve other writers.`

			`## RBAC`

			Agents are their own resource (`agents`), independent of project bindings.
			`Recommended:`

			- `view:agents` — list / describe
			- `create:agents` / `edit:agents` / `delete:agents` — CRUD
			- `run:agents:<name>` — drive a chat turn or manage its threads

			`Project-attached agents do not implicitly inherit project RBAC. If a project`
			`member should be able to chat with the project's agents, grant them`
			`run:agents:<each-name>` (or wildcard `run:agents`) explicitly.

			`## YAML round-trip`

			`get agent foo -o yaml \| mcpctl apply -f -` is a no-op. The `apply` schema
			`also accepts shorthand:`

			```yaml
			`apiVersion: mcpctl.io/v1`
			`kind: agent`
			`metadata: { name: deployer }`
			`spec:`
			`description: "I help you deploy code"`
			llm: qwen3-thinking # shorthand for `{ name: qwen3-thinking }`
			project: mcpctl-dev # shorthand for `{ name: mcpctl-dev }`
			`systemPrompt: \|`
			`You are a deployment assistant for mcpctl. Always check fulldeploy.sh`
			`and the k8s context before suggesting actions.`
			`defaultParams:`
			`temperature: 0.2`
			`max_tokens: 4096`
			`top_p: 0.9`
			`stop: ["</deploy>"]`
			```

			`## Wiring against your in-cluster qwen3-thinking`

			The `kubernetes-deployment` repo provisions LiteLLM in the `nvidia-nim`
			namespace (`http://litellm.nvidia-nim.svc.cluster.local:4000/v1` in-cluster,
			`https://llm.ad.itaz.eu/v1` external) and a virtual key reserved for mcpctl
			in the Pulumi secret `secrets:litellmMcpctlGatewayToken`. Pulling it once:

			```bash
			`cd /path/to/kubernetes-deployment`
			`LITELLM_TOKEN=$(pulumi config get --stack homelab secrets:litellmMcpctlGatewayToken)`

			`# fallback if Pulumi isn't authed locally:`
			`# LITELLM_TOKEN=$(kubectl --context worker0-k8s0 -n nvidia-nim get secret litellm-secrets \`
			`# -o jsonpath='{.data.LITELLM_MCPCTL_GATEWAY_TOKEN}' \| base64 -d)`

			`cd /path/to/mcpctl`
			`mcpctl create secret litellm-key --data "API_KEY=${LITELLM_TOKEN}"`
			`mcpctl create llm qwen3-thinking \`
			`--type openai --model qwen3-thinking \`
			`--url http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \`
			`--api-key-ref litellm-key/API_KEY \`
			`--description "Qwen3-30B-A3B-Thinking-FP8 via in-cluster vLLM behind LiteLLM"`
			`mcpctl create agent reviewer \`
			`--llm qwen3-thinking \`
			`--description "I review what you're shipping; ask after each major change." \`
			`--default-temperature 0.2 --default-max-tokens 4096`
			`mcpctl chat reviewer`
			```

			`## Troubleshooting`

			`- Namespace collision in mcplocal: if a project has an upstream MCP`
			server literally named `agent-<x>`, the agents plugin detects the
			collision in `onSessionCreate`, skips that agent's registration, and
			emits a `ctx.log.warn` line. Document the `agent-` prefix as reserved
			`on real server names.`

			- Llm-in-use blocks delete: `Agent.llm` is `onDelete: Restrict`. Detach
			`every agent (or delete them) before deleting the underlying Llm.`

			- Stale `pending` rows: a crash mid-loop leaves `pending` ChatMessage
			rows. The next request recovers — `markPendingAsError` flips them on the
			next failure path, and `loadHistory` filters out `error` rows when
			`rebuilding context for the next turn.`

			- `proxyModelName` is informational only for agents. The agent's own
			`internal tool loop runs server-side in mcpd and bypasses mcplocal's`
			`proxymodel pipeline entirely. Don't try to plumb it.`

			- Anthropic + tools: the Anthropic adapter currently drops `tool` role
			messages and doesn't translate OpenAI `tool_calls` to Anthropic
			`tool_use` / `tool_result` blocks. Use an OpenAI-compatible provider
			`(LiteLLM, vLLM, OpenAI) for agents that need tool calling until that`
			`translation lands.`