michal/mcpctl

Fork 0

Files

Michal 7320b50dac

CI/CD / lint (pull_request) Successful in 55s

Details

CI/CD / test (pull_request) Successful in 1m12s

Details

CI/CD / typecheck (pull_request) Successful in 2m46s

Details

CI/CD / smoke (pull_request) Failing after 1m44s

Details

CI/CD / build (pull_request) Failing after 7m0s

Details

CI/CD / publish (pull_request) Has been skipped

Details

feat(cli+docs+smoke): inference-task CLI + GC ticker + smoke + docs (v5 Stage 4)

CLI surface for the durable queue:

- `mcpctl get tasks` — table view (ID, STATUS, POOL, LLM, MODEL,
  STREAM, AGE, WORKER). Aliases `task`, `tasks`, `inference-task`,
  `inference-tasks` all normalize to the canonical plural so URL
  construction works uniformly. RESOURCE_ALIASES + completions
  generator updated.
- `mcpctl chat-llm <name> --async -m <msg>` — enqueue and exit. stdout
  is just the task id (pipeable into `xargs mcpctl get task`); stderr
  carries human-readable status. REPL mode is rejected for --async
  (fire-and-forget doesn't make sense without -m).

GC ticker in mcpd: 5-min interval. Pending tasks past 1 h queue
timeout flip to error with a clear message; terminal tasks past 7 d
retention get deleted. Both queries are index-backed.

Crash fix uncovered by the smoke: when the async route doesn't await
ref.done, a later cancel/error rejected the in-flight Promise as
unhandled and crashed mcpd. The route now attaches a no-op `.catch`
so the legacy `done` semantic still works for sync callers (chat,
direct infer) without taking out the process for async ones. The
EnqueueInferOptions also gained an explicit `ownerId` field so the
async API can stamp the authenticated user on the row instead of
inheriting 'system' from the constructor's resolveOwner — without
this, every GET/DELETE from the original caller would 404 due to
foreign-owner mismatch.

Smoke (tests/smoke/inference-task.smoke.test.ts):

  1. POST /inference-tasks while no worker bound → row=pending.
  2. Bring a registrar online → bindSession drain claims and
     dispatches → worker complete()s → row=completed → GET returns
     the assistant body.
  3. Stop worker, enqueue, DELETE → row=cancelled, persisted.

docs/inference-tasks.md (new): full data model, lifecycle diagram,
async API reference, CLI examples, RBAC table, GC defaults, and the
v5 limitations / v6 roadmap. Cross-linked from virtual-llms.md and
agents.md.

Tests + smoke: mcpd 893/893, mcplocal 723/723, cli 437/437, full
smoke 146/146 (was 144, +2 new task smoke). Live mcpd verified via
manual curl: enqueue → cancel → re-fetch — no crash, owner scoping
returns 404 on foreign ids, GC ticker logs at info when it sweeps.

v5 complete: durable queue (Stage 1) + VirtualLlmService rewire
(Stage 2) + async API & RBAC (Stage 3) + CLI/GC/smoke/docs (Stage 4).

2026-04-28 15:25:09 +01:00

10 KiB

Raw Permalink Blame History

Agents

An Agent is an LLM persona pinned to a specific Llm, with a system prompt, a description that surfaces in MCP tools/list, optional attachment to a Project, and LiteLLM-style sampling defaults. Conversations are persisted as ChatThread + ChatMessage rows so REPL sessions resume across runs.

Two surfaces use an agent:

Direct chat via mcpctl chat <name> (interactive REPL or one-shot -m "msg"). Streams over SSE; tool calls and tool results print to stderr in dim brackets. Slash-commands /set, /system, /tools, /clear, /save, /quit adjust runtime behavior.
Virtual MCP server registered into every project session by mcplocal's agents plugin. The agent shows up as agent-<name> with one tool chat, whose description is the agent's own description. Other Claude sessions / MCP clients see the agent as just another tool in tools/list and can consult it.

Data model

Three Prisma models added to src/db/prisma/schema.prisma:

Agent — name (unique), description, systemPrompt, llmId (FK Restrict — an Llm in active use cannot be deleted), projectId (FK SetNull — agents survive project deletion), proxyModelName (optional informational override), defaultParams (Json, LiteLLM-style), extras (Json, reserved for future LoRA / tool allowlists), ownerId, version, timestamps.
ChatThread — agentId, ownerId, title, lastTurnAt, timestamps. Cascade delete on agent.
ChatMessage — threadId, turnIndex (monotonic per thread, enforced by @@unique([threadId, turnIndex])), role ('system' | 'user' | 'assistant' | 'tool'), content, toolCalls (Json — assistant turn's [{id,name,arguments}]), toolCallId (which call a tool turn answers), status ('pending' | 'complete' | 'error'), createdAt. Cascade delete on thread.

status stays pending while the orchestrator runs an in-flight assistant or tool turn, then flips to complete once the round settles. On any exception in the chat loop, every pending row in the thread is flipped to error so the trail stays auditable.

Chat parameters (LiteLLM-style passthrough)

Per-call resolution: request body → agent.defaultParams → adapter default. Setting a key to null in the request explicitly clears a default.

Key	Type	Notes
`temperature`	number	0..2
`top_p`	number	0..1
`top_k`	integer	Anthropic-only; OpenAI ignores
`max_tokens`	integer	adapter clamps to provider max
`stop`	string \| string[]	up to 4 sequences
`presence_penalty`	number	OpenAI
`frequency_penalty`	number	OpenAI
`seed`	integer	reproducibility (provider-dependent)
`response_format`	object	`text` \| `json_object` \| `json_schema`
`tool_choice`	enum/object	`auto`\|`none`\|`required`\|`{type:'function',function:{name}}`
`tools_allowlist`	string[]	restricts which project MCP tools the agent can call this turn
`systemOverride`	string	replaces `agent.systemPrompt` for this call
`systemAppend`	string	concatenated to system block (after project Prompts)
`messages`	array	full message history override; if set, `message`/threadId history is ignored
`extra`	object	provider-specific knobs (Anthropic `metadata.user_id`, vLLM `repetition_penalty`) — adapters cherry-pick

HTTP API (mcpd)

GET    /api/v1/agents                  list (RBAC: view:agents)
GET    /api/v1/agents/:idOrName        describe (view:agents)
POST   /api/v1/agents                  create (create:agents)
PUT    /api/v1/agents/:idOrName        update (edit:agents)
DELETE /api/v1/agents/:idOrName        delete (delete:agents)
POST   /api/v1/agents/:name/chat       chat — non-streaming or SSE (run:agents:<name>)
POST   /api/v1/agents/:name/threads    create thread (run:agents:<name>)
GET    /api/v1/agents/:name/threads    list threads (run:agents:<name>)
GET    /api/v1/threads/:id/messages    replay history (view:agents)
GET    /api/v1/projects/:p/agents      project-scoped list (view:projects:<p>)

The chat endpoint reuses the SSE pattern from llm-infer.ts exactly: same headers (text/event-stream, X-Accel-Buffering: no), same data: …\n\n framing, same [DONE] terminator. SSE chunk types:

{type:'text', delta} — assistant text increments
{type:'tool_call', toolName, args} — model decided to call a tool
{type:'tool_result', toolName, ok} — tool dispatch outcome
{type:'final', threadId, turnIndex} — terminal turn
{type:'error', message} — fatal error in the loop

Tool-use loop

When the agent's project has MCP servers attached, mcpd's ChatService lists each server's tools (via mcp-proxy.service.ts — same path real MCP traffic uses) and presents them to the model namespaced as <server>__<tool>. On a tool_calls response the loop dispatches each call back through the same proxy, persists the assistant + tool turns linked by toolCallId, and loops (cap = 12 iterations) until the model returns terminal text.

Persistence is non-transactional across the loop because tool calls can take minutes; long-held DB transactions would starve other writers.

RBAC

Agents are their own resource (agents), independent of project bindings. Recommended:

view:agents — list / describe
create:agents / edit:agents / delete:agents — CRUD
run:agents:<name> — drive a chat turn or manage its threads

Project-attached agents do not implicitly inherit project RBAC. If a project member should be able to chat with the project's agents, grant them run:agents:<each-name> (or wildcard run:agents) explicitly.

YAML round-trip

get agent foo -o yaml | mcpctl apply -f - is a no-op. The apply schema also accepts shorthand:

apiVersion: mcpctl.io/v1
kind: agent
metadata: { name: deployer }
spec:
  description: "I help you deploy code"
  llm: qwen3-thinking          # shorthand for `{ name: qwen3-thinking }`
  project: mcpctl-dev          # shorthand for `{ name: mcpctl-dev }`
  systemPrompt: |
    You are a deployment assistant for mcpctl. Always check fulldeploy.sh
    and the k8s context before suggesting actions.
  defaultParams:
    temperature: 0.2
    max_tokens: 4096
    top_p: 0.9
    stop: ["</deploy>"]

Wiring against your in-cluster qwen3-thinking

The kubernetes-deployment repo provisions LiteLLM in the nvidia-nim namespace (http://litellm.nvidia-nim.svc.cluster.local:4000/v1 in-cluster, https://llm.ad.itaz.eu/v1 external) and a virtual key reserved for mcpctl in the Pulumi secret secrets:litellmMcpctlGatewayToken. Pulling it once:

cd /path/to/kubernetes-deployment
LITELLM_TOKEN=$(pulumi config get --stack homelab secrets:litellmMcpctlGatewayToken)

# fallback if Pulumi isn't authed locally:
# LITELLM_TOKEN=$(kubectl --context worker0-k8s0 -n nvidia-nim get secret litellm-secrets \
#   -o jsonpath='{.data.LITELLM_MCPCTL_GATEWAY_TOKEN}' | base64 -d)

cd /path/to/mcpctl
mcpctl create secret litellm-key --data "API_KEY=${LITELLM_TOKEN}"
mcpctl create llm qwen3-thinking \
    --type openai --model qwen3-thinking \
    --url http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \
    --api-key-ref litellm-key/API_KEY \
    --description "Qwen3-30B-A3B-Thinking-FP8 via in-cluster vLLM behind LiteLLM"
mcpctl create agent reviewer \
    --llm qwen3-thinking \
    --description "I review what you're shipping; ask after each major change." \
    --default-temperature 0.2 --default-max-tokens 4096
mcpctl chat reviewer

Troubleshooting

Namespace collision in mcplocal: if a project has an upstream MCP server literally named agent-<x>, the agents plugin detects the collision in onSessionCreate, skips that agent's registration, and emits a ctx.log.warn line. Document the agent- prefix as reserved on real server names.
Llm-in-use blocks delete: Agent.llm is onDelete: Restrict. Detach every agent (or delete them) before deleting the underlying Llm.
Stale pending rows: a crash mid-loop leaves pending ChatMessage rows. The next request recovers — markPendingAsError flips them on the next failure path, and loadHistory filters out error rows when rebuilding context for the next turn.
proxyModelName is informational only for agents. The agent's own internal tool loop runs server-side in mcpd and bypasses mcplocal's proxymodel pipeline entirely. Don't try to plumb it.
Anthropic + tools: the Anthropic adapter currently drops tool role messages and doesn't translate OpenAI tool_calls to Anthropic tool_use / tool_result blocks. Use an OpenAI-compatible provider (LiteLLM, vLLM, OpenAI) for agents that need tool calling until that translation lands.

10 KiB Raw Permalink Blame History

Agents

Data model

Chat parameters (LiteLLM-style passthrough)

HTTP API (mcpd)

Tool-use loop

RBAC

YAML round-trip

Wiring against your in-cluster qwen3-thinking

Troubleshooting

See also

10 KiB

Raw Permalink Blame History