Files
mcpctl/docs/inference-tasks.md
Michal 7320b50dac
Some checks failed
CI/CD / lint (pull_request) Successful in 55s
CI/CD / test (pull_request) Successful in 1m12s
CI/CD / typecheck (pull_request) Successful in 2m46s
CI/CD / smoke (pull_request) Failing after 1m44s
CI/CD / build (pull_request) Failing after 7m0s
CI/CD / publish (pull_request) Has been skipped
feat(cli+docs+smoke): inference-task CLI + GC ticker + smoke + docs (v5 Stage 4)
CLI surface for the durable queue:

- `mcpctl get tasks` — table view (ID, STATUS, POOL, LLM, MODEL,
  STREAM, AGE, WORKER). Aliases `task`, `tasks`, `inference-task`,
  `inference-tasks` all normalize to the canonical plural so URL
  construction works uniformly. RESOURCE_ALIASES + completions
  generator updated.
- `mcpctl chat-llm <name> --async -m <msg>` — enqueue and exit. stdout
  is just the task id (pipeable into `xargs mcpctl get task`); stderr
  carries human-readable status. REPL mode is rejected for --async
  (fire-and-forget doesn't make sense without -m).

GC ticker in mcpd: 5-min interval. Pending tasks past 1 h queue
timeout flip to error with a clear message; terminal tasks past 7 d
retention get deleted. Both queries are index-backed.

Crash fix uncovered by the smoke: when the async route doesn't await
ref.done, a later cancel/error rejected the in-flight Promise as
unhandled and crashed mcpd. The route now attaches a no-op `.catch`
so the legacy `done` semantic still works for sync callers (chat,
direct infer) without taking out the process for async ones. The
EnqueueInferOptions also gained an explicit `ownerId` field so the
async API can stamp the authenticated user on the row instead of
inheriting 'system' from the constructor's resolveOwner — without
this, every GET/DELETE from the original caller would 404 due to
foreign-owner mismatch.

Smoke (tests/smoke/inference-task.smoke.test.ts):

  1. POST /inference-tasks while no worker bound → row=pending.
  2. Bring a registrar online → bindSession drain claims and
     dispatches → worker complete()s → row=completed → GET returns
     the assistant body.
  3. Stop worker, enqueue, DELETE → row=cancelled, persisted.

docs/inference-tasks.md (new): full data model, lifecycle diagram,
async API reference, CLI examples, RBAC table, GC defaults, and the
v5 limitations / v6 roadmap. Cross-linked from virtual-llms.md and
agents.md.

Tests + smoke: mcpd 893/893, mcplocal 723/723, cli 437/437, full
smoke 146/146 (was 144, +2 new task smoke). Live mcpd verified via
manual curl: enqueue → cancel → re-fetch — no crash, owner scoping
returns 404 on foreign ids, GC ticker logs at info when it sweeps.

v5 complete: durable queue (Stage 1) + VirtualLlmService rewire
(Stage 2) + async API & RBAC (Stage 3) + CLI/GC/smoke/docs (Stage 4).
2026-04-28 15:25:09 +01:00

7.8 KiB

Inference Tasks (v5)

A durable inference task queue. Every inference call lands as a row in the database; mcpd's previous in-memory request map is gone. mcpd restart no longer drops in-flight work, worker disconnects no longer fail callers, and pools that have no live workers can still accept work that drains when one shows up.

The queue is the same persistence layer underneath two API surfaces:

  • Sync (existing): POST /api/v1/llms/<name>/infer and POST /api/v1/agents/<name>/chat. Caller's HTTP request stays open until the task finishes, exactly like pre-v5. Internally it's now a DB row with a 10-min wait. Used by mcpctl chat / mcpctl chat-llm.
  • Async (new — this doc): POST /api/v1/inference-tasks. Caller gets a task id immediately and polls / streams / cancels separately. Useful for fire-and-forget scripts, queued work for offline pools, and operator visibility into in-flight jobs.

Lifecycle

                ┌──────► claimed ──first chunk──► running ──┐
   pending ──┐  │             │                              │
             ├──┘ on dispatch  │ on revert (worker disconnect)│
             │                  └─────────► pending           │
             ▼                                                ▼
       cancelled ◄───────────── DELETE ──────────────► completed | error
Status Meaning
pending In queue, no worker has it yet (or claim was reverted)
claimed A worker has it; dispatch frame went down the SSE channel
running First chunk came back; streaming is in flight
completed Worker POSTed terminal result
error Permanent failure (auth, bad request, queue timeout)
cancelled Caller cancelled via DELETE

A worker disconnecting mid-task reverts claimed/running rows back to pending (the row's claimedBy is cleared); a sibling worker on the same pool drains it. Tasks survive mcpd restart — the row is the source of truth, so a fresh mcpd reads the queue and either matches a worker or waits.

Async API

POST   /api/v1/inference-tasks                  → 201 + { id, status, poolName, llmName, streaming, createdAt }
GET    /api/v1/inference-tasks?...               → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
GET    /api/v1/inference-tasks/<id>             → poll one row
GET    /api/v1/inference-tasks/<id>/stream      → SSE: `event: chunk` + `event: terminal`
DELETE /api/v1/inference-tasks/<id>             → cancel pending/claimed/running; no-op on terminal

Foreign-owner ids return 404 (not 403) on every endpoint — keeps id-enumeration from leaking task existence across users.

Enqueue

curl -X POST https://mcpd/api/v1/inference-tasks \
  -H 'Authorization: Bearer ...' \
  -H 'Content-Type: application/json' \
  -d '{
    "llmName": "vllm-local-qwen3",
    "request": {
      "messages": [{"role":"user","content":"summarize the changelog"}],
      "max_tokens": 500
    },
    "streaming": false
  }'

# 201 Created
# {
#   "id": "cmoip...",
#   "status": "pending",
#   "poolName": "vllm-local-qwen3",
#   "llmName": "vllm-local-qwen3",
#   "streaming": false,
#   "createdAt": "2026-04-28T..."
# }

Rejects with 400 for public Llms — the existing POST /api/v1/llms/<name>/infer is the right tool there. Rejects with 404 for missing Llm names.

Poll

curl https://mcpd/api/v1/inference-tasks/cmoip... \
  -H 'Authorization: Bearer ...'

# 200
# {
#   "id": "cmoip...",
#   "status": "completed",
#   "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
#   "claimedBy": "<providerSessionId>",
#   "createdAt": "...", "claimedAt": "...", "completedAt": "..."
# }

Stream live updates

curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
  -H 'Authorization: Bearer ...'

# event: chunk
# data: {"data":"hello "}
#
# event: chunk
# data: {"data":"world","done":true}
#
# event: terminal
# data: { ...full row... }

Already-terminal tasks emit one terminal event and close. Caller disconnect cleans up server-side.

Cancel

curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
  -H 'Authorization: Bearer ...'

Returns 200 + the cancelled row. Cancelling a completed / error / already-cancelled task is a no-op (returns the current row). The worker that picked the task up may still execute the underlying inference call (no remote-cancel protocol yet); the result is just discarded server-side.

CLI

mcpctl get tasks                # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
mcpctl get task <id>            # describe one
mcpctl get task <id> -o yaml    # full row, yaml format

# Enqueue async via chat-llm:
mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
# stdout: cmoip... (just the id, pipeable)
# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)

# Wire to other commands:
ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
sleep 30
mcpctl get task $ID

RBAC

A new tasks resource. Default action mapping:

Method Action
GET view
POST create
DELETE delete

Aliases: task, inference-task, inference-tasks all normalize to tasks so mcpctl create rbac-binding --resource task --role view ... works regardless of which form the operator types.

Owner scoping is enforced at the route layer ON TOP of the RBAC hook:

  • A caller with only the implicit role sees and can cancel their own tasks.
  • view:tasks resource-wide is required to see / list / cancel another user's tasks.

GC

mcpd runs a 5-min ticker that:

  • Flips pending tasks older than the queue timeout (default 1 h) to error with a clear message ("Task expired in pending state after ...s"). Any caller still polling sees a clean failure.
  • Deletes terminal tasks (completed/error/cancelled) past the retention window (default 7 d). Indexes back the WHERE status IN (...) AND completedAt < cutoff predicate so the sweep is cheap even with millions of rows.

Both timeouts are constants in main.ts; making them per-deployment configurable is a v6 concern.

Limitations / Out-of-scope for v5

  • Streaming chunks are not persisted. The DB stores only the final assembled body (or null for streaming-only consumers). A caller who connects to /stream AFTER the task completed gets one terminal event with no chunk replay. Persisting deltas would be a per-chunk INSERT — too expensive at SSE granularity.
  • Public-Llm queueing is not exposed. Public Llms are HTTP adapters with no SSE worker, so the queue concept doesn't apply. Direct /llms/<name>/infer against a public Llm stays the right call.
  • Multi-instance mcpd: the in-process EventEmitter that wakes blocked HTTP handlers fires only on the mcpd that holds the row's blocked handler. With multiple mcpd replicas behind a load balancer the wakeup wouldn't cross instances. v6 swap: pg LISTEN/NOTIFY, no schema change.
  • Worker capacity is one task per session at a time (today). A per-session capacity hint + scheduler-level concurrency control is a v6+ concern.
  • Remote cancel protocol: cancelling a running task marks the row as cancelled but the worker keeps executing. The result POST is discarded server-side. Real cancellation (an SSE message that tells the worker to abort its in-flight call) is a v6 concern.

See also

  • virtual-llms.md — the Llm-side primitives (kind, lifecycle, pools) that the queue routes around.
  • agents.md — how the chat path uses the same queue underneath.