CLI surface for the durable queue:
- `mcpctl get tasks` — table view (ID, STATUS, POOL, LLM, MODEL,
STREAM, AGE, WORKER). Aliases `task`, `tasks`, `inference-task`,
`inference-tasks` all normalize to the canonical plural so URL
construction works uniformly. RESOURCE_ALIASES + completions
generator updated.
- `mcpctl chat-llm <name> --async -m <msg>` — enqueue and exit. stdout
is just the task id (pipeable into `xargs mcpctl get task`); stderr
carries human-readable status. REPL mode is rejected for --async
(fire-and-forget doesn't make sense without -m).
GC ticker in mcpd: 5-min interval. Pending tasks past 1 h queue
timeout flip to error with a clear message; terminal tasks past 7 d
retention get deleted. Both queries are index-backed.
Crash fix uncovered by the smoke: when the async route doesn't await
ref.done, a later cancel/error rejected the in-flight Promise as
unhandled and crashed mcpd. The route now attaches a no-op `.catch`
so the legacy `done` semantic still works for sync callers (chat,
direct infer) without taking out the process for async ones. The
EnqueueInferOptions also gained an explicit `ownerId` field so the
async API can stamp the authenticated user on the row instead of
inheriting 'system' from the constructor's resolveOwner — without
this, every GET/DELETE from the original caller would 404 due to
foreign-owner mismatch.
Smoke (tests/smoke/inference-task.smoke.test.ts):
1. POST /inference-tasks while no worker bound → row=pending.
2. Bring a registrar online → bindSession drain claims and
dispatches → worker complete()s → row=completed → GET returns
the assistant body.
3. Stop worker, enqueue, DELETE → row=cancelled, persisted.
docs/inference-tasks.md (new): full data model, lifecycle diagram,
async API reference, CLI examples, RBAC table, GC defaults, and the
v5 limitations / v6 roadmap. Cross-linked from virtual-llms.md and
agents.md.
Tests + smoke: mcpd 893/893, mcplocal 723/723, cli 437/437, full
smoke 146/146 (was 144, +2 new task smoke). Live mcpd verified via
manual curl: enqueue → cancel → re-fetch — no crash, owner scoping
returns 404 on foreign ids, GC ticker logs at info when it sweeps.
v5 complete: durable queue (Stage 1) + VirtualLlmService rewire
(Stage 2) + async API & RBAC (Stage 3) + CLI/GC/smoke/docs (Stage 4).
7.8 KiB
Inference Tasks (v5)
A durable inference task queue. Every inference call lands as a row in the database; mcpd's previous in-memory request map is gone. mcpd restart no longer drops in-flight work, worker disconnects no longer fail callers, and pools that have no live workers can still accept work that drains when one shows up.
The queue is the same persistence layer underneath two API surfaces:
- Sync (existing):
POST /api/v1/llms/<name>/inferandPOST /api/v1/agents/<name>/chat. Caller's HTTP request stays open until the task finishes, exactly like pre-v5. Internally it's now a DB row with a 10-min wait. Used bymcpctl chat/mcpctl chat-llm. - Async (new — this doc):
POST /api/v1/inference-tasks. Caller gets a task id immediately and polls / streams / cancels separately. Useful for fire-and-forget scripts, queued work for offline pools, and operator visibility into in-flight jobs.
Lifecycle
┌──────► claimed ──first chunk──► running ──┐
pending ──┐ │ │ │
├──┘ on dispatch │ on revert (worker disconnect)│
│ └─────────► pending │
▼ ▼
cancelled ◄───────────── DELETE ──────────────► completed | error
| Status | Meaning |
|---|---|
pending |
In queue, no worker has it yet (or claim was reverted) |
claimed |
A worker has it; dispatch frame went down the SSE channel |
running |
First chunk came back; streaming is in flight |
completed |
Worker POSTed terminal result |
error |
Permanent failure (auth, bad request, queue timeout) |
cancelled |
Caller cancelled via DELETE |
A worker disconnecting mid-task reverts claimed/running rows back
to pending (the row's claimedBy is cleared); a sibling worker on
the same pool drains it. Tasks survive mcpd restart — the row is the
source of truth, so a fresh mcpd reads the queue and either matches a
worker or waits.
Async API
POST /api/v1/inference-tasks → 201 + { id, status, poolName, llmName, streaming, createdAt }
GET /api/v1/inference-tasks?... → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
GET /api/v1/inference-tasks/<id> → poll one row
GET /api/v1/inference-tasks/<id>/stream → SSE: `event: chunk` + `event: terminal`
DELETE /api/v1/inference-tasks/<id> → cancel pending/claimed/running; no-op on terminal
Foreign-owner ids return 404 (not 403) on every endpoint — keeps id-enumeration from leaking task existence across users.
Enqueue
curl -X POST https://mcpd/api/v1/inference-tasks \
-H 'Authorization: Bearer ...' \
-H 'Content-Type: application/json' \
-d '{
"llmName": "vllm-local-qwen3",
"request": {
"messages": [{"role":"user","content":"summarize the changelog"}],
"max_tokens": 500
},
"streaming": false
}'
# 201 Created
# {
# "id": "cmoip...",
# "status": "pending",
# "poolName": "vllm-local-qwen3",
# "llmName": "vllm-local-qwen3",
# "streaming": false,
# "createdAt": "2026-04-28T..."
# }
Rejects with 400 for public Llms — the existing
POST /api/v1/llms/<name>/infer is the right tool there. Rejects
with 404 for missing Llm names.
Poll
curl https://mcpd/api/v1/inference-tasks/cmoip... \
-H 'Authorization: Bearer ...'
# 200
# {
# "id": "cmoip...",
# "status": "completed",
# "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
# "claimedBy": "<providerSessionId>",
# "createdAt": "...", "claimedAt": "...", "completedAt": "..."
# }
Stream live updates
curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
-H 'Authorization: Bearer ...'
# event: chunk
# data: {"data":"hello "}
#
# event: chunk
# data: {"data":"world","done":true}
#
# event: terminal
# data: { ...full row... }
Already-terminal tasks emit one terminal event and close. Caller
disconnect cleans up server-side.
Cancel
curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
-H 'Authorization: Bearer ...'
Returns 200 + the cancelled row. Cancelling a completed /
error / already-cancelled task is a no-op (returns the current
row). The worker that picked the task up may still execute the
underlying inference call (no remote-cancel protocol yet); the
result is just discarded server-side.
CLI
mcpctl get tasks # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
mcpctl get task <id> # describe one
mcpctl get task <id> -o yaml # full row, yaml format
# Enqueue async via chat-llm:
mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
# stdout: cmoip... (just the id, pipeable)
# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)
# Wire to other commands:
ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
sleep 30
mcpctl get task $ID
RBAC
A new tasks resource. Default action mapping:
| Method | Action |
|---|---|
GET |
view |
POST |
create |
DELETE |
delete |
Aliases: task, inference-task, inference-tasks all normalize to
tasks so mcpctl create rbac-binding --resource task --role view ...
works regardless of which form the operator types.
Owner scoping is enforced at the route layer ON TOP of the RBAC hook:
- A caller with only the implicit role sees and can cancel their own tasks.
view:tasksresource-wide is required to see / list / cancel another user's tasks.
GC
mcpd runs a 5-min ticker that:
- Flips pending tasks older than the queue timeout (default 1 h)
to
errorwith a clear message ("Task expired in pending state after ...s"). Any caller still polling sees a clean failure. - Deletes terminal tasks (
completed/error/cancelled) past the retention window (default 7 d). Indexes back theWHERE status IN (...) AND completedAt < cutoffpredicate so the sweep is cheap even with millions of rows.
Both timeouts are constants in main.ts; making them per-deployment
configurable is a v6 concern.
Limitations / Out-of-scope for v5
- Streaming chunks are not persisted. The DB stores only the
final assembled body (or null for streaming-only consumers). A
caller who connects to
/streamAFTER the task completed gets oneterminalevent with no chunk replay. Persisting deltas would be a per-chunk INSERT — too expensive at SSE granularity. - Public-Llm queueing is not exposed. Public Llms are HTTP
adapters with no SSE worker, so the queue concept doesn't apply.
Direct
/llms/<name>/inferagainst a public Llm stays the right call. - Multi-instance mcpd: the in-process
EventEmitterthat wakes blocked HTTP handlers fires only on the mcpd that holds the row's blocked handler. With multiple mcpd replicas behind a load balancer the wakeup wouldn't cross instances. v6 swap: pgLISTEN/NOTIFY, no schema change. - Worker capacity is one task per session at a time (today). A per-session capacity hint + scheduler-level concurrency control is a v6+ concern.
- Remote cancel protocol: cancelling a
runningtask marks the row as cancelled but the worker keeps executing. The result POST is discarded server-side. Real cancellation (an SSE message that tells the worker to abort its in-flight call) is a v6 concern.
See also
- virtual-llms.md — the Llm-side primitives (kind, lifecycle, pools) that the queue routes around.
- agents.md — how the chat path uses the same queue underneath.