# Inference Tasks (v5) A **durable inference task queue**. Every inference call lands as a row in the database; mcpd's previous in-memory request map is gone. mcpd restart no longer drops in-flight work, worker disconnects no longer fail callers, and pools that have no live workers can still accept work that drains when one shows up. The queue is the same persistence layer underneath two API surfaces: - **Sync** (existing): `POST /api/v1/llms//infer` and `POST /api/v1/agents//chat`. Caller's HTTP request stays open until the task finishes, exactly like pre-v5. Internally it's now a DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`. - **Async** (new — this doc): `POST /api/v1/inference-tasks`. Caller gets a task id immediately and polls / streams / cancels separately. Useful for fire-and-forget scripts, queued work for offline pools, and operator visibility into in-flight jobs. ## Lifecycle ``` ┌──────► claimed ──first chunk──► running ──┐ pending ──┐ │ │ │ ├──┘ on dispatch │ on revert (worker disconnect)│ │ └─────────► pending │ ▼ ▼ cancelled ◄───────────── DELETE ──────────────► completed | error ``` | Status | Meaning | |-------------|---------| | `pending` | In queue, no worker has it yet (or claim was reverted) | | `claimed` | A worker has it; dispatch frame went down the SSE channel | | `running` | First chunk came back; streaming is in flight | | `completed` | Worker POSTed terminal result | | `error` | Permanent failure (auth, bad request, queue timeout) | | `cancelled` | Caller cancelled via `DELETE` | A worker disconnecting mid-task reverts `claimed`/`running` rows back to `pending` (the row's `claimedBy` is cleared); a sibling worker on the same pool drains it. Tasks survive mcpd restart — the row is the source of truth, so a fresh mcpd reads the queue and either matches a worker or waits. ## Async API ``` POST /api/v1/inference-tasks → 201 + { id, status, poolName, llmName, streaming, createdAt } GET /api/v1/inference-tasks?... → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit) GET /api/v1/inference-tasks/ → poll one row GET /api/v1/inference-tasks//stream → SSE: `event: chunk` + `event: terminal` DELETE /api/v1/inference-tasks/ → cancel pending/claimed/running; no-op on terminal ``` Foreign-owner ids return **404** (not 403) on every endpoint — keeps id-enumeration from leaking task existence across users. ### Enqueue ```sh curl -X POST https://mcpd/api/v1/inference-tasks \ -H 'Authorization: Bearer ...' \ -H 'Content-Type: application/json' \ -d '{ "llmName": "vllm-local-qwen3", "request": { "messages": [{"role":"user","content":"summarize the changelog"}], "max_tokens": 500 }, "streaming": false }' # 201 Created # { # "id": "cmoip...", # "status": "pending", # "poolName": "vllm-local-qwen3", # "llmName": "vllm-local-qwen3", # "streaming": false, # "createdAt": "2026-04-28T..." # } ``` Rejects with **400** for public Llms — the existing `POST /api/v1/llms//infer` is the right tool there. Rejects with **404** for missing Llm names. ### Poll ```sh curl https://mcpd/api/v1/inference-tasks/cmoip... \ -H 'Authorization: Bearer ...' # 200 # { # "id": "cmoip...", # "status": "completed", # "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] }, # "claimedBy": "", # "createdAt": "...", "claimedAt": "...", "completedAt": "..." # } ``` ### Stream live updates ```sh curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \ -H 'Authorization: Bearer ...' # event: chunk # data: {"data":"hello "} # # event: chunk # data: {"data":"world","done":true} # # event: terminal # data: { ...full row... } ``` Already-terminal tasks emit one `terminal` event and close. Caller disconnect cleans up server-side. ### Cancel ```sh curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \ -H 'Authorization: Bearer ...' ``` Returns 200 + the cancelled row. Cancelling a `completed` / `error` / already-`cancelled` task is a no-op (returns the current row). The worker that picked the task up may still execute the underlying inference call (no remote-cancel protocol yet); the result is just discarded server-side. ## CLI ```sh mcpctl get tasks # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER mcpctl get task # describe one mcpctl get task -o yaml # full row, yaml format # Enqueue async via chat-llm: mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog" # stdout: cmoip... (just the id, pipeable) # stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending) # Wire to other commands: ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task") sleep 30 mcpctl get task $ID ``` ## RBAC A new `tasks` resource. Default action mapping: | Method | Action | |--------|--------| | `GET` | `view` | | `POST` | `create` | | `DELETE` | `delete` | Aliases: `task`, `inference-task`, `inference-tasks` all normalize to `tasks` so `mcpctl create rbac-binding --resource task --role view ...` works regardless of which form the operator types. Owner scoping is enforced at the route layer ON TOP of the RBAC hook: - A caller with only the implicit role sees and can cancel their own tasks. - `view:tasks` resource-wide is required to see / list / cancel another user's tasks. ## GC mcpd runs a 5-min ticker that: - Flips **pending** tasks older than the queue timeout (default 1 h) to `error` with a clear message ("Task expired in pending state after ...s"). Any caller still polling sees a clean failure. - Deletes **terminal** tasks (`completed`/`error`/`cancelled`) past the retention window (default 7 d). Indexes back the `WHERE status IN (...) AND completedAt < cutoff` predicate so the sweep is cheap even with millions of rows. Both timeouts are constants in `main.ts`; making them per-deployment configurable is a v6 concern. ## Limitations / Out-of-scope for v5 - **Streaming chunks are not persisted.** The DB stores only the final assembled body (or null for streaming-only consumers). A caller who connects to `/stream` AFTER the task completed gets one `terminal` event with no chunk replay. Persisting deltas would be a per-chunk INSERT — too expensive at SSE granularity. - **Public-Llm queueing** is not exposed. Public Llms are HTTP adapters with no SSE worker, so the queue concept doesn't apply. Direct `/llms//infer` against a public Llm stays the right call. - **Multi-instance mcpd**: the in-process `EventEmitter` that wakes blocked HTTP handlers fires only on the mcpd that holds the row's blocked handler. With multiple mcpd replicas behind a load balancer the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`, no schema change. - **Worker capacity** is one task per session at a time (today). A per-session capacity hint + scheduler-level concurrency control is a v6+ concern. - **Remote cancel protocol**: cancelling a `running` task marks the row as cancelled but the worker keeps executing. The result POST is discarded server-side. Real cancellation (an SSE message that tells the worker to abort its in-flight call) is a v6 concern. ## See also - [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives (kind, lifecycle, pools) that the queue routes around. - [agents.md](./agents.md) — how the chat path uses the same queue underneath.