feat(cli+docs+smoke): inference-task CLI + GC ticker + smoke + docs (v5 Stage 4)

CLI surface for the durable queue: - `mcpctl get tasks` — table view (ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER). Aliases `task`, `tasks`, `inference-task`, `inference-tasks` all normalize to the canonical plural so URL construction works uniformly. RESOURCE_ALIASES + completions generator updated. - `mcpctl chat-llm <name> --async -m <msg>` — enqueue and exit. stdout is just the task id (pipeable into `xargs mcpctl get task`); stderr carries human-readable status. REPL mode is rejected for --async (fire-and-forget doesn't make sense without -m). GC ticker in mcpd: 5-min interval. Pending tasks past 1 h queue timeout flip to error with a clear message; terminal tasks past 7 d retention get deleted. Both queries are index-backed. Crash fix uncovered by the smoke: when the async route doesn't await ref.done, a later cancel/error rejected the in-flight Promise as unhandled and crashed mcpd. The route now attaches a no-op `.catch` so the legacy `done` semantic still works for sync callers (chat, direct infer) without taking out the process for async ones. The EnqueueInferOptions also gained an explicit `ownerId` field so the async API can stamp the authenticated user on the row instead of inheriting 'system' from the constructor's resolveOwner — without this, every GET/DELETE from the original caller would 404 due to foreign-owner mismatch. Smoke (tests/smoke/inference-task.smoke.test.ts): 1. POST /inference-tasks while no worker bound → row=pending. 2. Bring a registrar online → bindSession drain claims and dispatches → worker complete()s → row=completed → GET returns the assistant body. 3. Stop worker, enqueue, DELETE → row=cancelled, persisted. docs/inference-tasks.md (new): full data model, lifecycle diagram, async API reference, CLI examples, RBAC table, GC defaults, and the v5 limitations / v6 roadmap. Cross-linked from virtual-llms.md and agents.md. Tests + smoke: mcpd 893/893, mcplocal 723/723, cli 437/437, full smoke 146/146 (was 144, +2 new task smoke). Live mcpd verified via manual curl: enqueue → cancel → re-fetch — no crash, owner scoping returns 404 on foreign ids, GC ticker logs at info when it sweeps. v5 complete: durable queue (Stage 1) + VirtualLlmService rewire (Stage 2) + async API & RBAC (Stage 3) + CLI/GC/smoke/docs (Stage 4).
2026-04-28 15:25:09 +01:00
parent 1dcfdc8b05
commit 7320b50dac
14 changed files with 654 additions and 27 deletions
--- a/docs/inference-tasks.md
+++ b/docs/inference-tasks.md
@@ -0,0 +1,218 @@
+# Inference Tasks (v5)
+
+A **durable inference task queue**. Every inference call lands as a row
+in the database; mcpd's previous in-memory request map is gone. mcpd
+restart no longer drops in-flight work, worker disconnects no longer
+fail callers, and pools that have no live workers can still accept
+work that drains when one shows up.
+
+The queue is the same persistence layer underneath two API surfaces:
+
+- **Sync** (existing): `POST /api/v1/llms/<name>/infer` and
+  `POST /api/v1/agents/<name>/chat`. Caller's HTTP request stays open
+  until the task finishes, exactly like pre-v5. Internally it's now a
+  DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`.
+- **Async** (new — this doc): `POST /api/v1/inference-tasks`. Caller
+  gets a task id immediately and polls / streams / cancels separately.
+  Useful for fire-and-forget scripts, queued work for offline pools,
+  and operator visibility into in-flight jobs.
+
+## Lifecycle
+
+```
+                ┌──────► claimed ──first chunk──► running ──┐
+   pending ──┐  │             │                              │
+             ├──┘ on dispatch  │ on revert (worker disconnect)│
+             │                  └─────────► pending           │
+             ▼                                                ▼
+       cancelled ◄───────────── DELETE ──────────────► completed | error
+```
+
+| Status      | Meaning |
+|-------------|---------|
+| `pending`   | In queue, no worker has it yet (or claim was reverted) |
+| `claimed`   | A worker has it; dispatch frame went down the SSE channel |
+| `running`   | First chunk came back; streaming is in flight |
+| `completed` | Worker POSTed terminal result |
+| `error`     | Permanent failure (auth, bad request, queue timeout) |
+| `cancelled` | Caller cancelled via `DELETE` |
+
+A worker disconnecting mid-task reverts `claimed`/`running` rows back
+to `pending` (the row's `claimedBy` is cleared); a sibling worker on
+the same pool drains it. Tasks survive mcpd restart — the row is the
+source of truth, so a fresh mcpd reads the queue and either matches a
+worker or waits.
+
+## Async API
+
+```
+POST   /api/v1/inference-tasks                  → 201 + { id, status, poolName, llmName, streaming, createdAt }
+GET    /api/v1/inference-tasks?...               → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
+GET    /api/v1/inference-tasks/<id>             → poll one row
+GET    /api/v1/inference-tasks/<id>/stream      → SSE: `event: chunk` + `event: terminal`
+DELETE /api/v1/inference-tasks/<id>             → cancel pending/claimed/running; no-op on terminal
+```
+
+Foreign-owner ids return **404** (not 403) on every endpoint — keeps
+id-enumeration from leaking task existence across users.
+
+### Enqueue
+
+```sh
+curl -X POST https://mcpd/api/v1/inference-tasks \
+  -H 'Authorization: Bearer ...' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "llmName": "vllm-local-qwen3",
+    "request": {
+      "messages": [{"role":"user","content":"summarize the changelog"}],
+      "max_tokens": 500
+    },
+    "streaming": false
+  }'
+
+# 201 Created
+# {
+#   "id": "cmoip...",
+#   "status": "pending",
+#   "poolName": "vllm-local-qwen3",
+#   "llmName": "vllm-local-qwen3",
+#   "streaming": false,
+#   "createdAt": "2026-04-28T..."
+# }
+```
+
+Rejects with **400** for public Llms — the existing
+`POST /api/v1/llms/<name>/infer` is the right tool there. Rejects
+with **404** for missing Llm names.
+
+### Poll
+
+```sh
+curl https://mcpd/api/v1/inference-tasks/cmoip... \
+  -H 'Authorization: Bearer ...'
+
+# 200
+# {
+#   "id": "cmoip...",
+#   "status": "completed",
+#   "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
+#   "claimedBy": "<providerSessionId>",
+#   "createdAt": "...", "claimedAt": "...", "completedAt": "..."
+# }
+```
+
+### Stream live updates
+
+```sh
+curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
+  -H 'Authorization: Bearer ...'
+
+# event: chunk
+# data: {"data":"hello "}
+#
+# event: chunk
+# data: {"data":"world","done":true}
+#
+# event: terminal
+# data: { ...full row... }
+```
+
+Already-terminal tasks emit one `terminal` event and close. Caller
+disconnect cleans up server-side.
+
+### Cancel
+
+```sh
+curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
+  -H 'Authorization: Bearer ...'
+```
+
+Returns 200 + the cancelled row. Cancelling a `completed` /
+`error` / already-`cancelled` task is a no-op (returns the current
+row). The worker that picked the task up may still execute the
+underlying inference call (no remote-cancel protocol yet); the
+result is just discarded server-side.
+
+## CLI
+
+```sh
+mcpctl get tasks                # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
+mcpctl get task <id>            # describe one
+mcpctl get task <id> -o yaml    # full row, yaml format
+
+# Enqueue async via chat-llm:
+mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
+# stdout: cmoip... (just the id, pipeable)
+# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)
+
+# Wire to other commands:
+ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
+sleep 30
+mcpctl get task $ID
+```
+
+## RBAC
+
+A new `tasks` resource. Default action mapping:
+
+| Method | Action |
+|--------|--------|
+| `GET`  | `view` |
+| `POST` | `create` |
+| `DELETE` | `delete` |
+
+Aliases: `task`, `inference-task`, `inference-tasks` all normalize to
+`tasks` so `mcpctl create rbac-binding --resource task --role view ...`
+works regardless of which form the operator types.
+
+Owner scoping is enforced at the route layer ON TOP of the RBAC hook:
+- A caller with only the implicit role sees and can cancel their own tasks.
+- `view:tasks` resource-wide is required to see / list / cancel
+  another user's tasks.
+
+## GC
+
+mcpd runs a 5-min ticker that:
+
+- Flips **pending** tasks older than the queue timeout (default 1 h)
+  to `error` with a clear message ("Task expired in pending state
+  after ...s"). Any caller still polling sees a clean failure.
+- Deletes **terminal** tasks (`completed`/`error`/`cancelled`) past
+  the retention window (default 7 d). Indexes back the
+  `WHERE status IN (...) AND completedAt < cutoff` predicate so the
+  sweep is cheap even with millions of rows.
+
+Both timeouts are constants in `main.ts`; making them per-deployment
+configurable is a v6 concern.
+
+## Limitations / Out-of-scope for v5
+
+- **Streaming chunks are not persisted.** The DB stores only the
+  final assembled body (or null for streaming-only consumers). A
+  caller who connects to `/stream` AFTER the task completed gets one
+  `terminal` event with no chunk replay. Persisting deltas would be
+  a per-chunk INSERT — too expensive at SSE granularity.
+- **Public-Llm queueing** is not exposed. Public Llms are HTTP
+  adapters with no SSE worker, so the queue concept doesn't apply.
+  Direct `/llms/<name>/infer` against a public Llm stays the right
+  call.
+- **Multi-instance mcpd**: the in-process `EventEmitter` that wakes
+  blocked HTTP handlers fires only on the mcpd that holds the row's
+  blocked handler. With multiple mcpd replicas behind a load balancer
+  the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`,
+  no schema change.
+- **Worker capacity** is one task per session at a time (today). A
+  per-session capacity hint + scheduler-level concurrency control
+  is a v6+ concern.
+- **Remote cancel protocol**: cancelling a `running` task marks the
+  row as cancelled but the worker keeps executing. The result POST
+  is discarded server-side. Real cancellation (an SSE message that
+  tells the worker to abort its in-flight call) is a v6 concern.
+
+## See also
+
+- [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives
+  (kind, lifecycle, pools) that the queue routes around.
+- [agents.md](./agents.md) — how the chat path uses the same queue
+  underneath.