feat(cli+docs+smoke): inference-task CLI + GC ticker + smoke + docs (v5 Stage 4)
Some checks failed
CI/CD / lint (pull_request) Successful in 55s
CI/CD / test (pull_request) Successful in 1m12s
CI/CD / typecheck (pull_request) Successful in 2m46s
CI/CD / smoke (pull_request) Failing after 1m44s
CI/CD / build (pull_request) Failing after 7m0s
CI/CD / publish (pull_request) Has been skipped
Some checks failed
CI/CD / lint (pull_request) Successful in 55s
CI/CD / test (pull_request) Successful in 1m12s
CI/CD / typecheck (pull_request) Successful in 2m46s
CI/CD / smoke (pull_request) Failing after 1m44s
CI/CD / build (pull_request) Failing after 7m0s
CI/CD / publish (pull_request) Has been skipped
CLI surface for the durable queue:
- `mcpctl get tasks` — table view (ID, STATUS, POOL, LLM, MODEL,
STREAM, AGE, WORKER). Aliases `task`, `tasks`, `inference-task`,
`inference-tasks` all normalize to the canonical plural so URL
construction works uniformly. RESOURCE_ALIASES + completions
generator updated.
- `mcpctl chat-llm <name> --async -m <msg>` — enqueue and exit. stdout
is just the task id (pipeable into `xargs mcpctl get task`); stderr
carries human-readable status. REPL mode is rejected for --async
(fire-and-forget doesn't make sense without -m).
GC ticker in mcpd: 5-min interval. Pending tasks past 1 h queue
timeout flip to error with a clear message; terminal tasks past 7 d
retention get deleted. Both queries are index-backed.
Crash fix uncovered by the smoke: when the async route doesn't await
ref.done, a later cancel/error rejected the in-flight Promise as
unhandled and crashed mcpd. The route now attaches a no-op `.catch`
so the legacy `done` semantic still works for sync callers (chat,
direct infer) without taking out the process for async ones. The
EnqueueInferOptions also gained an explicit `ownerId` field so the
async API can stamp the authenticated user on the row instead of
inheriting 'system' from the constructor's resolveOwner — without
this, every GET/DELETE from the original caller would 404 due to
foreign-owner mismatch.
Smoke (tests/smoke/inference-task.smoke.test.ts):
1. POST /inference-tasks while no worker bound → row=pending.
2. Bring a registrar online → bindSession drain claims and
dispatches → worker complete()s → row=completed → GET returns
the assistant body.
3. Stop worker, enqueue, DELETE → row=cancelled, persisted.
docs/inference-tasks.md (new): full data model, lifecycle diagram,
async API reference, CLI examples, RBAC table, GC defaults, and the
v5 limitations / v6 roadmap. Cross-linked from virtual-llms.md and
agents.md.
Tests + smoke: mcpd 893/893, mcplocal 723/723, cli 437/437, full
smoke 146/146 (was 144, +2 new task smoke). Live mcpd verified via
manual curl: enqueue → cancel → re-fetch — no crash, owner scoping
returns 404 on foreign ids, GC ticker logs at info when it sweeps.
v5 complete: durable queue (Stage 1) + VirtualLlmService rewire
(Stage 2) + async API & RBAC (Stage 3) + CLI/GC/smoke/docs (Stage 4).
This commit is contained in:
218
docs/inference-tasks.md
Normal file
218
docs/inference-tasks.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Inference Tasks (v5)
|
||||
|
||||
A **durable inference task queue**. Every inference call lands as a row
|
||||
in the database; mcpd's previous in-memory request map is gone. mcpd
|
||||
restart no longer drops in-flight work, worker disconnects no longer
|
||||
fail callers, and pools that have no live workers can still accept
|
||||
work that drains when one shows up.
|
||||
|
||||
The queue is the same persistence layer underneath two API surfaces:
|
||||
|
||||
- **Sync** (existing): `POST /api/v1/llms/<name>/infer` and
|
||||
`POST /api/v1/agents/<name>/chat`. Caller's HTTP request stays open
|
||||
until the task finishes, exactly like pre-v5. Internally it's now a
|
||||
DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`.
|
||||
- **Async** (new — this doc): `POST /api/v1/inference-tasks`. Caller
|
||||
gets a task id immediately and polls / streams / cancels separately.
|
||||
Useful for fire-and-forget scripts, queued work for offline pools,
|
||||
and operator visibility into in-flight jobs.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
```
|
||||
┌──────► claimed ──first chunk──► running ──┐
|
||||
pending ──┐ │ │ │
|
||||
├──┘ on dispatch │ on revert (worker disconnect)│
|
||||
│ └─────────► pending │
|
||||
▼ ▼
|
||||
cancelled ◄───────────── DELETE ──────────────► completed | error
|
||||
```
|
||||
|
||||
| Status | Meaning |
|
||||
|-------------|---------|
|
||||
| `pending` | In queue, no worker has it yet (or claim was reverted) |
|
||||
| `claimed` | A worker has it; dispatch frame went down the SSE channel |
|
||||
| `running` | First chunk came back; streaming is in flight |
|
||||
| `completed` | Worker POSTed terminal result |
|
||||
| `error` | Permanent failure (auth, bad request, queue timeout) |
|
||||
| `cancelled` | Caller cancelled via `DELETE` |
|
||||
|
||||
A worker disconnecting mid-task reverts `claimed`/`running` rows back
|
||||
to `pending` (the row's `claimedBy` is cleared); a sibling worker on
|
||||
the same pool drains it. Tasks survive mcpd restart — the row is the
|
||||
source of truth, so a fresh mcpd reads the queue and either matches a
|
||||
worker or waits.
|
||||
|
||||
## Async API
|
||||
|
||||
```
|
||||
POST /api/v1/inference-tasks → 201 + { id, status, poolName, llmName, streaming, createdAt }
|
||||
GET /api/v1/inference-tasks?... → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
|
||||
GET /api/v1/inference-tasks/<id> → poll one row
|
||||
GET /api/v1/inference-tasks/<id>/stream → SSE: `event: chunk` + `event: terminal`
|
||||
DELETE /api/v1/inference-tasks/<id> → cancel pending/claimed/running; no-op on terminal
|
||||
```
|
||||
|
||||
Foreign-owner ids return **404** (not 403) on every endpoint — keeps
|
||||
id-enumeration from leaking task existence across users.
|
||||
|
||||
### Enqueue
|
||||
|
||||
```sh
|
||||
curl -X POST https://mcpd/api/v1/inference-tasks \
|
||||
-H 'Authorization: Bearer ...' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"llmName": "vllm-local-qwen3",
|
||||
"request": {
|
||||
"messages": [{"role":"user","content":"summarize the changelog"}],
|
||||
"max_tokens": 500
|
||||
},
|
||||
"streaming": false
|
||||
}'
|
||||
|
||||
# 201 Created
|
||||
# {
|
||||
# "id": "cmoip...",
|
||||
# "status": "pending",
|
||||
# "poolName": "vllm-local-qwen3",
|
||||
# "llmName": "vllm-local-qwen3",
|
||||
# "streaming": false,
|
||||
# "createdAt": "2026-04-28T..."
|
||||
# }
|
||||
```
|
||||
|
||||
Rejects with **400** for public Llms — the existing
|
||||
`POST /api/v1/llms/<name>/infer` is the right tool there. Rejects
|
||||
with **404** for missing Llm names.
|
||||
|
||||
### Poll
|
||||
|
||||
```sh
|
||||
curl https://mcpd/api/v1/inference-tasks/cmoip... \
|
||||
-H 'Authorization: Bearer ...'
|
||||
|
||||
# 200
|
||||
# {
|
||||
# "id": "cmoip...",
|
||||
# "status": "completed",
|
||||
# "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
|
||||
# "claimedBy": "<providerSessionId>",
|
||||
# "createdAt": "...", "claimedAt": "...", "completedAt": "..."
|
||||
# }
|
||||
```
|
||||
|
||||
### Stream live updates
|
||||
|
||||
```sh
|
||||
curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
|
||||
-H 'Authorization: Bearer ...'
|
||||
|
||||
# event: chunk
|
||||
# data: {"data":"hello "}
|
||||
#
|
||||
# event: chunk
|
||||
# data: {"data":"world","done":true}
|
||||
#
|
||||
# event: terminal
|
||||
# data: { ...full row... }
|
||||
```
|
||||
|
||||
Already-terminal tasks emit one `terminal` event and close. Caller
|
||||
disconnect cleans up server-side.
|
||||
|
||||
### Cancel
|
||||
|
||||
```sh
|
||||
curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
|
||||
-H 'Authorization: Bearer ...'
|
||||
```
|
||||
|
||||
Returns 200 + the cancelled row. Cancelling a `completed` /
|
||||
`error` / already-`cancelled` task is a no-op (returns the current
|
||||
row). The worker that picked the task up may still execute the
|
||||
underlying inference call (no remote-cancel protocol yet); the
|
||||
result is just discarded server-side.
|
||||
|
||||
## CLI
|
||||
|
||||
```sh
|
||||
mcpctl get tasks # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
|
||||
mcpctl get task <id> # describe one
|
||||
mcpctl get task <id> -o yaml # full row, yaml format
|
||||
|
||||
# Enqueue async via chat-llm:
|
||||
mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
|
||||
# stdout: cmoip... (just the id, pipeable)
|
||||
# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)
|
||||
|
||||
# Wire to other commands:
|
||||
ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
|
||||
sleep 30
|
||||
mcpctl get task $ID
|
||||
```
|
||||
|
||||
## RBAC
|
||||
|
||||
A new `tasks` resource. Default action mapping:
|
||||
|
||||
| Method | Action |
|
||||
|--------|--------|
|
||||
| `GET` | `view` |
|
||||
| `POST` | `create` |
|
||||
| `DELETE` | `delete` |
|
||||
|
||||
Aliases: `task`, `inference-task`, `inference-tasks` all normalize to
|
||||
`tasks` so `mcpctl create rbac-binding --resource task --role view ...`
|
||||
works regardless of which form the operator types.
|
||||
|
||||
Owner scoping is enforced at the route layer ON TOP of the RBAC hook:
|
||||
- A caller with only the implicit role sees and can cancel their own tasks.
|
||||
- `view:tasks` resource-wide is required to see / list / cancel
|
||||
another user's tasks.
|
||||
|
||||
## GC
|
||||
|
||||
mcpd runs a 5-min ticker that:
|
||||
|
||||
- Flips **pending** tasks older than the queue timeout (default 1 h)
|
||||
to `error` with a clear message ("Task expired in pending state
|
||||
after ...s"). Any caller still polling sees a clean failure.
|
||||
- Deletes **terminal** tasks (`completed`/`error`/`cancelled`) past
|
||||
the retention window (default 7 d). Indexes back the
|
||||
`WHERE status IN (...) AND completedAt < cutoff` predicate so the
|
||||
sweep is cheap even with millions of rows.
|
||||
|
||||
Both timeouts are constants in `main.ts`; making them per-deployment
|
||||
configurable is a v6 concern.
|
||||
|
||||
## Limitations / Out-of-scope for v5
|
||||
|
||||
- **Streaming chunks are not persisted.** The DB stores only the
|
||||
final assembled body (or null for streaming-only consumers). A
|
||||
caller who connects to `/stream` AFTER the task completed gets one
|
||||
`terminal` event with no chunk replay. Persisting deltas would be
|
||||
a per-chunk INSERT — too expensive at SSE granularity.
|
||||
- **Public-Llm queueing** is not exposed. Public Llms are HTTP
|
||||
adapters with no SSE worker, so the queue concept doesn't apply.
|
||||
Direct `/llms/<name>/infer` against a public Llm stays the right
|
||||
call.
|
||||
- **Multi-instance mcpd**: the in-process `EventEmitter` that wakes
|
||||
blocked HTTP handlers fires only on the mcpd that holds the row's
|
||||
blocked handler. With multiple mcpd replicas behind a load balancer
|
||||
the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`,
|
||||
no schema change.
|
||||
- **Worker capacity** is one task per session at a time (today). A
|
||||
per-session capacity hint + scheduler-level concurrency control
|
||||
is a v6+ concern.
|
||||
- **Remote cancel protocol**: cancelling a `running` task marks the
|
||||
row as cancelled but the worker keeps executing. The result POST
|
||||
is discarded server-side. Real cancellation (an SSE message that
|
||||
tells the worker to abort its in-flight call) is a v6 concern.
|
||||
|
||||
## See also
|
||||
|
||||
- [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives
|
||||
(kind, lifecycle, pools) that the queue routes around.
|
||||
- [agents.md](./agents.md) — how the chat path uses the same queue
|
||||
underneath.
|
||||
Reference in New Issue
Block a user