docs/inference-tasks.md

# Inference Tasks (v5)

A **durable inference task queue**. Every inference call lands as a row
in the database; mcpd's previous in-memory request map is gone. mcpd
restart no longer drops in-flight work, worker disconnects no longer
fail callers, and pools that have no live workers can still accept
work that drains when one shows up.

The queue is the same persistence layer underneath two API surfaces:

- **Sync** (existing): `POST /api/v1/llms/<name>/infer` and
  `POST /api/v1/agents/<name>/chat`. Caller's HTTP request stays open
  until the task finishes, exactly like pre-v5. Internally it's now a
  DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`.
- **Async** (new — this doc): `POST /api/v1/inference-tasks`. Caller
  gets a task id immediately and polls / streams / cancels separately.
  Useful for fire-and-forget scripts, queued work for offline pools,
  and operator visibility into in-flight jobs.

## Lifecycle

```
                ┌──────► claimed ──first chunk──► running ──┐
   pending ──┐  │             │                              │
             ├──┘ on dispatch  │ on revert (worker disconnect)│
             │                  └─────────► pending           │
             ▼                                                ▼
       cancelled ◄───────────── DELETE ──────────────► completed | error
```

| Status      | Meaning |
|-------------|---------|
| `pending`   | In queue, no worker has it yet (or claim was reverted) |
| `claimed`   | A worker has it; dispatch frame went down the SSE channel |
| `running`   | First chunk came back; streaming is in flight |
| `completed` | Worker POSTed terminal result |
| `error`     | Permanent failure (auth, bad request, queue timeout) |
| `cancelled` | Caller cancelled via `DELETE` |

A worker disconnecting mid-task reverts `claimed`/`running` rows back
to `pending` (the row's `claimedBy` is cleared); a sibling worker on
the same pool drains it. Tasks survive mcpd restart — the row is the
source of truth, so a fresh mcpd reads the queue and either matches a
worker or waits.

## Async API

```
POST   /api/v1/inference-tasks                  → 201 + { id, status, poolName, llmName, streaming, createdAt }
GET    /api/v1/inference-tasks?...               → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
GET    /api/v1/inference-tasks/<id>             → poll one row
GET    /api/v1/inference-tasks/<id>/stream      → SSE: `event: chunk` + `event: terminal`
DELETE /api/v1/inference-tasks/<id>             → cancel pending/claimed/running; no-op on terminal
```

Foreign-owner ids return **404** (not 403) on every endpoint — keeps
id-enumeration from leaking task existence across users.

### Enqueue

```sh
curl -X POST https://mcpd/api/v1/inference-tasks \
  -H 'Authorization: Bearer ...' \
  -H 'Content-Type: application/json' \
  -d '{
    "llmName": "vllm-local-qwen3",
    "request": {
      "messages": [{"role":"user","content":"summarize the changelog"}],
      "max_tokens": 500
    },
    "streaming": false
  }'

# 201 Created
# {
#   "id": "cmoip...",
#   "status": "pending",
#   "poolName": "vllm-local-qwen3",
#   "llmName": "vllm-local-qwen3",
#   "streaming": false,
#   "createdAt": "2026-04-28T..."
# }
```

Rejects with **400** for public Llms — the existing
`POST /api/v1/llms/<name>/infer` is the right tool there. Rejects
with **404** for missing Llm names.

### Poll

```sh
curl https://mcpd/api/v1/inference-tasks/cmoip... \
  -H 'Authorization: Bearer ...'

# 200
# {
#   "id": "cmoip...",
#   "status": "completed",
#   "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
#   "claimedBy": "<providerSessionId>",
#   "createdAt": "...", "claimedAt": "...", "completedAt": "..."
# }
```

### Stream live updates

```sh
curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
  -H 'Authorization: Bearer ...'

# event: chunk
# data: {"data":"hello "}
#
# event: chunk
# data: {"data":"world","done":true}
#
# event: terminal
# data: { ...full row... }
```

Already-terminal tasks emit one `terminal` event and close. Caller
disconnect cleans up server-side.

### Cancel

```sh
curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
  -H 'Authorization: Bearer ...'
```

Returns 200 + the cancelled row. Cancelling a `completed` /
`error` / already-`cancelled` task is a no-op (returns the current
row). The worker that picked the task up may still execute the
underlying inference call (no remote-cancel protocol yet); the
result is just discarded server-side.

## CLI

```sh
mcpctl get tasks                # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
mcpctl get task <id>            # describe one
mcpctl get task <id> -o yaml    # full row, yaml format

# Enqueue async via chat-llm:
mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
# stdout: cmoip... (just the id, pipeable)
# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)

# Wire to other commands:
ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
sleep 30
mcpctl get task $ID
```

## RBAC

A new `tasks` resource. Default action mapping:

| Method | Action |
|--------|--------|
| `GET`  | `view` |
| `POST` | `create` |
| `DELETE` | `delete` |

Aliases: `task`, `inference-task`, `inference-tasks` all normalize to
`tasks` so `mcpctl create rbac-binding --resource task --role view ...`
works regardless of which form the operator types.

Owner scoping is enforced at the route layer ON TOP of the RBAC hook:
- A caller with only the implicit role sees and can cancel their own tasks.
- `view:tasks` resource-wide is required to see / list / cancel
  another user's tasks.

## GC

mcpd runs a 5-min ticker that:

- Flips **pending** tasks older than the queue timeout (default 1 h)
  to `error` with a clear message ("Task expired in pending state
  after ...s"). Any caller still polling sees a clean failure.
- Deletes **terminal** tasks (`completed`/`error`/`cancelled`) past
  the retention window (default 7 d). Indexes back the
  `WHERE status IN (...) AND completedAt < cutoff` predicate so the
  sweep is cheap even with millions of rows.

Both timeouts are constants in `main.ts`; making them per-deployment
configurable is a v6 concern.

## Limitations / Out-of-scope for v5

- **Streaming chunks are not persisted.** The DB stores only the
  final assembled body (or null for streaming-only consumers). A
  caller who connects to `/stream` AFTER the task completed gets one
  `terminal` event with no chunk replay. Persisting deltas would be
  a per-chunk INSERT — too expensive at SSE granularity.
- **Public-Llm queueing** is not exposed. Public Llms are HTTP
  adapters with no SSE worker, so the queue concept doesn't apply.
  Direct `/llms/<name>/infer` against a public Llm stays the right
  call.
- **Multi-instance mcpd**: the in-process `EventEmitter` that wakes
  blocked HTTP handlers fires only on the mcpd that holds the row's
  blocked handler. With multiple mcpd replicas behind a load balancer
  the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`,
  no schema change.
- **Worker capacity** is one task per session at a time (today). A
  per-session capacity hint + scheduler-level concurrency control
  is a v6+ concern.
- **Remote cancel protocol**: cancelling a `running` task marks the
  row as cancelled but the worker keeps executing. The result POST
  is discarded server-side. Real cancellation (an SSE message that
  tells the worker to abort its in-flight call) is a v6 concern.

## See also

- [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives
  (kind, lifecycle, pools) that the queue routes around.
- [agents.md](./agents.md) — how the chat path uses the same queue
  underneath.
feat(cli+docs+smoke): inference-task CLI + GC ticker + smoke + docs (v5 Stage 4) CLI surface for the durable queue: - `mcpctl get tasks` — table view (ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER). Aliases `task`, `tasks`, `inference-task`, `inference-tasks` all normalize to the canonical plural so URL construction works uniformly. RESOURCE_ALIASES + completions generator updated. - `mcpctl chat-llm <name> --async -m <msg>` — enqueue and exit. stdout is just the task id (pipeable into `xargs mcpctl get task`); stderr carries human-readable status. REPL mode is rejected for --async (fire-and-forget doesn't make sense without -m). GC ticker in mcpd: 5-min interval. Pending tasks past 1 h queue timeout flip to error with a clear message; terminal tasks past 7 d retention get deleted. Both queries are index-backed. Crash fix uncovered by the smoke: when the async route doesn't await ref.done, a later cancel/error rejected the in-flight Promise as unhandled and crashed mcpd. The route now attaches a no-op `.catch` so the legacy `done` semantic still works for sync callers (chat, direct infer) without taking out the process for async ones. The EnqueueInferOptions also gained an explicit `ownerId` field so the async API can stamp the authenticated user on the row instead of inheriting 'system' from the constructor's resolveOwner — without this, every GET/DELETE from the original caller would 404 due to foreign-owner mismatch. Smoke (tests/smoke/inference-task.smoke.test.ts): 1. POST /inference-tasks while no worker bound → row=pending. 2. Bring a registrar online → bindSession drain claims and dispatches → worker complete()s → row=completed → GET returns the assistant body. 3. Stop worker, enqueue, DELETE → row=cancelled, persisted. docs/inference-tasks.md (new): full data model, lifecycle diagram, async API reference, CLI examples, RBAC table, GC defaults, and the v5 limitations / v6 roadmap. Cross-linked from virtual-llms.md and agents.md. Tests + smoke: mcpd 893/893, mcplocal 723/723, cli 437/437, full smoke 146/146 (was 144, +2 new task smoke). Live mcpd verified via manual curl: enqueue → cancel → re-fetch — no crash, owner scoping returns 404 on foreign ids, GC ticker logs at info when it sweeps. v5 complete: durable queue (Stage 1) + VirtualLlmService rewire (Stage 2) + async API & RBAC (Stage 3) + CLI/GC/smoke/docs (Stage 4). 2026-04-28 15:25:09 +01:00			`# Inference Tasks (v5)`

			`A durable inference task queue. Every inference call lands as a row`
			`in the database; mcpd's previous in-memory request map is gone. mcpd`
			`restart no longer drops in-flight work, worker disconnects no longer`
			`fail callers, and pools that have no live workers can still accept`
			`work that drains when one shows up.`

			`The queue is the same persistence layer underneath two API surfaces:`

			- Sync (existing): `POST /api/v1/llms/<name>/infer` and
			`POST /api/v1/agents/<name>/chat`. Caller's HTTP request stays open
			`until the task finishes, exactly like pre-v5. Internally it's now a`
			DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`.
			- Async (new — this doc): `POST /api/v1/inference-tasks`. Caller
			`gets a task id immediately and polls / streams / cancels separately.`
			`Useful for fire-and-forget scripts, queued work for offline pools,`
			`and operator visibility into in-flight jobs.`

			`## Lifecycle`

			```
			`┌──────► claimed ──first chunk──► running ──┐`
			`pending ──┐ │ │ │`
			`├──┘ on dispatch │ on revert (worker disconnect)│`
			`│ └─────────► pending │`
			`▼ ▼`
			`cancelled ◄───────────── DELETE ──────────────► completed \| error`
			```

			`\| Status \| Meaning \|`
			`\|-------------\|---------\|`
			\| `pending` \| In queue, no worker has it yet (or claim was reverted) \|
			\| `claimed` \| A worker has it; dispatch frame went down the SSE channel \|
			\| `running` \| First chunk came back; streaming is in flight \|
			\| `completed` \| Worker POSTed terminal result \|
			\| `error` \| Permanent failure (auth, bad request, queue timeout) \|
			\| `cancelled` \| Caller cancelled via `DELETE` \|

			A worker disconnecting mid-task reverts `claimed`/`running` rows back
			to `pending` (the row's `claimedBy` is cleared); a sibling worker on
			`the same pool drains it. Tasks survive mcpd restart — the row is the`
			`source of truth, so a fresh mcpd reads the queue and either matches a`
			`worker or waits.`

			`## Async API`

			```
			`POST /api/v1/inference-tasks → 201 + { id, status, poolName, llmName, streaming, createdAt }`
			`GET /api/v1/inference-tasks?... → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)`
			`GET /api/v1/inference-tasks/<id> → poll one row`
			GET /api/v1/inference-tasks/<id>/stream → SSE: `event: chunk` + `event: terminal`
			`DELETE /api/v1/inference-tasks/<id> → cancel pending/claimed/running; no-op on terminal`
			```

			`Foreign-owner ids return 404 (not 403) on every endpoint — keeps`
			`id-enumeration from leaking task existence across users.`

			`### Enqueue`

			```sh
			`curl -X POST https://mcpd/api/v1/inference-tasks \`
			`-H 'Authorization: Bearer ...' \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"llmName": "vllm-local-qwen3",`
			`"request": {`
			`"messages": [{"role":"user","content":"summarize the changelog"}],`
			`"max_tokens": 500`
			`},`
			`"streaming": false`
			`}'`

			`# 201 Created`
			`# {`
			`# "id": "cmoip...",`
			`# "status": "pending",`
			`# "poolName": "vllm-local-qwen3",`
			`# "llmName": "vllm-local-qwen3",`
			`# "streaming": false,`
			`# "createdAt": "2026-04-28T..."`
			`# }`
			```

			`Rejects with 400 for public Llms — the existing`
			`POST /api/v1/llms/<name>/infer` is the right tool there. Rejects
			`with 404 for missing Llm names.`

			`### Poll`

			```sh
			`curl https://mcpd/api/v1/inference-tasks/cmoip... \`
			`-H 'Authorization: Bearer ...'`

			`# 200`
			`# {`
			`# "id": "cmoip...",`
			`# "status": "completed",`
			`# "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },`
			`# "claimedBy": "<providerSessionId>",`
			`# "createdAt": "...", "claimedAt": "...", "completedAt": "..."`
			`# }`
			```

			`### Stream live updates`

			```sh
			`curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \`
			`-H 'Authorization: Bearer ...'`

			`# event: chunk`
			`# data: {"data":"hello "}`
			`#`
			`# event: chunk`
			`# data: {"data":"world","done":true}`
			`#`
			`# event: terminal`
			`# data: { ...full row... }`
			```

			Already-terminal tasks emit one `terminal` event and close. Caller
			`disconnect cleans up server-side.`

			`### Cancel`

			```sh
			`curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \`
			`-H 'Authorization: Bearer ...'`
			```

			Returns 200 + the cancelled row. Cancelling a `completed` /
			`error` / already-`cancelled` task is a no-op (returns the current
			`row). The worker that picked the task up may still execute the`
			`underlying inference call (no remote-cancel protocol yet); the`
			`result is just discarded server-side.`

			`## CLI`

			```sh
			`mcpctl get tasks # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER`
			`mcpctl get task <id> # describe one`
			`mcpctl get task <id> -o yaml # full row, yaml format`

			`# Enqueue async via chat-llm:`
			`mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"`
			`# stdout: cmoip... (just the id, pipeable)`
			`# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)`

			`# Wire to other commands:`
			`ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")`
			`sleep 30`
			`mcpctl get task $ID`
			```

			`## RBAC`

			A new `tasks` resource. Default action mapping:

			`\| Method \| Action \|`
			`\|--------\|--------\|`
			\| `GET` \| `view` \|
			\| `POST` \| `create` \|
			\| `DELETE` \| `delete` \|

			Aliases: `task`, `inference-task`, `inference-tasks` all normalize to
			`tasks` so `mcpctl create rbac-binding --resource task --role view ...`
			`works regardless of which form the operator types.`

			`Owner scoping is enforced at the route layer ON TOP of the RBAC hook:`
			`- A caller with only the implicit role sees and can cancel their own tasks.`
			- `view:tasks` resource-wide is required to see / list / cancel
			`another user's tasks.`

			`## GC`

			`mcpd runs a 5-min ticker that:`

			`- Flips pending tasks older than the queue timeout (default 1 h)`
			to `error` with a clear message ("Task expired in pending state
			`after ...s"). Any caller still polling sees a clean failure.`
			- Deletes terminal tasks (`completed`/`error`/`cancelled`) past
			`the retention window (default 7 d). Indexes back the`
			`WHERE status IN (...) AND completedAt < cutoff` predicate so the
			`sweep is cheap even with millions of rows.`

			Both timeouts are constants in `main.ts`; making them per-deployment
			`configurable is a v6 concern.`

			`## Limitations / Out-of-scope for v5`

			`- Streaming chunks are not persisted. The DB stores only the`
			`final assembled body (or null for streaming-only consumers). A`
			caller who connects to `/stream` AFTER the task completed gets one
			`terminal` event with no chunk replay. Persisting deltas would be
			`a per-chunk INSERT — too expensive at SSE granularity.`
			`- Public-Llm queueing is not exposed. Public Llms are HTTP`
			`adapters with no SSE worker, so the queue concept doesn't apply.`
			Direct `/llms/<name>/infer` against a public Llm stays the right
			`call.`
			- Multi-instance mcpd: the in-process `EventEmitter` that wakes
			`blocked HTTP handlers fires only on the mcpd that holds the row's`
			`blocked handler. With multiple mcpd replicas behind a load balancer`
			the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`,
			`no schema change.`
			`- Worker capacity is one task per session at a time (today). A`
			`per-session capacity hint + scheduler-level concurrency control`
			`is a v6+ concern.`
			- Remote cancel protocol: cancelling a `running` task marks the
			`row as cancelled but the worker keeps executing. The result POST`
			`is discarded server-side. Real cancellation (an SSE message that`
			`tells the worker to abort its in-flight call) is a v6 concern.`

			`## See also`

			`- [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives`
			`(kind, lifecycle, pools) that the queue routes around.`
			`- [agents.md](./agents.md) — how the chat path uses the same queue`
			`underneath.`