# Inference Tasks (v5)

A **durable inference task queue**. Every inference call lands as a row
in the database; mcpd's previous in-memory request map is gone. mcpd
restart no longer drops in-flight work, worker disconnects no longer
fail callers, and pools that have no live workers can still accept
work that drains when one shows up.

The queue is the same persistence layer underneath two API surfaces:

- **Sync** (existing): `POST /api/v1/llms/<name>/infer` and
  `POST /api/v1/agents/<name>/chat`. Caller's HTTP request stays open
  until the task finishes, exactly like pre-v5. Internally it's now a
  DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`.
- **Async** (new — this doc): `POST /api/v1/inference-tasks`. Caller
  gets a task id immediately and polls / streams / cancels separately.
  Useful for fire-and-forget scripts, queued work for offline pools,
  and operator visibility into in-flight jobs.

## Lifecycle

```
                ┌──────► claimed ──first chunk──► running ──┐
   pending ──┐  │             │                              │
             ├──┘ on dispatch  │ on revert (worker disconnect)│
             │                  └─────────► pending           │
             ▼                                                ▼
       cancelled ◄───────────── DELETE ──────────────► completed | error
```

| Status      | Meaning |
|-------------|---------|
| `pending`   | In queue, no worker has it yet (or claim was reverted) |
| `claimed`   | A worker has it; dispatch frame went down the SSE channel |
| `running`   | First chunk came back; streaming is in flight |
| `completed` | Worker POSTed terminal result |
| `error`     | Permanent failure (auth, bad request, queue timeout) |
| `cancelled` | Caller cancelled via `DELETE` |

A worker disconnecting mid-task reverts `claimed`/`running` rows back
to `pending` (the row's `claimedBy` is cleared); a sibling worker on
the same pool drains it. Tasks survive mcpd restart — the row is the
source of truth, so a fresh mcpd reads the queue and either matches a
worker or waits.

## Async API

```
POST   /api/v1/inference-tasks                  → 201 + { id, status, poolName, llmName, streaming, createdAt }
GET    /api/v1/inference-tasks?...               → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
GET    /api/v1/inference-tasks/<id>             → poll one row
GET    /api/v1/inference-tasks/<id>/stream      → SSE: `event: chunk` + `event: terminal`
DELETE /api/v1/inference-tasks/<id>             → cancel pending/claimed/running; no-op on terminal
```

Foreign-owner ids return **404** (not 403) on every endpoint — keeps
id-enumeration from leaking task existence across users.

### Enqueue

```sh
curl -X POST https://mcpd/api/v1/inference-tasks \
  -H 'Authorization: Bearer ...' \
  -H 'Content-Type: application/json' \
  -d '{
    "llmName": "vllm-local-qwen3",
    "request": {
      "messages": [{"role":"user","content":"summarize the changelog"}],
      "max_tokens": 500
    },
    "streaming": false
  }'

# 201 Created
# {
#   "id": "cmoip...",
#   "status": "pending",
#   "poolName": "vllm-local-qwen3",
#   "llmName": "vllm-local-qwen3",
#   "streaming": false,
#   "createdAt": "2026-04-28T..."
# }
```

Rejects with **400** for public Llms — the existing
`POST /api/v1/llms/<name>/infer` is the right tool there. Rejects
with **404** for missing Llm names.

### Poll

```sh
curl https://mcpd/api/v1/inference-tasks/cmoip... \
  -H 'Authorization: Bearer ...'

# 200
# {
#   "id": "cmoip...",
#   "status": "completed",
#   "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
#   "claimedBy": "<providerSessionId>",
#   "createdAt": "...", "claimedAt": "...", "completedAt": "..."
# }
```

### Stream live updates

```sh
curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
  -H 'Authorization: Bearer ...'

# event: chunk
# data: {"data":"hello "}
#
# event: chunk
# data: {"data":"world","done":true}
#
# event: terminal
# data: { ...full row... }
```

Already-terminal tasks emit one `terminal` event and close. Caller
disconnect cleans up server-side.

### Cancel

```sh
curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
  -H 'Authorization: Bearer ...'
```

Returns 200 + the cancelled row. Cancelling a `completed` /
`error` / already-`cancelled` task is a no-op (returns the current
row). The worker that picked the task up may still execute the
underlying inference call (no remote-cancel protocol yet); the
result is just discarded server-side.

## CLI

```sh
mcpctl get tasks                # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
mcpctl get task <id>            # describe one
mcpctl get task <id> -o yaml    # full row, yaml format

# Enqueue async via chat-llm:
mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
# stdout: cmoip... (just the id, pipeable)
# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)

# Wire to other commands:
ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
sleep 30
mcpctl get task $ID
```

## RBAC

A new `tasks` resource. Default action mapping:

| Method | Action |
|--------|--------|
| `GET`  | `view` |
| `POST` | `create` |
| `DELETE` | `delete` |

Aliases: `task`, `inference-task`, `inference-tasks` all normalize to
`tasks` so `mcpctl create rbac-binding --resource task --role view ...`
works regardless of which form the operator types.

Owner scoping is enforced at the route layer ON TOP of the RBAC hook:
- A caller with only the implicit role sees and can cancel their own tasks.
- `view:tasks` resource-wide is required to see / list / cancel
  another user's tasks.

## GC

mcpd runs a 5-min ticker that:

- Flips **pending** tasks older than the queue timeout (default 1 h)
  to `error` with a clear message ("Task expired in pending state
  after ...s"). Any caller still polling sees a clean failure.
- Deletes **terminal** tasks (`completed`/`error`/`cancelled`) past
  the retention window (default 7 d). Indexes back the
  `WHERE status IN (...) AND completedAt < cutoff` predicate so the
  sweep is cheap even with millions of rows.

Both timeouts are constants in `main.ts`; making them per-deployment
configurable is a v6 concern.

## Limitations / Out-of-scope for v5

- **Streaming chunks are not persisted.** The DB stores only the
  final assembled body (or null for streaming-only consumers). A
  caller who connects to `/stream` AFTER the task completed gets one
  `terminal` event with no chunk replay. Persisting deltas would be
  a per-chunk INSERT — too expensive at SSE granularity.
- **Public-Llm queueing** is not exposed. Public Llms are HTTP
  adapters with no SSE worker, so the queue concept doesn't apply.
  Direct `/llms/<name>/infer` against a public Llm stays the right
  call.
- **Multi-instance mcpd**: the in-process `EventEmitter` that wakes
  blocked HTTP handlers fires only on the mcpd that holds the row's
  blocked handler. With multiple mcpd replicas behind a load balancer
  the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`,
  no schema change.
- **Worker capacity** is one task per session at a time (today). A
  per-session capacity hint + scheduler-level concurrency control
  is a v6+ concern.
- **Remote cancel protocol**: cancelling a `running` task marks the
  row as cancelled but the worker keeps executing. The result POST
  is discarded server-side. Real cancellation (an SSE message that
  tells the worker to abort its in-flight call) is a v6 concern.

## See also

- [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives
  (kind, lifecycle, pools) that the queue routes around.
- [agents.md](./agents.md) — how the chat path uses the same queue
  underneath.