219 lines
7.8 KiB
Markdown
219 lines
7.8 KiB
Markdown
|
|
# Inference Tasks (v5)
|
||
|
|
|
||
|
|
A **durable inference task queue**. Every inference call lands as a row
|
||
|
|
in the database; mcpd's previous in-memory request map is gone. mcpd
|
||
|
|
restart no longer drops in-flight work, worker disconnects no longer
|
||
|
|
fail callers, and pools that have no live workers can still accept
|
||
|
|
work that drains when one shows up.
|
||
|
|
|
||
|
|
The queue is the same persistence layer underneath two API surfaces:
|
||
|
|
|
||
|
|
- **Sync** (existing): `POST /api/v1/llms/<name>/infer` and
|
||
|
|
`POST /api/v1/agents/<name>/chat`. Caller's HTTP request stays open
|
||
|
|
until the task finishes, exactly like pre-v5. Internally it's now a
|
||
|
|
DB row with a 10-min wait. Used by `mcpctl chat` / `mcpctl chat-llm`.
|
||
|
|
- **Async** (new — this doc): `POST /api/v1/inference-tasks`. Caller
|
||
|
|
gets a task id immediately and polls / streams / cancels separately.
|
||
|
|
Useful for fire-and-forget scripts, queued work for offline pools,
|
||
|
|
and operator visibility into in-flight jobs.
|
||
|
|
|
||
|
|
## Lifecycle
|
||
|
|
|
||
|
|
```
|
||
|
|
┌──────► claimed ──first chunk──► running ──┐
|
||
|
|
pending ──┐ │ │ │
|
||
|
|
├──┘ on dispatch │ on revert (worker disconnect)│
|
||
|
|
│ └─────────► pending │
|
||
|
|
▼ ▼
|
||
|
|
cancelled ◄───────────── DELETE ──────────────► completed | error
|
||
|
|
```
|
||
|
|
|
||
|
|
| Status | Meaning |
|
||
|
|
|-------------|---------|
|
||
|
|
| `pending` | In queue, no worker has it yet (or claim was reverted) |
|
||
|
|
| `claimed` | A worker has it; dispatch frame went down the SSE channel |
|
||
|
|
| `running` | First chunk came back; streaming is in flight |
|
||
|
|
| `completed` | Worker POSTed terminal result |
|
||
|
|
| `error` | Permanent failure (auth, bad request, queue timeout) |
|
||
|
|
| `cancelled` | Caller cancelled via `DELETE` |
|
||
|
|
|
||
|
|
A worker disconnecting mid-task reverts `claimed`/`running` rows back
|
||
|
|
to `pending` (the row's `claimedBy` is cleared); a sibling worker on
|
||
|
|
the same pool drains it. Tasks survive mcpd restart — the row is the
|
||
|
|
source of truth, so a fresh mcpd reads the queue and either matches a
|
||
|
|
worker or waits.
|
||
|
|
|
||
|
|
## Async API
|
||
|
|
|
||
|
|
```
|
||
|
|
POST /api/v1/inference-tasks → 201 + { id, status, poolName, llmName, streaming, createdAt }
|
||
|
|
GET /api/v1/inference-tasks?... → list (owner-scoped by default; ?status, ?poolName, ?agentId, ?limit)
|
||
|
|
GET /api/v1/inference-tasks/<id> → poll one row
|
||
|
|
GET /api/v1/inference-tasks/<id>/stream → SSE: `event: chunk` + `event: terminal`
|
||
|
|
DELETE /api/v1/inference-tasks/<id> → cancel pending/claimed/running; no-op on terminal
|
||
|
|
```
|
||
|
|
|
||
|
|
Foreign-owner ids return **404** (not 403) on every endpoint — keeps
|
||
|
|
id-enumeration from leaking task existence across users.
|
||
|
|
|
||
|
|
### Enqueue
|
||
|
|
|
||
|
|
```sh
|
||
|
|
curl -X POST https://mcpd/api/v1/inference-tasks \
|
||
|
|
-H 'Authorization: Bearer ...' \
|
||
|
|
-H 'Content-Type: application/json' \
|
||
|
|
-d '{
|
||
|
|
"llmName": "vllm-local-qwen3",
|
||
|
|
"request": {
|
||
|
|
"messages": [{"role":"user","content":"summarize the changelog"}],
|
||
|
|
"max_tokens": 500
|
||
|
|
},
|
||
|
|
"streaming": false
|
||
|
|
}'
|
||
|
|
|
||
|
|
# 201 Created
|
||
|
|
# {
|
||
|
|
# "id": "cmoip...",
|
||
|
|
# "status": "pending",
|
||
|
|
# "poolName": "vllm-local-qwen3",
|
||
|
|
# "llmName": "vllm-local-qwen3",
|
||
|
|
# "streaming": false,
|
||
|
|
# "createdAt": "2026-04-28T..."
|
||
|
|
# }
|
||
|
|
```
|
||
|
|
|
||
|
|
Rejects with **400** for public Llms — the existing
|
||
|
|
`POST /api/v1/llms/<name>/infer` is the right tool there. Rejects
|
||
|
|
with **404** for missing Llm names.
|
||
|
|
|
||
|
|
### Poll
|
||
|
|
|
||
|
|
```sh
|
||
|
|
curl https://mcpd/api/v1/inference-tasks/cmoip... \
|
||
|
|
-H 'Authorization: Bearer ...'
|
||
|
|
|
||
|
|
# 200
|
||
|
|
# {
|
||
|
|
# "id": "cmoip...",
|
||
|
|
# "status": "completed",
|
||
|
|
# "responseBody": { "choices": [{"message":{"role":"assistant","content":"..."}}] },
|
||
|
|
# "claimedBy": "<providerSessionId>",
|
||
|
|
# "createdAt": "...", "claimedAt": "...", "completedAt": "..."
|
||
|
|
# }
|
||
|
|
```
|
||
|
|
|
||
|
|
### Stream live updates
|
||
|
|
|
||
|
|
```sh
|
||
|
|
curl -N https://mcpd/api/v1/inference-tasks/cmoip.../stream \
|
||
|
|
-H 'Authorization: Bearer ...'
|
||
|
|
|
||
|
|
# event: chunk
|
||
|
|
# data: {"data":"hello "}
|
||
|
|
#
|
||
|
|
# event: chunk
|
||
|
|
# data: {"data":"world","done":true}
|
||
|
|
#
|
||
|
|
# event: terminal
|
||
|
|
# data: { ...full row... }
|
||
|
|
```
|
||
|
|
|
||
|
|
Already-terminal tasks emit one `terminal` event and close. Caller
|
||
|
|
disconnect cleans up server-side.
|
||
|
|
|
||
|
|
### Cancel
|
||
|
|
|
||
|
|
```sh
|
||
|
|
curl -X DELETE https://mcpd/api/v1/inference-tasks/cmoip... \
|
||
|
|
-H 'Authorization: Bearer ...'
|
||
|
|
```
|
||
|
|
|
||
|
|
Returns 200 + the cancelled row. Cancelling a `completed` /
|
||
|
|
`error` / already-`cancelled` task is a no-op (returns the current
|
||
|
|
row). The worker that picked the task up may still execute the
|
||
|
|
underlying inference call (no remote-cancel protocol yet); the
|
||
|
|
result is just discarded server-side.
|
||
|
|
|
||
|
|
## CLI
|
||
|
|
|
||
|
|
```sh
|
||
|
|
mcpctl get tasks # table view: ID, STATUS, POOL, LLM, MODEL, STREAM, AGE, WORKER
|
||
|
|
mcpctl get task <id> # describe one
|
||
|
|
mcpctl get task <id> -o yaml # full row, yaml format
|
||
|
|
|
||
|
|
# Enqueue async via chat-llm:
|
||
|
|
mcpctl chat-llm vllm-local-qwen3 --async -m "summarize the changelog"
|
||
|
|
# stdout: cmoip... (just the id, pipeable)
|
||
|
|
# stderr: (task cmoip... enqueued for pool 'vllm-local-qwen3', status=pending)
|
||
|
|
|
||
|
|
# Wire to other commands:
|
||
|
|
ID=$(mcpctl chat-llm vllm-local-qwen3 --async -m "long task")
|
||
|
|
sleep 30
|
||
|
|
mcpctl get task $ID
|
||
|
|
```
|
||
|
|
|
||
|
|
## RBAC
|
||
|
|
|
||
|
|
A new `tasks` resource. Default action mapping:
|
||
|
|
|
||
|
|
| Method | Action |
|
||
|
|
|--------|--------|
|
||
|
|
| `GET` | `view` |
|
||
|
|
| `POST` | `create` |
|
||
|
|
| `DELETE` | `delete` |
|
||
|
|
|
||
|
|
Aliases: `task`, `inference-task`, `inference-tasks` all normalize to
|
||
|
|
`tasks` so `mcpctl create rbac-binding --resource task --role view ...`
|
||
|
|
works regardless of which form the operator types.
|
||
|
|
|
||
|
|
Owner scoping is enforced at the route layer ON TOP of the RBAC hook:
|
||
|
|
- A caller with only the implicit role sees and can cancel their own tasks.
|
||
|
|
- `view:tasks` resource-wide is required to see / list / cancel
|
||
|
|
another user's tasks.
|
||
|
|
|
||
|
|
## GC
|
||
|
|
|
||
|
|
mcpd runs a 5-min ticker that:
|
||
|
|
|
||
|
|
- Flips **pending** tasks older than the queue timeout (default 1 h)
|
||
|
|
to `error` with a clear message ("Task expired in pending state
|
||
|
|
after ...s"). Any caller still polling sees a clean failure.
|
||
|
|
- Deletes **terminal** tasks (`completed`/`error`/`cancelled`) past
|
||
|
|
the retention window (default 7 d). Indexes back the
|
||
|
|
`WHERE status IN (...) AND completedAt < cutoff` predicate so the
|
||
|
|
sweep is cheap even with millions of rows.
|
||
|
|
|
||
|
|
Both timeouts are constants in `main.ts`; making them per-deployment
|
||
|
|
configurable is a v6 concern.
|
||
|
|
|
||
|
|
## Limitations / Out-of-scope for v5
|
||
|
|
|
||
|
|
- **Streaming chunks are not persisted.** The DB stores only the
|
||
|
|
final assembled body (or null for streaming-only consumers). A
|
||
|
|
caller who connects to `/stream` AFTER the task completed gets one
|
||
|
|
`terminal` event with no chunk replay. Persisting deltas would be
|
||
|
|
a per-chunk INSERT — too expensive at SSE granularity.
|
||
|
|
- **Public-Llm queueing** is not exposed. Public Llms are HTTP
|
||
|
|
adapters with no SSE worker, so the queue concept doesn't apply.
|
||
|
|
Direct `/llms/<name>/infer` against a public Llm stays the right
|
||
|
|
call.
|
||
|
|
- **Multi-instance mcpd**: the in-process `EventEmitter` that wakes
|
||
|
|
blocked HTTP handlers fires only on the mcpd that holds the row's
|
||
|
|
blocked handler. With multiple mcpd replicas behind a load balancer
|
||
|
|
the wakeup wouldn't cross instances. v6 swap: pg `LISTEN/NOTIFY`,
|
||
|
|
no schema change.
|
||
|
|
- **Worker capacity** is one task per session at a time (today). A
|
||
|
|
per-session capacity hint + scheduler-level concurrency control
|
||
|
|
is a v6+ concern.
|
||
|
|
- **Remote cancel protocol**: cancelling a `running` task marks the
|
||
|
|
row as cancelled but the worker keeps executing. The result POST
|
||
|
|
is discarded server-side. Real cancellation (an SSE message that
|
||
|
|
tells the worker to abort its in-flight call) is a v6 concern.
|
||
|
|
|
||
|
|
## See also
|
||
|
|
|
||
|
|
- [virtual-llms.md](./virtual-llms.md) — the Llm-side primitives
|
||
|
|
(kind, lifecycle, pools) that the queue routes around.
|
||
|
|
- [agents.md](./agents.md) — how the chat path uses the same queue
|
||
|
|
underneath.
|