feat: virtual LLMs v1 (registration skeleton) #63

Merged
michal merged 6 commits from feat/virtual-llm-v1 into main 2026-04-27 13:38:51 +00:00
Owner

Summary

v1 of the virtual-LLM feature. A user's local provider (e.g. `vllm-local`) can publish itself into mcpd's `Llm` registry as a `kind=virtual` row. Inference is relayed through the publishing mcplocal's SSE control channel — mcpd never holds the local URL or API key. When the publisher disappears, the row goes `inactive` after 90 s; after 4 h of inactivity it's auto-deleted.

This is the registration skeleton. Wake-on-demand (v2), virtual agents (v3), LB pool by model (v4), and task queue (v5) come as their own PRs — see `docs/virtual-llms.md` for the staged roadmap.

Stages

  • v1 Stage 1 — schema (`1acd8b5`): `Llm.kind` discriminator + lifecycle fields + migration. 7 db tests.
  • v1 Stage 2 — services (`2215922`): `VirtualLlmService` (register/heartbeat/disconnect/infer-task/gcSweep) + repo lifecycle queries. 18 mcpd tests.
  • v1 Stage 3 — routes + GC (`192a383`): four `/api/v1/llms/_provider-*` endpoints + kind=virtual branch in `/api/v1/llms/:name/infer` + 60-s GC ticker in main.ts. 14 new tests (11 routes + 3 infer).
  • v1 Stage 4 — mcplocal registrar (`97174f4`): SSE client that publishes opted-in providers, heartbeats, pumps inference tasks back through `provider.complete()`, and reconnects with sticky `providerSessionId`. 5 mcplocal tests.
  • v1 Stage 5 — CLI (`7e6b0ca`): new `mcpctl chat-llm ` (stateless, works for both public and virtual) + `KIND` + `STATUS` columns on `mcpctl get llm` + completions.
  • v1 Stage 6 — smoke + docs (`866f6ab`): live-cluster smoke test exercising register → infer relay → 503-on-disconnect, plus `docs/virtual-llms.md` + cross-links from `docs/agents.md` and the README.

How to use it (after merge + deploy)

```fish

In ~/.mcpctl/config.json, opt the provider in:

{ "name": "vllm-local", "type": "openai", "model": "...", "publish": true }

systemctl --user restart mcplocal

mcpctl get llm

NAME KIND STATUS TYPE MODEL TIER ID

qwen3-thinking public active openai qwen3-thinking fast ...

vllm-local virtual active openai Qwen/Qwen2.5-7B-Instruct-AWQ fast ...

mcpctl chat-llm vllm-local

hello?
```

Test plan

  • Unit (mcpd): 833/833 (was 801 + 14 routes/services + 18 virtual-llm-service)
  • Unit (mcplocal): +5 registrar tests
  • Unit (cli): 437/437 (was 430 + 7 from regenerated completions golden)
  • Unit (db): +7 schema tests for the new lifecycle fields
  • Workspace: 2043/2043 across 152 files
  • Typecheck clean across mcpd / mcplocal / cli / web / db
  • Smoke (live cluster): `virtual-llm.smoke.test.ts` exercises register → infer relay → 503-on-disconnect against the deployed mcpd. Will run automatically in `bash fulldeploy.sh`.

🤖 Generated with Claude Code

## Summary v1 of the virtual-LLM feature. A user's local provider (e.g. \`vllm-local\`) can publish itself into mcpd's \`Llm\` registry as a \`kind=virtual\` row. Inference is relayed through the publishing mcplocal's SSE control channel — mcpd never holds the local URL or API key. When the publisher disappears, the row goes \`inactive\` after 90 s; after 4 h of inactivity it's auto-deleted. This is the registration skeleton. Wake-on-demand (v2), virtual agents (v3), LB pool by model (v4), and task queue (v5) come as their own PRs — see \`docs/virtual-llms.md\` for the staged roadmap. ## Stages - **v1 Stage 1 — schema** (\`1acd8b5\`): \`Llm.kind\` discriminator + lifecycle fields + migration. 7 db tests. - **v1 Stage 2 — services** (\`2215922\`): \`VirtualLlmService\` (register/heartbeat/disconnect/infer-task/gcSweep) + repo lifecycle queries. 18 mcpd tests. - **v1 Stage 3 — routes + GC** (\`192a383\`): four \`/api/v1/llms/_provider-*\` endpoints + kind=virtual branch in \`/api/v1/llms/:name/infer\` + 60-s GC ticker in main.ts. 14 new tests (11 routes + 3 infer). - **v1 Stage 4 — mcplocal registrar** (\`97174f4\`): SSE client that publishes opted-in providers, heartbeats, pumps inference tasks back through \`provider.complete()\`, and reconnects with sticky \`providerSessionId\`. 5 mcplocal tests. - **v1 Stage 5 — CLI** (\`7e6b0ca\`): new \`mcpctl chat-llm <name>\` (stateless, works for both public and virtual) + \`KIND\` + \`STATUS\` columns on \`mcpctl get llm\` + completions. - **v1 Stage 6 — smoke + docs** (\`866f6ab\`): live-cluster smoke test exercising register → infer relay → 503-on-disconnect, plus \`docs/virtual-llms.md\` + cross-links from \`docs/agents.md\` and the README. ## How to use it (after merge + deploy) \`\`\`fish # In ~/.mcpctl/config.json, opt the provider in: # { "name": "vllm-local", "type": "openai", "model": "...", "publish": true } systemctl --user restart mcplocal mcpctl get llm # NAME KIND STATUS TYPE MODEL TIER ID # qwen3-thinking public active openai qwen3-thinking fast ... # vllm-local virtual active openai Qwen/Qwen2.5-7B-Instruct-AWQ fast ... mcpctl chat-llm vllm-local > hello? \`\`\` ## Test plan - [x] Unit (mcpd): 833/833 (was 801 + 14 routes/services + 18 virtual-llm-service) - [x] Unit (mcplocal): +5 registrar tests - [x] Unit (cli): 437/437 (was 430 + 7 from regenerated completions golden) - [x] Unit (db): +7 schema tests for the new lifecycle fields - [x] Workspace: **2043/2043 across 152 files** - [x] Typecheck clean across mcpd / mcplocal / cli / web / db - [ ] Smoke (live cluster): \`virtual-llm.smoke.test.ts\` exercises register → infer relay → 503-on-disconnect against the deployed mcpd. Will run automatically in \`bash fulldeploy.sh\`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
michal added 6 commits 2026-04-27 13:29:19 +00:00
First step of the virtual-LLM feature. A virtual Llm row is one that
gets *registered by an mcplocal client* rather than created via
\`mcpctl create llm\`. Its inference is relayed back through an SSE
control channel to the publishing session (mcpd routes added in
Stage 3). The lifecycle fields below let mcpd reap stale rows when
the publisher goes away.

Schema additions:
- enum LlmKind (public | virtual). Default public.
- enum LlmStatus (active | inactive | hibernating). Default active.
  hibernating is reserved for v2 wake-on-demand.
- Llm.kind, providerSessionId, lastHeartbeatAt, status, inactiveSince.
- @@index([kind, status]) for the GC sweep.
- @@index([providerSessionId]) for the reconnect lookup.

All existing rows backfill with kind=public/status=active so v1 is
purely additive — public LLMs ignore the lifecycle columns entirely.

7 new prisma-level assertions in tests/llm-virtual-schema.test.ts
cover: defaults, persisting kind=virtual + lifecycle together, the
active→inactive flip, hibernating value, enum rejection, the
(kind,status) GC index, the providerSessionId reconnect index.

mcpd suite still 801/801 (regenerated client) and typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The state machine for kind=virtual Llm rows. Wires the schema added
in Stage 1 into something that can register, heartbeat, time out,
and relay inference tasks. The HTTP routes (Stage 3) plug into this.

Repository (extends ILlmRepository):
- create/update accept kind/providerSessionId/lastHeartbeatAt/status/
  inactiveSince/type so VirtualLlmService can drive the lifecycle.
- findBySessionId(sessionId) — the reconnect lookup.
- findStaleVirtuals(cutoff) — heartbeat-stale rows for the GC sweep.
- findExpiredInactives(cutoff) — 4h-expired rows for deletion.

VirtualLlmService:
- register(): sticky-id-aware upsert. New names insert as kind=virtual/
  status=active. Existing virtual rows from the same session reactivate
  in place; existing inactive virtuals from a foreign session can be
  adopted (sticky reconnect). Refuses to overwrite a public row or a
  foreign session's still-active virtual.
- heartbeat(): bumps lastHeartbeatAt for every row owned by the
  session; revives inactive rows.
- bindSession()/unbindSession(): in-memory map of sessionId → SSE
  handle. Disconnect immediately flips owned rows to inactive AND
  rejects any in-flight tasks for that session.
- enqueueInferTask(): pushes an `infer` task frame to the SSE handle,
  returns a PendingTaskRef whose `done` resolves when the publisher
  POSTs the result back. Streaming variant exposes onChunk(cb).
- completeTask/pushTaskChunk/failTask: route-side hooks called from
  the result POST handler (lands in Stage 3).
- gcSweep(): flips heartbeat-stale active virtuals to inactive (90s
  cutoff), deletes inactives past 4h. Idempotent.

Lifecycle constants live in this file (HEARTBEAT_TIMEOUT_MS=90s,
INACTIVE_RETENTION_MS=4h) so future stages can tune in one place.

18 new mocked-repo tests cover: register variants (insert, sticky
reconnect, refuse public-overwrite, refuse foreign-session, adopt
inactive-foreign), heartbeat-revive, unbind cascade, enqueue happy
path + 503 paths (no session, inactive, public-Llm), complete/fail/
streaming chunk fan-out, GC sweep flip + delete + idempotence.

mcpd suite: 819/819 (was 801, +18). Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end backend wiring. After this stage, an mcplocal client can
register a provider, hold the SSE channel open, heartbeat, and have
its inference requests fanned through the relay — all without
touching the agent layer or the public-LLM path.

Routes (new file: routes/virtual-llms.ts):
  POST /api/v1/llms/_provider-register    → returns { providerSessionId, llms[] }
  GET  /api/v1/llms/_provider-stream      → SSE channel keyed by
                                            x-mcpctl-provider-session header.
                                            Emits `event: hello` on open,
                                            `event: task` on inference fan-out,
                                            `: ping` every 20 s for proxies.
  POST /api/v1/llms/_provider-heartbeat   → bumps lastHeartbeatAt
  POST /api/v1/llms/_provider-task/:id/result
                                          → mcplocal pushes result back;
                                            body shape is one of:
                                              { error: 'msg' }
                                              { chunk: { data, done? } }
                                              { status, body }

LlmService:
- LlmView gains kind/status/lastHeartbeatAt/inactiveSince so route
  handlers + the upcoming `mcpctl get llm` columns can branch on
  kind without re-fetching the row.

llm-infer.ts:
- Detects llm.kind === 'virtual' and delegates to
  VirtualLlmService.enqueueInferTask. Streaming + non-streaming both
  supported; on 503 (publisher offline) the existing audit hook still
  fires with the right status code.
- Adds optional `virtualLlms: VirtualLlmService` to LlmInferDeps;
  absence in test fixtures returns a 500 with a clear "server
  misconfiguration" message rather than silently falling through to
  the public path against an empty URL.

main.ts:
- Constructs VirtualLlmService(llmRepo).
- Passes it to registerLlmInferRoutes.
- Calls registerVirtualLlmRoutes(app, virtualLlmService).
- 60-s GC ticker started after app.listen; clears on graceful
  shutdown alongside the existing reconcile timer.

Tests: 11 new virtual-LLM route assertions (validation paths,
service plumbing for register/heartbeat/task-result) + 3 new
infer-route assertions (kind=virtual non-streaming relay, 503 path,
500 when virtualLlms dep missing). mcpd suite: 833/833 (was 819,
+14). Typecheck clean.

The full SSE handshake is exercised by the smoke test in Stage 6;
under app.inject the keep-alive blocks until close so unit-level
SSE testing isn't worth the complexity here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mcplocal counterpart to mcpd's VirtualLlmService. After this stage,
flipping \`publish: true\` on a provider in ~/.mcpctl/config.json makes
the provider show up in mcpctl get llm with kind=virtual the next time
mcplocal restarts; running an inference against it relays through this
client back to the local LlmProvider.

Config:
- LlmProviderFileEntry gains optional \`publish: boolean\` (default false,
  so existing setups don't change).

Registrar (new file: providers/registrar.ts):
- start(): if any provider is opted-in, POSTs to
  /api/v1/llms/_provider-register with the publishable set, persists
  the returned providerSessionId to ~/.mcpctl/provider-session for
  sticky reconnects, then opens the SSE control channel and starts a
  30-s heartbeat ticker.
- SSE listener parses event/data lines from text/event-stream frames.
  task frames trigger handleInferTask: convert OpenAI body to
  CompletionOptions, call provider.complete(), POST the result back as
  either { status, body } (non-streaming) or two chunk POSTs
  (streaming: one delta + a [DONE] marker).
- Disconnect → exponential backoff reconnect from 5 s up to 60 s. On
  successful reconnect the persisted sessionId revives the same Llm
  rows in mcpd (mcpd flips them back to active on heartbeat).
- stop() destroys the SSE socket and clears the timer; cleanly handed
  off from main.ts's existing shutdown handler.

Wired into mcplocal main.ts via maybeStartVirtualLlmRegistrar:
- Filters opted-in providers, looks up their LlmProvider instances in
  the registry.
- Reads ~/.mcpctl/credentials for mcpdUrl + bearer; absence is a
  best-effort skip (logs a warning, returns null) — never a boot
  blocker.

v1 caveat documented in the file header: LlmProvider returns a
finalized CompletionResult, not a token stream, so streaming requests
get a single delta chunk + [DONE]. Real per-token streaming is a v2
concern.

Tests: 5 new in tests/registrar.test.ts using a tiny in-process HTTP
server. Cover: no-op when nothing opted-in, register POST + sticky
sessionId persistence, sticky reconnect from disk, heartbeat ticker
fires at the configured interval, register HTTP error surfaces.

Workspace suite: 2043/2043 across 152 files (was 2006/149, +5
new tests + the new file gets discovered).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the loop on user-facing surface:

  $ mcpctl get llm
  NAME             KIND     STATUS    TYPE     MODEL                       TIER  KEY  ID
  qwen3-thinking   public   active    openai   qwen3-thinking              fast  ...  ...
  vllm-local       virtual  active    openai   Qwen/Qwen2.5-7B-Instruct    fast  -    ...

  $ mcpctl chat-llm vllm-local
  ────────────────────────────────────────
  LLM: vllm-local  openai → Qwen/Qwen2.5-7B-Instruct-AWQ
  Kind: virtual    Status: active
  ────────────────────────────────────────
  > hello?
  Hi! …

New: chat-llm command (commands/chat-llm.ts)
- Stateless chat with any mcpd-registered LLM. No threads, no tools,
  no project prompts. POSTs to /api/v1/llms/<name>/infer; mcpd's
  kind=virtual branch handles relay-through-mcplocal transparently,
  so the same CLI command works for both public and virtual LLMs.
- Reuses installStatusBar / formatStats / recordDelta / styleStats /
  PhaseStats from chat.ts (now exported) so the bottom-row tokens-per-
  second ticker behaves identically to mcpctl chat.
- Flags: --message (one-shot), --system, --temperature, --max-tokens,
  --no-stream. Streaming uses OpenAI chat.completion.chunk SSE.
- REPL mode keeps a per-session history array so multi-turn flows
  feel natural; each turn is an independent inference call.

Updated: get.ts
- LlmRow gains optional kind/status fields.
- llmColumns layout: NAME, KIND, STATUS, TYPE, MODEL, TIER, KEY, ID.
  Defaults gracefully when older mcpd responses don't return them.

Updated: chat.ts
- Re-exports the helpers chat-llm.ts needs (PhaseStats, newPhase,
  recordDelta, formatStats, styleStats, styleThinking, STDERR_IS_TTY,
  StatusBar, installStatusBar). No behavior change.

Completions: chat-llm picks up the standard option enumeration
automatically; bash gets a special-case for first-arg LLM-name
completion via _mcpctl_resource_names "llms".

CLI suite: 437/437 (was 430, +7 from auto-discovered test cases in
the regenerated completions golden). Workspace: 2043/2043 across
152 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat: virtual-LLM smoke test + docs (v1 Stage 6)
Some checks failed
CI/CD / typecheck (pull_request) Successful in 53s
CI/CD / test (pull_request) Successful in 1m8s
CI/CD / lint (pull_request) Successful in 2m6s
CI/CD / smoke (pull_request) Failing after 1m39s
CI/CD / build (pull_request) Successful in 2m11s
CI/CD / publish (pull_request) Has been skipped
866f6abc88
Final stage of v1.

Smoke (mcplocal/tests/smoke/virtual-llm.smoke.test.ts):
- Spins an in-process LlmProvider that returns canned content.
- Runs the registrar against the live mcpd in fulldeploy.
- Asserts: row appears with kind=virtual / status=active, infer
  through /api/v1/llms/<name>/infer comes back through the SSE
  relay with the provider's content + finish_reason, and a 503
  appears immediately after registrar.stop() (publisher offline).
- Times out / cleanup paths idempotent so re-runs against the same
  cluster don't litter rows. The 90-s heartbeat-stale flip and 4-h
  GC are unit-tested — too slow for smoke.

Docs:
- New docs/virtual-llms.md: when to use this vs creating a regular
  Llm row, how to opt-in via publish: true, the lifecycle table,
  the inference-relay sequence, the v1 streaming caveat, the v2-v5
  roadmap, and the full /api/v1/llms/_provider-* surface.
- agents.md cross-links virtual-llms.md alongside personalities/chat.
- README's Agents section gains a "Virtual LLMs" subsection.

Workspace suite: 2043/2043 (smoke files run separately). v1 closes.

Stage roadmap (each its own future PR):
  v2 wake-on-demand · v3 virtual agents · v4 LB pool · v5 task queue

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
michal merged commit 65b6b265d9 into main 2026-04-27 13:38:51 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: michal/mcpctl#63