feat: virtual LLMs v2 (wake-on-demand) #65
Reference in New Issue
Block a user
Delete Branch "feat/virtual-llm-v2-wake"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
v2 of the virtual-LLM feature. A provider with a hibernating backend (vLLM that suspends idle, Ollama that exits when nothing's connected, …) declares a wake recipe in mcplocal's local config. When that provider's `isAvailable()` returns false at boot, the row is published as `status=hibernating`. The next inference request triggers a `wake` SSE task; mcplocal runs the recipe, polls until the backend comes up, then mcpd flips the row to `active` and relays the inference.
Two recipe types:
Concurrent infers for the same hibernating Llm share one wake task — `wakeInFlight` map dedupes by Llm name.
Stages
Test plan
🤖 Generated with Claude Code
First half of v2 — mcplocal can now declare a hibernating backend and respond to a `wake` task by running a configured recipe. v2 Stage 2 will wire mcpd to dispatch the wake task before relaying inference. Config (LlmProviderFileEntry): - New \`wake\` block on a published provider: wake: type: http # or: command url: ... # http only method: POST # http only, default POST headers: {...} # http only body: ... # http only command: ... # command only args: [...] # command only maxWaitSeconds: 60 # how long to poll isAvailable() after wake fires Registrar (mcplocal): - At publish time, providers with a wake recipe whose isAvailable() returns false report initialStatus=hibernating to mcpd. Without a wake recipe (legacy v1) or when already up, status stays active. - handleWakeTask: runs the recipe (HTTP request OR child-process spawn), then polls isAvailable() up to maxWaitSeconds, sending a heartbeat each loop so mcpd's GC sweep doesn't time us out mid-boot. Reports { ok, ms } on success or { error } on timeout/recipe failure via the existing _provider-task/:id/result. - Replaces the v1 stub that rejected wake tasks with "not implemented". mcpd VirtualLlmService: - RegisterProviderInput gains optional initialStatus ('active' | 'hibernating'). The register/upsert path uses it for both new and reconnecting rows. Defaults to 'active' so v1 publishers still work unchanged. - Provider-register route's coercer accepts the new field. Tests: 3 new in registrar.test.ts cover initialStatus selection (hibernating when wake configured + unavailable, active otherwise, active when no wake even if unavailable). 8/8 registrar tests, 833/833 mcpd unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Second half of v2. mcpd now dispatches a \`wake\` task on the SSE control channel when an inference request hits a row whose status=hibernating, waits for the publisher to confirm readiness, then proceeds with the infer task. Concurrent infers for the same hibernating Llm share a single wake task — \`wakeInFlight\` map dedupes by Llm name. State machine in enqueueInferTask: active → push infer task immediately (existing path). inactive → 503, publisher offline (existing path). hibernating → ensureAwake() → push infer task (new in v2). ensureAwake/runWake (private): - Allocates a fresh taskId on the existing PendingTask plumbing. - Pushes \`{ kind: "wake", taskId, llmName }\` on the SSE handle. - Awaits the publisher's result POST. On 2xx, flips the row to active + bumps lastHeartbeatAt, so all queued + future infers hit the active path. On non-2xx or service.failTask, the row stays hibernating (next request retries). Tests: 4 new in virtual-llm-service.test.ts cover happy path (wake → infer in order), concurrent dedup (3 parallel infers, 1 wake task), wake failure surfaces to all queued infers and leaves the row hibernating, inactive ≠ hibernating (still rejects with 503, no wake attempt). 22/22 service tests, 2050/2050 workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Closes v2 (wake-on-demand). Same shape as v1's stage 6: smoke exercises the live-cluster path, docs lose the "v2 reserved" caveat and gain a full wake-recipe section. Smoke (virtual-llm.smoke.test.ts): - New "wake-on-demand" describe block runs alongside the v1 tests. - Spins a tiny in-process HTTP "wake controller"; the published provider's isAvailable() returns false until the wake POST flips the bool. Asserts: 1. Provider publishes as kind=virtual / status=hibernating. 2. First inference triggers the wake recipe, the recipe POSTs to the controller, the provider becomes available, mcpd relays the inference, and the row settles to active. - Cleans up the row + wake server in afterAll. Docs (docs/virtual-llms.md): - Lifecycle table updates the `hibernating` description from "reserved for v2" to the actual v2 semantics. - New "Wake-on-demand (v2)" section: configuration shapes for both recipe types (http + command), the wake-then-infer flow diagram, concurrent-infer dedup, failure semantics. - Roadmap drops v2; v3-v5 still listed. Workspace: 2050/2050 (smoke runs separately; the new SSE-based wake test runs only against a live cluster, not under \`pnpm test:run\`). v2 closes. v3 = virtual agents, v4 = LB pool by model, v5 = queue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>