feat(docs+smoke): LB pool live smoke + virtual-llms.md pool semantics (v4 Stage 3)

Smoke (tests/smoke/llm-pool.smoke.test.ts): two in-process registrars publish virtual Llms with distinct names but a shared poolName, then: 1. /api/v1/llms/<name>/members surfaces both with the correct effective pool key, size, activeCount, and per-member kind/status. 2. Chat through an agent pinned to one pool member dispatches across the pool — verified by running 12 calls and asserting at least one response from each backend (the random-shuffle selection would have to hit only-A or only-B in 12 fair coin flips, ~1/2048). 3. Failover: stop one publisher, the surviving member still serves chat. /members shows the stopped row as inactive immediately (unbindSession runs synchronously on SSE close). docs/virtual-llms.md gets a full "LB pools (v4)" section with the two-field schema model, dispatcher selection + failover semantics, public + virtual declaration examples, list/describe rendering, the "pin to specific instance" escape hatch, and an API surface entry for /members. docs/agents.md cross-link extended. Tests: full smoke 144/144 (was 141, +3 for the new pool smoke). Stages 1-3 ship the complete v4 — public and virtual Llms can both join pools, agents transparently load-balance across them, yaml round-trip preserves poolName, and the existing single-Llm world keeps working byte-identically when poolName is null.
2026-04-27 23:22:15 +01:00
parent e21f96080d
commit 137711fdf6
3 changed files with 432 additions and 5 deletions
--- a/docs/virtual-llms.md
+++ b/docs/virtual-llms.md
@@ -278,10 +278,148 @@ when the agent's pinned Llm is virtual.
 same agent name collide on the second register with HTTP 409. Per-publisher
 namespacing is a v4+ concern — same constraint as virtual Llms in v1.

+## LB pools (v4)
+
+Two or more `Llm` rows that share a `poolName` stack into one
+load-balanced pool. Agents pin to a single Llm by id; the chat
+dispatcher transparently widens to "all healthy Llms with the same
+effective pool key" at request time and picks one. There is no new
+`LlmPool` resource — `poolName` is just an optional column on `Llm`,
+so RBAC, listing, yaml round-trip, and apply all work the same way
+they did pre-v4.
+
+### Pool semantics
+
+| Field | Behavior |
+|---|---|
+| `Llm.name` | Globally unique (unchanged). The apply key. |
+| `Llm.poolName` | Optional. When set, declares membership. When NULL, falls back to `name` ("solo Llm, pool of 1"). |
+
+Effective pool key = `poolName ?? name`. The dispatcher's lookup is:
+
+```sql
+SELECT * FROM Llm
+WHERE poolName = $1 OR (poolName IS NULL AND name = $1)
+```
+
+So a solo Llm whose `name` happens to equal an explicit `poolName`
+joins that pool — by design, an existing single-row Llm can be
+promoted to "pool seed" without a rename or migration.
+
+### Selection + failover
+
+- **Selection**: random shuffle of all members whose `status` is
+  `active` (or `hibernating` — VirtualLlmService handles wake on
+  dispatch). `inactive` members are skipped.
+- **Failover** (non-streaming): if dispatch throws on the first
+  candidate (transport failure, virtual publisher disconnect), the
+  dispatcher iterates the rest of the shuffled list until one
+  succeeds or the list is exhausted. Auth/4xx responses are NOT
+  retried — siblings with the same key/model would fail identically.
+- **Failover** (streaming): only covers "couldn't establish stream"
+  failures (transport error before any chunk yielded). Once any
+  output has been streamed, we're committed to that backend.
+
+### Declaring a pool
+
+#### Public Llms
+
+```bash
+mcpctl create llm prod-qwen-1 --type openai --model qwen3-thinking \
+  --url https://prod-1.example.com --pool-name qwen-pool \
+  --api-key-ref qwen-key/API_KEY
+
+mcpctl create llm prod-qwen-2 --type openai --model qwen3-thinking \
+  --url https://prod-2.example.com --pool-name qwen-pool \
+  --api-key-ref qwen-key/API_KEY
+```
+
+Or via apply (yaml round-trip preserves `poolName`):
+
+```yaml
+---
+kind: llm
+name: prod-qwen-1
+type: openai
+model: qwen3-thinking
+url: https://prod-1.example.com
+poolName: qwen-pool
+apiKeyRef: { name: qwen-key, key: API_KEY }
+---
+kind: llm
+name: prod-qwen-2
+type: openai
+model: qwen3-thinking
+url: https://prod-2.example.com
+poolName: qwen-pool
+apiKeyRef: { name: qwen-key, key: API_KEY }
+```
+
+#### Virtual Llms (mcplocal-published)
+
+```jsonc
+// ~/.mcpctl/config.json
+{
+  "llm": {
+    "providers": [
+      {
+        "name": "vllm-alice-qwen3",       // unique per publisher
+        "type": "vllm-managed",
+        "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
+        "venvPath": "~/vllm_env",
+        "publish": true,
+        "poolName": "user-vllm-qwen3-thinking"   // shared pool key
+      }
+    ]
+  }
+}
+```
+
+Each user's mcplocal picks a unique `name` (e.g. include the hostname
+to guarantee no collisions) but shares the `poolName`. Agents pinned
+to any single member — or to `qwen3-thinking` (the public LiteLLM
+endpoint, also given `poolName: user-vllm-qwen3-thinking` if mixing
+public + virtual is desired) — see one logical pool that auto-grows
+as more workers come online.
+
+### Listing + describe
+
+The `mcpctl get llm` table has a `POOL` column right after `NAME`.
+Solo rows render as `-`; pool members show their explicit pool key:
+
+```
+NAME              POOL          KIND     STATUS    TYPE    MODEL                           ID
+qwen3-thinking    -             public   active    openai  qwen3-thinking                  cmo...
+prod-qwen-1       qwen-pool     public   active    openai  qwen3-thinking                  cmo...
+prod-qwen-2       qwen-pool     public   active    openai  qwen3-thinking                  cmo...
+```
+
+`mcpctl describe llm <name>` adds a `Pool:` block at the top when the
+row is in an explicit pool OR when its implicit pool has size > 1:
+
+```
+Pool:
+  Pool name:    qwen-pool
+  Members:      2 (2 active)
+    - prod-qwen-1  [public/active]  ← this row
+    - prod-qwen-2  [public/active]
+```
+
+`GET /api/v1/llms/<name>/members` is the API surface — returns full
+`LlmView`s for every member plus aggregate `size` / `activeCount` so
+operator tooling doesn't need a second roundtrip.
+
+### Pinning to a specific instance
+
+To pin an agent to one specific instance (e.g. for debugging,
+RBAC-scoped routing, or "this agent must hit this model with this
+key"), give that instance a unique name and leave its `poolName`
+unset. The agent's pool is then size 1 and dispatch is deterministic.
+Pool membership is opt-in via `poolName` — the default behavior is
+single-Llm.
+
 ## Roadmap (later stages)

- **v4 — LB pool by model**: agents can target a model name instead of
-  a specific Llm; mcpd picks the healthiest pool member per request.
 - **v5 — Task queue**: persisted requests for hibernating/saturated
  pools. Workers pull tasks of their model when they come online.

@@ -301,8 +439,12 @@ POST  /api/v1/llms/_provider-task/:id/result
                                             { chunk: { data, done? } }
                                             { status, body }

-GET   /api/v1/llms                         → list (includes kind, status, lastHeartbeatAt, inactiveSince)
-POST  /api/v1/llms/<virtual>/infer         → routes through the SSE relay
+GET   /api/v1/llms                         → list (includes kind, status, lastHeartbeatAt, inactiveSince, poolName)
+GET   /api/v1/llms/<name>                  → single Llm row (also accepts a CUID id)
+GET   /api/v1/llms/<name>/members          → v4: pool members for the effective pool key:
+                                              { poolName, explicitPoolName, size, activeCount, members[] }
+POST  /api/v1/llms/<virtual>/infer         → routes through the SSE relay (v4: dispatcher
+                                              also expands by poolName when set)
 DELETE /api/v1/llms/<virtual>              → delete unconditionally (also runs GC's job)
 GET   /api/v1/agents                       → list (v3: includes kind, status, lastHeartbeatAt, inactiveSince)
 ```