Files
mcpctl/docs/virtual-llms.md
Michal 137711fdf6
Some checks failed
CI/CD / lint (pull_request) Successful in 53s
CI/CD / test (pull_request) Successful in 1m8s
CI/CD / typecheck (pull_request) Successful in 2m53s
CI/CD / smoke (pull_request) Failing after 1m47s
CI/CD / build (pull_request) Successful in 6m20s
CI/CD / publish (pull_request) Has been skipped
feat(docs+smoke): LB pool live smoke + virtual-llms.md pool semantics (v4 Stage 3)
Smoke (tests/smoke/llm-pool.smoke.test.ts): two in-process registrars
publish virtual Llms with distinct names but a shared poolName, then:

  1. /api/v1/llms/<name>/members surfaces both with the correct
     effective pool key, size, activeCount, and per-member kind/status.
  2. Chat through an agent pinned to one pool member dispatches across
     the pool — verified by running 12 calls and asserting at least
     one response from each backend (the random-shuffle selection
     would have to hit only-A or only-B in 12 fair coin flips, ~1/2048).
  3. Failover: stop one publisher, the surviving member still serves
     chat. /members shows the stopped row as inactive immediately
     (unbindSession runs synchronously on SSE close).

docs/virtual-llms.md gets a full "LB pools (v4)" section with the
two-field schema model, dispatcher selection + failover semantics,
public + virtual declaration examples, list/describe rendering, the
"pin to specific instance" escape hatch, and an API surface entry
for /members. docs/agents.md cross-link extended.

Tests: full smoke 144/144 (was 141, +3 for the new pool smoke).
Stages 1-3 ship the complete v4 — public and virtual Llms can both
join pools, agents transparently load-balance across them, yaml
round-trip preserves poolName, and the existing single-Llm world
keeps working byte-identically when poolName is null.
2026-04-27 23:22:15 +01:00

17 KiB

Virtual LLMs

A virtual LLM is an Llm row in mcpd that's registered by an mcplocal client rather than created by hand with mcpctl create llm. Inference for a virtual LLM is relayed back through the publishing mcplocal's SSE control channel — mcpd never needs to know the local URL or hold its API key.

When the publishing mcplocal goes away (or the user shuts down their laptop) the row decays: active → inactive after 90 s without a heartbeat, then deleted after 4 h of inactivity. A reconnecting mcplocal adopts the same row using a sticky providerSessionId it persisted at first publish.

When to use this

  • Local model on a developer laptop that you want everyone on the team to be able to chat with via mcpctl chat-llm <name>. The model doesn't need to be reachable from mcpd's k8s pods — only the user's mcplocal does (which is already the case because mcplocal pulls projects from mcpd over HTTPS).
  • Hibernating models that wake on demand (v2 — see "Roadmap").
  • Pool of identical models distributed across user laptops, eligible for load balancing (v4).

If your model is reachable from mcpd's k8s pods over LAN/VPN, you don't need a virtual LLM — just mcpctl create llm <name> --type openai --url … and you're done.

Publishing a local provider

mcplocal's local config (~/.mcpctl/config.json) gains a publish: true opt-in per provider:

{
  "llm": {
    "providers": [
      {
        "name": "vllm-local",
        "type": "openai",
        "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
        "url": "http://127.0.0.1:8000/v1",
        "tier": "fast",
        "publish": true
      }
    ]
  }
}

Restart mcplocal:

systemctl --user restart mcplocal

The registrar:

  1. Reads ~/.mcpctl/credentials for mcpdUrl + bearer token.
  2. POSTs to /api/v1/llms/_provider-register with the publishable set.
  3. Persists the returned providerSessionId to ~/.mcpctl/provider-session so the next restart adopts the same mcpd row.
  4. Opens the SSE channel at /api/v1/llms/_provider-stream.
  5. Heartbeats every 30 s.
  6. Listens for event: task frames and runs them against the local LlmProvider.

If ~/.mcpctl/credentials doesn't exist (e.g. you haven't run mcpctl auth login), the registrar logs a warning and skips — publishing is a best-effort feature, not a boot blocker.

Verifying

$ mcpctl get llm
NAME             KIND     STATUS    TYPE     MODEL                          TIER  KEY                                 ID
qwen3-thinking   public   active    openai   qwen3-thinking                 fast  secret://litellm-key/API_KEY        cmofx8y7u…
vllm-local       virtual  active    openai   Qwen/Qwen2.5-7B-Instruct-AWQ   fast  -                                   cmoxz12ab…

$ mcpctl chat-llm vllm-local
─────────────────────────────────────────────────────────
LLM: vllm-local  openai → Qwen/Qwen2.5-7B-Instruct-AWQ
Kind: virtual    Status: active
─────────────────────────────────────────────────────────
> hello?
Hi! …

You can also chat with public LLMs the same way:

$ mcpctl chat-llm qwen3-thinking

The CLI doesn't care about kind — mcpd's /api/v1/llms/<name>/infer route branches on it server-side.

Lifecycle in detail

State What it means
active Heartbeat received within the last 90 s and the SSE channel is open.
inactive Either the SSE closed or the heartbeat watchdog tripped. Inference returns 503.
hibernating Publisher is online but the backend is asleep; the next inference triggers a wake task before relaying.

Two timers on mcpd run the GC sweep:

  • 90 s without a heartbeat → flip activeinactive.
  • 4 h in inactive → delete the row entirely.

A reconnecting mcplocal with the same providerSessionId revives every inactive row it owns; it only orphans rows that fell past the 4-h cutoff.

Inference relay

When mcpd receives POST /api/v1/llms/<virtual>/infer:

  1. Look up the row, see kind=virtual + status=active.
  2. Find the open SSE session for that providerSessionId. Missing session → 503.
  3. Push a { kind: "infer", taskId, llmName, request, streaming } task frame onto the SSE.
  4. mcplocal pulls, calls LlmProvider.complete(...), and POSTs the result back to /api/v1/llms/_provider-task/<taskId>/result:
    • non-streaming: { status: 200, body: <chat.completion> }
    • streaming: per-chunk { chunk: { data, done? } }
    • failure: { error: "..." }
  5. mcpd forwards the result/chunks out to the original caller.

v1 caveat — streaming granularity: LlmProvider.complete() returns a finalized CompletionResult, not a token stream. Streaming requests therefore arrive at the caller as a single delta + [DONE]. Real per-token streaming is a v2 concern.

Wake-on-demand (v2)

A provider whose backend hibernates (a vLLM instance that suspends when idle, an Ollama daemon that exits when nothing's connected, …) can declare a wake recipe in mcplocal config. When that provider's isAvailable() returns false at registrar startup, the row is published as status=hibernating. The next inference request that hits the row triggers the recipe and waits for the backend to come up before relaying.

Two recipe types:

// HTTP — POST to a "wake controller" that starts the backend out of band.
{
  "name": "vllm-local",
  "type": "openai",
  "model": "...",
  "publish": true,
  "wake": {
    "type": "http",
    "url": "http://10.0.0.50:9090/wake/vllm",
    "method": "POST",
    "headers": { "Authorization": "Bearer ..." },
    "maxWaitSeconds": 60
  }
}
// command — spawn a local process (systemd, wakeonlan, custom script).
{
  "name": "vllm-local",
  "type": "openai",
  "model": "...",
  "publish": true,
  "wake": {
    "type": "command",
    "command": "/usr/local/bin/start-vllm",
    "args": ["--profile", "qwen3"],
    "maxWaitSeconds": 120
  }
}

How a request flows when the row is hibernating:

client → mcpd POST /api/v1/llms/<name>/infer
         mcpd: status === hibernating → push wake task on SSE
         mcplocal: receive wake task → run recipe → poll isAvailable()
                   → heartbeat each tick → POST { ok: true } back
         mcpd: flip row → active, push the original infer task
         mcplocal: run inference → POST result back
mcpd → client (forwards the inference result)

Concurrent infers for the same hibernating Llm share a single wake task — only the first request triggers the recipe; later ones await the same in-flight wake promise. After the wake settles, every queued infer dispatches in order.

If the recipe fails (HTTP non-2xx, command exits non-zero, or the provider doesn't come up within maxWaitSeconds), every queued infer is rejected with a clear error and the row stays hibernating — the next request gets a fresh wake attempt.

Virtual agents (v3)

Virtual agents extend the same publishing model to agents — named LLM personas with their own system prompt and sampling defaults. mcplocal declares them in its config alongside its providers, and the existing _provider-register endpoint atomically publishes both Llms and Agents in one round-trip. They show up under mcpctl get agent next to manually-created public agents and become chat-able via mcpctl chat <agent> — no special command.

Declaring a virtual agent in mcplocal config

// ~/.mcpctl/config.json
{
  "llm": {
    "providers": [
      { "name": "vllm-local", "type": "vllm", "model": "Qwen/Qwen2.5-7B-Instruct-AWQ", "publish": true }
    ]
  },
  "agents": [
    {
      "name": "local-coder",
      "llm": "vllm-local",
      "description": "Local coding assistant on the workstation GPU",
      "systemPrompt": "You are a senior engineer. Be terse.",
      "defaultParams": { "temperature": 0.2 }
    }
  ]
}

llm references a published provider's name from the same config. Agents pinned to a name that isn't being published are still forwarded to mcpd — the server validates llmName and 404s with a clear message if it's genuinely missing, which lets you point at a public Llm if you want.

Lifecycle

Same shape as virtual Llms — 30 s heartbeat from mcplocal, 90 s heartbeat-stale → status flips to inactive, 4 h inactive → row deleted by mcpd's GC sweep. Heartbeats cover both Llms and Agents owned by the session.

The GC orders agent deletes before their pinned virtual Llm so the Agent.llmId onDelete: Restrict FK doesn't block the sweep.

Listing

$ mcpctl get agents
NAME          KIND     STATUS    LLM             PROJECT             DESCRIPTION
local-coder   virtual  active    vllm-local      -                   Local coding assistant on…
reviewer      public   active    qwen3-thinking  mcpctl-development  I review what you're shipping…

The KIND and STATUS columns are the v3 additions. Round-tripping through mcpctl get agent X -o yaml | mcpctl apply -f - strips those runtime fields cleanly so a virtual agent can be re-declared as a public one (or vice versa) without manual editing.

Chatting

$ mcpctl chat local-coder
> hello?
… streams through mcpd → SSE → mcplocal's vllm-local provider …

Same command as for public agents. Works because chat.service has a kind=virtual branch that hands off to VirtualLlmService.enqueueInferTask when the agent's pinned Llm is virtual.

Cluster-wide name uniqueness

Agent.name is unique cluster-wide. Two mcplocals trying to publish the same agent name collide on the second register with HTTP 409. Per-publisher namespacing is a v4+ concern — same constraint as virtual Llms in v1.

LB pools (v4)

Two or more Llm rows that share a poolName stack into one load-balanced pool. Agents pin to a single Llm by id; the chat dispatcher transparently widens to "all healthy Llms with the same effective pool key" at request time and picks one. There is no new LlmPool resource — poolName is just an optional column on Llm, so RBAC, listing, yaml round-trip, and apply all work the same way they did pre-v4.

Pool semantics

Field Behavior
Llm.name Globally unique (unchanged). The apply key.
Llm.poolName Optional. When set, declares membership. When NULL, falls back to name ("solo Llm, pool of 1").

Effective pool key = poolName ?? name. The dispatcher's lookup is:

SELECT * FROM Llm
WHERE poolName = $1 OR (poolName IS NULL AND name = $1)

So a solo Llm whose name happens to equal an explicit poolName joins that pool — by design, an existing single-row Llm can be promoted to "pool seed" without a rename or migration.

Selection + failover

  • Selection: random shuffle of all members whose status is active (or hibernating — VirtualLlmService handles wake on dispatch). inactive members are skipped.
  • Failover (non-streaming): if dispatch throws on the first candidate (transport failure, virtual publisher disconnect), the dispatcher iterates the rest of the shuffled list until one succeeds or the list is exhausted. Auth/4xx responses are NOT retried — siblings with the same key/model would fail identically.
  • Failover (streaming): only covers "couldn't establish stream" failures (transport error before any chunk yielded). Once any output has been streamed, we're committed to that backend.

Declaring a pool

Public Llms

mcpctl create llm prod-qwen-1 --type openai --model qwen3-thinking \
  --url https://prod-1.example.com --pool-name qwen-pool \
  --api-key-ref qwen-key/API_KEY

mcpctl create llm prod-qwen-2 --type openai --model qwen3-thinking \
  --url https://prod-2.example.com --pool-name qwen-pool \
  --api-key-ref qwen-key/API_KEY

Or via apply (yaml round-trip preserves poolName):

---
kind: llm
name: prod-qwen-1
type: openai
model: qwen3-thinking
url: https://prod-1.example.com
poolName: qwen-pool
apiKeyRef: { name: qwen-key, key: API_KEY }
---
kind: llm
name: prod-qwen-2
type: openai
model: qwen3-thinking
url: https://prod-2.example.com
poolName: qwen-pool
apiKeyRef: { name: qwen-key, key: API_KEY }

Virtual Llms (mcplocal-published)

// ~/.mcpctl/config.json
{
  "llm": {
    "providers": [
      {
        "name": "vllm-alice-qwen3",       // unique per publisher
        "type": "vllm-managed",
        "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
        "venvPath": "~/vllm_env",
        "publish": true,
        "poolName": "user-vllm-qwen3-thinking"   // shared pool key
      }
    ]
  }
}

Each user's mcplocal picks a unique name (e.g. include the hostname to guarantee no collisions) but shares the poolName. Agents pinned to any single member — or to qwen3-thinking (the public LiteLLM endpoint, also given poolName: user-vllm-qwen3-thinking if mixing public + virtual is desired) — see one logical pool that auto-grows as more workers come online.

Listing + describe

The mcpctl get llm table has a POOL column right after NAME. Solo rows render as -; pool members show their explicit pool key:

NAME              POOL          KIND     STATUS    TYPE    MODEL                           ID
qwen3-thinking    -             public   active    openai  qwen3-thinking                  cmo...
prod-qwen-1       qwen-pool     public   active    openai  qwen3-thinking                  cmo...
prod-qwen-2       qwen-pool     public   active    openai  qwen3-thinking                  cmo...

mcpctl describe llm <name> adds a Pool: block at the top when the row is in an explicit pool OR when its implicit pool has size > 1:

Pool:
  Pool name:    qwen-pool
  Members:      2 (2 active)
    - prod-qwen-1  [public/active]  ← this row
    - prod-qwen-2  [public/active]

GET /api/v1/llms/<name>/members is the API surface — returns full LlmViews for every member plus aggregate size / activeCount so operator tooling doesn't need a second roundtrip.

Pinning to a specific instance

To pin an agent to one specific instance (e.g. for debugging, RBAC-scoped routing, or "this agent must hit this model with this key"), give that instance a unique name and leave its poolName unset. The agent's pool is then size 1 and dispatch is deterministic. Pool membership is opt-in via poolName — the default behavior is single-Llm.

Roadmap (later stages)

  • v5 — Task queue: persisted requests for hibernating/saturated pools. Workers pull tasks of their model when they come online.

API surface (v1)

POST  /api/v1/llms/_provider-register      → returns { providerSessionId, llms[], agents[] }
                                              v3: body accepts an optional `agents[]` array
                                              alongside `providers[]`. Atomic publish; older
                                              clients (providers-only) keep working.
GET   /api/v1/llms/_provider-stream        → SSE channel; require x-mcpctl-provider-session header
POST  /api/v1/llms/_provider-heartbeat     → { providerSessionId } — bumps both Llms and Agents
                                              owned by the session
POST  /api/v1/llms/_provider-task/:id/result
                                           → one of:
                                             { error: "msg" }
                                             { chunk: { data, done? } }
                                             { status, body }

GET   /api/v1/llms                         → list (includes kind, status, lastHeartbeatAt, inactiveSince, poolName)
GET   /api/v1/llms/<name>                  → single Llm row (also accepts a CUID id)
GET   /api/v1/llms/<name>/members          → v4: pool members for the effective pool key:
                                              { poolName, explicitPoolName, size, activeCount, members[] }
POST  /api/v1/llms/<virtual>/infer         → routes through the SSE relay (v4: dispatcher
                                              also expands by poolName when set)
DELETE /api/v1/llms/<virtual>              → delete unconditionally (also runs GC's job)
GET   /api/v1/agents                       → list (v3: includes kind, status, lastHeartbeatAt, inactiveSince)

RBAC piggybacks on view/edit/create:llms — no new resource. Publishing a virtual LLM is morally a create:llms operation.

See also

  • agents.md — what an Agent is and how it pins to an LLM.
  • chat.mdmcpctl chat <agent> (full agent flow).
  • The CLI: mcpctl chat-llm <name> (this doc) is the stateless counterpart for raw LLM chat.