Files
mcpctl/docs/virtual-llms.md
Michal 2b2444a2c5
Some checks failed
CI/CD / typecheck (pull_request) Successful in 55s
CI/CD / test (pull_request) Successful in 1m11s
CI/CD / lint (pull_request) Successful in 2m49s
CI/CD / smoke (pull_request) Failing after 1m42s
CI/CD / build (pull_request) Successful in 5m37s
CI/CD / publish (pull_request) Has been skipped
docs+smoke(v7): visibility section in virtual-llms.md + register/list smoke
Wraps up v7 Stage 3:

- docs/virtual-llms.md gains a "Visibility scope (v7)" section that
  explains public-vs-private semantics, who skips the filter (owner +
  `*` admin), how to grant single-row exceptions via name-scoped RBAC,
  per-row override syntax in mcplocal config, the `--visibility` flag
  on `mcpctl create llm`/`create agent`, and YAML round-trip behavior.
- New smoke (virtual-llm-visibility.smoke.test.ts) publishes one
  public + one private virtual Llm via the registrar against the
  live mcpd and asserts the GET /llms response carries visibility
  + a non-empty ownerId for both, and that GET /llms/<name> returns
  the private row to its owner without 404. Cross-user filtering is
  covered by mcpd's visibility-filter unit tests; smoke proves the
  fields make the round-trip end-to-end.

Will pass once mcpd is rebuilt + deployed via fulldeploy.sh on this
branch (current main is v6, doesn't yet serialize visibility).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:08:03 +01:00

21 KiB

Virtual LLMs

A virtual LLM is an Llm row in mcpd that's registered by an mcplocal client rather than created by hand with mcpctl create llm. Inference for a virtual LLM is relayed back through the publishing mcplocal's SSE control channel — mcpd never needs to know the local URL or hold its API key.

When the publishing mcplocal goes away (or the user shuts down their laptop) the row decays: active → inactive after 90 s without a heartbeat, then deleted after 4 h of inactivity. A reconnecting mcplocal adopts the same row using a sticky providerSessionId it persisted at first publish.

When to use this

  • Local model on a developer laptop that you want everyone on the team to be able to chat with via mcpctl chat-llm <name>. The model doesn't need to be reachable from mcpd's k8s pods — only the user's mcplocal does (which is already the case because mcplocal pulls projects from mcpd over HTTPS).
  • Hibernating models that wake on demand (v2 — see "Roadmap").
  • Pool of identical models distributed across user laptops, eligible for load balancing (v4).

If your model is reachable from mcpd's k8s pods over LAN/VPN, you don't need a virtual LLM — just mcpctl create llm <name> --type openai --url … and you're done.

Publishing a local provider

mcplocal's local config (~/.mcpctl/config.json) gains a publish: true opt-in per provider:

{
  "llm": {
    "providers": [
      {
        "name": "vllm-local",
        "type": "openai",
        "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
        "url": "http://127.0.0.1:8000/v1",
        "tier": "fast",
        "publish": true
      }
    ]
  }
}

Restart mcplocal:

systemctl --user restart mcplocal

The registrar:

  1. Reads ~/.mcpctl/credentials for mcpdUrl + bearer token.
  2. POSTs to /api/v1/llms/_provider-register with the publishable set.
  3. Persists the returned providerSessionId to ~/.mcpctl/provider-session so the next restart adopts the same mcpd row.
  4. Opens the SSE channel at /api/v1/llms/_provider-stream.
  5. Heartbeats every 30 s.
  6. Listens for event: task frames and runs them against the local LlmProvider.

If ~/.mcpctl/credentials doesn't exist (e.g. you haven't run mcpctl auth login), the registrar logs a warning and skips — publishing is a best-effort feature, not a boot blocker.

Verifying

$ mcpctl get llm
NAME             KIND     STATUS    TYPE     MODEL                          TIER  KEY                                 ID
qwen3-thinking   public   active    openai   qwen3-thinking                 fast  secret://litellm-key/API_KEY        cmofx8y7u…
vllm-local       virtual  active    openai   Qwen/Qwen2.5-7B-Instruct-AWQ   fast  -                                   cmoxz12ab…

$ mcpctl chat-llm vllm-local
─────────────────────────────────────────────────────────
LLM: vllm-local  openai → Qwen/Qwen2.5-7B-Instruct-AWQ
Kind: virtual    Status: active
─────────────────────────────────────────────────────────
> hello?
Hi! …

You can also chat with public LLMs the same way:

$ mcpctl chat-llm qwen3-thinking

The CLI doesn't care about kind — mcpd's /api/v1/llms/<name>/infer route branches on it server-side.

Lifecycle in detail

State What it means
active Heartbeat received within the last 90 s and the SSE channel is open.
inactive Either the SSE closed or the heartbeat watchdog tripped. Inference returns 503.
hibernating Publisher is online but the backend is asleep; the next inference triggers a wake task before relaying.

Two timers on mcpd run the GC sweep:

  • 90 s without a heartbeat → flip activeinactive.
  • 4 h in inactive → delete the row entirely.

A reconnecting mcplocal with the same providerSessionId revives every inactive row it owns; it only orphans rows that fell past the 4-h cutoff.

Inference relay

When mcpd receives POST /api/v1/llms/<virtual>/infer:

  1. Look up the row, see kind=virtual + status=active.
  2. Find the open SSE session for that providerSessionId. Missing session → 503.
  3. Push a { kind: "infer", taskId, llmName, request, streaming } task frame onto the SSE.
  4. mcplocal pulls, calls LlmProvider.complete(...), and POSTs the result back to /api/v1/llms/_provider-task/<taskId>/result:
    • non-streaming: { status: 200, body: <chat.completion> }
    • streaming: per-chunk { chunk: { data, done? } }
    • failure: { error: "..." }
  5. mcpd forwards the result/chunks out to the original caller.

v1 caveat — streaming granularity: LlmProvider.complete() returns a finalized CompletionResult, not a token stream. Streaming requests therefore arrive at the caller as a single delta + [DONE]. Real per-token streaming is a v2 concern.

Wake-on-demand (v2)

A provider whose backend hibernates (a vLLM instance that suspends when idle, an Ollama daemon that exits when nothing's connected, …) can declare a wake recipe in mcplocal config. When that provider's isAvailable() returns false at registrar startup, the row is published as status=hibernating. The next inference request that hits the row triggers the recipe and waits for the backend to come up before relaying.

Two recipe types:

// HTTP — POST to a "wake controller" that starts the backend out of band.
{
  "name": "vllm-local",
  "type": "openai",
  "model": "...",
  "publish": true,
  "wake": {
    "type": "http",
    "url": "http://10.0.0.50:9090/wake/vllm",
    "method": "POST",
    "headers": { "Authorization": "Bearer ..." },
    "maxWaitSeconds": 60
  }
}
// command — spawn a local process (systemd, wakeonlan, custom script).
{
  "name": "vllm-local",
  "type": "openai",
  "model": "...",
  "publish": true,
  "wake": {
    "type": "command",
    "command": "/usr/local/bin/start-vllm",
    "args": ["--profile", "qwen3"],
    "maxWaitSeconds": 120
  }
}

How a request flows when the row is hibernating:

client → mcpd POST /api/v1/llms/<name>/infer
         mcpd: status === hibernating → push wake task on SSE
         mcplocal: receive wake task → run recipe → poll isAvailable()
                   → heartbeat each tick → POST { ok: true } back
         mcpd: flip row → active, push the original infer task
         mcplocal: run inference → POST result back
mcpd → client (forwards the inference result)

Concurrent infers for the same hibernating Llm share a single wake task — only the first request triggers the recipe; later ones await the same in-flight wake promise. After the wake settles, every queued infer dispatches in order.

If the recipe fails (HTTP non-2xx, command exits non-zero, or the provider doesn't come up within maxWaitSeconds), every queued infer is rejected with a clear error and the row stays hibernating — the next request gets a fresh wake attempt.

Virtual agents (v3)

Virtual agents extend the same publishing model to agents — named LLM personas with their own system prompt and sampling defaults. mcplocal declares them in its config alongside its providers, and the existing _provider-register endpoint atomically publishes both Llms and Agents in one round-trip. They show up under mcpctl get agent next to manually-created public agents and become chat-able via mcpctl chat <agent> — no special command.

Declaring a virtual agent in mcplocal config

// ~/.mcpctl/config.json
{
  "llm": {
    "providers": [
      { "name": "vllm-local", "type": "vllm", "model": "Qwen/Qwen2.5-7B-Instruct-AWQ", "publish": true }
    ]
  },
  "agents": [
    {
      "name": "local-coder",
      "llm": "vllm-local",
      "description": "Local coding assistant on the workstation GPU",
      "systemPrompt": "You are a senior engineer. Be terse.",
      "defaultParams": { "temperature": 0.2 }
    }
  ]
}

llm references a published provider's name from the same config. Agents pinned to a name that isn't being published are still forwarded to mcpd — the server validates llmName and 404s with a clear message if it's genuinely missing, which lets you point at a public Llm if you want.

Lifecycle

Same shape as virtual Llms — 30 s heartbeat from mcplocal, 90 s heartbeat-stale → status flips to inactive, 4 h inactive → row deleted by mcpd's GC sweep. Heartbeats cover both Llms and Agents owned by the session.

The GC orders agent deletes before their pinned virtual Llm so the Agent.llmId onDelete: Restrict FK doesn't block the sweep.

Listing

$ mcpctl get agents
NAME          KIND     STATUS    LLM             PROJECT             DESCRIPTION
local-coder   virtual  active    vllm-local      -                   Local coding assistant on…
reviewer      public   active    qwen3-thinking  mcpctl-development  I review what you're shipping…

The KIND and STATUS columns are the v3 additions. Round-tripping through mcpctl get agent X -o yaml | mcpctl apply -f - strips those runtime fields cleanly so a virtual agent can be re-declared as a public one (or vice versa) without manual editing.

Chatting

$ mcpctl chat local-coder
> hello?
… streams through mcpd → SSE → mcplocal's vllm-local provider …

Same command as for public agents. Works because chat.service has a kind=virtual branch that hands off to VirtualLlmService.enqueueInferTask when the agent's pinned Llm is virtual.

Cluster-wide name uniqueness

Agent.name is unique cluster-wide. Two mcplocals trying to publish the same agent name collide on the second register with HTTP 409. Per-publisher namespacing is a v4+ concern — same constraint as virtual Llms in v1.

LB pools (v4)

Two or more Llm rows that share a poolName stack into one load-balanced pool. Agents pin to a single Llm by id; the chat dispatcher transparently widens to "all healthy Llms with the same effective pool key" at request time and picks one. There is no new LlmPool resource — poolName is just an optional column on Llm, so RBAC, listing, yaml round-trip, and apply all work the same way they did pre-v4.

Pool semantics

Field Behavior
Llm.name Globally unique (unchanged). The apply key.
Llm.poolName Optional. When set, declares membership. When NULL, falls back to name ("solo Llm, pool of 1").

Effective pool key = poolName ?? name. The dispatcher's lookup is:

SELECT * FROM Llm
WHERE poolName = $1 OR (poolName IS NULL AND name = $1)

So a solo Llm whose name happens to equal an explicit poolName joins that pool — by design, an existing single-row Llm can be promoted to "pool seed" without a rename or migration.

Selection + failover

  • Selection: random shuffle of all members whose status is active (or hibernating — VirtualLlmService handles wake on dispatch). inactive members are skipped.
  • Failover (non-streaming): if dispatch throws on the first candidate (transport failure, virtual publisher disconnect), the dispatcher iterates the rest of the shuffled list until one succeeds or the list is exhausted. Auth/4xx responses are NOT retried — siblings with the same key/model would fail identically.
  • Failover (streaming): only covers "couldn't establish stream" failures (transport error before any chunk yielded). Once any output has been streamed, we're committed to that backend.

Declaring a pool

Public Llms

mcpctl create llm prod-qwen-1 --type openai --model qwen3-thinking \
  --url https://prod-1.example.com --pool-name qwen-pool \
  --api-key-ref qwen-key/API_KEY

mcpctl create llm prod-qwen-2 --type openai --model qwen3-thinking \
  --url https://prod-2.example.com --pool-name qwen-pool \
  --api-key-ref qwen-key/API_KEY

Or via apply (yaml round-trip preserves poolName):

---
kind: llm
name: prod-qwen-1
type: openai
model: qwen3-thinking
url: https://prod-1.example.com
poolName: qwen-pool
apiKeyRef: { name: qwen-key, key: API_KEY }
---
kind: llm
name: prod-qwen-2
type: openai
model: qwen3-thinking
url: https://prod-2.example.com
poolName: qwen-pool
apiKeyRef: { name: qwen-key, key: API_KEY }

Virtual Llms (mcplocal-published)

// ~/.mcpctl/config.json
{
  "llm": {
    "providers": [
      {
        "name": "vllm-alice-qwen3",       // unique per publisher
        "type": "vllm-managed",
        "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
        "venvPath": "~/vllm_env",
        "publish": true,
        "poolName": "user-vllm-qwen3-thinking"   // shared pool key
      }
    ]
  }
}

Each user's mcplocal picks a unique name (e.g. include the hostname to guarantee no collisions) but shares the poolName. Agents pinned to any single member — or to qwen3-thinking (the public LiteLLM endpoint, also given poolName: user-vllm-qwen3-thinking if mixing public + virtual is desired) — see one logical pool that auto-grows as more workers come online.

Listing + describe

The mcpctl get llm table has a POOL column right after NAME. Solo rows render as -; pool members show their explicit pool key:

NAME              POOL          KIND     STATUS    TYPE    MODEL                           ID
qwen3-thinking    -             public   active    openai  qwen3-thinking                  cmo...
prod-qwen-1       qwen-pool     public   active    openai  qwen3-thinking                  cmo...
prod-qwen-2       qwen-pool     public   active    openai  qwen3-thinking                  cmo...

mcpctl describe llm <name> adds a Pool: block at the top when the row is in an explicit pool OR when its implicit pool has size > 1:

Pool:
  Pool name:    qwen-pool
  Members:      2 (2 active)
    - prod-qwen-1  [public/active]  ← this row
    - prod-qwen-2  [public/active]

GET /api/v1/llms/<name>/members is the API surface — returns full LlmViews for every member plus aggregate size / activeCount so operator tooling doesn't need a second roundtrip.

Pinning to a specific instance

To pin an agent to one specific instance (e.g. for debugging, RBAC-scoped routing, or "this agent must hit this model with this key"), give that instance a unique name and leave its poolName unset. The agent's pool is then size 1 and dispatch is deterministic. Pool membership is opt-in via poolName — the default behavior is single-Llm.

Durable inference task queue (v5)

Every infer call (sync /llms/<name>/infer, agent chat, async POST /inference-tasks) now lands as a row in InferenceTask. mcpd's in-memory request map is gone — the DB is the source of truth. Workers (mcplocal SSE sessions) drain queued rows when they bind, so tasks queued while a pool was empty drain when the first worker shows up. mcpd restart no longer drops in-flight work; worker disconnect mid-task reverts the row to pending instead of failing the caller.

See inference-tasks.md for the full data model, async API, lifecycle, RBAC, and CLI surface.

Visibility scope (v7)

Virtual Llms and Agents now carry an explicit visibility field that decides who can see the row in listings.

Visibility Meaning
public Visible to anyone with view:llms / view:agents. Default for hand-created Llms.
private Only the owner plus principals with a name-scoped grant can see it. Default for virtual Llms and Agents on first publish.

The owner is whichever user authenticated the publishing POST /api/v1/llms/_provider-register (or mcpctl create llm). For mcplocal that's whichever ~/.mcpctl/credentials token is on disk. Legacy rows from before v7 default to visibility=public, ownerId=NULL, so the upgrade is a no-op for everything that already exists.

Who skips the filter?

Two principals see every row regardless of visibility:

  1. The row owner (ownerId === request.userId).
  2. Anyone with a cross-resource admin grant — RBAC binding { resource: '*' }. Operationally this is the SRE / cluster admin.

A plain view:llms resource grant is not the same as admin: it's a RBAC wildcard for name-scoping (you can name any Llm), but the visibility filter still applies on top. This is the v7 split that prevents a user with view:llms from enumerating every developer's private virtual Llm.

Granting a single-row exception

When alice wants bob to see her private virtual Llm alice-vllm-local without making it public, she binds:

mcpctl create rbac bob view:llms --name alice-vllm-local

Same shape as any other name-scoped binding. Removing the binding flips bob back to "row not found".

Publishing as private from mcplocal

mcplocal defaults to private for every published provider and agent. Override per-row in ~/.mcpctl/config.json:

{
  "llm": {
    "providers": [
      { "name": "vllm-local",  "type": "vllm", "model": "...", "publish": true,
        "visibility": "private" },                                 // default; explicit for clarity
      { "name": "shared-qwen", "type": "vllm", "model": "...", "publish": true,
        "visibility": "public" }                                   // every team member can chat with it
    ]
  },
  "agents": [
    { "name": "local-coder", "llm": "vllm-local",
      "visibility": "private" }                                    // private agents pinned to private Llms
  ]
}

On a sticky reconnect (providerSessionId matches an existing row) the visibility is only updated when the publisher explicitly sends it — leaving the field off keeps whatever the row already has, including any field admin set out-of-band.

Hand-created Llms

mcpctl create llm defaults to public (matches pre-v7 behavior). Pass --visibility private to opt in:

mcpctl create llm my-key --type openai --model gpt-4o \
  --api-key-ref my-secret/key --visibility private

The same --visibility flag is on mcpctl create agent.

CLI surface

mcpctl get llm and mcpctl get agent show a VISIBILITY column. YAML round-trips cleanly: mcpctl get llm X -o yaml | mcpctl apply -f - preserves visibility, and ownerId is stripped from the apply doc because it's server-side state (the apply re-stamps the ownerId of the authenticated caller, not the original creator).

Roadmap (later stages)

(LB pool by name landed in v4; durable task queue landed in v5; visibility scope landed in v7.)

  • v8 — multi-instance mcpd via pg LISTEN/NOTIFY (replaces the per-instance EventEmitter wakeup), per-session worker capacity, remote cancel protocol over the SSE channel.

API surface (v1)

POST  /api/v1/llms/_provider-register      → returns { providerSessionId, llms[], agents[] }
                                              v3: body accepts an optional `agents[]` array
                                              alongside `providers[]`. Atomic publish; older
                                              clients (providers-only) keep working.
GET   /api/v1/llms/_provider-stream        → SSE channel; require x-mcpctl-provider-session header
POST  /api/v1/llms/_provider-heartbeat     → { providerSessionId } — bumps both Llms and Agents
                                              owned by the session
POST  /api/v1/llms/_provider-task/:id/result
                                           → one of:
                                             { error: "msg" }
                                             { chunk: { data, done? } }
                                             { status, body }

GET   /api/v1/llms                         → list (includes kind, status, lastHeartbeatAt, inactiveSince, poolName)
GET   /api/v1/llms/<name>                  → single Llm row (also accepts a CUID id)
GET   /api/v1/llms/<name>/members          → v4: pool members for the effective pool key:
                                              { poolName, explicitPoolName, size, activeCount, members[] }
POST  /api/v1/llms/<virtual>/infer         → routes through the SSE relay (v4: dispatcher
                                              also expands by poolName when set)
DELETE /api/v1/llms/<virtual>              → delete unconditionally (also runs GC's job)
GET   /api/v1/agents                       → list (v3: includes kind, status, lastHeartbeatAt, inactiveSince)

RBAC piggybacks on view/edit/create:llms — no new resource. Publishing a virtual LLM is morally a create:llms operation.

See also

  • agents.md — what an Agent is and how it pins to an LLM.
  • chat.mdmcpctl chat <agent> (full agent flow).
  • The CLI: mcpctl chat-llm <name> (this doc) is the stateless counterpart for raw LLM chat.