Wraps up v7 Stage 3: - docs/virtual-llms.md gains a "Visibility scope (v7)" section that explains public-vs-private semantics, who skips the filter (owner + `*` admin), how to grant single-row exceptions via name-scoped RBAC, per-row override syntax in mcplocal config, the `--visibility` flag on `mcpctl create llm`/`create agent`, and YAML round-trip behavior. - New smoke (virtual-llm-visibility.smoke.test.ts) publishes one public + one private virtual Llm via the registrar against the live mcpd and asserts the GET /llms response carries visibility + a non-empty ownerId for both, and that GET /llms/<name> returns the private row to its owner without 404. Cross-user filtering is covered by mcpd's visibility-filter unit tests; smoke proves the fields make the round-trip end-to-end. Will pass once mcpd is rebuilt + deployed via fulldeploy.sh on this branch (current main is v6, doesn't yet serialize visibility). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 KiB
Virtual LLMs
A virtual LLM is an Llm row in mcpd that's registered by an mcplocal
client rather than created by hand with mcpctl create llm. Inference for
a virtual LLM is relayed back through the publishing mcplocal's SSE control
channel — mcpd never needs to know the local URL or hold its API key.
When the publishing mcplocal goes away (or the user shuts down their
laptop) the row decays: active → inactive after 90 s without a
heartbeat, then deleted after 4 h of inactivity. A reconnecting mcplocal
adopts the same row using a sticky providerSessionId it persisted at
first publish.
When to use this
- Local model on a developer laptop that you want everyone on the
team to be able to chat with via
mcpctl chat-llm <name>. The model doesn't need to be reachable from mcpd's k8s pods — only the user's mcplocal does (which is already the case because mcplocal pulls projects from mcpd over HTTPS). - Hibernating models that wake on demand (v2 — see "Roadmap").
- Pool of identical models distributed across user laptops, eligible for load balancing (v4).
If your model is reachable from mcpd's k8s pods over LAN/VPN, you don't
need a virtual LLM — just mcpctl create llm <name> --type openai --url …
and you're done.
Publishing a local provider
mcplocal's local config (~/.mcpctl/config.json) gains a publish: true
opt-in per provider:
{
"llm": {
"providers": [
{
"name": "vllm-local",
"type": "openai",
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"url": "http://127.0.0.1:8000/v1",
"tier": "fast",
"publish": true
}
]
}
}
Restart mcplocal:
systemctl --user restart mcplocal
The registrar:
- Reads
~/.mcpctl/credentialsformcpdUrl+ bearer token. - POSTs to
/api/v1/llms/_provider-registerwith the publishable set. - Persists the returned
providerSessionIdto~/.mcpctl/provider-sessionso the next restart adopts the same mcpd row. - Opens the SSE channel at
/api/v1/llms/_provider-stream. - Heartbeats every 30 s.
- Listens for
event: taskframes and runs them against the localLlmProvider.
If ~/.mcpctl/credentials doesn't exist (e.g. you haven't run
mcpctl auth login), the registrar logs a warning and skips —
publishing is a best-effort feature, not a boot blocker.
Verifying
$ mcpctl get llm
NAME KIND STATUS TYPE MODEL TIER KEY ID
qwen3-thinking public active openai qwen3-thinking fast secret://litellm-key/API_KEY cmofx8y7u…
vllm-local virtual active openai Qwen/Qwen2.5-7B-Instruct-AWQ fast - cmoxz12ab…
$ mcpctl chat-llm vllm-local
─────────────────────────────────────────────────────────
LLM: vllm-local openai → Qwen/Qwen2.5-7B-Instruct-AWQ
Kind: virtual Status: active
─────────────────────────────────────────────────────────
> hello?
Hi! …
You can also chat with public LLMs the same way:
$ mcpctl chat-llm qwen3-thinking
The CLI doesn't care about kind — mcpd's /api/v1/llms/<name>/infer
route branches on it server-side.
Lifecycle in detail
| State | What it means |
|---|---|
active |
Heartbeat received within the last 90 s and the SSE channel is open. |
inactive |
Either the SSE closed or the heartbeat watchdog tripped. Inference returns 503. |
hibernating |
Publisher is online but the backend is asleep; the next inference triggers a wake task before relaying. |
Two timers on mcpd run the GC sweep:
- 90 s without a heartbeat → flip
active→inactive. - 4 h in
inactive→ delete the row entirely.
A reconnecting mcplocal with the same providerSessionId revives every
inactive row it owns; it only orphans rows that fell past the 4-h cutoff.
Inference relay
When mcpd receives POST /api/v1/llms/<virtual>/infer:
- Look up the row, see
kind=virtual+status=active. - Find the open SSE session for that
providerSessionId. Missing session → 503. - Push a
{ kind: "infer", taskId, llmName, request, streaming }task frame onto the SSE. - mcplocal pulls, calls
LlmProvider.complete(...), and POSTs the result back to/api/v1/llms/_provider-task/<taskId>/result:- non-streaming:
{ status: 200, body: <chat.completion> } - streaming: per-chunk
{ chunk: { data, done? } } - failure:
{ error: "..." }
- non-streaming:
- mcpd forwards the result/chunks out to the original caller.
v1 caveat — streaming granularity: LlmProvider.complete() returns
a finalized CompletionResult, not a token stream. Streaming requests
therefore arrive at the caller as a single delta + [DONE]. Real
per-token streaming is a v2 concern.
Wake-on-demand (v2)
A provider whose backend hibernates (a vLLM instance that suspends
when idle, an Ollama daemon that exits when nothing's connected, …)
can declare a wake recipe in mcplocal config. When that provider's
isAvailable() returns false at registrar startup, the row is
published as status=hibernating. The next inference request that
hits the row triggers the recipe and waits for the backend to come up
before relaying.
Two recipe types:
// HTTP — POST to a "wake controller" that starts the backend out of band.
{
"name": "vllm-local",
"type": "openai",
"model": "...",
"publish": true,
"wake": {
"type": "http",
"url": "http://10.0.0.50:9090/wake/vllm",
"method": "POST",
"headers": { "Authorization": "Bearer ..." },
"maxWaitSeconds": 60
}
}
// command — spawn a local process (systemd, wakeonlan, custom script).
{
"name": "vllm-local",
"type": "openai",
"model": "...",
"publish": true,
"wake": {
"type": "command",
"command": "/usr/local/bin/start-vllm",
"args": ["--profile", "qwen3"],
"maxWaitSeconds": 120
}
}
How a request flows when the row is hibernating:
client → mcpd POST /api/v1/llms/<name>/infer
mcpd: status === hibernating → push wake task on SSE
mcplocal: receive wake task → run recipe → poll isAvailable()
→ heartbeat each tick → POST { ok: true } back
mcpd: flip row → active, push the original infer task
mcplocal: run inference → POST result back
mcpd → client (forwards the inference result)
Concurrent infers for the same hibernating Llm share a single wake task — only the first request triggers the recipe; later ones await the same in-flight wake promise. After the wake settles, every queued infer dispatches in order.
If the recipe fails (HTTP non-2xx, command exits non-zero, or the
provider doesn't come up within maxWaitSeconds), every queued infer
is rejected with a clear error and the row stays hibernating —
the next request gets a fresh wake attempt.
Virtual agents (v3)
Virtual agents extend the same publishing model to agents — named
LLM personas with their own system prompt and sampling defaults. mcplocal
declares them in its config alongside its providers, and the existing
_provider-register endpoint atomically publishes both Llms and Agents
in one round-trip. They show up under mcpctl get agent next to
manually-created public agents and become chat-able via
mcpctl chat <agent> — no special command.
Declaring a virtual agent in mcplocal config
// ~/.mcpctl/config.json
{
"llm": {
"providers": [
{ "name": "vllm-local", "type": "vllm", "model": "Qwen/Qwen2.5-7B-Instruct-AWQ", "publish": true }
]
},
"agents": [
{
"name": "local-coder",
"llm": "vllm-local",
"description": "Local coding assistant on the workstation GPU",
"systemPrompt": "You are a senior engineer. Be terse.",
"defaultParams": { "temperature": 0.2 }
}
]
}
llm references a published provider's name from the same config. Agents
pinned to a name that isn't being published are still forwarded to mcpd —
the server validates llmName and 404s with a clear message if it's
genuinely missing, which lets you point at a public Llm if you want.
Lifecycle
Same shape as virtual Llms — 30 s heartbeat from mcplocal, 90 s
heartbeat-stale → status flips to inactive, 4 h inactive → row deleted
by mcpd's GC sweep. Heartbeats cover both Llms and Agents owned by the
session.
The GC orders agent deletes before their pinned virtual Llm so the
Agent.llmId onDelete: Restrict FK doesn't block the sweep.
Listing
$ mcpctl get agents
NAME KIND STATUS LLM PROJECT DESCRIPTION
local-coder virtual active vllm-local - Local coding assistant on…
reviewer public active qwen3-thinking mcpctl-development I review what you're shipping…
The KIND and STATUS columns are the v3 additions. Round-tripping
through mcpctl get agent X -o yaml | mcpctl apply -f - strips those
runtime fields cleanly so a virtual agent can be re-declared as a public
one (or vice versa) without manual editing.
Chatting
$ mcpctl chat local-coder
> hello?
… streams through mcpd → SSE → mcplocal's vllm-local provider …
Same command as for public agents. Works because chat.service has a
kind=virtual branch that hands off to VirtualLlmService.enqueueInferTask
when the agent's pinned Llm is virtual.
Cluster-wide name uniqueness
Agent.name is unique cluster-wide. Two mcplocals trying to publish the
same agent name collide on the second register with HTTP 409. Per-publisher
namespacing is a v4+ concern — same constraint as virtual Llms in v1.
LB pools (v4)
Two or more Llm rows that share a poolName stack into one
load-balanced pool. Agents pin to a single Llm by id; the chat
dispatcher transparently widens to "all healthy Llms with the same
effective pool key" at request time and picks one. There is no new
LlmPool resource — poolName is just an optional column on Llm,
so RBAC, listing, yaml round-trip, and apply all work the same way
they did pre-v4.
Pool semantics
| Field | Behavior |
|---|---|
Llm.name |
Globally unique (unchanged). The apply key. |
Llm.poolName |
Optional. When set, declares membership. When NULL, falls back to name ("solo Llm, pool of 1"). |
Effective pool key = poolName ?? name. The dispatcher's lookup is:
SELECT * FROM Llm
WHERE poolName = $1 OR (poolName IS NULL AND name = $1)
So a solo Llm whose name happens to equal an explicit poolName
joins that pool — by design, an existing single-row Llm can be
promoted to "pool seed" without a rename or migration.
Selection + failover
- Selection: random shuffle of all members whose
statusisactive(orhibernating— VirtualLlmService handles wake on dispatch).inactivemembers are skipped. - Failover (non-streaming): if dispatch throws on the first candidate (transport failure, virtual publisher disconnect), the dispatcher iterates the rest of the shuffled list until one succeeds or the list is exhausted. Auth/4xx responses are NOT retried — siblings with the same key/model would fail identically.
- Failover (streaming): only covers "couldn't establish stream" failures (transport error before any chunk yielded). Once any output has been streamed, we're committed to that backend.
Declaring a pool
Public Llms
mcpctl create llm prod-qwen-1 --type openai --model qwen3-thinking \
--url https://prod-1.example.com --pool-name qwen-pool \
--api-key-ref qwen-key/API_KEY
mcpctl create llm prod-qwen-2 --type openai --model qwen3-thinking \
--url https://prod-2.example.com --pool-name qwen-pool \
--api-key-ref qwen-key/API_KEY
Or via apply (yaml round-trip preserves poolName):
---
kind: llm
name: prod-qwen-1
type: openai
model: qwen3-thinking
url: https://prod-1.example.com
poolName: qwen-pool
apiKeyRef: { name: qwen-key, key: API_KEY }
---
kind: llm
name: prod-qwen-2
type: openai
model: qwen3-thinking
url: https://prod-2.example.com
poolName: qwen-pool
apiKeyRef: { name: qwen-key, key: API_KEY }
Virtual Llms (mcplocal-published)
// ~/.mcpctl/config.json
{
"llm": {
"providers": [
{
"name": "vllm-alice-qwen3", // unique per publisher
"type": "vllm-managed",
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"venvPath": "~/vllm_env",
"publish": true,
"poolName": "user-vllm-qwen3-thinking" // shared pool key
}
]
}
}
Each user's mcplocal picks a unique name (e.g. include the hostname
to guarantee no collisions) but shares the poolName. Agents pinned
to any single member — or to qwen3-thinking (the public LiteLLM
endpoint, also given poolName: user-vllm-qwen3-thinking if mixing
public + virtual is desired) — see one logical pool that auto-grows
as more workers come online.
Listing + describe
The mcpctl get llm table has a POOL column right after NAME.
Solo rows render as -; pool members show their explicit pool key:
NAME POOL KIND STATUS TYPE MODEL ID
qwen3-thinking - public active openai qwen3-thinking cmo...
prod-qwen-1 qwen-pool public active openai qwen3-thinking cmo...
prod-qwen-2 qwen-pool public active openai qwen3-thinking cmo...
mcpctl describe llm <name> adds a Pool: block at the top when the
row is in an explicit pool OR when its implicit pool has size > 1:
Pool:
Pool name: qwen-pool
Members: 2 (2 active)
- prod-qwen-1 [public/active] ← this row
- prod-qwen-2 [public/active]
GET /api/v1/llms/<name>/members is the API surface — returns full
LlmViews for every member plus aggregate size / activeCount so
operator tooling doesn't need a second roundtrip.
Pinning to a specific instance
To pin an agent to one specific instance (e.g. for debugging,
RBAC-scoped routing, or "this agent must hit this model with this
key"), give that instance a unique name and leave its poolName
unset. The agent's pool is then size 1 and dispatch is deterministic.
Pool membership is opt-in via poolName — the default behavior is
single-Llm.
Durable inference task queue (v5)
Every infer call (sync /llms/<name>/infer, agent chat, async
POST /inference-tasks) now lands as a row in InferenceTask. mcpd's
in-memory request map is gone — the DB is the source of truth.
Workers (mcplocal SSE sessions) drain queued rows when they bind, so
tasks queued while a pool was empty drain when the first worker shows
up. mcpd restart no longer drops in-flight work; worker disconnect
mid-task reverts the row to pending instead of failing the caller.
See inference-tasks.md for the full data model, async API, lifecycle, RBAC, and CLI surface.
Visibility scope (v7)
Virtual Llms and Agents now carry an explicit visibility field that decides who can see the row in listings.
| Visibility | Meaning |
|---|---|
public |
Visible to anyone with view:llms / view:agents. Default for hand-created Llms. |
private |
Only the owner plus principals with a name-scoped grant can see it. Default for virtual Llms and Agents on first publish. |
The owner is whichever user authenticated the publishing
POST /api/v1/llms/_provider-register (or mcpctl create llm). For
mcplocal that's whichever ~/.mcpctl/credentials token is on disk.
Legacy rows from before v7 default to visibility=public, ownerId=NULL,
so the upgrade is a no-op for everything that already exists.
Who skips the filter?
Two principals see every row regardless of visibility:
- The row owner (
ownerId === request.userId). - Anyone with a cross-resource admin grant — RBAC binding
{ resource: '*' }. Operationally this is the SRE / cluster admin.
A plain view:llms resource grant is not the same as admin: it's a
RBAC wildcard for name-scoping (you can name any Llm), but the
visibility filter still applies on top. This is the v7 split that
prevents a user with view:llms from enumerating every developer's
private virtual Llm.
Granting a single-row exception
When alice wants bob to see her private virtual Llm alice-vllm-local
without making it public, she binds:
mcpctl create rbac bob view:llms --name alice-vllm-local
Same shape as any other name-scoped binding. Removing the binding flips bob back to "row not found".
Publishing as private from mcplocal
mcplocal defaults to private for every published provider and agent.
Override per-row in ~/.mcpctl/config.json:
{
"llm": {
"providers": [
{ "name": "vllm-local", "type": "vllm", "model": "...", "publish": true,
"visibility": "private" }, // default; explicit for clarity
{ "name": "shared-qwen", "type": "vllm", "model": "...", "publish": true,
"visibility": "public" } // every team member can chat with it
]
},
"agents": [
{ "name": "local-coder", "llm": "vllm-local",
"visibility": "private" } // private agents pinned to private Llms
]
}
On a sticky reconnect (providerSessionId matches an existing row)
the visibility is only updated when the publisher explicitly sends
it — leaving the field off keeps whatever the row already has,
including any field admin set out-of-band.
Hand-created Llms
mcpctl create llm defaults to public (matches pre-v7 behavior).
Pass --visibility private to opt in:
mcpctl create llm my-key --type openai --model gpt-4o \
--api-key-ref my-secret/key --visibility private
The same --visibility flag is on mcpctl create agent.
CLI surface
mcpctl get llm and mcpctl get agent show a VISIBILITY column.
YAML round-trips cleanly: mcpctl get llm X -o yaml | mcpctl apply -f -
preserves visibility, and ownerId is stripped from the apply doc
because it's server-side state (the apply re-stamps the ownerId of the
authenticated caller, not the original creator).
Roadmap (later stages)
(LB pool by name landed in v4; durable task queue landed in v5; visibility scope landed in v7.)
- v8 — multi-instance mcpd via pg
LISTEN/NOTIFY(replaces the per-instance EventEmitter wakeup), per-session worker capacity, remote cancel protocol over the SSE channel.
API surface (v1)
POST /api/v1/llms/_provider-register → returns { providerSessionId, llms[], agents[] }
v3: body accepts an optional `agents[]` array
alongside `providers[]`. Atomic publish; older
clients (providers-only) keep working.
GET /api/v1/llms/_provider-stream → SSE channel; require x-mcpctl-provider-session header
POST /api/v1/llms/_provider-heartbeat → { providerSessionId } — bumps both Llms and Agents
owned by the session
POST /api/v1/llms/_provider-task/:id/result
→ one of:
{ error: "msg" }
{ chunk: { data, done? } }
{ status, body }
GET /api/v1/llms → list (includes kind, status, lastHeartbeatAt, inactiveSince, poolName)
GET /api/v1/llms/<name> → single Llm row (also accepts a CUID id)
GET /api/v1/llms/<name>/members → v4: pool members for the effective pool key:
{ poolName, explicitPoolName, size, activeCount, members[] }
POST /api/v1/llms/<virtual>/infer → routes through the SSE relay (v4: dispatcher
also expands by poolName when set)
DELETE /api/v1/llms/<virtual> → delete unconditionally (also runs GC's job)
GET /api/v1/agents → list (v3: includes kind, status, lastHeartbeatAt, inactiveSince)
RBAC piggybacks on view/edit/create:llms — no new resource. Publishing
a virtual LLM is morally a create:llms operation.