mcpctl

Author	SHA1	Message	Date
Michal	1f0be8a5c1	fix(agents): close gaps from /gstack-review P1 — thread reads now enforce ownership ======================================== chat.service.ts / routes/agent-chat.ts GET /api/v1/threads/:id/messages was previously RBAC-mapped to view:agents (no resourceName scope) with the route comment promising "service-level owner check enforces fine-grained access" — but the service didn't actually check. Any caller with view:agents could read another user's thread by guessing/learning the threadId. CUIDs are hard to brute-force but they leak: SSE `final` chunks, agents-plugin `_meta.threadId`, and several response bodies surface them. Now ChatService.listMessages(threadId, ownerId) loads the thread, returns 404 (not 403, to avoid id-enumeration via differential status codes) if ownerId doesn't match. Regression test in chat-service.test.ts covers Alice/Bob isolation + nonexistent-thread same-shape 404. P2 — AgentChatRequestSchema strict mode ======================================== validation/agent.schema.ts `.merge()` does NOT inherit `.strict()` from AgentChatParamsSchema. Typo'd fields (e.g. `temprature`) silently fell through and the agent silently used the default — debuggable only by reading the LLM call payload. Re-applied `.strict()` on the merged schema. P2 — per-agent maxIterations override + clamp ============================================== chat.service.ts Loop cap was a hard-coded module constant (12), wrong for both research-style agents (need higher) and cheap-probe agents (could opt lower). Now reads `agent.extras.maxIterations`, clamps 1..50, falls back to 12 default. The clamp is the soft-DoS guard: a hostile agent definition with `maxIterations:1000000` can't burn unbounded LLM calls per request. Both chat() and chatStream() use ctx.maxIterations now. Regression test covers low-cap override (rejects with `exceeded 2`) and hostile-value clamp (rejects with `exceeded 50`). P3 — SSE write to closed socket ================================ routes/agent-chat.ts When the upstream adapter throws after some chunks were already written AND the client disconnected, the catch block tried to flush more chunks to a closed socket. Without an `on('error')` handler Node emits unhandled error events; once Pino is wired to alerts this'd page on every disconnect-mid-stream. writeSseChunk now checks `reply.raw.destroyed \|\| writableEnded` before write. P3 — BACKEND_TOKEN_DEAD preserves original stack ================================================= services/secret-backend-rotator.service.ts When wrapping mintRoleToken/lookupSelf failures as BACKEND_TOKEN_DEAD, the new Error() discarded the original throw — hard to tell whether the inner failure was a network blip vs an OpenBao API mismatch vs DNS. Now uses `new Error(msg, { cause: err })` so the inner stack survives. P3 — .gitignore .claude/scheduled_tasks.lock ============================================= This persisted state file was leaking into every `git status`. Tests ===== mcpd 761/761 (+2 regression tests). mcplocal 715/715. cli 430/430. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:53:19 +01:00
Michal	2e266e318a	fix(mcplocal): lower default token introspection TTL in serve.ts too Followup to `e51b924`. The middleware default in token-auth.ts is 5s, but serve.ts wraps the construction with its own env-fallback default of 30000ms — so when MCPLOCAL_TOKEN_POSITIVE_TTL_MS isn't set in the environment, serve.ts always wins and revoked tokens still propagate slowly. Lowered serve.ts to 5s for symmetry; operators wanting a longer window can set the env var explicitly. Caught by mcptoken.smoke continuing to fail after the previous redeploy: verified the token-auth.js shipped with `?? 5_000`, but the wrapper was overriding it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:41:22 +01:00
Michal	e51b92473f	fix(smoke,rotator,auth): repair smoke env + close failure modes that caused 27 post-deploy smoke failures This commit lands the durable side of the post-deploy investigation: genuine bugs that let the upstream OpenBao re-init silently break every secret write for 4 days, plus test-code bugs that masked the same breakage in the smoke output. mcpd — fail loud on dead OpenBao tokens ======================================= secret-backend-rotator.service.ts When `mintRoleToken` or `lookupSelf` returns 403/401, classify it as BACKEND_TOKEN_DEAD (likely cause: upstream OpenBao re-init invalidated every pre-existing token), wrap the thrown error with explicit remediation (mint via root + `mcpctl create secret <name> --data <key>=<token> --force`), persist the same message to tokenMeta.lastRotationError, and emit a structured `level:fatal` console.error so it shows up in `kubectl logs deploy/mcpd` with grep- friendly `kind:BACKEND_TOKEN_DEAD`. Adds a `healthCheck(backendId)` method that runs lookup-self without minting — so the boot-time loop can detect the dead-token state immediately, not 24 hours later. secret-backend-rotator-loop.ts Boot-time health check: in `start()`, for every rotatable backend, call `rotator.healthCheck(b.id)` and on failure log a structured fatal entry. This converts the prior silent failure mode (24h wait until scheduled rotation reveals the dead token, with secret writes failing under it the entire time) into "mcpd boots, immediately sees the dead token, alerts loudly". Existing isOverdue path is unchanged. mcpd — Prisma userId crash on /me ================================= routes/auth.ts GET /api/v1/auth/me used `request.userId!` which lied: an authenticated McpToken bearer satisfies the auth middleware but has no associated User row, so userId stayed undefined and `findUnique({ id: undefined })` threw PrismaClientValidationError. Now returns 401 with a clear "service-account/token-bound principal cannot be queried via /me" message instead of bubbling a 500. mcplocal — token revocation propagation ======================================= http/token-auth.ts Lowered default introspection positiveTtl from 30s → 5s. mcpd's introspection endpoint is a single DB lookup; the cache only protects against burst restart storms, not steady-state load. The 30s window let revoked tokens keep working for the full window after revocation (caught by mcptoken.smoke's negative-cache assertion). Aligns with the existing 5s negativeTtl and the test's `wait 7s after revoke` expectation. smoke tests — read URL the same way the CLI does ================================================ mcp-client.ts Adds `loadMcpdAuth()`: URL from `~/.mcpctl/config.json`, token from `~/.mcpctl/credentials`. Critically, the URL does NOT come from credentials. credentials.mcpdUrl carries a stale field for legacy reasons and goes out of sync (left over from old `mcpctl login --mcpd-url localhost:3xxx` invocations) — tests reading it ended up hitting whatever URL the user last logged into rather than the URL the CLI is actually using right now. audit/security/system-prompts smoke now use loadMcpdAuth(), eliminating ~10 cascade failures. Also: switch httpRequest to https.request when scheme is https (matching audit/security/system-prompts/mcp-client/agent helpers). Bumps default callTool timeout from 30s → 60s; many tools that fetch external resources routinely run 10-30s. agent.smoke.test.ts - readToken read from `credentials.json`; the file is `credentials` (no extension). Caused 401 on POST /threads. - `mcpctl get <resource> <name> -o json` returns an array, not a bare object. Round-trip yaml test now indexes [0] before reading description. secretbackend.smoke.test.ts Two genuine assertion-drift fixes (env was right, test was stale): - "lists at least one secretbackend": stop hard-coding the default backend type as 'plaintext'; the invariant is "exactly one default exists". The seeded plaintext is the bootstrap default but operators routinely promote a remote backend (openbao etc.) once it's healthy. - "refuses to delete the seeded default": widen the regex from /default\|in use\|cannot delete/ to also accept "referenced" — the exact wording has shifted to "is still referenced by N secret(s); migrate them first". audit.test.ts / system-prompts.test.ts / security.test.ts Switch http.request → https.request when URL is https (each had its own copy of the helper). Drop the now-orphan loadMcpdCredentials in favour of loadMcpdAuth from mcp-client.ts. Tests ===== mcpd 759/759, mcplocal 715/715 unit suites still green. Smoke (live): Run 1 (pre-commit, post bao-token rotation): 27 → 12 failures. Run 2 (after fixes-batch, pre-redeploy): 12 → 2 failures. The remaining 2 (mcptoken cache TTL, proxy-pipeline timeout) are what the durable code changes here address; verify after the next redeploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:35:13 +01:00
Michal	8b56f09f25	feat(agents): smoke tests + README + docs (Stage 6, final) Closes the agents feature. Smoke tests (run via `pnpm test:smoke` against a live mcpd at $MCPD_URL, default https://mcpctl.ad.itaz.eu): * tests/smoke/agent.smoke.test.ts — full CRUD round-trip: create secret + Llm + agent with sampling defaults; `get agents` surfaces it; `get agent foo -o yaml \| apply -f` round-trips identically; create + list a thread via the HTTP API; agent delete leaves Llm + secret intact (Restrict + SetNull as designed). Self- skips with a warning when /healthz is unreachable. * tests/smoke/agent-chat.smoke.test.ts — gated on MCPCTL_SMOKE_LLM_URL + MCPCTL_SMOKE_LLM_KEY. Provisions secret + Llm + agent against a real upstream, runs `mcpctl chat -m … --no- stream` (asserts a reply lands), then runs the streaming default (asserts text on stdout + `(thread: …)` on stderr). The fast path for verifying the in-cluster qwen3-thinking deployment: MCPCTL_SMOKE_LLM_URL=http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \ MCPCTL_SMOKE_LLM_MODEL=qwen3-thinking \ MCPCTL_SMOKE_LLM_KEY=$(pulumi config get --stack homelab \ secrets:litellmMcpctlGatewayToken) \ pnpm test:smoke Docs: * README.md — new "Agents" section under Resources with the qwen3-thinking quickstart and links to docs/agents.md and docs/chat.md. Adds llm + agent rows to the resources table. * docs/agents.md (new) — full reference: data model, chat-parameter table, HTTP API, RBAC mapping, tool-use loop semantics, yaml round-trip shorthand, the kubernetes-deployment wiring recipe, and a troubleshooting section (namespace collision, llm-in-use, pending-row recovery, Anthropic-tool limitation). * docs/chat.md (new) — user-facing `mcpctl chat` walkthrough: modes, per-call flags, slash-commands, threads, and a troubleshooting section. * CLAUDE.md — adds a "Resource types" cheatsheet with one-line pointers to each, including the new `agent` row that links to the docs. All suites still green: mcpd 759/759, mcplocal 715/715, cli 430/430. Smoke tests typecheck and self-skip when no live mcpd is reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:08:37 +01:00
Michal	727e7d628c	feat(agents): mcpctl chat REPL + agent CRUD + completions (Stage 5) This is the moment the user can actually talk to an agent end-to-end: mcpctl create llm qwen3-thinking --type openai --model qwen3-thinking \ --url http://litellm.nvidia-nim.svc.cluster.local:4000/v1 \ --api-key-ref litellm-key/API_KEY mcpctl create agent reviewer --llm qwen3-thinking --project mcpctl-dev \ --description "I review security design — ask me after each major change." mcpctl chat reviewer Pieces: * src/cli/src/commands/chat.ts (new) — REPL + one-shot. Streams the SSE endpoint and prints text deltas to stdout as they arrive; tool_call / tool_result events go to stderr in dim-style brackets so the chat output stays clean. LiteLLM-style flags (--temperature / --top-p / --top-k / --max-tokens / --seed / --stop / --allow-tool / --extra) layer over agent.defaultParams. In-REPL slash-commands: /set KEY VAL, /system <text>, /tools (list project's MCP servers), /clear (new thread), /save (PATCH agent.defaultParams = current overrides), /quit. * src/cli/src/commands/create.ts — `create agent` mirroring the llm pattern. Every yaml-applyable field has a corresponding flag (memory rule); --default-temperature / --default-top-p / --default-top-k / --default-max-tokens / --default-seed / --default-stop / --default-extra / --default-params-file all populate agent.defaultParams. * src/cli/src/commands/apply.ts — AgentSpecSchema accepts both `llm: qwen3-thinking` shorthand and `llm: { name: ... }` long form; runs after llms in the apply order so apiKey/llm references resolve. Round- trips with `get agent foo -o yaml \| apply -f -` (memory rule). * src/cli/src/commands/get.ts — agentColumns (NAME, LLM, PROJECT, DESCRIPTION, ID); RESOURCE_KIND mapping for yaml export. * src/cli/src/commands/shared.ts — `agent`/`agents`/`thread`/`threads` added to RESOURCE_ALIASES. * src/cli/src/index.ts — wires createChatCommand into the program; passes the resolved baseUrl + token so chat can stream SSE without going through ApiClient (which only does buffered request/response). * completions/mcpctl.{fish,bash} regenerated. scripts/generate-completions.ts knows about agents (canonical + aliases) and emits a special-case `chat)` block that completes the first arg with `mcpctl get agents` names. tests/completions.test.ts: +9 new assertions covering agents in the resource list, chat in the commands list, --llm flag for create agent, agent-name completion for chat, etc. CLI suite: 430/430 (was 421). Completions --check is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:02:38 +01:00
Michal	285be11dd5	feat(agents): mcplocal agents plugin + composePlugins helper (Stage 4) When a Claude (or any other MCP client) connects to a project's mcplocal endpoint, every Agent attached to that project now appears in the session's tools/list as a virtual MCP server named `agent-<agentName>` with one tool `chat`. Calling that tool POSTs to the Stage 3 chat endpoint and returns the assistant's reply as MCP content. The tool's description is the agent's own description, so connecting clients see prose like "I review security design — ask me after each major change." This is what makes one agent reachable from another's MCP session. Plumbing: * src/mcplocal/src/proxymodel/plugins/agents.ts (new) — the plugin. onSessionCreate fetches /api/v1/projects/:p/agents via mcpd, then registers a VirtualServer per agent. The chat tool's inputSchema mirrors the LiteLLM-style override surface (temperature, top_p, top_k, max_tokens, stop, seed, tools_allowlist, extra) plus threadId for follow-ups. Namespace collision with an existing upstream MCP server named `agent-<x>` is detected and skipped with a `ctx.log.warn` line — better to surface the conflict than to silently shadow real tool entries in the virtualTools map. * src/mcplocal/src/proxymodel/plugins/compose.ts (new) — generic N-plugin composition helper. Lifecycle hooks fan out in order; transform hooks (onToolsList, onResourcesList, onPromptsList, onToolCallAfter) pipeline; intercept hooks (onToolCallBefore, onResourceRead, onPromptGet, onInitialize) short-circuit on the first non-null. Generalizes what createDefaultPlugin does for two fixed parents. * src/mcplocal/src/http/project-mcp-endpoint.ts — every project session now uses composePlugins([defaultPlugin, agentsPlugin]) so agents show up no matter which proxymodel the project is on. * Plugin context: added getFromMcpd(path) alongside postToMcpd. The existing postToMcpd was hard-coded to POST; the agents plugin needs GET to discover. Wired through plugin.ts → plugin-context.ts → router.ts. Tests: plugin-agents.test.ts (8) — registers per agent, falls back to a generic description, skips on namespace collision, no-ops with zero agents, logs and continues on mcpd error, chat handler POSTs correct body and returns content array, isError surfacing on mcpd error, onSessionDestroy unregisters everything. plugin-compose.test.ts (6) — single-plugin pass-through, empty rejection, lifecycle ordering, intercept short-circuit, list pipeline, no-op composition stays minimal. mcplocal suite: 715/715. mcpd suite still 759/759. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:51:44 +01:00
Michal	03ae4e15f7	feat(agents): mcpd routes + RBAC + tool dispatcher (Stage 3) Wires the Stage 2 services into HTTP. New routes: GET /api/v1/agents — list GET /api/v1/agents/:idOrName — describe POST /api/v1/agents — create PUT /api/v1/agents/:idOrName — update DELETE /api/v1/agents/:idOrName — delete GET /api/v1/projects/:p/agents — project-scoped list (mcplocal disco) POST /api/v1/agents/:name/chat — chat (non-streaming or SSE stream) POST /api/v1/agents/:name/threads — create thread explicitly GET /api/v1/agents/:name/threads — list threads GET /api/v1/threads/:id/messages — replay history The chat endpoint reuses the SSE pattern from llm-infer.ts (same headers incl. X-Accel-Buffering:no, same `data: …\n\n` framing, same `[DONE]` terminator). Each ChatService chunk is one frame. Non-streaming returns {threadId, assistant, turnIndex} as JSON. RBAC mapping in main.ts:mapUrlToPermission: - /agents/:name/{chat,threads} → run:agents:<name> - /threads/:id/ → view:agents (service-level owner check handles fine-grained access since the URL doesn't carry the agent name) - /agents and /agents/:idOrName → default {GET:view, POST:create, PUT:edit, DELETE:delete} on resource 'agents'. 'agents' added to nameResolvers so RBAC's CUID→name lookup works. ChatToolDispatcherImpl bridges ChatService to McpProxyService: it lists a project's MCP servers, fans out tools/list calls to each, namespaces tool names as `<server>__<tool>`, and routes tools/call back to the right serverId on dispatch. tools/list errors on a single server are logged and that server's tools are dropped from the turn's tool surface — one bad server doesn't poison the whole list. Tests: agent-routes.test.ts (15) — full HTTP CRUD round-trip, 404/409 paths, project-scoped list, non-streaming + SSE chat, thread create/list, /threads/:id/messages replay, body-required 400. chat-tool-dispatcher.test.ts (7) — empty list when no project / no servers, namespacing + inputSchema forwarding, partial-failure skipping with audit log, callTool dispatch shape, missing-server rejection, JSON-RPC error surfacing. All 22 new green; mcpd suite now 759/759 (was 737). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:45:15 +01:00
Michal	eda8e79712	feat(agents): mcpd repos + Agent/Chat services with tool-use loop (Stage 2) Layers the persistence-side logic on top of the Stage 1 schema. AgentService mirrors LlmService's CRUD shape with name-resolved llm/project references and yaml round-trip support; ChatService is the orchestrator that drives one chat turn end-to-end: build the merged system block (agent.systemPrompt + project Prompts ordered by priority desc + per-call systemAppend), persist the user turn, run the adapter, dispatch any tool_calls through an injected ChatToolDispatcher, persist tool turns linked back via toolCallId, and loop until the model returns terminal text. Per-call params resolve LiteLLM-style: request body → agent.defaultParams → adapter default. The escape hatch `extra` is forwarded as-is so each adapter can cherry-pick provider-specific knobs (Anthropic metadata, vLLM repetition_penalty, etc.) without code changes here. Persistence is non-transactional across the loop because tool calls can take minutes; long-held DB transactions would starve other writers. Instead each in-flight assistant turn is written `pending` and flipped to `complete` only after its tool results land. On failure or max-iter overrun, every `pending` row in the thread is flipped to `error` so the trail is auditable. Tools are namespaced on the wire as `<server>__<tool>`, unmarshalled at dispatch time; `tools_allowlist` filters before the model sees the list. Tests: agent-service.test.ts (7) — CRUD with name-resolved llm/project, conflict on duplicate, llm switch, project detach, listByProject filtering, upsertByName branch coverage. chat-service.test.ts (9) — plain text turn, full text→tool→text loop with toolCallId linkage, max-iter cap leaves zero pending, adapter-throws leaves zero pending, body→defaultParams merge, `extra` passthrough, project-Prompt priority ordering in the system block, tool-without- project rejection, tools_allowlist filtering. All 16 green; full mcpd suite still 737/737. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:38:38 +01:00
Michal	3726a65f53	feat(agents): add Agent + ChatThread + ChatMessage schema (Stage 1) Introduces the persistence layer for the upcoming Agent feature: an LLM persona pinned to a specific Llm, optionally attached to a Project, with persisted chat threads/messages so conversations survive REPL exits. Constraint shape: - Agent.llm uses ON DELETE RESTRICT — deleting an Llm in active use fails. - Agent.project uses ON DELETE SET NULL — agents survive project deletion. - ChatThread → ChatMessage cascade so deleting an agent purges its history. - ChatMessage @@unique([threadId, turnIndex]) gives append ordering even under racing writers (services retry on collision). LiteLLM-style per-call overrides will live in Agent.defaultParams (Json); the loose extras Json field is reserved for future LoRA/tool-allowlist work. Pinned vitest fileParallelism=false in @mcpctl/db: all suites share the same Postgres, and adding a second suite exposed FK contention between a clearAllTables in one file and a create in another. Per-test isolation still comes from beforeEach. Tests: 8/8 green in src/db/tests/agent-schema.test.ts (defaults, name uniqueness, llm-in-use Restrict, project-delete SetNull, agent-delete cascade, duplicate (threadId, turnIndex) blocked, tool-call payload round-trip, lastTurnAt DESC ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:29:55 +01:00
Michal	6ac79de8a4	feat(secrets): one-shot startup backfill for keyNames on existing rows Some checks failed CI/CD / lint (push) Successful in 52s Details CI/CD / test (push) Successful in 1m8s Details CI/CD / typecheck (push) Successful in 2m20s Details CI/CD / build (push) Successful in 2m49s Details CI/CD / smoke (push) Failing after 3m16s Details CI/CD / publish (push) Has been skipped Details Lazy backfill in SecretService.getById covers per-row retries, but list views still show 'KEYS: -' until each row is described. New backfillSecretKeyNames bootstrap runs once at startup, finds Secrets where keyNames=[] AND data={} (i.e. backend-stored, pre-existing rows), calls resolveData to learn the keys, persists. Sequential to be kind to the upstream backend on cold start. Idempotent + non-fatal.	2026-04-24 01:01:40 +01:00
Michal	9a808877b5	feat(secrets): track key names so list/describe work for backend-stored secrets Some checks failed CI/CD / lint (push) Successful in 53s Details CI/CD / test (push) Successful in 1m6s Details CI/CD / typecheck (push) Successful in 2m11s Details CI/CD / smoke (push) Failing after 1m42s Details CI/CD / publish (push) Has been cancelled Details CI/CD / build (push) Has been cancelled Details Post-migration, every Secret on a non-plaintext backend had an empty `data` column (values live in the backend; only externalRef on the row). The CLI's \`get secrets\` showed \`KEYS: -\` and \`describe secret\` showed \`(empty)\` for all 9 migrated secrets — useless without --show-values. Fix: dedicated \`keyNames Json\` column on Secret that stores the sorted key list independently from the values. Populated on every write path, lazily backfilled on first read for pre-existing rows that pre-date the column. Schema default \`[]\` keeps prisma db push self-healing on rolling upgrades. - src/db/prisma/schema.prisma: add Secret.keyNames Json @default("[]") - src/mcpd/src/repositories/secret.repository.ts: pipe keyNames through create + update - src/mcpd/src/services/secret.service.ts: - create/update populate keyNames = sorted Object.keys(data) - getById lazy-backfills empty keyNames (cheap: derives from data for plaintext, single backend read for openbao) - src/mcpd/src/services/secret-migrate.service.ts: migrate writes keyNames alongside the new backendId so freshly-migrated rows are populated without a follow-up read - src/cli/src/commands/get.ts: KEYS column reads keyNames first, falls back to Object.keys(data) for older rows - src/cli/src/commands/describe.ts: shows the Data section keys whenever keyNames OR data has entries (so backend-stored secrets render their key list); --show-values still resolves through the backend After deploy, the 9 already-migrated secrets backfill their keyNames on the next describe-by-id, with no operator action needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:57:06 +01:00
Michal	b1bccee50d	test(describe): mock the ?reveal=true path on --show-values Some checks failed CI/CD / lint (push) Successful in 54s Details CI/CD / test (push) Successful in 1m7s Details CI/CD / typecheck (push) Successful in 2m19s Details CI/CD / smoke (push) Failing after 5m9s Details CI/CD / publish (push) Has been cancelled Details CI/CD / build (push) Has been cancelled Details Follow-up to `faccbb5`: the describe-secret test for --show-values used the old fetchResource shape, so it broke after the route now goes through client.get directly with ?reveal=true.	2026-04-24 00:49:22 +01:00
Michal	faccbb58e7	fix(secrets): describe --show-values resolves through the backend driver Some checks failed CI/CD / lint (push) Successful in 55s Details CI/CD / test (push) Failing after 1m5s Details CI/CD / typecheck (push) Has started running Details CI/CD / smoke (push) Has been cancelled Details CI/CD / build (push) Has been cancelled Details CI/CD / publish (push) Has been cancelled Details Post-migration, every Secret on a non-plaintext backend has empty `Secret.data` (the actual value lives in the backend; only externalRef is on the row). `describe secret --show-values` was reading the raw row, so the user saw "Data: (empty)" for every migrated secret. - Route GET /api/v1/secrets/:id accepts ?reveal=true; when set, resolves the value via SecretService.resolveData() so the response carries the actual data dispatched through the right driver. - CLI --show-values flips that query param. Without --show-values the route returns the raw row exactly as before (no leak risk). Caught running the wizard end-to-end on the live cluster after the ClusterMesh fix on the kubernetes-deployment side made bao reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:46:54 +01:00
Michal	bf312850b5	fix(openbao): include response body in error messages Some checks failed CI/CD / lint (push) Successful in 52s Details CI/CD / typecheck (push) Successful in 51s Details CI/CD / test (push) Successful in 1m4s Details CI/CD / smoke (push) Failing after 1m36s Details CI/CD / build (push) Successful in 2m19s Details CI/CD / publish (push) Has been skipped Details Debugging the wizard migration flow, every OpenBao error was just `HTTP 403` with no context. The response body often carries the actual reason (missing capability, specific path, namespace mismatch), so surfacing it makes operator debugging a one-step task. Added a shared bodyText() helper that trims huge HTML error pages to 400 chars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:01:03 +01:00
Michal	72e49f719f	fix(mcpd): skip bootstrap tokens on migrate + back-fill ops on existing admins Some checks failed CI/CD / lint (push) Successful in 52s Details CI/CD / typecheck (push) Successful in 1m45s Details CI/CD / test (push) Successful in 1m2s Details CI/CD / build (push) Successful in 2m9s Details CI/CD / smoke (push) Has started running Details CI/CD / publish (push) Has been cancelled Details Two production issues caught running the wizard end-to-end: 1. `mcpctl migrate secrets --from default --to bao` listed `bao-creds` as a candidate — the very token that lets mcpd reach bao. Moving it would brick the auth chain (destination backend needs its own bootstrap token to read its own bootstrap token). Fix: SecretMigrateService now calls backends.list() and filters out any Secret whose name matches ANY SecretBackend's `config.tokenSecretRef.name`. dryRun mirrors the same filter so candidates match reality. `--names` explicitly bypasses the filter for operators who really mean it. 2. Initial rotation in the wizard 403'd because the global RBAC hook demands the `rotate-secretbackend` operation, which wasn't in bootstrap-admin — migrateAdminRole only added ops when processing a legacy `role: admin` entry, so already-migrated admin rows missed every new op added after their initial migration. Fix: migrateAdminRole now also runs a back-fill pass on rows that look admin-equivalent (have both `edit:` and `run:`), appending any missing op from ADMIN_OPS. Writes only when something actually changed, so restarts stay quiet. Same path also retroactively grants `migrate-secrets` which had the same problem yesterday. Tests: 4 new migrate-service cases (bootstrap filter on/off, dryRun parity, --names bypass). Full suite 1889/1889. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:56:00 +01:00
Michal	1c5301289c	refactor(wizard): rename --admin-token → --setup-token Some checks failed CI/CD / typecheck (push) Has been cancelled Details CI/CD / test (push) Has been cancelled Details CI/CD / smoke (push) Has been cancelled Details CI/CD / build (push) Has been cancelled Details CI/CD / publish (push) Has been cancelled Details CI/CD / lint (push) Has been cancelled Details Any token with policy-write + auth/token admin works; root is a convenient default but a scoped service account is fine too. The previous naming misrepresented the permission floor as root-only. - flag: --admin-token → --setup-token - wizard field: adminToken → setupToken - prompt label: "OpenBao admin / root token" → "OpenBao setup token (needs policy write + auth/token admin perms; root is fine)" - file doc + one comment reworded - tests updated for the new label - regression test (token-absent-from-stdout) kept unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 17:27:09 +01:00
Michal	dd4246878d	feat(openbao): wizard-provisioning + daily token rotation Some checks failed CI/CD / typecheck (pull_request) Successful in 55s Details CI/CD / test (pull_request) Successful in 1m4s Details CI/CD / lint (pull_request) Successful in 2m2s Details CI/CD / smoke (pull_request) Failing after 1m36s Details CI/CD / build (pull_request) Successful in 4m13s Details CI/CD / publish (pull_request) Has been skipped Details One-command setup replaces the 6-step manual flow — `mcpctl create secretbackend bao --type openbao --wizard` takes the OpenBao admin token once, provisions a narrow policy + token role, mints the first periodic token, stores it on mcpd, verifies end-to-end, and prints the migration command. The admin token is NEVER persisted. The stored credential auto-rotates daily: mcpd mints a successor via the token role (self-rotation capability is part of the policy it was issued with), verifies the successor, writes it over the backing Secret, then revokes the predecessor by accessor. TTL 720h means a week of rotation failures still leaves 20+ days of runway. Shared: - New `@mcpctl/shared/vault` — pure HTTP wrappers (verifyHealth, ensureKvV2, writePolicy, ensureTokenRole, mintRoleToken, revokeAccessor, lookupSelf, testWriteReadDelete) and policy HCL builder. mcpd: - `tokenMeta Json @default("{}")` on SecretBackend. Self-healing schema migration — empty default lets `prisma db push` add the column cleanly. - SecretBackendRotator.rotateOne: mint → verify → persist → revoke-old → update tokenMeta. Failures surface via `lastRotationError` on the row; the old token keeps working. - SecretBackendRotatorLoop: on startup rotates overdue backends, schedules per-backend timers with ±10min jitter. Stops cleanly on shutdown. - New `POST /api/v1/secretbackends/:id/rotate` (operation `rotate-secretbackend` — added to bootstrap-admin's auto-migrated ops alongside migrate-secrets, which was previously missing too). CLI: - `--wizard` on `create secretbackend` delegates to the interactive flow. All prompts can be pre-answered via flags (--url, --admin-token, --mount, --path-prefix, --policy-name, --token-role, --no-promote-default) for CI. - `mcpctl rotate secretbackend <name>` — convenience verb; hits the new rotate endpoint. - `describe secretbackend` renders a Token health section (healthy / STALE / WARNING / ERROR) with generated/renewal/expiry timestamps and last rotation error. Only shown when tokenMeta.rotatable is true — the existing k8s-auth + static-token backends don't surface it. Tests: 15 vault-client unit tests (shared), 8 rotator unit tests (mcpd), 3 wizard flow tests (cli, including a regression test that the admin token never appears in stdout). Full suite 1885/1885 (+32). Completions regenerated for the new flags. Out of scope (explicit): kubernetes-auth wizard, Vault Enterprise namespaces in the wizard path, rotation for non-wizard static-token backends. See plan file for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 17:20:37 +01:00
Michal	515206685b	feat(openbao): kubernetes ServiceAccount auth — no static token in DB Some checks failed CI/CD / lint (push) Successful in 52s Details CI/CD / test (push) Successful in 1m5s Details CI/CD / typecheck (push) Successful in 2m8s Details CI/CD / smoke (push) Failing after 3m38s Details CI/CD / build (push) Successful in 4m15s Details CI/CD / publish (push) Has been skipped Details Why: requiring a static OpenBao root token to live (even once-bootstrap) on the plaintext backend is the weakest link in the chain. With the bao-side Kubernetes auth method enabled, mcpd's pod can authenticate using its own projected SA token, exchange it for a short-lived Vault client token, and keep the database free of any vault credentials at all. Driver changes (src/mcpd/src/services/secret-backends/openbao.ts): - New `OpenBaoConfig.auth = 'token' \| 'kubernetes'`. Defaults to 'token' so existing rows keep working. Both shapes share url + mount + pathPrefix + namespace; auth-specific fields are mutually exclusive in the config schema. - Kubernetes auth flow: read JWT from /var/run/secrets/.../token, POST to /v1/auth/<authMount>/login {role, jwt}, cache the returned client_token for `lease_duration - 60s` (grace window), then re-login. - One-shot 403-retry: if a request comes back 403 (revoked / clock skew), purge cache and retry the original request once with a fresh login. - Reads + writes go through the same getToken() path so token-auth is unchanged for existing deployments. CLI (src/cli/src/commands/create.ts): - `mcpctl create secretbackend bao --type openbao --auth kubernetes \ --url https://bao.example:8200 --role mcpctl` - Optional `--auth-mount` (default 'kubernetes') + `--sa-token-path` (default the standard projected-token path) for non-default deployments. - Token-auth path unchanged: `--auth token --token-secret SECRET/KEY` (or omit `--auth` since 'token' is the default). Validation (factory.ts) gates on the auth strategy: each path enforces its own required fields and produces a clear error if misconfigured. Tests: 6 new k8s-auth unit cases (login wire shape, lease-based caching, custom authMount, 403-on-login, missing-role rejection, missing-tokenSecretRef rejection). Full suite 1859/1859. Completions regenerated for the new flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:23:05 +01:00
Michal	a21220b6f6	fix(deploy): self-healing pre-migrate bootstrap for SecretBackend rollout Some checks failed CI/CD / typecheck (push) Successful in 51s Details CI/CD / lint (push) Successful in 1m42s Details CI/CD / test (push) Successful in 1m6s Details CI/CD / smoke (push) Failing after 3m41s Details CI/CD / build (push) Successful in 4m31s Details CI/CD / publish (push) Has been skipped Details Why: clusters upgrading from the pre-SecretBackend schema crash-loop on the first rollout. `prisma db push` applies the Phase 0 migration as three sequential steps — add Secret.backendId column (default ''), create SecretBackend table, add FK — and the FK fails because empty-string values reference no row in the empty SecretBackend table. This happened on the live cluster today; I fixed it by hand with psql. This PR makes the fix automatic so a fresh cluster or anyone replaying the migration doesn't hit the same trap. - New `src/db/src/scripts/pre-migrate-bootstrap.ts` — idempotent node script. Checks if SecretBackend table exists; if so, ensures a default row exists (insert on conflict noop), then backfills any Secret.backendId = '' to point at it. Uses Prisma raw queries so it runs against a partially- migrated schema. - `deploy/entrypoint.sh` now catches a failed first push, runs the bootstrap, and retries. Fresh installs and fully-migrated clusters take the happy path (one push, no bootstrap needed). Pre-Phase-0 upgrades take the healing path (push fails → bootstrap seeds → retry succeeds). - The bootstrap is deliberately non-fatal — even on unexpected errors it logs and exits 0 so the retry still runs. If that retry also fails, the push error surfaces normally and the pod crash-loops visibly rather than silently starting in a half-migrated state. Verified the idempotent path logically: on the already-bootstrapped cluster (1 backend row, 0 empty-backendId Secrets), the script's UPDATE matches zero rows and the INSERT hits ON CONFLICT DO NOTHING — pure no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:59:07 +01:00
Michal	d5236171cc	fix(smoke): use json output for llm apiKeyRef assertion Some checks failed CI/CD / lint (push) Successful in 51s Details CI/CD / typecheck (push) Successful in 1m42s Details CI/CD / test (push) Successful in 1m5s Details CI/CD / smoke (push) Has started running Details CI/CD / publish (push) Has been cancelled Details CI/CD / build (push) Has been cancelled Details The table KEY column truncates at ~34 chars so `secret://<name>/<key>` wasn't appearing verbatim in stdout — the assertion was correct but brittle against presentation choices. Switched to `-o json` where the ref round-trips as a structured object, which is what actually matters. Caught by the live-cluster smoke run right after Phase 0-4 rolled out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:55:39 +01:00
Michal	860033d3de	fix(db): make Secret.backendId default to empty string for rollout migration Some checks failed CI/CD / typecheck (push) Successful in 53s Details CI/CD / lint (push) Successful in 1m44s Details CI/CD / test (push) Successful in 1m5s Details CI/CD / smoke (push) Failing after 3m43s Details CI/CD / build (push) Failing after 6m52s Details CI/CD / publish (push) Has been skipped Details Why: `prisma db push` refused to add the required `backendId` column on clusters with pre-existing Secret rows — it can't assign NOT NULL without a default, and the cluster DB had 9 live rows. The mcpd pod crash-looped during the Phase 0 rollout because of this. Empty-string default lets the schema apply cleanly; `bootstrapSecretBackends` (which runs on every startup) then rewrites those empty values to the seeded `default` plaintext backend's id. New writes via SecretService always carry a real FK immediately, so the empty-string state only exists during the one-shot migration window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:45:08 +01:00
Michal	58788bc120	test(smoke): end-to-end coverage for SecretBackend, Llm, infer proxy, project-llm-ref Covers the Phase 0-4 CLI contract against live mcpd. Matches the existing mcptoken.smoke pattern: skip gracefully on unreachable /healthz, cleanup fixtures in afterAll, use --direct to bypass mcplocal for admin operations. - secretbackend.smoke.test.ts · seeded plaintext default exists + isDefault · create/describe/delete round-trip · refuses to delete the default backend (409 shape) · get -o yaml output starts with `kind: secretbackend` (apply-compatible) - llm.smoke.test.ts · create secret + llm with --api-key-ref, verify describe hides the raw value but surfaces secret://name/key · yaml round-trip: get -o yaml > file → amend → apply -f → describe shows change · deleting the llm leaves the underlying Secret intact (onDelete: SetNull) - llm-infer.smoke.test.ts · 404 for unknown name, 400 for missing messages · 5xx when upstream url is unreachable (proxy returns a structured error) · opt-in happy-path gated on LLM_INFER_SMOKE_REAL=1 + LLM_INFER_SMOKE_LLM=<name> so CI doesn't need a real provider key - project-llm-ref.smoke.test.ts · describe project with --llm <registered> — no warning · describe project with --llm <nonexistent> — shows "warning: …registry default" · describe project with --llm none — explicit disable, no warning These require PRs #51-55 to be merged and fulldeploy.sh run before they'll find the new endpoints on live mcpd. Until then they skip or fail with "Not Found". Unit tests for the same code paths (1853 total) continue to pass against mocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:09:41 +01:00
Michal	de854b1944	feat(project): Project.llmProvider semantically names an Llm resource Why: Phases 0-3 built the server-managed Llm registry; this phase pivots the existing Project.llmProvider column from "local provider hint" to "named Llm reference" so operators can pick a centralised Llm per project. No schema change — the column stays a free-form string for backward compat. - `mcpctl create project --llm <name>` (+ `--llm-model <override>`) sets llmProvider/llmModel to a centralised Llm reference, or 'none' to disable. - `mcpctl describe project` fetches the Llm catalogue alongside prompts and flags values that don't resolve with a visible warning. 'none' is treated as an explicit disable, not an orphan. - `apply -f` doc comments updated; --llm-provider still accepted but now documented as naming an Llm resource. - New `resolveProjectLlmReference(mcpdClient, name)` helper in mcplocal's discovery: returns `registered`/`disabled`/`unregistered`/`unreachable`. The HTTP-mode proxy-model pipeline will consume this when it pivots to mcpd's /api/v1/llms/:name/infer proxy. - project-mcp-endpoint.ts cache-namespace path gets a comment explaining the new resolution order — behavior unchanged, just clarified. Tests: 6 resolver unit tests + 3 new describe-warning cases. Full suite 1853/1853 (+9 from Phase 3's 1844). TypeScript clean; completions regenerated for the new create-project flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 18:28:46 +01:00
Michal	4d8ee23d0e	feat(mcplocal): RBAC-bounded vllm-managed failover + name-based llm lookup Why: when mcpd's inference proxy is unreachable, clients with a local vllm-managed provider should be able to substitute — but only if they still have view permission on the centralized Llm. Otherwise revoking an Llm wouldn't actually stop a misbehaving client. Infrastructure (the agent + mcplocal HTTP-mode wire-up will land separately when those clients pivot to mcpd's proxy): - LlmProviderFileEntry gains optional `failoverFor: <central llm name>`. The entry is otherwise the same local provider it always was; the new field just declares which central Llm it can substitute for. - ProviderRegistry tracks a failover map (registerFailover / getFailoverFor / listFailovers). Unregister removes any failover entry pointing at the removed provider so we don't end up with dangling references. - New FailoverRouter wraps a primary inference call. On primary failure: if a local provider is registered for the Llm, HEAD-probe `mcpd /api/v1/llms/ :name` with the caller's bearer to verify view permission, then either invoke the local provider (allowed) or re-throw the primary error (403, 401, network unreachable, anything else — all fail-closed). - Server: GET /api/v1/llms/:idOrName accepts both CUID and human name. Lets FailoverRouter probe by name without a separate id-resolution call. HEAD derives automatically from GET in Fastify, which runs the same RBAC hook and drops the body — exactly what the probe needs. Tests: 11 failover unit tests (registry map, decision flow, fail-closed for forbidden + unreachable, checkAuth status mapping) + 4 new route tests (name lookup, HEAD existing/missing). Full suite 1844/1844 (+14 from Phase 2's 1830). TypeScript clean across mcpd + mcplocal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:05:43 +01:00
Michal	23f53a0798	feat(mcpd): inference proxy — POST /api/v1/llms/:name/infer Why: the point of the Llm resource (Phase 1) is that credentials never leave the server. This lands the proxy: clients POST OpenAI chat/completions to mcpd, mcpd attaches the provider API key server-side, and the response streams back as OpenAI-format SSE. Design: - Wire format client-side is always OpenAI chat/completions — every existing SDK speaks it. Adapters translate on the provider side. - `openai \| vllm \| deepseek \| ollama` → pure passthrough (they already speak OpenAI). `anthropic` → translator to/from Anthropic Messages API (system-string extraction, content-block flattening, SSE event remap). - Plain fetch; no @anthropic-ai/sdk dep. Consistent with the OpenBao driver shape and keeps the proxy layer thin. - `gemini-cli` intentionally rejected — subprocess providers need extra lifecycle plumbing; deferred to a follow-up. - Streaming: adapters yield `StreamingChunk`s; the route frames them as `data: <json>\n\n` + terminal `data: [DONE]\n\n` so any OpenAI client works unchanged. RBAC: - New URL special-case in mapUrlToPermission: `POST /api/v1/llms/:name/infer` → `run:llms:<name>` (not the default create:llms). Users need an explicit `{role: 'run', resource: 'llms', [name: X]}` binding to call infer. - Possession of `edit:llms` does NOT imply `run` — keeps catalogue management separate from spend. Audit: route emits an `llm_inference_call` event per request (llm name, model, user/tokenSha, streaming, duration, status). main.ts wires it to the structured logger for now; hook is in place for a richer audit sink later. Tests: - 11 adapter tests (passthrough POST shape + default URLs + no-auth ollama + SSE forwarding; anthropic translate request/response + non-2xx wrap + SSE event translation; registry dispatch + caching + unsupported-provider). - 7 route tests (404, 400, non-streaming dispatch + audit, apiKey failure, null apiKeyRef path, streaming SSE output, 502 on adapter error). - Full suite 1830/1830 (+18 from Phase 1's 1812). TypeScript clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:43:55 +01:00
Michal	6ff90a8228	feat(mcpd): Llm resource — CRUD + CLI + apply Why: every client that wants an LLM (the agent, HTTP-mode mcplocal, Claude Code's STDIO mcplocal) today has to know the provider URL + key, and each user's ~/.mcpctl/config.json carries them. Centralising the catalogue on the server is the prerequisite for Phase 2 (mcpd proxies inference so credentials never leave the cluster). This phase adds the `Llm` resource and its CRUD surface — no proxy yet, no client pivot yet. Just enough to register what you have. Schema: - New `Llm` model: name/type/model/url/tier/description + {apiKeySecretId, apiKeySecretKey} FK pair. Reverse `llms` relation on Secret. - Provider types: anthropic \| openai \| deepseek \| vllm \| ollama \| gemini-cli. - Tiers: fast \| heavy. mcpd: - LlmRepository + LlmService + Zod validation schema + /api/v1/llms routes. - API surface exposes `apiKeyRef: {name, key}` — the service translates to/ from the FK pair so clients never deal in cuids. - `resolveApiKey(llmName)` reads through SecretService (which itself dispatches to the right SecretBackend). That's the hook Phase 2's inference proxy uses. - RBAC: added `'llms'` to RBAC_RESOURCES + resource alias. Standard view/create/edit/delete semantics. - Wired into main.ts (repo, service, routes). CLI: - `mcpctl create llm <name> --type X --model Y --tier fast\|heavy --api-key-ref SECRET/KEY [--url ...] [--extra k=v ...]` - `mcpctl get\|describe\|delete llm` — standard resource verbs. - `mcpctl apply -f` with `kind: llm` (single- or multi-doc yaml/json). Applied after secrets, before servers — apiKeyRef resolves an existing Secret. - Shell completions regenerated. Tests: 11 service unit tests + 9 route tests (happy path, 404s, 409, validation). Full suite 1812/1812 (+20 from the 1792 Phase 0 baseline). TypeScript clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:28:43 +01:00
Michal	029c3d5f34	feat(mcpd): pluggable SecretBackend abstraction + OpenBao driver + migrate All checks were successful CI/CD / typecheck (pull_request) Successful in 51s Details CI/CD / lint (pull_request) Successful in 1m47s Details CI/CD / test (pull_request) Successful in 1m3s Details CI/CD / smoke (pull_request) Successful in 4m34s Details CI/CD / build (pull_request) Successful in 3m50s Details CI/CD / publish (pull_request) Has been skipped Details Why: API keys live in Postgres as plaintext JSON. A DB read exposes every credential in the system. Before centralising more secrets (LLM keys, etc.) we want to be able to point at an external KV store and drop DB access to sensitive rows. New model: - `SecretBackend` resource (CRUD + isDefault invariant) owns how a secret is stored. `Secret` gains `backendId` FK and `externalRef`. Reads/writes dispatch through a driver. - `plaintext` driver (near-noop, uses existing Secret.data column) is seeded as the `default` row at startup. Acts as trust root / bootstrap. - `openbao` driver (also HashiCorp Vault KV v2 compatible) talks plain HTTP, no SDK dependency. Auth via static token pulled from a plaintext-backed `Secret` through the injected SecretRefResolver. Caches resolved token. - `SecretMigrateService` moves secrets one-at-a-time: read → write dest → flip row → best-effort source delete. Interrupted runs are idempotent (skips secrets already on destination). CLI surface: - `mcpctl create\|get\|describe\|delete secretbackend` + `--default` on create. - `mcpctl migrate secrets --from X --to Y [--names a,b] [--keep-source] [--dry-run]` - `apply -f` round-trips secretbackends (yaml/json multi-doc + grouped). - RBAC: `secretbackends` resource + `run:migrate-secrets` operation. - Fish + bash completions regenerated. docs/secret-backends.md covers the OpenBao policy, chicken-and-egg auth flow, and the migration semantics. Broke the circular dep (OpenBao needs SecretService to resolve its own token, SecretService needs SecretBackendService) with a deferred-resolver bridge in mcpd startup. 11 new driver unit tests; existing env-resolver/secret-route/ backup tests updated for the new service signatures. Full suite: 1792/1792. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:29:55 +01:00
Michal	6946250090	Revert "feat(mcplocal): per-McpToken gate-ungate cache so service tokens survive proxies" All checks were successful CI/CD / lint (push) Successful in 51s Details CI/CD / typecheck (push) Successful in 1m46s Details CI/CD / test (push) Successful in 1m3s Details CI/CD / build (push) Successful in 2m14s Details CI/CD / smoke (push) Successful in 4m43s Details CI/CD / publish (push) Successful in 1m23s Details This reverts commit `39df459bb1`.	2026-04-18 18:16:18 +01:00
Michal	39df459bb1	feat(mcplocal): per-McpToken gate-ungate cache so service tokens survive proxies All checks were successful CI/CD / lint (pull_request) Successful in 1m0s Details CI/CD / typecheck (pull_request) Successful in 1m51s Details CI/CD / test (pull_request) Successful in 1m3s Details CI/CD / build (pull_request) Successful in 2m13s Details CI/CD / smoke (pull_request) Successful in 4m49s Details CI/CD / publish (pull_request) Has been skipped Details Fixes the LiteLLM loop: LiteLLM's /mcp/ proxy doesn't propagate the mcp-session-id header, so every tool call from qwen3 landed on a fresh upstream session, which always started gated, so the only visible tool was begin_session — forever. The session-id gate works fine for Claude Code (stdio, long-lived), but breaks through session-stripping proxies. Identity that DOES survive: the McpToken (always in the Authorization header). So now the gate keys its ungate state on both: - sessionId → per-session (unchanged; Claude Code path) - tokenSha → per-token (NEW; service-token path) Flow for an McpToken caller: 1. first begin_session succeeds → session ungated + tokenSha cached 2. next request lands on a new mcp-session-id (proxy stripped it) 3. SessionGate.createSession sees tokenSha, finds active token entry, starts the new session ungated with the prior tags + retrievedPrompts 4. tools/list on the fresh session returns the full upstream set — no more begin_session loop Plumbing: - AuditCollector.getSessionMcpTokenSha(sessionId) exposes the already- tracked principal. - PluginSessionContext gets getMcpTokenSha() so plugins can read the token identity without knowing about the collector. - SessionGate gains (tokenSha?: string) on createSession/ungate, plus isTokenUngated and revokeToken. TTL defaults to 1hr; tunable via MCPLOCAL_TOKEN_UNGATE_TTL_MS env var. - Gate plugin passes ctx.getMcpTokenSha() at every ungate call site (begin_session, gated-intercept, intercept-fallback). Tests: 7 new cases in session-gate.test.ts covering cross-session persistence, token isolation, STDIO-path unchanged, TTL expiry, revokeToken, and the empty-string edge case. 21/21 pass; 690/690 in mcplocal overall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 17:34:28 +01:00
Michal	75fe0533c1	fix(mcplocal): propagate caller's bearer to prompt-index and LLM-config calls All checks were successful CI/CD / typecheck (pull_request) Successful in 51s Details CI/CD / test (pull_request) Successful in 1m3s Details CI/CD / lint (pull_request) Successful in 2m27s Details CI/CD / build (pull_request) Successful in 2m11s Details CI/CD / smoke (pull_request) Successful in 4m56s Details CI/CD / publish (pull_request) Has been skipped Details The proxy-path fix (`5d10728`) covered upstream tools/call routing via McpdUpstream, but getOrCreateRouter in project-mcp-endpoint.ts had TWO more mcpd-bound call sites that silently fell back to the pod's empty default token: 1. fetchProjectLlmConfig(mcpdClient, projectName) 2. router.setPromptConfig(mcpdClient.withHeaders({...})) → which is what gate.ts begin_session uses via ctx.fetchPromptIndex() to hit /api/v1/projects/:name/prompts/visible Symptom: in the k8s mcplocal pod, LiteLLM would initialize + tools/list fine (showing begin_session), but tools/call begin_session returned `{isError: true, content: "McpError: Authentication failed: invalid or expired token"}`. Reproduced against the live cluster by driving LiteLLM's /mcp/ endpoint with qwen3-thinking's exact payload. Fix: build `requestClient = mcpdClient.withToken(authToken)` once at the top of getOrCreateRouter and thread it through fetchProjectLlmConfig and setPromptConfig. withHeaders still adds X-Service-Account for mcpd-side audit tagging, but the bearer now carries the caller's McpToken identity (resolves as McpToken:<sha> on mcpd). Verified: unit tests pass (mock needed withToken/withTimeout stubs). Next step: rebuild image + roll pod + retest LiteLLM→mcp flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 04:44:27 +01:00
Michal	5d1072889f	fix(mcplocal): thread client bearer into per-upstream McpdClient Symptom: HTTP-mode mcplocal accepted the incoming mcpctl_pat_ bearer, but every /api/v1/mcp/proxy call to mcpd for upstream discovery came back with "Authentication failed: invalid or expired token" — because those proxy calls were using the pod's DEFAULT McpdClient token, which in a container with no ~/.mcpctl/credentials is the empty string. The discovery GET was correct (explicit authOverride in forward()), but syncUpstreams() then created McpdUpstream instances bound to the original mcpdClient — so every tools/list to each upstream went out with `Authorization: Bearer ` (empty) and mcpd's auth hook rejected it. Fix: add McpdClient.withToken(token) and have refreshProjectUpstreams swap to `mcpdClient.withToken(authToken)` before handing the client to syncUpstreams. This keeps the "pod has no identity" design: the token used for downstream /api/v1/mcp/proxy calls is the caller's McpToken, same as the one used for the initial discovery GET and for introspect. Tested: project-discovery.test.ts + mcpd-upstream.test.ts pass. Next: rebuild + roll the mcplocal image and retry LiteLLM probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:06:55 +01:00
Michal	dfc53cd15e	fix(mcpd): per-route /api/v1/mcp/proxy auth missed McpToken dispatch Symptom: LiteLLM → mcplocal → mcpd proxy calls for project-scoped MCP tool discovery all 401'd with "Authentication failed: invalid or expired token", even though the same mcpctl_pat_ bearer works against /api/v1/mcptokens/introspect and /api/v1/projects/:name/servers. Result: the new k8s mcplocal pod could accept the bearer and respond to /projects/:name/mcp (initialize was 200), but every downstream upstream discovery call through /api/v1/mcp/proxy failed. Root cause: registerMcpProxyRoutes installs its own route-scoped createAuthMiddleware with the `authDeps` parameter it receives. In main.ts that was being constructed with only `findSession` — missing the `findMcpToken` that the GLOBAL auth hook already had. So a mcpctl_pat_ bearer got all the way to the proxy route and then was handed to an old-shape middleware that knew nothing about the prefix. Fix: extract authDeps (findSession + findMcpToken) to a named const and reuse it for both the global hook and the proxy route. Comment at the declaration site warns future additions to keep the two paths in sync — they have to agree or McpToken bearers silently break on whichever one drifts. Verified against the live cluster: LiteLLM's discoverTools path no longer 401s; mcplocal logs now show successful upstream proxy calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 00:23:44 +01:00
Michal	1887d90821	docs: scrub MCPLOCAL_MCPD_TOKEN — pod has no persistent mcpd identity Some checks failed CI/CD / lint (pull_request) Successful in 50s Details CI/CD / test (pull_request) Successful in 1m4s Details CI/CD / typecheck (pull_request) Failing after 7m3s Details CI/CD / smoke (pull_request) Has been skipped Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish (pull_request) Has been skipped Details The earlier plan recommended an MCPLOCAL_MCPD_TOKEN env var so the pod would have a ServiceAccount session into mcpd. It's unnecessary: the pod forwards every inbound client bearer (mcpctl_pat_...) verbatim to mcpd for all downstream calls — both introspect and project discovery. mcpd's auth middleware dispatches on the prefix and resolves the McpToken principal directly. No pod secret, no rotation story. Updates: - serve.ts header: explicit "identity model" section calling this out so future readers don't restore the env var thinking it's missing. - docs/mcptoken-implementation.md: drop the "mount MCPLOCAL_MCPD_TOKEN" Pulumi guidance and the "dedicated ServiceAccount" follow-up item; state the correct image URL (internal 10.0.0.194 registry) and the gated-vs-ungated rule for LLM config mounts. No runtime code changes — serve.ts never actually required the token; this just fixes the documentation and the header comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 23:54:46 +01:00
Michal	3061a5f6ae	test+feat: token-auth unit coverage + env-tunable introspection TTLs Some checks failed CI/CD / lint (pull_request) Successful in 51s Details CI/CD / typecheck (pull_request) Successful in 51s Details CI/CD / test (pull_request) Successful in 1m3s Details CI/CD / smoke (pull_request) Failing after 3m24s Details CI/CD / build (pull_request) Successful in 4m45s Details CI/CD / publish (pull_request) Has been skipped Details Verifies the HTTP-mode revocation lag ≤ 5s two ways: 1. Unit (tests/http/token-auth.test.ts, 8 cases): Fastify preHandler with injected fetch stub exercises the positive/negative cache directly — first call returns ok:true, we flip the stub to revoked:true, wait past the short positive TTL, next call gets 401 with "revoked". Plus: non-Bearer 401, non-mcpctl_pat_ 401, wrong- project 403, mcpd-unreachable 401, happy-path caching (1 fetch for N requests within TTL), ok:false from mcpd 401. 2. End-to-end (smoke, run manually): added MCPLOCAL_TOKEN_POSITIVE_TTL_MS and MCPLOCAL_TOKEN_NEGATIVE_TTL_MS env vars to serve.ts so the smoke can shrink the 30s positive default for testing. Confirmed: with positive TTL = 2s, the mcptoken.smoke.test.ts revocation case passes against a local serve.js pointed at prod mcpd. Operators get the same knobs in production — default behavior unchanged (30s positive, 5s negative). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 23:25:06 +01:00
Michal	913678e400	fix(smoke): mcptoken — runtime gatewayUp gate + scope revocation case to HTTP-mode All checks were successful CI/CD / lint (pull_request) Successful in 52s Details CI/CD / test (pull_request) Successful in 1m4s Details CI/CD / typecheck (pull_request) Successful in 2m23s Details CI/CD / build (pull_request) Successful in 2m52s Details CI/CD / smoke (pull_request) Successful in 5m40s Details CI/CD / publish (pull_request) Has been skipped Details Two bugs found while trying to point MCPGW_URL=http://localhost:3200 (the systemd mcplocal) so we could get real smoke coverage before the Pulumi stack for mcp.ad.itaz.eu lands: 1. describe.skipIf(!gatewayUp) was evaluated at parse time, before beforeAll ran, so gatewayUp was always false and the whole suite skipped. Switched to the vllm-managed.test.ts pattern: runtime `if (!gatewayUp) return` at the start of each it(). 2. The revocation 401 assertion only makes sense against the containerized serve.ts entry, which has a 5s negative introspection cache. Against systemd mcplocal the whole project router is cached for minutes, so a deleted token with a warm session still succeeds. Added IS_HTTP_MODE detection (hostname not localhost/127/0.0.0.0, or MCPGW_IS_HTTP_MODE=true) and skip the assertion otherwise — still revoking the token so cleanup runs identically. Run against systemd mcplocal locally: MCPGW_URL=http://localhost:3200 pnpm --filter @mcpctl/mcplocal \\ exec vitest run --config vitest.smoke.config.ts mcptoken → 6/6 pass (revocation case explicitly deferred). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 23:20:36 +01:00
Michal	f68e123821	fix(cli): https support in status + api-client; add demo-mcp-call.py All checks were successful CI/CD / lint (pull_request) Successful in 1m40s Details CI/CD / typecheck (pull_request) Successful in 1m35s Details CI/CD / test (pull_request) Successful in 2m16s Details CI/CD / build (pull_request) Successful in 2m17s Details CI/CD / smoke (pull_request) Successful in 4m37s Details CI/CD / publish (pull_request) Has been skipped Details - status.ts + api-client.ts now dispatch on URL scheme so an https mcpd URL no longer crashes with "Protocol https: not supported". Caught by fulldeploy smoke runs — status.ts had `import http` only and was synchronously throwing against https://mcpctl.ad.itaz.eu. Each http.get call is wrapped so future scheme-mismatch errors also degrade to "unreachable" instead of a stack trace. - .dockerignore no longer excludes src/mcplocal/ (the new Dockerfile.mcplocal needs those files). - scripts/demo-mcp-call.py: standalone, stdlib-only Python demo that makes an MCP request (initialize + tools/list, optional tools/call) using an mcpctl_pat_ bearer. Counterpart to `mcpctl test mcp` for showing external (e.g. vLLM) clients how the bearer flow works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:34:00 +01:00
Michal	2127b41d9f	feat: HTTP-mode mcplocal container + mcpctl test mcp + token-auth preHandler Delivers the final piece of the mcptoken stack: a containerized, network-accessible mcplocal that serves Streamable-HTTP MCP to off-host clients (the vLLM use case), authenticated by project-scoped McpTokens. New binary (same package, new entry): - src/mcplocal/src/serve.ts — HTTP-only entry. Reads MCPLOCAL_MCPD_URL, MCPLOCAL_MCPD_TOKEN, MCPLOCAL_HTTP_HOST/PORT, MCPLOCAL_CACHE_DIR from env. No StdioProxyServer, no --upstream. - src/mcplocal/src/http/token-auth.ts — Fastify preHandler that validates mcpctl_pat_ bearers via mcpd's /api/v1/mcptokens/introspect. 30s positive / 5s negative TTL. Rejects wrong-project with 403. Shared HTTP MCP client: - src/shared/src/mcp-http/ — reusable McpHttpSession with initialize, listTools, callTool, close. Handles http+https, SSE, id correlation, distinct McpProtocolError / McpTransportError. Plus mcpHealthCheck and deriveBaseUrl helpers. New CLI verb `mcpctl test mcp <url>`: - Flags: --token (also $MCPCTL_TOKEN), --tool, --args (JSON), --expect-tools, --timeout, -o text\|json, --no-health. - Exit codes: 0 PASS, 1 TRANSPORT/AUTH FAIL, 2 CONTRACT FAIL. Container + deploy: - deploy/Dockerfile.mcplocal (Node 20 alpine, multi-stage, pnpm workspace, CMD node src/mcplocal/dist/serve.js, VOLUME /var/lib/mcplocal/cache, HEALTHCHECK on :3200/healthz). - scripts/build-mcplocal.sh mirrors build-mcpd.sh. - fulldeploy.sh is now a 4-step pipeline that also builds + rolls out mcplocal (gated on `kubectl get deployment/mcplocal` so the script stays green before the Pulumi stack lands). Audit + cache: - project-mcp-endpoint.ts passes MCPLOCAL_CACHE_DIR into FileCache at both construction sites and, when request.mcpToken is present, calls collector.setSessionMcpToken(id, ...) so audit events carry the tokenName/tokenSha. Tests: - 9 unit cases on `mcpctl test mcp` (happy path, health miss, expect-tools hit/miss, transport throw, tool isError, json report, $MCPCTL_TOKEN env fallback, invalid --args). - Smoke test src/mcplocal/tests/smoke/mcptoken.smoke.test.ts — gated on healthz($MCPGW_URL), skipped cleanly when unreachable. Covers happy path, wrong-project 403, --expect-tools contract failure, and revocation 401 within the negative-cache window. 1773/1773 workspace tests pass. Pulumi resources (Deployment, Service, Ingress, PVC, Secret, NetworkPolicy) still need to land in ../kubernetes-deployment before the smoke gate flips on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 01:21:42 +01:00
Michal	a151b2e756	feat: mcpctl mcptoken verbs + mcpd auth dispatch + audit plumbing Adds the end-to-end CLI surface for McpTokens and the mcpd auth dispatch that recognizes them. mcpd auth middleware: - Dispatch on the `mcpctl_pat_` bearer prefix. McpToken bearers resolve through a new `findMcpToken(hash)` dep, populating `request.mcpToken` and `request.userId = ownerId`. Everything else follows the existing session path. - Returns 401 for revoked / expired / unknown tokens. - Global RBAC hook now threads `mcpTokenSha` into `canAccess` / `canRunOperation` / `getAllowedScope`, and enforces a hard project-scope check: a McpToken principal can only hit `/api/v1/projects/<its-project>/...`. CLI verbs: - `mcpctl create mcptoken <name> -p <proj> [--rbac empty\|clone] [--bind role:view,resource:servers] [--ttl 30d\|never\|ISO] [--description ...] [--force]` — returns the raw token once. - `mcpctl get mcptokens [-p <proj>]` — table with NAME/PROJECT/PREFIX/CREATED/LAST USED/EXPIRES/STATUS. - `mcpctl get mcptoken <name> -p <proj>` and `mcpctl describe mcptoken <name> -p <proj>` — describe surfaces the auto-created RBAC bindings. - `mcpctl delete mcptoken <name> -p <proj>`. - `apply -f` support with `kind: mcptoken`. Tokens are immutable, so apply creates if missing and skips if the name is already active. Audit plumbing: - `AuditEvent` / collector now carry optional `tokenName` / `tokenSha`. `setSessionMcpToken` sits alongside `setSessionUserName`; both feed a per-session principal map used at emit time. - `AuditEventService` query accepts `tokenName` / `tokenSha` filters. - Console `AuditEvent` type carries the new fields so a follow-up can add a TOKEN column. Completions regenerated. 1764/1764 tests pass workspace-wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 01:12:43 +01:00
Michal	efcfeeab65	feat(cli)!: migrate `create rbac` bindings to --roleBindings kv syntax BREAKING: `mcpctl create rbac` no longer accepts `--binding` or `--operation`. Use `--roleBindings` instead with key:value pairs: # resource binding --roleBindings role:view,resource:servers --roleBindings role:view,resource:servers,name:my-ha # operation binding (role:run is implied by action:) --roleBindings action:logs The on-disk YAML shape (`roleBindings: [{role, resource, name?}]` or `{role:'run', action}`) is unchanged, so Git backups and existing `apply -f` files continue to work. Only the command-line input format changes. The parser is extracted to src/cli/src/commands/rbac-bindings.ts so the upcoming `mcpctl create mcptoken --bind <kv>` verb can reuse it. Completions, tests, and the new parser unit test all pass (406/406). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 01:03:57 +01:00
Michal	2ddb493bb0	feat(mcpd): McpToken schema + CRUD routes + introspection Adds a new McpToken Prisma model (project-scoped, SHA-256 hashed at rest, optional expiry, revocable) plus backing repository, service, and REST routes. Tokens are a first-class RBAC subject: new 'McpToken' kind is added to the subject enum and the service auto-creates an RbacDefinition with subject McpToken:<sha> when bindings are provided. Creator-permission ceiling: the service rejects any requested binding the creator cannot already satisfy themselves (re-uses rbacService.canAccess / canRunOperation). rbacMode=clone snapshots the creator's full permissions into the token. Routes: POST /api/v1/mcptokens create (returns raw token once) GET /api/v1/mcptokens list (filter by project) GET /api/v1/mcptokens/:id describe (no secret in response) POST /api/v1/mcptokens/:id/revoke soft-delete + remove RbacDef DELETE /api/v1/mcptokens/:id hard-delete GET /api/v1/mcptokens/introspect validate raw bearer (used by mcplocal) Extends AuditEvent with optional tokenName/tokenSha fields (indexed) so token-driven activity can be filtered later. Adds token helpers in @mcpctl/shared: TOKEN_PREFIX='mcpctl_pat_', generateToken, hashToken, isMcpToken, timingSafeEqualHex. Follow-up PRs add the auth-hook dispatch on the prefix, the CLI verbs, and the HTTP-mode mcplocal that calls /introspect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 01:00:04 +01:00
Michal	3149ea3ae7	fix: MCP proxy resilience — discovery cache, default liveness probes Some checks failed CI/CD / lint (push) Successful in 52s Details CI/CD / typecheck (push) Successful in 1m51s Details CI/CD / test (push) Successful in 1m1s Details CI/CD / smoke (push) Failing after 3m21s Details CI/CD / build (push) Successful in 4m9s Details CI/CD / publish (push) Has been skipped Details Adds a per-server tools/list cache in McpRouter (positive + negative TTL) so a slow or dead upstream only stalls the first discovery call, not every subsequent client request. Invalidated on upstream add/remove. Health probes now apply a default liveness spec (tools/list via the real production path) to any RUNNING instance without an explicit healthCheck, so synthetic and real failures converge on the same signal. Includes supporting updates in mcpd-client, discovery, upstream/mcpd, seeder, and fulldeploy/release scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 00:48:57 +01:00
Michal	9ff2dcc3d9	fix: actually wire STDIO attach for docker-image MCP servers All checks were successful CI/CD / typecheck (pull_request) Successful in 52s Details CI/CD / lint (pull_request) Successful in 1m43s Details CI/CD / test (pull_request) Successful in 1m2s Details CI/CD / build (pull_request) Successful in 1m45s Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details CI/CD / smoke (pull_request) Successful in 9m51s Details Commit `1bd5087` added attachInteractive to the orchestrator interface but never hooked it up in mcp-proxy-service — sendViaPersistentAttach was promised in the commit message but missing from the diff. Servers with a distroless image whose entrypoint IS the MCP server (gitea-mcp) ended up needing a bogus `command: [node, dist/index.js]` workaround that silently failed on every exec, leaving clients with empty tool lists. Changes: - PersistentStdioClient: take a StdioMode discriminated union. Exec mode runs a command via execInteractive; attach mode talks to PID 1 via attachInteractive. - mcp-proxy-service: dispatch by config — command → exec; packageName → exec via runtime runner; dockerImage-only → attach. Error serialization no longer drops non-Error objects as "[object Object]". - templates/gitea.yaml: remove the command workaround; the image CMD runs as PID 1 and mcpd attaches. - Add unit tests covering both modes and the unsupported-orchestrator paths. Also required (separate repo): mcpd's k8s Role needed pods/attach added alongside pods/exec; updated in kubernetes-deployment/…/mcpctl/server.ts and kubectl-patched on the live cluster. Verified end-to-end against mcpctl.ad.itaz.eu: - gitea (attach): 49 tools listed, real tools/call round-trip. - aws-docs (exec via packageName): 4 tools, no regression. - docmost (exec via command): 11 tools, no regression. - mcpd suite: 634/634 passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 22:26:26 +01:00
Michal	857f8c72ae	fix: MCP proxy resilience — timeouts, parallel discovery, error propagation All checks were successful CI/CD / typecheck (pull_request) Successful in 49s Details CI/CD / lint (pull_request) Successful in 1m49s Details CI/CD / test (pull_request) Successful in 1m4s Details CI/CD / build (pull_request) Successful in 1m49s Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details CI/CD / smoke (pull_request) Successful in 10m3s Details - McpdClient: add 30s AbortSignal timeout to all fetch calls (was infinite) - CLI bridge: return JSON-RPC error on stdout when HTTP fails (was silent) - Router: parallel tool/resource discovery via Promise.allSettled (was sequential — one slow server blocked all) - vllm-managed: 60s error cooldown prevents retry-on-every-call when vLLM is broken - Tests: McpdClient timeout suite (9), parallel discovery, vllm cooldown, bridge error response Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:28:03 +01:00
Michal	383be66286	feat: add backup + server type smoke tests New smoke test file: backup-and-servers.test.ts - Backup completeness: prompts, templates, runtime, command, containerPort, replicas - SSE server proxy (my-home-assistant): 84 tools - Docker-image STDIO proxy (docmost): 11 tools - Package STDIO proxy (aws-docs): 4 tools - Instance status accuracy: RUNNING instances must respond to proxy These tests would have caught every migration bug: - Missing runtime (python servers on node runner) - Missing command (HA SSE in STDIO mode) - Missing containerPort (SSE on wrong port) - Backup data loss (prompts, templates, server fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 00:05:54 +01:00
Michal	016f8abe68	fix: accurate instance status — STARTING until pod is actually running All checks were successful CI/CD / typecheck (pull_request) Successful in 52s Details CI/CD / lint (pull_request) Successful in 1m53s Details CI/CD / test (pull_request) Successful in 1m2s Details CI/CD / build (pull_request) Successful in 4m0s Details CI/CD / smoke (pull_request) Successful in 8m38s Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Instance status now reflects actual container state: - startOne() sets STARTING (not RUNNING) after container creation - syncStatus() promotes STARTING→RUNNING when pod is ready - syncStatus() demotes RUNNING→STARTING if pod restarts (CrashLoop) - External servers still get RUNNING immediately (no container) Previously, CrashLooping pods showed as RUNNING in mcpctl get instances. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:45:10 +01:00
Michal	1bd5087052	fix: add prompts/templates to backup + STDIO attach for docker-image servers Two bugs fixed: 1. Backup completeness: JSON backup API now includes prompts and templates. Previously these were silently dropped during backup/restore, causing data loss on migration. 2. STDIO proxy for docker-image servers: servers with dockerImage but no packageName/command (like docmost) now use k8s Attach to connect to the container's PID 1 stdin/stdout instead of exec. This fixes "has no packageName or command" errors. Changes: - backup-service.ts: add BackupPrompt/BackupTemplate types, export them - restore-service.ts: restore prompts (with project FK) and templates - mcp-proxy-service.ts: sendViaPersistentAttach for docker-image STDIO - orchestrator.ts: add attachInteractive to McpOrchestrator interface - kubernetes-orchestrator.ts: implement attachInteractive via k8s Attach - k8s-client-official.ts: expose Attach client Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:37:16 +01:00
Michal	d293df738a	feat: automatic reconciliation loop for MCP server instances mcpd now runs a periodic reconcileAll() every 30s that: - Detects crashed/missing containers (syncStatus) - Cleans up ERROR instances - Creates replacement pods to match desired replica count This replaces the old syncStatus-only timer. Servers migrated from another deployment or recovering from node failures will automatically get their instances recreated. 6 new tests for reconcileAll covering: missing instances, skip replicas=0, already-at-count, ERROR cleanup, multi-server, error isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:00:19 +01:00
Michal	14be2fa18e	feat: nodeSelector for MCP server pods + restore fix - Add MCPD_NODE_SELECTOR env var support in manifest generator for mixed-arch clusters (e.g. arm64+amd64) - Fix backup restore: resolve system user ID instead of hardcoded 'system' string Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 13:04:34 +01:00
Michal	3663963a32	fix: resolve system user ID in backup restore for projects The restore service hardcoded ownerId as the literal string 'system' instead of looking up the actual system user ID. This caused FK constraint violations when restoring projects to a fresh database. Now resolves the system user by email, falling back to the first available user. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 02:04:32 +01:00
Michal	5e45960a18	feat: add Kubernetes orchestrator for MCP server pod management mcpd can now deploy MCP server instances as Kubernetes pods instead of Docker containers. Set MCPD_ORCHESTRATOR=kubernetes to enable. - Add @kubernetes/client-node with thin wrapper (context enforcement via MCPD_K8S_CONTEXT to prevent multi-cluster mishaps) - Rewrite KubernetesOrchestrator: pod CRUD, pod IP extraction, exec via SPDY (one-shot + interactive), log streaming - Manifest generator: stdin:true for STDIO servers, args (not command) to preserve runner image entrypoint, security hardening - Orchestrator selection in main.ts via MCPD_ORCHESTRATOR env var - 25 unit tests for k8s orchestrator, all 624 tests pass Tested end-to-end on local k3s: - mcpd deployed via Pulumi, creates pods in mcpctl-servers namespace - NetworkPolicy verified: only mcpd can reach MCP server pods - Python runner (uvx) successfully runs aws-documentation-mcp-server Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 01:55:13 +01:00

1 2 3 4

159 Commits