Three connected issues with how instances came up + got reported as
healthy when their secret backend was unreachable. The motivating
case: gitea-mcp-server starts when mcpd can't read the
gitea-creds secret from OpenBao, runs with an empty
GITEA_ACCESS_TOKEN, replies fine to tools/list (so liveness passes),
but every authed call fails with "token is required" — and
`mcpctl get instances` cheerfully reports the instance as healthy.
## What changed
### 1. Env resolution failures are now fatal for the start attempt
`src/mcpd/src/services/instance.service.ts`
The previous behaviour swallowed `resolveServerEnv` failures and let
the container start anyway with whatever env survived ("non-fatal —
container may still work if env vars are optional"). That's the bug:
the gitea container started with no token, ran for weeks, and was
reported healthy.
The catch now calls `markInstanceError(instance, "secret resolution
failed: <reason>")` and returns. Optional/missing env vars should be
modelled as `value: ""` entries on the server, not as silent
secret-resolution failures.
### 2. ERROR instances retry with backoff, not blind churn
Adds Kubernetes-style escalation: 30 s × 5 attempts, then 5 min
pauses thereafter. Retry state lives on `McpInstance.metadata` (no
schema migration) — `attemptCount`, `lastAttemptAt`, `nextRetryAt`,
`error`.
The reconciler no longer tears down ERROR instances and creates
fresh replacements (which would reset attemptCount and effectively
loop at 30 s forever). Instead:
- ERROR rows whose `nextRetryAt` is in the future are LEFT ALONE
and counted against the replica budget — preventing tight create-
fail-create churn while a previous attempt is in its backoff window.
- ERROR rows whose `nextRetryAt` has elapsed are retried IN-PLACE
via a new `retryInstance` method, which preserves attemptCount on
the same row so the schedule actually escalates.
The work has been factored into `startOne` (creates + initial attempt)
+ `attemptStart` (env + container) + `retryInstance` (re-attempt the
same row) + `markInstanceError` (write retry metadata).
### 3. STDIO readiness probe goes through mcpProxyService
`src/mcpd/src/services/health-probe.service.ts`
The legacy `probeStdio` (a `docker exec node -e '... spawn(packageName)
...'` invocation) only worked for packageName-based servers. Image-
based STDIO servers like gitea-mcp-server fell through with "No
packageName or command for STDIO server" and were reported unhealthy
for the WRONG reason — they have no packageName because they are an
image, not because anything's wrong.
New `probeReadinessViaProxy`: sends `tools/call` through the live
running container via `mcpProxyService.execute`. Same code path as
production traffic, so probe failures match real failures. Picks up:
- JSON-RPC errors (e.g. "token is required" when env is empty).
- Tool-level errors expressed as `result.isError: true`.
- Connection failures wrapped as exceptions.
- Hard timeouts via the deadline race.
After this PR, configuring `gitea` with
`healthCheck: { tool: get_me, intervalSeconds: 60 }` makes
`mcpctl get instances` report it as `unhealthy` whenever the auth
token is missing or wrong — which is honest.
The dead `probeStdio` (~120 LOC) is removed; HTTP/SSE bespoke probe
paths are kept for now (they work and the diff stays minimal).
## Tests
`src/mcpd/tests/instance-service.test.ts`:
- Replaces "cleans up ERROR instances and creates replacements" with
"retries ERROR instances in-place when their backoff has elapsed".
- Adds "leaves ERROR instances alone while their nextRetryAt is in
the future" and "escalates the backoff: attemptCount + nextRetryAt
persist on retry failures".
`src/mcpd/tests/services/health-probe.test.ts`:
- Swaps STDIO probe mocks from `orchestrator.execInContainer` →
`mcpProxyService.execute`.
- Adds "marks unhealthy when proxy returns a JSON-RPC error
(e.g. broken-secret auth failure)" — explicitly the gitea case.
- Adds "marks unhealthy when proxy returns a tool-level error in
result.isError" — covers servers that report tool failures as
isError instead of as JSON-RPC errors.
- Renames "handles exec timeout" → "handles probe timeout" and
exercises the deadline race rather than an exec rejection.
Full suite: 162 test files / 2161 tests green (+4 new).
## Manual verification step (post-deploy)
```bash
mcpctl edit server gitea
# → add healthCheck:
# tool: get_me
# intervalSeconds: 60
# timeoutSeconds: 10
# failureThreshold: 3
```
If OpenBao is still down: gitea instance enters ERROR with
attemptCount + nextRetryAt visible in `mcpctl describe instance`.
Otherwise: gitea env resolves at next start, probe passes, instance
is honestly healthy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of the Skills + Revisions + Proposals work. Stands up the generic
revision/proposal layer and wires Prompt into it. Skills will plug into the
same infrastructure in PR-3 with no further service changes required.
This PR is intentionally additive: PromptRequest table and routes are
unchanged. The /api/v1/proposals API runs side-by-side with the legacy
/api/v1/promptrequests API. The PromptRequest cutover (rename + backfill +
mcplocal rewire) is deferred to a later PR so this one stays reviewable.
## What's added
### Repositories (src/mcpd/src/repositories/)
- resource-revision.repository.ts — append-only revision log keyed by
(resourceType, resourceId). Soft FK; no relations declared. Supports
history listing, semver lookup, and contentHash cross-resource search.
- resource-proposal.repository.ts — generic propose queue. Status lifecycle
pending → approved | rejected. Mirrors Prompt's `?? ''` workaround for
nullable-FK compound lookups.
### Services (src/mcpd/src/services/)
- resource-revision.service.ts — record() inserts a revision with a stable
sha256 contentHash computed from canonicalised JSON (key-sorted at every
level so reordered objects produce the same hash). Caller passes a
pre-computed semver; service does NOT decide bump policy.
- resource-proposal.service.ts — propose / approve / reject / list, with a
per-resourceType handler registry. PromptService registers the 'prompt'
handler at construction; the SkillService will register 'skill' in PR-3.
approve() runs in a Prisma $transaction so the resource update + revision
insert + proposal status flip are atomic.
### Pure utility (src/mcpd/src/utils/semver.ts)
- bumpSemver(current, kind) for major / minor / patch
- compareSemver(a, b) — numeric, not lex (10 > 9)
- isValidSemver(s)
- Invalid input falls back to '0.1.0' rather than throwing — keeps the
audit-write path from blowing up the prompt update if a row's semver
ever drifts out of MAJOR.MINOR.PATCH shape.
### Routes (src/mcpd/src/routes/)
- revisions.ts — GET /api/v1/revisions?resourceType=&resourceId=,
GET /api/v1/revisions/:id, GET /api/v1/revisions/:id/diff?against=<id|live>
(unified-format diff via the `diff` package), and POST
/api/v1/prompts/:id/restore-revision { revisionId, note? }.
- proposals.ts — GET / POST /api/v1/proposals,
GET /api/v1/proposals/:id, PUT for body updates, POST .../approve and
POST .../reject, plus DELETE.
## What's changed
- PromptService.create / update now record a ResourceRevision when the
revision service is wired. Update auto-bumps patch on content change;
authors can override via `--bump major|minor|patch` or `--semver X.Y.Z`
on the CLI (forwarded into the PUT body). Best-effort: revision write
failures are swallowed so the prompt save still succeeds (revision is
audit, not source of truth).
- PromptService.setProposalService registers a 'prompt' approval handler
with the proposal service. Approval runs in a Prisma transaction:
upsert prompt → record revision → update currentRevisionId → flip
proposal status. semver bumps to 0.1.0 on first approval, patch
thereafter.
- New CLI flags on `mcpctl edit prompt`: --bump, --semver, --note. They're
prompt-only (validated client-side); other resources reject them.
- Aliases in shared.ts: `proposal`/`prop` → proposals,
`revision`/`rev` → revisions.
- diff dependency added to mcpd.
## Tests
- src/mcpd/tests/utils/semver.test.ts — covers bump/compare/validate
including numeric (not lex) semver compare and invalid-input fallback.
- prompt-service.test.ts updated: makePrompt fixture now sets semver +
agentId + currentRevisionId; updatePrompt assertion expects the
auto-bumped patch in the same update call.
- prompt-routes.test.ts updated symmetrically.
## RBAC
`proposals` and `revisions` URL segments map to the existing `prompts`
permission for now. PR-7 may split if a "reviewer" role becomes useful.
## Verification
Full suite: 158 test files / 2127 tests green.
`pnpm build` clean across all 6 workspace packages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the schema landed in Stage 1 into the service layer. No HTTP
routes yet — Stage 3 will register `/api/v1/...` endpoints and update
chat.service to read agent-direct + personality prompts when building
the system block.
Repositories:
- PersonalityRepository: CRUD + listPrompts/attach/detach bindings.
- PromptRepository: findByAgent + findByNameAndAgent; create/update
accept the new agentId column. findGlobal now also filters
agentId=null so agent-direct prompts don't leak into global lists.
- AgentRepository: defaultPersonalityId on create + connect/disconnect
in update.
Services:
- PersonalityService: CRUD scoped per agent, plus attach/detach with
scope enforcement — a prompt may bind only if it's agent-direct on
the same agent, in the agent's project, or global. Foreign-project
/ foreign-agent attachments are rejected with 400.
- PromptService: createPrompt / upsertByName accept agentId and
resolve `agent: <name>`, with XOR-with-project guard. Adds
listPromptsForAgent.
- AgentService: defaultPersonality (by name on the agent's own
personality set) round-trips through update + AgentView.
Validation:
- prompt.schema.ts: refine() rejects projectId+agentId together.
- personality.schema.ts: new Create/Update/AttachPrompt schemas.
- agent.schema.ts: defaultPersonality { name } | null on update.
Tests: 12 PersonalityService + 7 PromptService agent-scope tests
covering happy paths, XOR/scope enforcement, double-attach guard,
detach-not-bound. mcpd suite: 796/796 (was 777). Typecheck clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-server tools/list cache in McpRouter (positive + negative TTL)
so a slow or dead upstream only stalls the first discovery call, not every
subsequent client request. Invalidated on upstream add/remove.
Health probes now apply a default liveness spec (tools/list via the real
production path) to any RUNNING instance without an explicit healthCheck,
so synthetic and real failures converge on the same signal.
Includes supporting updates in mcpd-client, discovery, upstream/mcpd,
seeder, and fulldeploy/release scripts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
proxyMode "direct" was a security hole (leaked secrets as plaintext env
vars in .mcp.json) and bypassed all mcplocal features (gating, audit,
RBAC, content pipeline, namespacing). Removed from schema, API, CLI,
and all tests. Old configs with proxyMode are accepted but silently
stripped via Zod .transform() for backward compatibility.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Audit Console Phase 1: tool_call_trace emission from mcplocal router,
session_bind/rbac_decision event kinds, GET /audit/sessions endpoint,
full Ink TUI with session sidebar, event timeline, and detail view
(mcpctl console --audit).
System prompts: move 6 hardcoded LLM prompts to mcpctl-system project
with extensible ResourceRuleRegistry validation framework, template
variable enforcement ({{maxTokens}}, {{pageCount}}), and delete-resets-
to-default behavior. All consumers fetch via SystemPromptFetcher with
hardcoded fallbacks.
CLI: -p shorthand for --project across get/create/delete/config commands,
console auto-scroll improvements, shell completions regenerated.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add warmup() to LlmProvider interface for eager subprocess startup
- ManagedVllmProvider.warmup() starts vLLM in background on project load
- ProviderRegistry.warmupAll() triggers all managed providers
- NamedProvider proxies warmup() to inner provider
- paginate stage generates LLM-powered descriptive page titles when
available, cached by content hash, falls back to generic "Page N"
- project-mcp-endpoint calls warmupAll() on router creation so vLLM
is loading while the session initializes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements the full gated session flow and prompt intelligence system:
- Prisma schema: add gated, priority, summary, chapters, linkTarget fields
- Session gate: state machine (gated → begin_session → ungated) with LLM-powered
tool selection based on prompt index
- Tag matcher: intelligent prompt-to-tool matching with project/server/action tags
- LLM selector: tiered provider selection (fast for gating, heavy for complex tasks)
- Link resolver: cross-project MCP resource references (project/server:uri format)
- Prompt summary service: LLM-generated summaries and chapter extraction
- System project bootstrap: ensures default project exists on startup
- Structural link health checks: enrichWithLinkStatus on prompt GET endpoints
- CLI: create prompt --priority/--link, create project --gated/--no-gated,
describe project shows prompts section, get prompts shows PRI/LINK/STATUS
- Apply/edit: priority, linkTarget, gated fields supported
- Shell completions: fish updated with new flags
- 1,253 tests passing across all packages
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix MCP proxy to support SSE and STDIO transports (not just HTTP POST)
- Enrich tool descriptions with server context for LLM clarity
- Add Prompt and PromptRequest resources with two-resource RBAC model
- Add propose_prompt MCP tool for LLM to create pending prompt requests
- Add prompt resources visible in MCP resources/list (approved + session's pending)
- Add project-level prompt/instructions in MCP initialize response
- Add ServiceAccount subject type for RBAC (SA identity from X-Service-Account header)
- Add CLI commands: create prompt, get prompts/promptrequests, approve promptrequest
- Add prompts to apply config schema
- 956 tests passing across all packages
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements Kubernetes-style liveness probes that call MCP tools defined
in server healthCheck configs. For STDIO servers, uses docker exec to
spawn a disposable MCP client that sends initialize + tool call. For
HTTP/SSE servers, sends JSON-RPC directly.
- HealthProbeRunner service with configurable interval/threshold/timeout
- execInContainer added to orchestrator interface + Docker implementation
- Instance findById now includes server relation (fixes describe showing IDs)
- Events appended to instance (last 50), healthStatus tracked as
healthy/degraded/unhealthy
- 12 unit tests covering probing, thresholds, intervals, cleanup
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>