Three connected issues with how instances came up + got reported as
healthy when their secret backend was unreachable. The motivating
case: gitea-mcp-server starts when mcpd can't read the
gitea-creds secret from OpenBao, runs with an empty
GITEA_ACCESS_TOKEN, replies fine to tools/list (so liveness passes),
but every authed call fails with "token is required" — and
`mcpctl get instances` cheerfully reports the instance as healthy.
## What changed
### 1. Env resolution failures are now fatal for the start attempt
`src/mcpd/src/services/instance.service.ts`
The previous behaviour swallowed `resolveServerEnv` failures and let
the container start anyway with whatever env survived ("non-fatal —
container may still work if env vars are optional"). That's the bug:
the gitea container started with no token, ran for weeks, and was
reported healthy.
The catch now calls `markInstanceError(instance, "secret resolution
failed: <reason>")` and returns. Optional/missing env vars should be
modelled as `value: ""` entries on the server, not as silent
secret-resolution failures.
### 2. ERROR instances retry with backoff, not blind churn
Adds Kubernetes-style escalation: 30 s × 5 attempts, then 5 min
pauses thereafter. Retry state lives on `McpInstance.metadata` (no
schema migration) — `attemptCount`, `lastAttemptAt`, `nextRetryAt`,
`error`.
The reconciler no longer tears down ERROR instances and creates
fresh replacements (which would reset attemptCount and effectively
loop at 30 s forever). Instead:
- ERROR rows whose `nextRetryAt` is in the future are LEFT ALONE
and counted against the replica budget — preventing tight create-
fail-create churn while a previous attempt is in its backoff window.
- ERROR rows whose `nextRetryAt` has elapsed are retried IN-PLACE
via a new `retryInstance` method, which preserves attemptCount on
the same row so the schedule actually escalates.
The work has been factored into `startOne` (creates + initial attempt)
+ `attemptStart` (env + container) + `retryInstance` (re-attempt the
same row) + `markInstanceError` (write retry metadata).
### 3. STDIO readiness probe goes through mcpProxyService
`src/mcpd/src/services/health-probe.service.ts`
The legacy `probeStdio` (a `docker exec node -e '... spawn(packageName)
...'` invocation) only worked for packageName-based servers. Image-
based STDIO servers like gitea-mcp-server fell through with "No
packageName or command for STDIO server" and were reported unhealthy
for the WRONG reason — they have no packageName because they are an
image, not because anything's wrong.
New `probeReadinessViaProxy`: sends `tools/call` through the live
running container via `mcpProxyService.execute`. Same code path as
production traffic, so probe failures match real failures. Picks up:
- JSON-RPC errors (e.g. "token is required" when env is empty).
- Tool-level errors expressed as `result.isError: true`.
- Connection failures wrapped as exceptions.
- Hard timeouts via the deadline race.
After this PR, configuring `gitea` with
`healthCheck: { tool: get_me, intervalSeconds: 60 }` makes
`mcpctl get instances` report it as `unhealthy` whenever the auth
token is missing or wrong — which is honest.
The dead `probeStdio` (~120 LOC) is removed; HTTP/SSE bespoke probe
paths are kept for now (they work and the diff stays minimal).
## Tests
`src/mcpd/tests/instance-service.test.ts`:
- Replaces "cleans up ERROR instances and creates replacements" with
"retries ERROR instances in-place when their backoff has elapsed".
- Adds "leaves ERROR instances alone while their nextRetryAt is in
the future" and "escalates the backoff: attemptCount + nextRetryAt
persist on retry failures".
`src/mcpd/tests/services/health-probe.test.ts`:
- Swaps STDIO probe mocks from `orchestrator.execInContainer` →
`mcpProxyService.execute`.
- Adds "marks unhealthy when proxy returns a JSON-RPC error
(e.g. broken-secret auth failure)" — explicitly the gitea case.
- Adds "marks unhealthy when proxy returns a tool-level error in
result.isError" — covers servers that report tool failures as
isError instead of as JSON-RPC errors.
- Renames "handles exec timeout" → "handles probe timeout" and
exercises the deadline race rather than an exec rejection.
Full suite: 162 test files / 2161 tests green (+4 new).
## Manual verification step (post-deploy)
```bash
mcpctl edit server gitea
# → add healthCheck:
# tool: get_me
# intervalSeconds: 60
# timeoutSeconds: 10
# failureThreshold: 3
```
If OpenBao is still down: gitea instance enters ERROR with
attemptCount + nextRetryAt visible in `mcpctl describe instance`.
Otherwise: gitea env resolves at next start, probe passes, instance
is honestly healthy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>