mcpctl

michal/mcpctl

Fork 0

Commit Graph

Author	SHA1	Message	Date
Michal	e6cd73543a	fix(mcpd): fail-loud on env resolution + retry/backoff + readiness via proxy Three connected issues with how instances came up + got reported as healthy when their secret backend was unreachable. The motivating case: gitea-mcp-server starts when mcpd can't read the gitea-creds secret from OpenBao, runs with an empty GITEA_ACCESS_TOKEN, replies fine to tools/list (so liveness passes), but every authed call fails with "token is required" — and `mcpctl get instances` cheerfully reports the instance as healthy. ## What changed ### 1. Env resolution failures are now fatal for the start attempt `src/mcpd/src/services/instance.service.ts` The previous behaviour swallowed `resolveServerEnv` failures and let the container start anyway with whatever env survived ("non-fatal — container may still work if env vars are optional"). That's the bug: the gitea container started with no token, ran for weeks, and was reported healthy. The catch now calls `markInstanceError(instance, "secret resolution failed: <reason>")` and returns. Optional/missing env vars should be modelled as `value: ""` entries on the server, not as silent secret-resolution failures. ### 2. ERROR instances retry with backoff, not blind churn Adds Kubernetes-style escalation: 30 s × 5 attempts, then 5 min pauses thereafter. Retry state lives on `McpInstance.metadata` (no schema migration) — `attemptCount`, `lastAttemptAt`, `nextRetryAt`, `error`. The reconciler no longer tears down ERROR instances and creates fresh replacements (which would reset attemptCount and effectively loop at 30 s forever). Instead: - ERROR rows whose `nextRetryAt` is in the future are LEFT ALONE and counted against the replica budget — preventing tight create- fail-create churn while a previous attempt is in its backoff window. - ERROR rows whose `nextRetryAt` has elapsed are retried IN-PLACE via a new `retryInstance` method, which preserves attemptCount on the same row so the schedule actually escalates. The work has been factored into `startOne` (creates + initial attempt) + `attemptStart` (env + container) + `retryInstance` (re-attempt the same row) + `markInstanceError` (write retry metadata). ### 3. STDIO readiness probe goes through mcpProxyService `src/mcpd/src/services/health-probe.service.ts` The legacy `probeStdio` (a `docker exec node -e '... spawn(packageName) ...'` invocation) only worked for packageName-based servers. Image- based STDIO servers like gitea-mcp-server fell through with "No packageName or command for STDIO server" and were reported unhealthy for the WRONG reason — they have no packageName because they are an image, not because anything's wrong. New `probeReadinessViaProxy`: sends `tools/call` through the live running container via `mcpProxyService.execute`. Same code path as production traffic, so probe failures match real failures. Picks up: - JSON-RPC errors (e.g. "token is required" when env is empty). - Tool-level errors expressed as `result.isError: true`. - Connection failures wrapped as exceptions. - Hard timeouts via the deadline race. After this PR, configuring `gitea` with `healthCheck: { tool: get_me, intervalSeconds: 60 }` makes `mcpctl get instances` report it as `unhealthy` whenever the auth token is missing or wrong — which is honest. The dead `probeStdio` (~120 LOC) is removed; HTTP/SSE bespoke probe paths are kept for now (they work and the diff stays minimal). ## Tests `src/mcpd/tests/instance-service.test.ts`: - Replaces "cleans up ERROR instances and creates replacements" with "retries ERROR instances in-place when their backoff has elapsed". - Adds "leaves ERROR instances alone while their nextRetryAt is in the future" and "escalates the backoff: attemptCount + nextRetryAt persist on retry failures". `src/mcpd/tests/services/health-probe.test.ts`: - Swaps STDIO probe mocks from `orchestrator.execInContainer` → `mcpProxyService.execute`. - Adds "marks unhealthy when proxy returns a JSON-RPC error (e.g. broken-secret auth failure)" — explicitly the gitea case. - Adds "marks unhealthy when proxy returns a tool-level error in result.isError" — covers servers that report tool failures as isError instead of as JSON-RPC errors. - Renames "handles exec timeout" → "handles probe timeout" and exercises the deadline race rather than an exec rejection. Full suite: 162 test files / 2161 tests green (+4 new). ## Manual verification step (post-deploy) ```bash mcpctl edit server gitea # → add healthCheck: # tool: get_me # intervalSeconds: 60 # timeoutSeconds: 10 # failureThreshold: 3 ``` If OpenBao is still down: gitea instance enters ERROR with attemptCount + nextRetryAt visible in `mcpctl describe instance`. Otherwise: gitea env resolves at next start, probe passes, instance is honestly healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:55:23 +01:00
Michal	3149ea3ae7	fix: MCP proxy resilience — discovery cache, default liveness probes Some checks failed CI/CD / lint (push) Successful in 52s Details CI/CD / typecheck (push) Successful in 1m51s Details CI/CD / test (push) Successful in 1m1s Details CI/CD / smoke (push) Failing after 3m21s Details CI/CD / build (push) Successful in 4m9s Details CI/CD / publish (push) Has been skipped Details Adds a per-server tools/list cache in McpRouter (positive + negative TTL) so a slow or dead upstream only stalls the first discovery call, not every subsequent client request. Invalidated on upstream add/remove. Health probes now apply a default liveness spec (tools/list via the real production path) to any RUNNING instance without an explicit healthCheck, so synthetic and real failures converge on the same signal. Includes supporting updates in mcpd-client, discovery, upstream/mcpd, seeder, and fulldeploy/release scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 00:48:57 +01:00
Michal	738bfafd46	feat: MCP health probe runner — periodic tool-call probes for instances Implements Kubernetes-style liveness probes that call MCP tools defined in server healthCheck configs. For STDIO servers, uses docker exec to spawn a disposable MCP client that sends initialize + tool call. For HTTP/SSE servers, sends JSON-RPC directly. - HealthProbeRunner service with configurable interval/threshold/timeout - execInContainer added to orchestrator interface + Docker implementation - Instance findById now includes server relation (fixes describe showing IDs) - Events appended to instance (last 50), healthStatus tracked as healthy/degraded/unhealthy - 12 unit tests covering probing, thresholds, intervals, cleanup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 00:38:48 +00:00

Author

SHA1

Message

Date

Michal

e6cd73543a

fix(mcpd): fail-loud on env resolution + retry/backoff + readiness via proxy

Three connected issues with how instances came up + got reported as
healthy when their secret backend was unreachable. The motivating
case: gitea-mcp-server starts when mcpd can't read the
gitea-creds secret from OpenBao, runs with an empty
GITEA_ACCESS_TOKEN, replies fine to tools/list (so liveness passes),
but every authed call fails with "token is required" — and
`mcpctl get instances` cheerfully reports the instance as healthy.

## What changed

### 1. Env resolution failures are now fatal for the start attempt

`src/mcpd/src/services/instance.service.ts`

The previous behaviour swallowed `resolveServerEnv` failures and let
the container start anyway with whatever env survived ("non-fatal —
container may still work if env vars are optional"). That's the bug:
the gitea container started with no token, ran for weeks, and was
reported healthy.

The catch now calls `markInstanceError(instance, "secret resolution
failed: <reason>")` and returns. Optional/missing env vars should be
modelled as `value: ""` entries on the server, not as silent
secret-resolution failures.

### 2. ERROR instances retry with backoff, not blind churn

Adds Kubernetes-style escalation: 30 s × 5 attempts, then 5 min
pauses thereafter. Retry state lives on `McpInstance.metadata` (no
schema migration) — `attemptCount`, `lastAttemptAt`, `nextRetryAt`,
`error`.

The reconciler no longer tears down ERROR instances and creates
fresh replacements (which would reset attemptCount and effectively
loop at 30 s forever). Instead:

- ERROR rows whose `nextRetryAt` is in the future are LEFT ALONE
  and counted against the replica budget — preventing tight create-
  fail-create churn while a previous attempt is in its backoff window.
- ERROR rows whose `nextRetryAt` has elapsed are retried IN-PLACE
  via a new `retryInstance` method, which preserves attemptCount on
  the same row so the schedule actually escalates.

The work has been factored into `startOne` (creates + initial attempt)
+ `attemptStart` (env + container) + `retryInstance` (re-attempt the
same row) + `markInstanceError` (write retry metadata).

### 3. STDIO readiness probe goes through mcpProxyService

`src/mcpd/src/services/health-probe.service.ts`

The legacy `probeStdio` (a `docker exec node -e '... spawn(packageName)
...'` invocation) only worked for packageName-based servers. Image-
based STDIO servers like gitea-mcp-server fell through with "No
packageName or command for STDIO server" and were reported unhealthy
for the WRONG reason — they have no packageName because they are an
image, not because anything's wrong.

New `probeReadinessViaProxy`: sends `tools/call` through the live
running container via `mcpProxyService.execute`. Same code path as
production traffic, so probe failures match real failures. Picks up:

- JSON-RPC errors (e.g. "token is required" when env is empty).
- Tool-level errors expressed as `result.isError: true`.
- Connection failures wrapped as exceptions.
- Hard timeouts via the deadline race.

After this PR, configuring `gitea` with
`healthCheck: { tool: get_me, intervalSeconds: 60 }` makes
`mcpctl get instances` report it as `unhealthy` whenever the auth
token is missing or wrong — which is honest.

The dead `probeStdio` (~120 LOC) is removed; HTTP/SSE bespoke probe
paths are kept for now (they work and the diff stays minimal).

## Tests

`src/mcpd/tests/instance-service.test.ts`:
- Replaces "cleans up ERROR instances and creates replacements" with
  "retries ERROR instances in-place when their backoff has elapsed".
- Adds "leaves ERROR instances alone while their nextRetryAt is in
  the future" and "escalates the backoff: attemptCount + nextRetryAt
  persist on retry failures".

`src/mcpd/tests/services/health-probe.test.ts`:
- Swaps STDIO probe mocks from `orchestrator.execInContainer` →
  `mcpProxyService.execute`.
- Adds "marks unhealthy when proxy returns a JSON-RPC error
  (e.g. broken-secret auth failure)" — explicitly the gitea case.
- Adds "marks unhealthy when proxy returns a tool-level error in
  result.isError" — covers servers that report tool failures as
  isError instead of as JSON-RPC errors.
- Renames "handles exec timeout" → "handles probe timeout" and
  exercises the deadline race rather than an exec rejection.

Full suite: 162 test files / 2161 tests green (+4 new).

## Manual verification step (post-deploy)

```bash
mcpctl edit server gitea
# → add healthCheck:
#     tool: get_me
#     intervalSeconds: 60
#     timeoutSeconds: 10
#     failureThreshold: 3
```

If OpenBao is still down: gitea instance enters ERROR with
attemptCount + nextRetryAt visible in `mcpctl describe instance`.
Otherwise: gitea env resolves at next start, probe passes, instance
is honestly healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-07 18:55:23 +01:00

Michal

3149ea3ae7

fix: MCP proxy resilience — discovery cache, default liveness probes

CI/CD / lint (push) Successful in 52s

Details

CI/CD / typecheck (push) Successful in 1m51s

Details

CI/CD / test (push) Successful in 1m1s

Details

CI/CD / smoke (push) Failing after 3m21s

Details

CI/CD / build (push) Successful in 4m9s

Details

CI/CD / publish (push) Has been skipped

Details

Adds a per-server tools/list cache in McpRouter (positive + negative TTL)
so a slow or dead upstream only stalls the first discovery call, not every
subsequent client request. Invalidated on upstream add/remove.

Health probes now apply a default liveness spec (tools/list via the real
production path) to any RUNNING instance without an explicit healthCheck,
so synthetic and real failures converge on the same signal.

Includes supporting updates in mcpd-client, discovery, upstream/mcpd,
seeder, and fulldeploy/release scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-17 00:48:57 +01:00

Michal

738bfafd46

feat: MCP health probe runner — periodic tool-call probes for instances

Implements Kubernetes-style liveness probes that call MCP tools defined
in server healthCheck configs. For STDIO servers, uses docker exec to
spawn a disposable MCP client that sends initialize + tool call. For
HTTP/SSE servers, sends JSON-RPC directly.

- HealthProbeRunner service with configurable interval/threshold/timeout
- execInContainer added to orchestrator interface + Docker implementation
- Instance findById now includes server relation (fixes describe showing IDs)
- Events appended to instance (last 50), healthStatus tracked as
  healthy/degraded/unhealthy
- 12 unit tests covering probing, thresholds, intervals, cleanup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-23 00:38:48 +00:00

3 Commits