fix(mcpd): fail-loud on env resolution + retry/backoff + readiness via proxy

Three connected issues with how instances came up + got reported as healthy when their secret backend was unreachable. The motivating case: gitea-mcp-server starts when mcpd can't read the gitea-creds secret from OpenBao, runs with an empty GITEA_ACCESS_TOKEN, replies fine to tools/list (so liveness passes), but every authed call fails with "token is required" — and `mcpctl get instances` cheerfully reports the instance as healthy. ## What changed ### 1. Env resolution failures are now fatal for the start attempt `src/mcpd/src/services/instance.service.ts` The previous behaviour swallowed `resolveServerEnv` failures and let the container start anyway with whatever env survived ("non-fatal — container may still work if env vars are optional"). That's the bug: the gitea container started with no token, ran for weeks, and was reported healthy. The catch now calls `markInstanceError(instance, "secret resolution failed: <reason>")` and returns. Optional/missing env vars should be modelled as `value: ""` entries on the server, not as silent secret-resolution failures. ### 2. ERROR instances retry with backoff, not blind churn Adds Kubernetes-style escalation: 30 s × 5 attempts, then 5 min pauses thereafter. Retry state lives on `McpInstance.metadata` (no schema migration) — `attemptCount`, `lastAttemptAt`, `nextRetryAt`, `error`. The reconciler no longer tears down ERROR instances and creates fresh replacements (which would reset attemptCount and effectively loop at 30 s forever). Instead: - ERROR rows whose `nextRetryAt` is in the future are LEFT ALONE and counted against the replica budget — preventing tight create- fail-create churn while a previous attempt is in its backoff window. - ERROR rows whose `nextRetryAt` has elapsed are retried IN-PLACE via a new `retryInstance` method, which preserves attemptCount on the same row so the schedule actually escalates. The work has been factored into `startOne` (creates + initial attempt) + `attemptStart` (env + container) + `retryInstance` (re-attempt the same row) + `markInstanceError` (write retry metadata). ### 3. STDIO readiness probe goes through mcpProxyService `src/mcpd/src/services/health-probe.service.ts` The legacy `probeStdio` (a `docker exec node -e '... spawn(packageName) ...'` invocation) only worked for packageName-based servers. Image- based STDIO servers like gitea-mcp-server fell through with "No packageName or command for STDIO server" and were reported unhealthy for the WRONG reason — they have no packageName because they are an image, not because anything's wrong. New `probeReadinessViaProxy`: sends `tools/call` through the live running container via `mcpProxyService.execute`. Same code path as production traffic, so probe failures match real failures. Picks up: - JSON-RPC errors (e.g. "token is required" when env is empty). - Tool-level errors expressed as `result.isError: true`. - Connection failures wrapped as exceptions. - Hard timeouts via the deadline race. After this PR, configuring `gitea` with `healthCheck: { tool: get_me, intervalSeconds: 60 }` makes `mcpctl get instances` report it as `unhealthy` whenever the auth token is missing or wrong — which is honest. The dead `probeStdio` (~120 LOC) is removed; HTTP/SSE bespoke probe paths are kept for now (they work and the diff stays minimal). ## Tests `src/mcpd/tests/instance-service.test.ts`: - Replaces "cleans up ERROR instances and creates replacements" with "retries ERROR instances in-place when their backoff has elapsed". - Adds "leaves ERROR instances alone while their nextRetryAt is in the future" and "escalates the backoff: attemptCount + nextRetryAt persist on retry failures". `src/mcpd/tests/services/health-probe.test.ts`: - Swaps STDIO probe mocks from `orchestrator.execInContainer` → `mcpProxyService.execute`. - Adds "marks unhealthy when proxy returns a JSON-RPC error (e.g. broken-secret auth failure)" — explicitly the gitea case. - Adds "marks unhealthy when proxy returns a tool-level error in result.isError" — covers servers that report tool failures as isError instead of as JSON-RPC errors. - Renames "handles exec timeout" → "handles probe timeout" and exercises the deadline race rather than an exec rejection. Full suite: 162 test files / 2161 tests green (+4 new). ## Manual verification step (post-deploy) ```bash mcpctl edit server gitea # → add healthCheck: # tool: get_me # intervalSeconds: 60 # timeoutSeconds: 10 # failureThreshold: 3 ``` If OpenBao is still down: gitea instance enters ERROR with attemptCount + nextRetryAt visible in `mcpctl describe instance`. Otherwise: gitea env resolves at next start, probe passes, instance is honestly healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:55:23 +01:00
parent 56735a5290
commit e6cd73543a
4 changed files with 414 additions and 183 deletions
--- a/src/mcpd/tests/services/health-probe.test.ts
+++ b/src/mcpd/tests/services/health-probe.test.ts
@@ -192,25 +192,28 @@ describe('HealthProbeRunner', () => {
    expect(serverRepo.findById).not.toHaveBeenCalled();
  });

-  it('probes STDIO instance with exec and marks healthy on success', async () => {
+  it('probes STDIO instance via mcpProxyService and marks healthy on success', async () => {
    const instance = makeInstance();
    const server = makeServer();

    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    vi.mocked(serverRepo.findById).mockResolvedValue(server);
-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 0,
-      stdout: 'OK',
-      stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1,
+      result: { content: [{ type: 'text', text: 'ok' }] },
    });

    await runner.tick();

-    expect(orchestrator.execInContainer).toHaveBeenCalledWith(
-      'container-abc',
-      expect.arrayContaining(['node', '-e']),
-      expect.objectContaining({ timeoutMs: 10000 }),
-    );
+    // STDIO readiness now goes through the proxy (the live container),
+    // not via docker-exec into a synthetic spawn — see comment on
+    // probeReadinessViaProxy for why.
+    expect(orchestrator.execInContainer).not.toHaveBeenCalled();
+    expect(mcpProxyService.execute).toHaveBeenCalledWith({
+      serverId: 'srv-1',
+      method: 'tools/call',
+      params: { name: 'list_datasources', arguments: {} },
+    });

    expect(instanceRepo.updateStatus).toHaveBeenCalledWith(
      'inst-1',
@@ -225,6 +228,57 @@ describe('HealthProbeRunner', () => {
    );
  });

+  it('marks unhealthy when proxy returns a JSON-RPC error (e.g. broken-secret auth failure)', async () => {
+    const instance = makeInstance();
+    const server = makeServer({
+      healthCheck: { tool: 'get_me', intervalSeconds: 0, failureThreshold: 1 } as McpServer['healthCheck'],
+    });
+
+    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
+    vi.mocked(serverRepo.findById).mockResolvedValue(server);
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1,
+      error: { code: -32603, message: 'token is required' },
+    });
+
+    await runner.tick();
+
+    expect(instanceRepo.updateStatus).toHaveBeenCalledWith(
+      'inst-1',
+      'RUNNING',
+      expect.objectContaining({
+        healthStatus: 'unhealthy',
+        events: expect.arrayContaining([
+          expect.objectContaining({ type: 'Warning', message: expect.stringContaining('token is required') }),
+        ]),
+      }),
+    );
+  });
+
+  it('marks unhealthy when proxy returns a tool-level error in result.isError', async () => {
+    const instance = makeInstance();
+    const server = makeServer({
+      healthCheck: { tool: 'get_me', intervalSeconds: 0, failureThreshold: 1 } as McpServer['healthCheck'],
+    });
+
+    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
+    vi.mocked(serverRepo.findById).mockResolvedValue(server);
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1,
+      result: { isError: true, content: [{ type: 'text', text: 'auth failed: token is required' }] },
+    });
+
+    await runner.tick();
+
+    const events = vi.mocked(instanceRepo.updateStatus).mock.calls[0]?.[2]?.events as Array<{ message: string }> | undefined;
+    expect(events?.[events.length - 1]?.message).toContain('auth failed');
+    expect(instanceRepo.updateStatus).toHaveBeenCalledWith(
+      'inst-1',
+      'RUNNING',
+      expect.objectContaining({ healthStatus: 'unhealthy' }),
+    );
+  });
+
  it('marks unhealthy after failureThreshold consecutive failures', async () => {
    const instance = makeInstance();
    const healthCheck: HealthCheckSpec = {
@@ -237,10 +291,9 @@ describe('HealthProbeRunner', () => {

    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    vi.mocked(serverRepo.findById).mockResolvedValue(server);
-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 1,
-      stdout: 'ERROR:connection refused',
-      stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1,
+      error: { code: -32603, message: 'connection refused' },
    });

    // First failure → degraded
@@ -274,15 +327,15 @@ describe('HealthProbeRunner', () => {
    vi.mocked(serverRepo.findById).mockResolvedValue(server);

    // Two failures
-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 1, stdout: 'ERROR:fail', stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1, error: { code: -32603, message: 'fail' },
    });
    await runner.tick();
    await runner.tick();

    // Then success — should reset to healthy
-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 0, stdout: 'OK', stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1, result: {},
    });
    await runner.tick();

@@ -290,13 +343,16 @@ describe('HealthProbeRunner', () => {
    expect(lastCall?.[2]).toEqual(expect.objectContaining({ healthStatus: 'healthy' }));
  });

-  it('handles exec timeout as failure', async () => {
+  it('handles probe timeout as failure', async () => {
    const instance = makeInstance();
-    const server = makeServer();
+    const server = makeServer({
+      healthCheck: { tool: 'list_datasources', intervalSeconds: 0, timeoutSeconds: 0.05, failureThreshold: 3 } as unknown as McpServer['healthCheck'],
+    });

    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    vi.mocked(serverRepo.findById).mockResolvedValue(server);
-    vi.mocked(orchestrator.execInContainer).mockRejectedValue(new Error('Exec timed out after 10000ms'));
+    // Hang forever — the probe's internal deadline should fire instead.
+    vi.mocked(mcpProxyService.execute).mockImplementation(() => new Promise(() => { /* never resolves */ }));

    await runner.tick();

@@ -323,8 +379,8 @@ describe('HealthProbeRunner', () => {

    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    vi.mocked(serverRepo.findById).mockResolvedValue(server);
-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 0, stdout: 'OK', stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1, result: {},
    });

    await runner.tick();
@@ -343,17 +399,17 @@ describe('HealthProbeRunner', () => {

    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    vi.mocked(serverRepo.findById).mockResolvedValue(server);
-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 0, stdout: 'OK', stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1, result: {},
    });

    // First tick: should probe
    await runner.tick();
-    expect(orchestrator.execInContainer).toHaveBeenCalledTimes(1);
+    expect(mcpProxyService.execute).toHaveBeenCalledTimes(1);

    // Second tick immediately: should skip (300s interval not elapsed)
    await runner.tick();
-    expect(orchestrator.execInContainer).toHaveBeenCalledTimes(1);
+    expect(mcpProxyService.execute).toHaveBeenCalledTimes(1);
  });

  it('cleans up probe states for removed instances', async () => {
@@ -364,9 +420,12 @@ describe('HealthProbeRunner', () => {

    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    vi.mocked(serverRepo.findById).mockResolvedValue(server);
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1, result: {},
+    });

    await runner.tick();
-    expect(orchestrator.execInContainer).toHaveBeenCalledTimes(1);
+    expect(mcpProxyService.execute).toHaveBeenCalledTimes(1);

    // Instance removed
    vi.mocked(instanceRepo.findAll).mockResolvedValue([]);
@@ -375,7 +434,7 @@ describe('HealthProbeRunner', () => {
    // Re-add same instance — should probe again (state was cleaned)
    vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
    await runner.tick();
-    expect(orchestrator.execInContainer).toHaveBeenCalledTimes(2);
+    expect(mcpProxyService.execute).toHaveBeenCalledTimes(2);
  });

  it('skips STDIO instances without containerId', async () => {
@@ -397,8 +456,8 @@ describe('HealthProbeRunner', () => {
      arguments: {},
    };

-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 0, stdout: 'OK', stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1, result: {},
    });

    const result = await runner.probeInstance(instance, server, healthCheck);
@@ -407,15 +466,14 @@ describe('HealthProbeRunner', () => {
    expect(result.message).toBe('ok');
  });

-  it('handles STDIO exec failure with error message', async () => {
+  it('surfaces upstream JSON-RPC error message verbatim', async () => {
    const instance = makeInstance();
    const server = makeServer();
    const healthCheck: HealthCheckSpec = { tool: 'list_datasources', arguments: {} };

-    vi.mocked(orchestrator.execInContainer).mockResolvedValue({
-      exitCode: 1,
-      stdout: 'ERROR:ECONNREFUSED 10.0.0.1:3000',
-      stderr: '',
+    vi.mocked(mcpProxyService.execute).mockResolvedValue({
+      jsonrpc: '2.0', id: 1,
+      error: { code: -32603, message: 'ECONNREFUSED 10.0.0.1:3000' },
    });

    const result = await runner.probeInstance(instance, server, healthCheck);