Commit Graph

2 Commits

Author SHA1 Message Date
Michal
e51b92473f fix(smoke,rotator,auth): repair smoke env + close failure modes that
caused 27 post-deploy smoke failures

This commit lands the durable side of the post-deploy investigation:
genuine bugs that let the upstream OpenBao re-init silently break every
secret write for 4 days, plus test-code bugs that masked the same
breakage in the smoke output.

mcpd — fail loud on dead OpenBao tokens
=======================================
secret-backend-rotator.service.ts
  When `mintRoleToken` or `lookupSelf` returns 403/401, classify it as
  BACKEND_TOKEN_DEAD (likely cause: upstream OpenBao re-init invalidated
  every pre-existing token), wrap the thrown error with explicit
  remediation (mint via root + `mcpctl create secret <name> --data
  <key>=<token> --force`), persist the same message to
  tokenMeta.lastRotationError, and emit a structured `level:fatal`
  console.error so it shows up in `kubectl logs deploy/mcpd` with grep-
  friendly `kind:BACKEND_TOKEN_DEAD`. Adds a `healthCheck(backendId)`
  method that runs lookup-self without minting — so the boot-time loop
  can detect the dead-token state immediately, not 24 hours later.

secret-backend-rotator-loop.ts
  Boot-time health check: in `start()`, for every rotatable backend, call
  `rotator.healthCheck(b.id)` and on failure log a structured fatal entry.
  This converts the prior silent failure mode (24h wait until scheduled
  rotation reveals the dead token, with secret writes failing under it
  the entire time) into "mcpd boots, immediately sees the dead token,
  alerts loudly". Existing isOverdue path is unchanged.

mcpd — Prisma userId crash on /me
=================================
routes/auth.ts
  GET /api/v1/auth/me used `request.userId!` which lied: an authenticated
  McpToken bearer satisfies the auth middleware but has no associated
  User row, so userId stayed undefined and `findUnique({ id: undefined })`
  threw PrismaClientValidationError. Now returns 401 with a clear
  "service-account/token-bound principal cannot be queried via /me"
  message instead of bubbling a 500.

mcplocal — token revocation propagation
=======================================
http/token-auth.ts
  Lowered default introspection positiveTtl from 30s → 5s. mcpd's
  introspection endpoint is a single DB lookup; the cache only protects
  against burst restart storms, not steady-state load. The 30s window
  let revoked tokens keep working for the full window after revocation
  (caught by mcptoken.smoke's negative-cache assertion). Aligns with the
  existing 5s negativeTtl and the test's `wait 7s after revoke` expectation.

smoke tests — read URL the same way the CLI does
================================================
mcp-client.ts
  Adds `loadMcpdAuth()`: URL from `~/.mcpctl/config.json`, token from
  `~/.mcpctl/credentials`. Critically, the URL does NOT come from
  credentials. credentials.mcpdUrl carries a stale field for legacy
  reasons and goes out of sync (left over from old `mcpctl login
  --mcpd-url localhost:3xxx` invocations) — tests reading it ended up
  hitting whatever URL the user last logged into rather than the URL
  the CLI is actually using right now. audit/security/system-prompts
  smoke now use loadMcpdAuth(), eliminating ~10 cascade failures.
  Also: switch httpRequest to https.request when scheme is https
  (matching audit/security/system-prompts/mcp-client/agent helpers).
  Bumps default callTool timeout from 30s → 60s; many tools that fetch
  external resources routinely run 10-30s.

agent.smoke.test.ts
  - readToken read from `credentials.json`; the file is `credentials`
    (no extension). Caused 401 on POST /threads.
  - `mcpctl get <resource> <name> -o json` returns an array, not a bare
    object. Round-trip yaml test now indexes [0] before reading
    description.

secretbackend.smoke.test.ts
  Two genuine assertion-drift fixes (env was right, test was stale):
  - "lists at least one secretbackend": stop hard-coding the default
    backend type as 'plaintext'; the invariant is "exactly one default
    exists". The seeded plaintext is the bootstrap default but operators
    routinely promote a remote backend (openbao etc.) once it's healthy.
  - "refuses to delete the seeded default": widen the regex from
    /default|in use|cannot delete/ to also accept "referenced" — the
    exact wording has shifted to "is still referenced by N secret(s);
    migrate them first".

audit.test.ts / system-prompts.test.ts / security.test.ts
  Switch http.request → https.request when URL is https (each had its
  own copy of the helper). Drop the now-orphan loadMcpdCredentials in
  favour of loadMcpdAuth from mcp-client.ts.

Tests
=====
mcpd 759/759, mcplocal 715/715 unit suites still green. Smoke (live):
  Run 1 (pre-commit, post bao-token rotation):  27 → 12 failures.
  Run 2 (after fixes-batch, pre-redeploy):      12 →  2 failures.
The remaining 2 (mcptoken cache TTL, proxy-pipeline timeout) are what
the durable code changes here address; verify after the next redeploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 18:35:13 +01:00
Michal
03827f11e4 feat: eager vLLM warmup and smart page titles in paginate stage
- Add warmup() to LlmProvider interface for eager subprocess startup
- ManagedVllmProvider.warmup() starts vLLM in background on project load
- ProviderRegistry.warmupAll() triggers all managed providers
- NamedProvider proxies warmup() to inner provider
- paginate stage generates LLM-powered descriptive page titles when
  available, cached by content hash, falls back to generic "Page N"
- project-mcp-endpoint calls warmupAll() on router creation so vLLM
  is loading while the session initializes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:07:39 +00:00