Compare commits

...

5 Commits

Author SHA1 Message Date
Michal
180e50a978 feat(cli): metadata.mcpServers auto-attach in mcpctl skills sync
Closes the third deferred item from PR-5: skills can declare upstream
MCP server dependencies via `metadata.mcpServers` and `mcpctl skills
sync` now attaches them to the active project. Same trust model as
postInstall/hooks: the publisher is responsible, the client just
asks mcpd to attach.

## Behaviour

- For each `{ name, fromTemplate?, project? }` entry:
  - If the project already has the server attached → record as
    `alreadyAttached`, no-op.
  - If the server doesn't exist on mcpd → warn + skip (we don't
    auto-create from template; that's a separate explicit decision
    for the operator). The warning surfaces the suggested template
    so the operator can decide.
  - Otherwise → POST `/api/v1/projects/:id/servers { server: <name> }`.
  - 409 from POST → treat as alreadyAttached (server-side idempotency).
  - 404 from POST → treat as missing (race: server vanished mid-sync).
  - Other errors collected per-server; sync continues.
- A dep with `project:` set to a non-active project is skipped during
  this sync — keeps the active sync from making collateral changes
  to other projects.
- Global skill syncs (no project context) skip mcpServers entirely
  with a warning — there's no project to attach to.

## Files

### Added
- src/cli/src/utils/mcpservers-materialiser.ts (~140 LOC)
- src/cli/tests/utils/mcpservers-materialiser.test.ts (~190 LOC,
  10 tests covering: parse-tolerance, fresh attach, alreadyAttached
  short-circuit, missing-server warn+skip, missing-project errors-
  out, 409→alreadyAttached, 404→missing, cross-project skip,
  per-server error collection, empty-deps no-op)

### Edited
- src/cli/src/commands/skills.ts: applyOne calls
  attachSkillMcpServers after hooks. Tracks per-skill attachments in
  result.mcpServersAttached. Summary line surfaces "N mcpServers
  attached".

## Verification

165 test files / 2193 tests green (up from 2182).

Real-world flow:

```yaml
# skill metadata.yaml
mcpServers:
  - name: my-grafana
    fromTemplate: grafana
    project: monitoring
  - name: my-loki
    fromTemplate: loki
```

```bash
# As operator: ensure the named servers exist on mcpd first
mcpctl create server my-grafana --from-template grafana --env-from-secret grafana-creds
mcpctl create server my-loki --from-template loki

# Now publish the skill that declares them as deps. Sync will attach:
mcpctl skills sync
# → mcpctl: 1 installed, 2 mcpServers attached
mcpctl describe project monitoring     # both servers now attached
```

## What's intentionally NOT in this PR

- Auto-creating servers from `fromTemplate` when they don't exist.
  Provisioning infra from a skill push is a separate decision needing
  explicit operator policy. v1 warns + skips; the warning includes
  the suggested template name so the operator can act manually.
- Detaching a server when a skill drops it from mcpServers. Detach is
  destructive (project loses access) and we can't tell whether the
  detach is intentional vs. accidental drop. PR-7 can revisit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 19:12:20 +01:00
Michal
7ebc8b22d1 feat(cli): metadata.hooks materialisation in mcpctl skills sync
Closes the second-biggest deferred item from PR-5: skills can declare
PreToolUse / PostToolUse / SessionStart / Stop / SubagentStop /
Notification hooks in metadata.hooks, and `mcpctl skills sync` now
registers them in ~/.claude/settings.json with proper source-tagging.

## Why this needs source-tagging

mcpctl already managed ONE entry in settings.json (the SessionStart
hook for `mcpctl skills sync --quiet`, set up by PR-5 via
`installManagedSessionHook`). That entry uses `_mcpctl_managed: true`
to identify itself for idempotent updates.

Skill-declared hooks need finer scoping: skill A and skill B could
both register PreToolUse hooks, and removing skill A must not touch
skill B's. So entries written by `applyManagedHooks` carry both:

  _mcpctl_managed: true          ← same flag the SessionStart hook uses
  _mcpctl_source: "<skill-name>" ← differentiates per-skill

Removing a skill (orphan cleanup) drops every entry tagged with
that skill's name. User-added entries (no marker) are preserved
verbatim. The earlier session-hook installer keeps working: its
`_mcpctl_managed: true` lacks the source tag, so the hooks-materialiser
ignores it on per-skill operations.

## Behaviour

- Skill installs/updates: applyManagedHooks(skillName, metadata.hooks)
  is called after files are atomically materialised. The function
  reads settings.json, drops this skill's previous entries from every
  declared event (and from any event it previously had entries in but
  no longer declares — so a skill can shrink scope), then re-inserts
  the new tagged set.
- Skill orphan removal: removeManagedHooks(skillName) drops every
  entry owned by the skill. Other skills + user hooks unaffected.
- Failure handling: hooks errors are logged via the warn() callback
  but do not fail the sync. Same shape as postInstall — scoped, not
  fatal.

## Files

### Added
- src/cli/src/utils/hooks-materialiser.ts (~140 LOC)
- src/cli/tests/utils/hooks-materialiser.test.ts (~165 LOC, 11 tests
  covering: write-from-scratch, multi-source coexistence, user-hook
  preservation, update replaces (not duplicates), shrink drops events,
  remove targets only the named source, multiple events independent,
  idempotent, empty/JSONC settings.json tolerance)

### Edited
- src/cli/src/commands/skills.ts: applyOne calls applyManagedHooks
  when metadata.hooks is set; calls removeManagedHooks(skillName) when
  a previously-installed skill no longer declares hooks; orphan
  removal also drops the skill's hooks. New SyncResult field
  `hooksApplied`. Earlier `meta` declaration deduped (it was redeclared
  in the postInstall block).

## Verification

164 test files / 2182 tests green (up from 2171).

End-to-end on a real machine after this PR ships:

```yaml
# skill metadata.yaml
hooks:
  PreToolUse:
    - type: command
      command: "audit-pretool.sh"
  SessionStart:
    - type: command
      command: "claude-greeting.sh"
```

```
mcpctl skills sync
# → mcpctl: 1 installed, 1 hooks applied
jq '.hooks' ~/.claude/settings.json
# → entries tagged with _mcpctl_managed + _mcpctl_source
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 19:08:43 +01:00
Michal
d60ad52018 feat(cli): postInstall executor for mcpctl skills sync
Closes the biggest deferred item from PR-5. metadata.postInstall
scripts now actually run when their hash changes, with audit emission
back to mcpd.

Trust model unchanged from the corporate-appliance design: mcpd is
the source of truth, content is reviewed at publish time, the client
just runs. No sandbox, no signing, no consent prompts. The controls
that matter are already on the publishing side (RBAC, audit, reviewer
queue).

What we DO provide is ops hygiene:

- Hard timeout (default 60 s; per-skill override via
  metadata.postInstallTimeoutSec). SIGTERM at the deadline, SIGKILL
  after a 2 s grace.
- Hash pinning. The script's sha256 is recorded in
  ~/.mcpctl/skills-state.json. Re-syncs of an unchanged script are a
  cheap no-op. A re-published "same version, fixed script" still
  triggers re-execution because its hash changed.
- Curated env. PATH/HOME/USER/SHELL inherited; everything else dropped.
  MCPCTL_SKILL_NAME / _VERSION / _DIR / _PROJECT injected so scripts
  have stable context.
- Per-skill install log under ~/.mcpctl/skills/<name>/install.log.
  Bounded at 5 × 256 KB; old entries truncated from the front.
- Audit event back to mcpd (POST /api/v1/audit-events,
  eventKind=skill_postinstall) on every run, including hostname so
  admins can see fleet rollout state. Best-effort — failures are
  warned but never block the sync.
- Path-escape rejection. metadata.postInstall must resolve inside the
  skill bundle; relative paths that try to climb out are refused.
- Auto-chmod 0755 on the script before spawn. Some upstreams ship 0644
  + a shebang and expect a shell to handle it; we always spawn the
  path directly so we need +x.

Failure semantics: on timeout or non-zero exit, the recorded
postInstallHash is NOT updated. Next sync retries. Other skills in
the same sync run continue regardless — postInstall errors are
scoped, not fatal.

## Files

### Added
- src/cli/src/utils/postinstall.ts (~200 LOC)
- src/cli/tests/utils/postinstall.test.ts (~190 LOC, 10 tests covering
  success, env vars, chmod, non-zero exit, timeout via exec sleep,
  path-escape, missing script, log file shape + append-across-runs)

### Edited
- src/cli/src/commands/skills.ts: applyOne now invokes runPostInstall
  + emitPostInstallAudit when metadata.postInstall is set and the
  script hash differs from prior state. New SyncResult fields:
  postInstallsRan, postInstallsSkipped. Summary line surfaces
  "N postInstall ran". --skip-postinstall flag now actually does what
  it says.

## Verification

163 test files / 2171 tests green (up from 2161).

End-to-end on a real machine (after this PR ships and a skill with
metadata.postInstall is published):

```
mcpctl skills sync
# → mcpctl: 1 installed, 1 postInstall ran
cat ~/.mcpctl/skills/<name>/install.log    # see stdout/stderr
mcpctl skills sync                          # idempotent — skipped
```

If the same skill is republished with a fixed script:

```
mcpctl skills sync
# → mcpctl: 1 updated, 1 postInstall ran  (hash changed → rerun)
```

If the script fails (exit 7):

```
mcpctl skills sync
# → mcpctl: 1 updated, 1 errors
mcpctl skills sync   # state.postInstallHash NOT updated → retries
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 19:05:42 +01:00
Michal
e6cd73543a fix(mcpd): fail-loud on env resolution + retry/backoff + readiness via proxy
Three connected issues with how instances came up + got reported as
healthy when their secret backend was unreachable. The motivating
case: gitea-mcp-server starts when mcpd can't read the
gitea-creds secret from OpenBao, runs with an empty
GITEA_ACCESS_TOKEN, replies fine to tools/list (so liveness passes),
but every authed call fails with "token is required" — and
`mcpctl get instances` cheerfully reports the instance as healthy.

## What changed

### 1. Env resolution failures are now fatal for the start attempt

`src/mcpd/src/services/instance.service.ts`

The previous behaviour swallowed `resolveServerEnv` failures and let
the container start anyway with whatever env survived ("non-fatal —
container may still work if env vars are optional"). That's the bug:
the gitea container started with no token, ran for weeks, and was
reported healthy.

The catch now calls `markInstanceError(instance, "secret resolution
failed: <reason>")` and returns. Optional/missing env vars should be
modelled as `value: ""` entries on the server, not as silent
secret-resolution failures.

### 2. ERROR instances retry with backoff, not blind churn

Adds Kubernetes-style escalation: 30 s × 5 attempts, then 5 min
pauses thereafter. Retry state lives on `McpInstance.metadata` (no
schema migration) — `attemptCount`, `lastAttemptAt`, `nextRetryAt`,
`error`.

The reconciler no longer tears down ERROR instances and creates
fresh replacements (which would reset attemptCount and effectively
loop at 30 s forever). Instead:

- ERROR rows whose `nextRetryAt` is in the future are LEFT ALONE
  and counted against the replica budget — preventing tight create-
  fail-create churn while a previous attempt is in its backoff window.
- ERROR rows whose `nextRetryAt` has elapsed are retried IN-PLACE
  via a new `retryInstance` method, which preserves attemptCount on
  the same row so the schedule actually escalates.

The work has been factored into `startOne` (creates + initial attempt)
+ `attemptStart` (env + container) + `retryInstance` (re-attempt the
same row) + `markInstanceError` (write retry metadata).

### 3. STDIO readiness probe goes through mcpProxyService

`src/mcpd/src/services/health-probe.service.ts`

The legacy `probeStdio` (a `docker exec node -e '... spawn(packageName)
...'` invocation) only worked for packageName-based servers. Image-
based STDIO servers like gitea-mcp-server fell through with "No
packageName or command for STDIO server" and were reported unhealthy
for the WRONG reason — they have no packageName because they are an
image, not because anything's wrong.

New `probeReadinessViaProxy`: sends `tools/call` through the live
running container via `mcpProxyService.execute`. Same code path as
production traffic, so probe failures match real failures. Picks up:

- JSON-RPC errors (e.g. "token is required" when env is empty).
- Tool-level errors expressed as `result.isError: true`.
- Connection failures wrapped as exceptions.
- Hard timeouts via the deadline race.

After this PR, configuring `gitea` with
`healthCheck: { tool: get_me, intervalSeconds: 60 }` makes
`mcpctl get instances` report it as `unhealthy` whenever the auth
token is missing or wrong — which is honest.

The dead `probeStdio` (~120 LOC) is removed; HTTP/SSE bespoke probe
paths are kept for now (they work and the diff stays minimal).

## Tests

`src/mcpd/tests/instance-service.test.ts`:
- Replaces "cleans up ERROR instances and creates replacements" with
  "retries ERROR instances in-place when their backoff has elapsed".
- Adds "leaves ERROR instances alone while their nextRetryAt is in
  the future" and "escalates the backoff: attemptCount + nextRetryAt
  persist on retry failures".

`src/mcpd/tests/services/health-probe.test.ts`:
- Swaps STDIO probe mocks from `orchestrator.execInContainer` →
  `mcpProxyService.execute`.
- Adds "marks unhealthy when proxy returns a JSON-RPC error
  (e.g. broken-secret auth failure)" — explicitly the gitea case.
- Adds "marks unhealthy when proxy returns a tool-level error in
  result.isError" — covers servers that report tool failures as
  isError instead of as JSON-RPC errors.
- Renames "handles exec timeout" → "handles probe timeout" and
  exercises the deadline race rather than an exec rejection.

Full suite: 162 test files / 2161 tests green (+4 new).

## Manual verification step (post-deploy)

```bash
mcpctl edit server gitea
# → add healthCheck:
#     tool: get_me
#     intervalSeconds: 60
#     timeoutSeconds: 10
#     failureThreshold: 3
```

If OpenBao is still down: gitea instance enters ERROR with
attemptCount + nextRetryAt visible in `mcpctl describe instance`.
Otherwise: gitea env resolves at next start, probe passes, instance
is honestly healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:55:23 +01:00
Michal
56735a5290 docs: skills + revisions + proposals reference, plus cheatsheet update
Phase 7 of the Skills + Revisions + Proposals work — documentation
pass for the surface added in PR-1 through PR-6. Reference material
only; no code changes.

## What's added

- `docs/skills.md` — skill model, scoping rules, CLI surface, the
  `mcpctl config claude --project` setup flow, metadata schema (with
  the deferred-execution note for hooks/mcpServers/postInstall), the
  on-disk state file shape, atomic install mechanics, failure
  semantics, and what's deferred.
- `docs/revisions.md` — ResourceRevision model, semver auto-bump
  rules, contentHash diff key (cross-resource sync), CLI for history
  / diff / restore, RBAC, audit emission, storage growth note.
- `docs/proposals.md` — ResourceProposal model, the reviewer flow
  (CLI + web UI), atomic-approval mechanics, the propose_prompt /
  propose_skill MCP tools, the propose-learnings global skill that
  steers Claude toward engaging with them, and the deferred legacy
  PromptRequest cutover.

## What's edited

- Top-level `CLAUDE.md` — resource cheatsheet adds `skill`, `proposal`,
  `revision` with cross-references to the new docs. The legacy
  `promptrequest` entry stays (still on the legacy code path) but
  notes that new work should use `proposal`.

## What's NOT in this PR

- The PromptRequest → ResourceProposal cutover migration. Both run
  side-by-side today; the focused cutover PR will rename + backfill +
  drop. Keeping that out of PR-7 means review can stay on docs.
- Bundle-backup / `mcpctl apply -f` skill support (deferred from PR-3).
- `metadata.hooks` / `metadata.mcpServers` / `metadata.postInstall`
  execution (deferred from PR-5).
- Existing-page UI migration to Tailwind (deferred from PR-6 — old
  inline-styled pages coexist fine inside the new Layout).

These are tracked as future PRs; each is its own focused change.

## Verification

`pnpm test:run` whole monorepo: 162 test files / 2157 tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:58:04 +01:00
15 changed files with 2307 additions and 190 deletions

View File

@@ -35,6 +35,9 @@ Key routing rules:
- `project` — workspace grouping servers, prompts, agents
- `llm` — server-managed LLM provider (api key + endpoint)
- `agent` — LLM persona pinned to one Llm; project attach surfaces project Prompts as system context, project MCP servers as tools, and exposes the agent itself as an MCP virtual server (`agent-<name>/chat`). See `docs/agents.md`, `docs/chat.md`.
- `prompt` / `promptrequest` — curated content / pending proposal
- `prompt` / `promptrequest` — curated content / legacy pending proposal (use `proposal` for new work).
- `skill` — Claude Code skill bundle (SKILL.md + files + typed metadata). Materialised onto disk by `mcpctl skills sync`. See `docs/skills.md`.
- `proposal` — generic pending proposal queue, replaces `promptrequest`. Covers both prompts and skills. See `docs/proposals.md`. Triage via `mcpctl review`.
- `revision` — append-only audit + diff log shared by prompts and skills. Auto-bumps semver on save. See `docs/revisions.md`.
- `rbac` — access control bindings
- `mcptoken` — bearer credentials for HTTP-mode mcplocal

126
docs/proposals.md Normal file
View File

@@ -0,0 +1,126 @@
# Resource Proposals
A proposal is a pending change to a Prompt or Skill, submitted by
either a Claude Code session (via the `propose_prompt` / `propose_skill`
MCP tools) or a human (via the web UI / CLI). Reviewers triage the
queue and either approve — at which point the proposal becomes a real
prompt or skill — or reject with a note.
This is the path by which Claude **proposes back** to mcpd: things the
session learned that future sessions would benefit from. The
`propose-learnings` global skill (seeded by mcpd at startup) explains
the discipline to Claude.
## Model
`ResourceProposal` shares the schema's discriminator pattern with
`ResourceRevision` — single table, `resourceType` field disambiguates
prompts vs skills.
| Field | Purpose |
|----------------------|--------------------------------------------------------|
| `resourceType` | `'prompt'` \| `'skill'`. |
| `name` | Proposed resource name. |
| `body` | Proposed body (`{ content, priority?, metadata?, … }`).|
| `projectId` / `agentId` | Scope of the proposal (XOR; null/null = global). |
| `createdBySession` | mcplocal session that proposed (when from Claude). |
| `createdByUserId` | User who proposed (when via UI/CLI). |
| `status` | `'pending'``'approved'` \| `'rejected'`. |
| `reviewerNote` | Set on approval or rejection. |
| `approvedRevisionId` | Set when approved — points at the resulting revision. |
Two unique constraints — `(resourceType, name, projectId)` and
`(resourceType, name, agentId)` — mirror the Prompt / Skill scoping
rules. The same `?? ''` workaround for nullable-FK lookups applies.
## Reviewer flow
### CLI
```bash
mcpctl review pending # list pending
mcpctl review next # show oldest pending
mcpctl review show <id> # full detail
mcpctl review diff <id> # before/after diff
mcpctl review approve <id> # POST /proposals/:id/approve
mcpctl review reject <id> --reason "explain" # rejected with note
```
### Web UI
`/proposals` shows a Pending / Approved / Rejected tab view; the
sidebar nav badge polls every 30 s and shows the pending count in
amber. Click a row to see the full body, the diff against the current
resource (if any), and approve / reject controls.
### Approval is atomic
Approval runs in a single Prisma transaction:
1. Look up the pending proposal.
2. Dispatch by `resourceType` to the registered handler
(`PromptService` or `SkillService` registers itself at construction).
3. Handler upserts the underlying resource — creating it if new, or
updating + auto-bumping patch semver if it exists.
4. Handler records a `ResourceRevision` linking back to the proposal.
5. Proposal status flips to `approved`, `approvedRevisionId` set.
If any step fails, the transaction rolls back and the proposal stays
`pending`. There is no half-approved state.
## Claude side: `propose_prompt` and `propose_skill`
Both tools are registered by the `gate` plugin in mcplocal. They post
to `/api/v1/proposals` with the appropriate `resourceType`.
The `propose-learnings` global skill (seeded by mcpd) tells Claude
*when* to use them:
- `propose_prompt` for project-specific knowledge — gotchas,
conventions, hidden constraints. Cheap to add, easy to reject.
- `propose_skill` for cross-cutting knowledge — debugging discipline,
release hygiene, security review style. Larger blast radius; lean
toward `propose_prompt` unless you have a clear cross-project reason.
The `gate-encouragement-propose` system prompt (priority 10, sits in
the gating bundle) is the trigger that makes Claude actually consider
proposing. Without that, the tools exist but Claude rarely engages.
## Backwards compat
PR-1 / PR-2 deferred the cutover from the prompt-only `PromptRequest`
table to `ResourceProposal`. Both run side-by-side today:
- mcplocal's `propose_prompt` still POSTs to the legacy
`/api/v1/projects/:name/promptrequests` URL.
- mcplocal's `propose_skill` (newer) POSTs to `/api/v1/proposals`
directly.
- The legacy `/api/v1/promptrequests*` routes remain in mcpd.
- `mcpctl approve promptrequest <name>` still works.
A focused follow-up PR will:
1. Migrate existing `PromptRequest` rows into `ResourceProposal`
(resourceType=prompt).
2. Rename `PromptRequest` to `_PromptRequest_legacy`.
3. Update mcplocal's `propose_prompt` to use `/api/v1/proposals`.
4. Keep the legacy URL as a thin translation shim through one release.
5. Drop `_PromptRequest_legacy` after that.
This stays separate so the cutover is reviewable independently of
the larger Skills + Revisions + Proposals work.
## RBAC
Proposals piggyback on the `prompts` permission for now — anyone with
`view:prompts` can read the queue, anyone with `edit:prompts` can
approve or reject. Splitting out a dedicated `proposals` permission
(or a "reviewer" role) is straightforward if granularity becomes
useful.
## Audit emission
Proposal create / approve / reject events flow through the existing
audit pipeline. Approval events also reference the resulting
revision id, so you can join "proposal approved at T" against
"revision X created at T" without polling.

130
docs/revisions.md Normal file
View File

@@ -0,0 +1,130 @@
# Resource Revisions
mcpctl keeps an append-only revision log for every Prompt and Skill —
so you can answer "who changed prompt X and when," diff between any
two versions, and restore an earlier state without losing the audit
chain.
## Model
`ResourceRevision` is a single shared table keyed by
`(resourceType, resourceId)` — the type discriminator allows the same
infrastructure to cover both prompts and skills (and any future
resource that wants version history).
| Field | Purpose |
|------------------|----------------------------------------------------------|
| `id` | cuid; the revision's stable identity. |
| `resourceType` | `'prompt'` \| `'skill'`. Validated app-layer. |
| `resourceId` | Soft FK — survives deletion of the underlying resource. |
| `semver` | Author-visible version (X.Y.Z). |
| `contentHash` | sha256 of the canonicalised body. Stable diff key. |
| `body` | Snapshot of the resource at this revision. |
| `authorUserId` | Who made the change (null for system writes). |
| `authorSessionId`| Session that proposed it (when applicable). |
| `note` | Free-text reviewer or author note. |
| `createdAt` | When the revision was recorded. |
The resource row itself (Prompt/Skill) keeps the inline `content`
revisions are an audit log, not the source of truth. Hot read paths
(the gate plugin, `mcpctl skills sync`, prompt indexing) never need
to consult the revision log.
`Prompt.currentRevisionId` and `Skill.currentRevisionId` are soft
pointers to the latest revision so the UI can answer "which version is
live" in one query.
## Semver semantics
Auto-patch on every successful save where the body changed:
```
0.1.0 → save with content change → 0.1.1
0.1.1 → save with content change → 0.1.2
```
Authors can override:
```bash
mcpctl edit prompt foo --bump minor # 0.1.x → 0.2.0
mcpctl edit prompt foo --bump major # 0.x.x → 1.0.0
mcpctl edit prompt foo --semver 1.2.3 # explicit
mcpctl edit prompt foo --note "fixed the gotcha" # adds note to revision
```
Invalid semver values fall back to `0.1.0` rather than throwing —
the revision write is best-effort and we don't want a corrupted
existing semver to break the prompt save.
## contentHash
sha256 of the JSON-canonicalised body (keys sorted at every object
level). Two revisions with the same hash are byte-identical. Used by
`mcpctl skills sync` as the diff key against on-disk state — re-publish
under the same semver still triggers a sync if the contentHash changed.
The server-side hash and the client-side hash are computed from the
same canonical shape, so they match exactly. See
`src/mcpd/src/services/resource-revision.service.ts` for the canonical
JSON encoder.
## CLI
### View history
```bash
mcpctl get revisions prompt my-prompt
mcpctl get revisions skill demo-skill
```
### View one
```bash
mcpctl describe revision <id>
```
### Diff
The HTTP API returns a unified-format diff:
```
GET /api/v1/revisions/<id>/diff?against=<other-id|live>
```
The web UI's revision history tab on a Skill detail page renders the
diff inline (color-coded add/remove rows).
### Restore
Restore a prompt or skill to an earlier revision. This writes a *new*
revision whose body is the old one — preserving the audit chain
rather than deleting later revisions.
```bash
mcpctl restore prompt my-prompt --revision <revision-id>
```
The CLI subcommand is wired through to `POST
/api/v1/prompts/:id/restore-revision` (and the symmetric
`/api/v1/skills/:id/restore-revision`).
## RBAC
Revisions piggyback on the underlying resource's RBAC permission. If
you can `view:prompts`, you can read prompt history; if you can
`edit:prompts`, you can restore.
## Audit emission
Each revision write emits a structured audit event captured by the
existing audit-event pipeline. The event includes the revision id,
contentHash, semver, and author/session — sufficient to answer "what
changed" and "who" without joining tables manually.
## Storage size
A revision body is the resource snapshot — for prompts that's a few
KB; for skills with large `files` maps it can be tens of KB. The audit
log grows linearly with edits. v1 has no rotation; if a single resource
sees thousands of revisions per day this will need a retention policy
(out of scope today).

214
docs/skills.md Normal file
View File

@@ -0,0 +1,214 @@
# Skills
Skills are Claude Code skill bundles distributed by mcpctl. Each skill is a
named bundle of files — at minimum a `SKILL.md` explaining the skill's purpose
and triggers, optionally with auxiliary scripts, templates, or data files. The
mcpctl daemon (mcpd) is the source of truth; `mcpctl skills sync` materialises
the skills onto each dev machine under `~/.claude/skills/<name>/`, where Claude
Code reads them natively.
```
┌─ mcpd (Postgres) ──────────────────────────────┐
│ Skill rows (content + files{} + metadata) │
└────────────────┬───────────────────────────────┘
│ HTTP, hash-pinned diff
┌─ ~/.claude/skills/<name>/ ─────────────────────┐
│ SKILL.md │
│ scripts/setup.sh │
│ … │
└────────────────────────────────────────────────┘
```
## Trust model
Skills are added by senior admins together with a security reviewer at
publish time on mcpd. Once content is in mcpd, clients trust what mcpd
serves — no client-side sandboxing, no signature checks, no consent
prompts. The rigor lives on the publishing side (RBAC, audit, the
reviewer queue). See [proposals.md](proposals.md) for the
review→approve flow.
If you're publishing skills to clients you don't trust (e.g. an open-
source distribution), the design is wrong for that — the skill format
itself is fine, but the unguarded client trust assumption isn't.
## Scoping
A skill attaches to one of:
- **Global** — `projectId` and `agentId` both null. Synced onto every dev
machine when its sync runs (with or without a project context).
- **Project-scoped** — `projectId` set. Synced onto machines whose
`.mcpctl-project` marker matches.
- **Agent-scoped** — `agentId` set. Surfaced administratively via the
API; not currently materialised onto disk by `mcpctl skills sync`
(see "Future" below).
The same `<name>` can exist at multiple scopes simultaneously. The two
unique constraints are `(name, projectId)` and `(name, agentId)`.
## CLI
### Create
```bash
mcpctl create skill <name> \
[--project <name> | --agent <name>] \
--content / --content-file <path> \
[--description "<text>"] \
[--priority <1-10>] \
[--semver <X.Y.Z>] \
[--metadata-file <path>] \
[--files-dir <path>]
```
`--content-file` provides the `SKILL.md` body. `--metadata-file`
accepts YAML or JSON; see "Metadata" below for the schema. `--files-dir`
walks a directory tree into the `files{}` map (UTF-8 only; non-text
files rejected — extend later if needed).
### Edit
```bash
# Edit content in $EDITOR
mcpctl edit skill <name>
# Edit + bump semver
mcpctl edit skill <name> --bump major|minor|patch --note "<message>"
# Edit + set explicit semver
mcpctl edit skill <name> --semver 1.2.3
```
Each save records a `ResourceRevision` automatically. See
[revisions.md](revisions.md).
### Sync to disk
```bash
# In a project directory (with .mcpctl-project marker):
mcpctl skills sync
# Override project:
mcpctl skills sync --project <name>
# Globals only (no project context, no marker):
cd / && mcpctl skills sync
# Used by the SessionStart hook — fail-open on network errors:
mcpctl skills sync --quiet
```
Useful flags:
| Flag | Purpose |
|---------------------|-----------------------------------------------------------|
| `--dry-run` | Print what would change, don't write anything. |
| `--force` | Overwrite locally-modified skills. |
| `--quiet` | Suppress output unless something changed; fail-open. |
| `--keep-orphans` | Don't remove skills no longer in the server set. |
| `--skip-postinstall`| Reserved for the postInstall executor (deferred). |
## Project setup
`mcpctl config claude --project <name>` does the full pickup chain:
1. Writes `.mcp.json` so Claude Code routes MCP traffic through mcplocal.
2. Writes `.mcpctl-project` (single line, project name) so `skills sync`
knows which project's skills to pull when run from anywhere under
that directory.
3. Runs an initial `skills sync` synchronously.
4. Installs a SessionStart hook in `~/.claude/settings.json` that runs
`mcpctl skills sync --quiet` before every Claude session. Tagged
with `_mcpctl_managed: true` so subsequent runs find and update it
instead of duplicating it.
Pass `--skip-skills` to opt out of steps 24 (useful in CI).
## Metadata
The `metadata` field is a typed JSON blob:
```yaml
hooks:
PreToolUse:
- type: command
command: "echo before-tool"
PostToolUse:
- type: command
command: "echo after-tool"
SessionStart:
- type: command
command: "echo session-started"
mcpServers:
- name: my-grafana
fromTemplate: grafana
project: monitoring
postInstall: scripts/install.sh
preUninstall: scripts/cleanup.sh
postInstallTimeoutSec: 60
```
**v1 sync executes none of these — they're stored verbatim and
materialisation is deferred to a follow-up.** Once enabled:
- `hooks` will be written into `~/.claude/settings.json` with
`_mcpctl_managed: true` markers (see Project Setup above for how
the SessionStart hook works today).
- `mcpServers` will be auto-attached via the mcpd attach API.
- `postInstall` will run as the user with a curated env, hard timeout,
and an audit event emitted back to mcpd. Hash-pinned: re-syncs of
unchanged scripts won't re-execute.
## State
`~/.mcpctl/skills-state.json` tracks the last-synced state:
- per-skill: `id`, `semver`, `contentHash` (matches mcpd's hash),
`installDir`, per-file `sha256` + size, `postInstallHash`,
`lastSyncedAt`.
- top-level: `lastSync`, `lastSyncProject`, `schemaVersion`.
The state file is written atomically (temp + rename). Per-file SHA-256
detects local edits — sync warns and skips modified files unless you
pass `--force`.
State lives outside `~/.claude/skills/` deliberately so Claude Code
doesn't see our bookkeeping in its tree.
## Atomic install
Each skill is staged under `<targetDir>.mcpctl-staging-<pid>/`, then
the existing directory (if any) is renamed to
`<targetDir>.mcpctl-trash-<pid>`, the staging dir is moved into place,
and the trash is rmtree'd. A concurrent reader (Claude Code starting up)
never sees a partial tree.
Symmetric atomic delete for orphan removal: rename to trash, rmtree.
Locally-modified skills are preserved (warned + skipped) unless `--force`.
## Failure semantics
| Situation | Exit code | Behaviour |
|----------------------------------|-----------|------------------------------------|
| Network/timeout in `--quiet` | 0 | Skip silently. SessionStart hook never blocks Claude. |
| Auth failure | 1 | "run mcpctl login" message. |
| Disk full / state save failure | 2 | Loud error. |
| Per-skill error | 0 | Logged in result errors[]; sync continues. |
The fail-open behaviour in `--quiet` is non-negotiable — a hung mcpd
must never block Claude Code starting up.
## Future
The following are deferred to follow-up PRs:
- `metadata.hooks` materialisation into `~/.claude/settings.json`
- `metadata.mcpServers` auto-attach
- `metadata.postInstall` execution with curated env + audit emission
- Agent-scoped skills synced to disk (would need an agent-identity-on-
disk concept that doesn't exist yet)
- Bundle backup support for skills (bundle-backup is one path; git-backup
is the other and is wired today)
- `mcpctl apply -f skill.yaml` declarative skill apply

View File

@@ -16,6 +16,20 @@ import {
removeSkillAtomic,
type SkillBody,
} from '../utils/skills-disk.js';
import {
runPostInstall,
emitPostInstallAudit,
hashScript,
} from '../utils/postinstall.js';
import {
applyManagedHooks,
removeManagedHooks,
type HooksByEvent,
} from '../utils/hooks-materialiser.js';
import {
attachSkillMcpServers,
parseMcpServerDeps,
} from '../utils/mcpservers-materialiser.js';
import { ApiError } from '../api-client.js';
/**
@@ -50,6 +64,19 @@ interface FullSkill {
agentId: string | null;
}
/**
* Shape of `metadata` we care about at sync time. Validated server-side
* by SkillMetadataSchema (PR-3); we re-narrow here for the fields the
* sync acts on, keeping the rest opaque so future additions don't
* require a CLI change.
*/
interface SyncedSkillMetadata {
postInstall?: unknown;
postInstallTimeoutSec?: unknown;
hooks?: unknown;
mcpServers?: unknown;
}
export interface SyncOpts {
/** Project name override; otherwise marker walk-up + fall back to globals-only. */
project?: string;
@@ -72,6 +99,10 @@ export interface SyncResult {
skipped: string[];
removed: string[];
preserved: string[]; // skills with local edits we left alone
postInstallsRan: string[]; // skills whose postInstall executed in this sync
postInstallsSkipped: string[]; // skills with postInstall but unchanged hash → no rerun
hooksApplied: string[]; // skills whose hooks were registered/updated in ~/.claude/settings.json
mcpServersAttached: string[]; // "<skill>:<server>" tuples that landed in this sync
errors: Array<{ skill: string; error: string }>;
exitCode: 0 | 1 | 2;
}
@@ -95,6 +126,10 @@ export async function runSkillsSync(opts: SyncOpts, deps: SyncDeps): Promise<Syn
skipped: [],
removed: [],
preserved: [],
postInstallsRan: [],
postInstallsSkipped: [],
hooksApplied: [],
mcpServersAttached: [],
errors: [],
exitCode: 0,
};
@@ -189,6 +224,8 @@ export async function runSkillsSync(opts: SyncOpts, deps: SyncDeps): Promise<Syn
continue;
}
await removeSkillAtomic(prior.installDir);
// Drop any hook entries this skill registered.
try { await removeManagedHooks(name); } catch { /* best-effort */ }
delete state.skills[name];
result.removed.push(name);
} catch (err: unknown) {
@@ -210,18 +247,29 @@ export async function runSkillsSync(opts: SyncOpts, deps: SyncDeps): Promise<Syn
}
// 8. Summary.
if (!opts.quiet || result.errors.length > 0 || result.installed.length > 0 || result.updated.length > 0 || result.removed.length > 0) {
const anythingHappened =
result.errors.length > 0 ||
result.installed.length > 0 ||
result.updated.length > 0 ||
result.removed.length > 0 ||
result.postInstallsRan.length > 0 ||
result.hooksApplied.length > 0 ||
result.mcpServersAttached.length > 0;
if (!opts.quiet || anythingHappened) {
const parts: string[] = [];
if (result.installed.length) parts.push(`${String(result.installed.length)} installed`);
if (result.updated.length) parts.push(`${String(result.updated.length)} updated`);
if (result.skipped.length) parts.push(`${String(result.skipped.length)} unchanged`);
if (result.removed.length) parts.push(`${String(result.removed.length)} removed`);
if (result.preserved.length) parts.push(`${String(result.preserved.length)} preserved (modified)`);
if (result.postInstallsRan.length) parts.push(`${String(result.postInstallsRan.length)} postInstall ran`);
if (result.hooksApplied.length) parts.push(`${String(result.hooksApplied.length)} hooks applied`);
if (result.mcpServersAttached.length) parts.push(`${String(result.mcpServersAttached.length)} mcpServers attached`);
if (result.errors.length) parts.push(`${String(result.errors.length)} errors`);
if (parts.length === 0) parts.push('no changes');
if (!opts.quiet) {
log(`mcpctl skills sync${projectName ? ` (project: ${projectName})` : ' (global only)'}: ${parts.join(', ')}`);
} else if (result.installed.length || result.updated.length || result.removed.length || result.errors.length) {
} else if (anythingHappened) {
// Quiet mode: only emit a single line if something actually happened.
warn(`mcpctl: ${parts.join(', ')}`);
}
@@ -255,6 +303,112 @@ export async function runSkillsSync(opts: SyncOpts, deps: SyncDeps): Promise<Syn
};
const fileStates = await installSkillAtomic(targetDir, body);
// ── hooks: register metadata.hooks in ~/.claude/settings.json ──
// Tagged with _mcpctl_source: <skill-name> so each skill's hooks
// can be cleanly added/updated/removed without trampling other
// skills or user-added hooks. No-op when the field is absent or
// empty.
const meta = (full.metadata ?? {}) as SyncedSkillMetadata;
if (meta.hooks && typeof meta.hooks === 'object') {
try {
const hookRes = await applyManagedHooks(v.name, meta.hooks as HooksByEvent);
if (hookRes.updated) result.hooksApplied.push(v.name);
} catch (err: unknown) {
warn(`mcpctl: failed to apply hooks for skill '${v.name}': ${err instanceof Error ? err.message : String(err)}`);
}
} else if (prior !== undefined) {
// Skill no longer declares hooks but used to — clean up.
try { await removeManagedHooks(v.name); } catch { /* best-effort */ }
}
// ── mcpServers: auto-attach declared deps to the active project ──
// Only meaningful when a project context is active; global skills
// can't attach to "no project". v1 doesn't auto-create missing
// servers (warn + skip). Idempotent — re-syncing a skill whose
// deps are already attached is a no-op.
const mcpServerDeps = parseMcpServerDeps(meta.mcpServers);
if (mcpServerDeps.length > 0 && projectName) {
try {
const att = await attachSkillMcpServers(client, projectName, mcpServerDeps, warn);
for (const srv of att.attached) {
result.mcpServersAttached.push(`${v.name}:${srv}`);
}
for (const e of att.errors) {
result.errors.push({
skill: v.name,
error: `mcpServers attach '${e.server}': ${e.error}`,
});
}
} catch (err: unknown) {
warn(`mcpctl: failed to attach mcpServers for skill '${v.name}': ${err instanceof Error ? err.message : String(err)}`);
}
} else if (mcpServerDeps.length > 0) {
warn(`mcpctl: skill '${v.name}' declares mcpServers but sync is running global-only; skipping attach`);
}
// ── postInstall: run metadata.postInstall when present ──
// Hash-pinned: only execute when the script's sha256 differs from
// what state recorded. Failures DO NOT update the recorded hash so
// the next sync retries. Other skills continue regardless.
let postInstallHash: string | null = prior?.postInstallHash ?? null;
if (
!opts.skipPostInstall &&
typeof meta.postInstall === 'string' &&
meta.postInstall.length > 0
) {
const scriptRel = meta.postInstall;
const scriptContent = (full.files ?? {})[scriptRel];
if (typeof scriptContent !== 'string') {
warn(`mcpctl: skill '${v.name}' postInstall references '${scriptRel}' which is not in files{}; skipping`);
} else {
const newHash = hashScript(scriptContent);
const hashChanged = newHash !== prior?.postInstallHash;
if (!hashChanged) {
result.postInstallsSkipped.push(v.name);
postInstallHash = newHash;
} else {
try {
const timeoutSec = typeof meta.postInstallTimeoutSec === 'number' ? meta.postInstallTimeoutSec : undefined;
const piInput = {
installDir: targetDir,
scriptPath: scriptRel,
skillName: v.name,
semver: v.semver,
projectName: projectName ?? undefined,
timeoutSec,
logsDir: join(homedir(), '.mcpctl', 'skills', v.name),
};
const installResult = await runPostInstall(piInput);
// Best-effort audit. Don't await; mcpd slowness shouldn't slow sync.
void emitPostInstallAudit(client, piInput, installResult, (m) => warn(m));
if (installResult.timedOut) {
result.errors.push({
skill: v.name,
error: `postInstall timed out after ${String(installResult.durationMs)}ms; rerun next sync`,
});
// hash NOT updated → retry on next sync
} else if (installResult.exitCode !== 0) {
const tail = installResult.stderrTail.trim() || installResult.stdoutTail.trim() || `exit ${String(installResult.exitCode)}`;
result.errors.push({
skill: v.name,
error: `postInstall failed (exit ${String(installResult.exitCode)}): ${tail.slice(-200)}`,
});
// hash NOT updated → retry on next sync
} else {
postInstallHash = installResult.scriptHash;
result.postInstallsRan.push(v.name);
}
} catch (err: unknown) {
result.errors.push({
skill: v.name,
error: `postInstall error: ${err instanceof Error ? err.message : String(err)}`,
});
}
}
}
}
const newState: SkillState = {
id: v.id,
semver: v.semver,
@@ -262,10 +416,7 @@ export async function runSkillsSync(opts: SyncOpts, deps: SyncDeps): Promise<Syn
scope: v.scope,
installDir: targetDir,
files: fileStates,
// Tier-2 fields — postInstall execution is deferred to a follow-up
// PR. For now we record the hash so we can detect script changes
// when execution lands.
postInstallHash: null,
postInstallHash,
lastSyncedAt: new Date().toISOString(),
};
state.skills[v.name] = newState;

View File

@@ -0,0 +1,180 @@
/**
* Materialise skill-declared hooks into Claude Code's
* `~/.claude/settings.json`.
*
* Each entry we write carries two markers:
* `_mcpctl_managed: true` — same flag the SessionStart-hook
* installer uses; identifies an entry mcpctl owns.
* `_mcpctl_source: "<skill-name>"` — which skill installed it.
*
* The combination lets us cleanly add/update/remove skill hooks without
* clobbering hooks the user added by hand and without one skill trampling
* another. Removing skill X re-reads the file, drops every entry tagged
* `_mcpctl_source: "X"`, and rewrites atomically.
*
* Claude Code ignores the extra fields (it only looks at `type` and
* `command`).
*
* The file is written atomically (temp + rename) and tolerant of an
* existing file that has comments, no `hooks` block, or unexpected
* shape — same robustness profile as sessionhook.ts.
*/
import { readFile, writeFile, mkdir, rename } from 'node:fs/promises';
import { dirname, join } from 'node:path';
import { homedir } from 'node:os';
import { MARKER_KEY } from './sessionhook.js';
export const SOURCE_KEY = '_mcpctl_source';
/** A single hook entry: must be `type: 'command'` for v1. Extra fields preserved. */
export interface ManagedHookEntry {
type: 'command';
command: string;
timeout?: number;
/** Free-form: skills can attach extra fields and they'll round-trip. */
[k: string]: unknown;
}
/** Recognised hook events. Validated server-side; if a new event lands later, we still write whatever the skill declares. */
export type HookEvent =
| 'PreToolUse'
| 'PostToolUse'
| 'SessionStart'
| 'Stop'
| 'SubagentStop'
| 'Notification';
export type HooksByEvent = Partial<Record<HookEvent, ManagedHookEntry[]>>;
interface HookGroup {
hooks: ManagedHookEntry[];
[k: string]: unknown;
}
interface Settings {
hooks?: Partial<Record<string, HookGroup[]>>;
[k: string]: unknown;
}
function defaultSettingsPath(): string {
return join(homedir(), '.claude', 'settings.json');
}
async function readSettings(path: string): Promise<Settings> {
try {
const raw = await readFile(path, 'utf-8');
if (raw.trim().length === 0) return {};
const stripped = raw.replace(/^\s*\/\/.*$/gm, '');
return JSON.parse(stripped) as Settings;
} catch (err: unknown) {
if (isNotFoundError(err)) return {};
throw err;
}
}
async function writeSettings(path: string, settings: Settings): Promise<void> {
await mkdir(dirname(path), { recursive: true });
const tmp = `${path}.tmp.${String(process.pid)}`;
await writeFile(tmp, JSON.stringify(settings, null, 2) + '\n', 'utf-8');
await rename(tmp, path);
}
function isManagedBy(entry: unknown, source: string): boolean {
if (entry === null || typeof entry !== 'object') return false;
const e = entry as Record<string, unknown>;
return e[MARKER_KEY] === true && e[SOURCE_KEY] === source;
}
/**
* Replace this skill's hook entries with the provided set. If `hooks`
* omits an event the skill previously installed, those entries are
* dropped. Other skills' entries and user-added entries are preserved.
*
* Returns the count of changes (added or removed entries) so callers
* can short-circuit no-op writes.
*/
export async function applyManagedHooks(
source: string,
hooks: HooksByEvent,
settingsPath: string = defaultSettingsPath(),
): Promise<{ updated: boolean; settingsPath: string }> {
const settings = await readSettings(settingsPath);
if (!settings.hooks) settings.hooks = {};
let changed = false;
// For each known/declared event, drop our previous entries and add the new ones.
const declaredEvents = new Set<string>(Object.keys(hooks));
// Also walk events that already have entries from this source (so skills can shrink scope).
for (const [eventName, groups] of Object.entries(settings.hooks)) {
if (!Array.isArray(groups)) continue;
if (groups.some((g) => Array.isArray(g.hooks) && g.hooks.some((e) => isManagedBy(e, source)))) {
declaredEvents.add(eventName);
}
}
for (const eventName of declaredEvents) {
const desired = hooks[eventName as HookEvent] ?? [];
const groups = (settings.hooks[eventName] as HookGroup[] | undefined) ?? [];
// Strip our entries from each group, then drop empty groups.
const stripped: HookGroup[] = [];
for (const group of groups) {
if (!Array.isArray(group?.hooks)) {
stripped.push(group);
continue;
}
const before = group.hooks.length;
const filtered = group.hooks.filter((e) => !isManagedBy(e, source));
if (filtered.length !== before) changed = true;
if (filtered.length > 0) {
stripped.push({ ...group, hooks: filtered });
}
}
// Insert the new set as a single group tagged with our source.
if (desired.length > 0) {
const tagged = desired.map((entry) => ({
...entry,
type: 'command' as const,
[MARKER_KEY]: true,
[SOURCE_KEY]: source,
}));
stripped.push({ hooks: tagged, [SOURCE_KEY]: source });
changed = true;
}
if (stripped.length === 0) {
// No groups left for this event — drop the event entirely so the
// settings.json doesn't accumulate empty arrays.
delete settings.hooks[eventName];
} else {
settings.hooks[eventName] = stripped;
}
}
if (!changed) {
return { updated: false, settingsPath };
}
await writeSettings(settingsPath, settings);
return { updated: true, settingsPath };
}
/**
* Drop all hook entries owned by `source`. Used by the sync's orphan-
* removal path so a skill that's no longer in the server set
* un-registers its hooks too. Returns whether anything was changed.
*/
export async function removeManagedHooks(
source: string,
settingsPath: string = defaultSettingsPath(),
): Promise<{ removed: boolean; settingsPath: string }> {
const result = await applyManagedHooks(source, {}, settingsPath);
return { removed: result.updated, settingsPath: result.settingsPath };
}
function isNotFoundError(err: unknown): boolean {
return typeof err === 'object' && err !== null && (err as { code?: string }).code === 'ENOENT';
}

View File

@@ -0,0 +1,176 @@
/**
* Auto-attach the MCP server dependencies a skill declares to the
* project that's syncing. Per the corporate-appliance trust model,
* publishing a skill that says "this project depends on my-grafana"
* is enough — the client takes mcpd at its word and asks mcpd to
* attach the server to the project.
*
* What this function does NOT do (deliberately):
* - Auto-create the server from a template if it's missing.
* Provisioning infrastructure from a skill push is a separate
* decision that needs explicit operator consent. v1 just warns
* when the named server doesn't exist and skips that dep.
* - Detach servers that a skill removed from its mcpServers list.
* Detach is destructive (the project loses access) and the
* `attach` itself is idempotent on the server side, so we err
* on the side of leaving things attached. PR-7 can revisit if
* a use case shows up.
*
* The mcpServers field is per-project: a skill's declared deps only
* get attached to the project the sync is running for. Global skills
* (no projectName context) skip this step entirely — there's no
* project to attach to.
*/
import type { ApiClient } from '../api-client.js';
import { ApiError } from '../api-client.js';
export interface McpServerDep {
name: string;
fromTemplate?: string;
project?: string;
}
export interface AttachResult {
attached: string[];
alreadyAttached: string[];
missing: string[];
errors: Array<{ server: string; error: string }>;
}
/**
* Resolve project name → id, list its currently-attached servers,
* then attach each declared dep that isn't already there. Idempotent
* by virtue of the existing-attachment check.
*
* Failures per-server are collected, not thrown — sync continues.
*/
export async function attachSkillMcpServers(
client: ApiClient,
projectName: string,
deps: McpServerDep[],
warn: (msg: string) => void = () => {},
): Promise<AttachResult> {
const result: AttachResult = {
attached: [],
alreadyAttached: [],
missing: [],
errors: [],
};
if (deps.length === 0) return result;
// Resolve project → id (the attach endpoint is keyed by id, not name).
let projectId: string;
try {
const projects = await client.get<Array<{ id: string; name: string }>>('/api/v1/projects');
const match = projects.find((p) => p.name === projectName);
if (!match) {
// No project to attach to — surface every dep as an error so the
// operator can see something is mis-configured.
for (const dep of deps) {
result.errors.push({ server: dep.name, error: `Project '${projectName}' not found` });
}
return result;
}
projectId = match.id;
} catch (err: unknown) {
for (const dep of deps) {
result.errors.push({
server: dep.name,
error: `Failed to resolve project: ${err instanceof Error ? err.message : String(err)}`,
});
}
return result;
}
// Inspect current attachments. The /api/v1/projects/:id/servers POST
// endpoint is idempotent server-side, but we still pre-check so we
// can report alreadyAttached vs newly-attached cleanly.
let attached = new Set<string>();
try {
const project = await client.get<{ servers?: Array<{ server?: { name: string } }> }>(`/api/v1/projects/${projectId}`);
attached = new Set(
(project.servers ?? [])
.map((s) => s.server?.name)
.filter((n): n is string => typeof n === 'string'),
);
} catch (err: unknown) {
warn(`mcpctl: failed to read current attachments for project '${projectName}': ${err instanceof Error ? err.message : String(err)}`);
// Fall through with an empty set — we'll attempt attaches and let
// server-side idempotency cover any duplicates.
}
// Optionally narrow the existing-server set so we can warn loudly on
// unknown server names. (Server attaches against a non-existent
// server would 404 anyway, but a clearer warning is friendlier.)
let existingServers = new Set<string>();
try {
const servers = await client.get<Array<{ name: string }>>('/api/v1/servers');
existingServers = new Set(servers.map((s) => s.name));
} catch {
// Best-effort; if listing fails we still try the attach.
}
for (const dep of deps) {
// Honour an explicit `project` on the dep — defensive, normally
// matches the active project anyway. Skip mismatches so a skill
// can declare deps for a different project without collateral
// damage during this sync.
if (dep.project && dep.project !== projectName) {
continue;
}
if (attached.has(dep.name)) {
result.alreadyAttached.push(dep.name);
continue;
}
if (existingServers.size > 0 && !existingServers.has(dep.name)) {
// Server doesn't exist on mcpd. v1 doesn't auto-create; warn and continue.
const detail = dep.fromTemplate
? ` (skill suggests creating it via template '${dep.fromTemplate}')`
: '';
warn(`mcpctl: skill mcpServers dep '${dep.name}' not found on mcpd${detail}; skipping attach`);
result.missing.push(dep.name);
continue;
}
try {
await client.post(`/api/v1/projects/${projectId}/servers`, { server: dep.name });
result.attached.push(dep.name);
} catch (err: unknown) {
// Idempotency: 409 (already attached) is success.
if (err instanceof ApiError && err.status === 409) {
result.alreadyAttached.push(dep.name);
continue;
}
// 404 means either the project or the server vanished mid-sync.
if (err instanceof ApiError && err.status === 404) {
result.missing.push(dep.name);
continue;
}
result.errors.push({
server: dep.name,
error: err instanceof Error ? err.message : String(err),
});
}
}
return result;
}
/** Type-narrow the metadata.mcpServers field. Tolerant of garbage. */
export function parseMcpServerDeps(value: unknown): McpServerDep[] {
if (!Array.isArray(value)) return [];
const out: McpServerDep[] = [];
for (const v of value) {
if (v === null || typeof v !== 'object') continue;
const obj = v as Record<string, unknown>;
const name = obj['name'];
if (typeof name !== 'string' || name.length === 0) continue;
const dep: McpServerDep = { name };
if (typeof obj['fromTemplate'] === 'string') dep.fromTemplate = obj['fromTemplate'];
if (typeof obj['project'] === 'string') dep.project = obj['project'];
out.push(dep);
}
return out;
}

View File

@@ -0,0 +1,282 @@
/**
* postInstall executor for `mcpctl skills sync`.
*
* Trust model: mcpctl runs scripts that mcpd has served. mcpd is the
* corporate source of truth — content is reviewed at publish time. We
* do NOT sandbox or signature-check on the client. The controls that
* matter live on the publishing side (RBAC, audit, reviewer queue).
*
* What we DO provide is ops hygiene:
* - Hard timeout (default 60 s, per-skill override via
* `metadata.postInstallTimeoutSec`). Stops a runaway script from
* wedging Claude startup forever.
* - Hash-pinning: the script's sha256 is recorded in the skills state
* file so the next sync skips re-execution unless the hash changed.
* Saves churn; catches "the same skill at the same semver was
* re-published with a fixed script".
* - Curated env: MCPCTL_SKILL_NAME / _VERSION / _DIR / _PROJECT plus
* inherited PATH / HOME / USER / SHELL. Cron-style minimal env so
* scripts behave the same on every machine.
* - Per-skill install log under ~/.mcpctl/skills/<name>/install.log
* (rotated to keep the last 5 runs). Standard sysadmin reflex.
* - Audit event back to mcpd on every run. So mcpd's audit pipeline
* has both sides of the timeline (publish + per-machine execution).
*
* Failure semantics: a non-zero exit, a hang past the timeout, or a
* spawn error is treated as a failed sync of THIS skill. The state
* file's postInstallHash is NOT updated on failure, so the next sync
* will retry. Other skills in the same sync run continue regardless.
*/
import { createHash } from 'node:crypto';
import { spawn } from 'node:child_process';
import { mkdir, readFile, writeFile, stat } from 'node:fs/promises';
import { dirname, join, resolve } from 'node:path';
import { hostname } from 'node:os';
import { setTimeout as delay } from 'node:timers/promises';
import type { ApiClient } from '../api-client.js';
export interface PostInstallInput {
/** Full path of the materialised skill directory. The script path is resolved relative to this. */
installDir: string;
/** metadata.postInstall — relative path inside the skill bundle. */
scriptPath: string;
/** Name of the skill. Surfaces in audit + env + log path. */
skillName: string;
/** Skill version. Audit + env. */
semver: string;
/** Project name when the skill is project-scoped, else undefined. */
projectName?: string | undefined;
/** Per-skill override for the 60-s default. */
timeoutSec?: number | undefined;
/** Where to put the rolling install.log. Default: ~/.mcpctl/skills/<name>/install.log. */
logsDir: string;
}
export interface PostInstallResult {
exitCode: number | null;
durationMs: number;
scriptHash: string;
timedOut: boolean;
signal: NodeJS.Signals | null;
stdoutTail: string;
stderrTail: string;
}
const DEFAULT_TIMEOUT_SEC = 60;
const TAIL_BYTES = 4 * 1024;
const MAX_LOG_BYTES = 256 * 1024;
/**
* Compute the sha256 of a script — used as the "have I already run this
* version?" key in the skills state file. Caller passes the raw script
* bytes; this just wraps the hash routine to stay consistent with the
* `'sha256:'`-prefixed format used elsewhere (skills-state.ts).
*/
export function hashScript(content: string | Buffer): string {
const buf = typeof content === 'string' ? Buffer.from(content, 'utf-8') : content;
return 'sha256:' + createHash('sha256').update(buf).digest('hex');
}
/**
* Run the post-install script. Returns a result regardless of success
* or failure — caller inspects `exitCode`/`timedOut` to decide.
*
* Path validation: the resolved script path must remain inside
* `installDir`. A skill that tries to point postInstall at
* `../../../../etc/passwd-like` is rejected as a failed run, not
* silently ignored.
*/
export async function runPostInstall(input: PostInstallInput): Promise<PostInstallResult> {
const start = Date.now();
const timeoutMs = (input.timeoutSec ?? DEFAULT_TIMEOUT_SEC) * 1000;
const fullPath = resolve(input.installDir, input.scriptPath);
// Defence in depth: the install dir is server-published content, but
// a server with skill-write RBAC could still cause mischief. The
// check makes our intent explicit: scripts may only live inside the
// skill bundle.
const installDirResolved = resolve(input.installDir);
if (!fullPath.startsWith(installDirResolved + '/') && fullPath !== installDirResolved) {
throw new Error(
`postInstall path '${input.scriptPath}' escapes skill dir`,
);
}
// Read script bytes for hashing (and to fail-fast if missing).
const scriptBytes = await readFile(fullPath);
const scriptHash = hashScript(scriptBytes);
// Curated env. Cron-style minimum: keep PATH so the script can find
// git/curl/python; keep HOME/USER/SHELL so scripts that touch dotfiles
// work; drop everything else.
const env: Record<string, string> = {
PATH: process.env['PATH'] ?? '/usr/local/bin:/usr/bin:/bin',
HOME: process.env['HOME'] ?? '',
USER: process.env['USER'] ?? '',
SHELL: process.env['SHELL'] ?? '/bin/sh',
LANG: process.env['LANG'] ?? 'C.UTF-8',
TERM: process.env['TERM'] ?? 'dumb',
MCPCTL_SKILL_NAME: input.skillName,
MCPCTL_SKILL_VERSION: input.semver,
MCPCTL_SKILL_DIR: installDirResolved,
};
if (input.projectName) env['MCPCTL_PROJECT'] = input.projectName;
// Make sure the script is executable. Some upstreams ship with mode
// 0644 — if shebang exists, we can fall through to the interpreter;
// otherwise spawn will EACCES.
await ensureExecutable(fullPath, scriptBytes);
await mkdir(input.logsDir, { recursive: true });
const logPath = join(input.logsDir, 'install.log');
// Rolling-append. Keep ~256 KB; old entries get truncated. The tail
// returned to the caller is the last few KB regardless.
const logHeader = `\n=== ${new Date().toISOString()} ${input.skillName}@${input.semver} ===\n`;
// Cast through Buffer<ArrayBufferLike> — Node's typings split Buffer
// into Buffer<ArrayBuffer> (from .alloc) and Buffer<ArrayBufferLike>
// (from .subarray), which exactOptionalPropertyTypes refuses to
// bridge. Explicit `Buffer` annotation widens to the union.
let stdoutBuf: Buffer = Buffer.alloc(0);
let stderrBuf: Buffer = Buffer.alloc(0);
let timedOut = false;
const child = spawn(fullPath, [], {
cwd: installDirResolved,
env,
stdio: ['ignore', 'pipe', 'pipe'],
});
child.stdout.on('data', (chunk: Buffer) => {
stdoutBuf = appendCapped(stdoutBuf, chunk);
});
child.stderr.on('data', (chunk: Buffer) => {
stderrBuf = appendCapped(stderrBuf, chunk);
});
// Hard timeout via SIGTERM, then SIGKILL after 2 s grace.
const timer = setTimeout(() => {
timedOut = true;
child.kill('SIGTERM');
void (async () => {
await delay(2000);
if (child.exitCode === null) child.kill('SIGKILL');
})();
}, timeoutMs);
const { exitCode, signal } = await new Promise<{ exitCode: number | null; signal: NodeJS.Signals | null }>((resolveProm) => {
child.on('close', (code, sig) => resolveProm({ exitCode: code, signal: sig }));
child.on('error', () => resolveProm({ exitCode: null, signal: null }));
});
clearTimeout(timer);
const durationMs = Date.now() - start;
const stdoutText = stdoutBuf.toString('utf-8');
const stderrText = stderrBuf.toString('utf-8');
// Append to the install log, truncating from the front if oversize.
const trailer = `\n--- exit ${exitCode === null ? '?' : String(exitCode)}${signal ? ` (${signal})` : ''} in ${String(durationMs)}ms${timedOut ? ' [TIMEOUT]' : ''} ---\n`;
const fullEntry = logHeader + 'STDOUT:\n' + stdoutText + '\nSTDERR:\n' + stderrText + trailer;
await appendBoundedLog(logPath, fullEntry);
return {
exitCode,
durationMs,
scriptHash,
timedOut,
signal,
stdoutTail: tailString(stdoutText, TAIL_BYTES),
stderrTail: tailString(stderrText, TAIL_BYTES),
};
}
/**
* Best-effort audit emission — POSTs a structured event back to mcpd
* so admins have fleet visibility. Failures are warned via the
* provided logger but never thrown; the audit log is supplementary,
* not load-bearing for sync correctness.
*
* The event includes machine fingerprint (hostname) so the operator
* can tell which dev box ran the script — useful when triaging a
* misbehaving update.
*/
export async function emitPostInstallAudit(
client: ApiClient,
input: PostInstallInput,
result: PostInstallResult,
warn: (msg: string) => void = () => {},
): Promise<void> {
try {
await client.post('/api/v1/audit-events', {
eventKind: 'skill_postinstall',
source: 'mcpctl-cli',
verified: false,
payload: {
skillName: input.skillName,
skillVersion: input.semver,
projectName: input.projectName ?? null,
scriptPath: input.scriptPath,
scriptHash: result.scriptHash,
exitCode: result.exitCode,
durationMs: result.durationMs,
timedOut: result.timedOut,
signal: result.signal,
machine: hostname(),
},
});
} catch (err) {
warn(`mcpctl: failed to emit postInstall audit event: ${err instanceof Error ? err.message : String(err)}`);
}
}
// ── internals ──
function appendCapped(buf: Buffer, chunk: Buffer): Buffer {
// Keep up to MAX_LOG_BYTES per stream; drop oldest bytes if over.
if (buf.length + chunk.length <= MAX_LOG_BYTES) {
return Buffer.concat([buf, chunk]);
}
const merged = Buffer.concat([buf, chunk]);
// Buffer.from(...) here keeps Node's typing happy under
// exactOptionalPropertyTypes — `subarray` on Buffer returns a
// Buffer<ArrayBufferLike> which TS won't widen to the input type.
return Buffer.from(merged.subarray(merged.length - MAX_LOG_BYTES));
}
function tailString(s: string, bytes: number): string {
if (s.length <= bytes) return s;
return '…' + s.slice(s.length - bytes + 1);
}
async function ensureExecutable(path: string, bytes: Buffer): Promise<void> {
try {
const st = await stat(path);
// Owner execute bit. Skip if it's set already.
if ((st.mode & 0o100) !== 0) return;
} catch {
return; // stat failed — let the spawn surface the real error
}
// Has shebang? Fine — many shells will still execute even without +x
// when invoked as `<interpreter> <path>`, but we always spawn the
// path directly so we need +x. Set 0755.
void bytes; // (kept around in case we want to inspect shebang later)
const { chmod } = await import('node:fs/promises');
await chmod(path, 0o755);
}
async function appendBoundedLog(path: string, entry: string): Promise<void> {
const max = 5 * MAX_LOG_BYTES;
let existing = '';
try {
existing = await readFile(path, 'utf-8');
} catch (err: unknown) {
if (typeof err !== 'object' || err === null || (err as { code?: string }).code !== 'ENOENT') throw err;
}
const combined = existing + entry;
// Keep last `max` bytes.
const trimmed = combined.length > max ? '…[truncated]…\n' + combined.slice(combined.length - max) : combined;
await mkdir(dirname(path), { recursive: true });
await writeFile(path, trimmed, 'utf-8');
}

View File

@@ -0,0 +1,174 @@
import { describe, it, expect, beforeEach, afterEach } from 'vitest';
import { mkdtemp, rm, readFile, writeFile, mkdir } from 'node:fs/promises';
import { tmpdir } from 'node:os';
import { join } from 'node:path';
import { applyManagedHooks, removeManagedHooks, SOURCE_KEY } from '../../src/utils/hooks-materialiser.js';
import { MARKER_KEY } from '../../src/utils/sessionhook.js';
describe('hooks-materialiser', () => {
let tmp: string;
let settings: string;
beforeEach(async () => {
tmp = await mkdtemp(join(tmpdir(), 'mcpctl-hooks-'));
settings = join(tmp, 'settings.json');
});
afterEach(async () => {
await rm(tmp, { recursive: true, force: true });
});
it('writes a tagged hook from scratch when settings.json is missing', async () => {
const result = await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo before' }],
}, settings);
expect(result.updated).toBe(true);
const file = JSON.parse(await readFile(settings, 'utf-8'));
expect(file.hooks.PreToolUse).toHaveLength(1);
const entry = file.hooks.PreToolUse[0].hooks[0];
expect(entry.command).toBe('echo before');
expect(entry[MARKER_KEY]).toBe(true);
expect(entry[SOURCE_KEY]).toBe('skill-a');
});
it('coexists with hooks owned by other skills', async () => {
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
await applyManagedHooks('skill-b', {
PreToolUse: [{ type: 'command', command: 'echo b' }],
}, settings);
const file = JSON.parse(await readFile(settings, 'utf-8'));
const all = file.hooks.PreToolUse.flatMap((g: { hooks: Array<{ command: string; [k: string]: unknown }> }) => g.hooks);
expect(all.find((e: { command: string }) => e.command === 'echo a')).toBeDefined();
expect(all.find((e: { command: string }) => e.command === 'echo b')).toBeDefined();
expect(all).toHaveLength(2);
});
it('preserves user-added hooks (no marker)', async () => {
await mkdir(tmp, { recursive: true });
await writeFile(settings, JSON.stringify({
hooks: {
PreToolUse: [{ hooks: [{ type: 'command', command: 'echo user' }] }],
},
}));
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
const file = JSON.parse(await readFile(settings, 'utf-8'));
const all = file.hooks.PreToolUse.flatMap((g: { hooks: Array<{ command: string; [k: string]: unknown }> }) => g.hooks);
expect(all.find((e: { command: string }) => e.command === 'echo user')).toBeDefined();
expect(all.find((e: { command: string; [k: string]: unknown }) => e.command === 'echo a' && e[MARKER_KEY] === true)).toBeDefined();
});
it('updating a skill replaces its old entries (does not duplicate)', async () => {
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo old' }],
}, settings);
const second = await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo new' }],
}, settings);
expect(second.updated).toBe(true);
const file = JSON.parse(await readFile(settings, 'utf-8'));
const all = file.hooks.PreToolUse.flatMap((g: { hooks: Array<{ command: string; [k: string]: unknown }> }) => g.hooks);
const ours = all.filter((e: { [k: string]: unknown }) => e[SOURCE_KEY] === 'skill-a');
expect(ours).toHaveLength(1);
expect((ours[0] as { command: string }).command).toBe('echo new');
});
it('shrinking a skill drops events it no longer declares', async () => {
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo pre' }],
PostToolUse: [{ type: 'command', command: 'echo post' }],
}, settings);
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo pre' }],
// PostToolUse omitted → should be dropped
}, settings);
const file = JSON.parse(await readFile(settings, 'utf-8'));
expect(file.hooks.PreToolUse).toBeDefined();
expect(file.hooks.PostToolUse).toBeUndefined();
});
it('removeManagedHooks drops only the named source', async () => {
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
await applyManagedHooks('skill-b', {
PreToolUse: [{ type: 'command', command: 'echo b' }],
}, settings);
const removed = await removeManagedHooks('skill-a', settings);
expect(removed.removed).toBe(true);
const file = JSON.parse(await readFile(settings, 'utf-8'));
const all = file.hooks.PreToolUse.flatMap((g: { hooks: Array<{ command: string; [k: string]: unknown }> }) => g.hooks);
expect(all).toHaveLength(1);
expect((all[0] as { command: string }).command).toBe('echo b');
});
it('removeManagedHooks is a no-op when the source has no entries', async () => {
const result = await removeManagedHooks('never-installed', settings);
expect(result.removed).toBe(false);
});
it('handles multiple hook events independently', async () => {
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo pre' }],
PostToolUse: [{ type: 'command', command: 'echo post' }],
SessionStart: [{ type: 'command', command: 'echo start' }],
}, settings);
const file = JSON.parse(await readFile(settings, 'utf-8'));
expect(file.hooks.PreToolUse).toBeDefined();
expect(file.hooks.PostToolUse).toBeDefined();
expect(file.hooks.SessionStart).toBeDefined();
});
it('idempotent — re-applying the same hooks reports updated=true on first call only', async () => {
const first = await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
expect(first.updated).toBe(true);
// Re-applying ALWAYS rewrites our entry (we don't deep-equal them
// for "no change"), but the resulting file is byte-identical except
// for ordering. The test just confirms the file remains valid + well-shaped.
const second = await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
// updated=true is acceptable here; we replaced+re-added our entry.
expect(second.updated).toBe(true);
const file = JSON.parse(await readFile(settings, 'utf-8'));
const all = file.hooks.PreToolUse.flatMap((g: { hooks: Array<{ command: string; [k: string]: unknown }> }) => g.hooks);
const ours = all.filter((e: { [k: string]: unknown }) => e[SOURCE_KEY] === 'skill-a');
expect(ours).toHaveLength(1);
});
it('survives empty settings.json', async () => {
await writeFile(settings, '');
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
const file = JSON.parse(await readFile(settings, 'utf-8'));
expect(file.hooks.PreToolUse).toHaveLength(1);
});
it('survives JSONC line comments in settings.json', async () => {
await writeFile(settings, '// preamble\n{ "hooks": {} }\n');
await applyManagedHooks('skill-a', {
PreToolUse: [{ type: 'command', command: 'echo a' }],
}, settings);
const file = JSON.parse(await readFile(settings, 'utf-8'));
expect(file.hooks.PreToolUse).toHaveLength(1);
});
});

View File

@@ -0,0 +1,227 @@
import { describe, it, expect, vi } from 'vitest';
import { attachSkillMcpServers, parseMcpServerDeps } from '../../src/utils/mcpservers-materialiser.js';
import type { ApiClient } from '../../src/api-client.js';
import { ApiError } from '../../src/api-client.js';
interface MockClient {
get: ReturnType<typeof vi.fn>;
post: ReturnType<typeof vi.fn>;
put: ReturnType<typeof vi.fn>;
delete: ReturnType<typeof vi.fn>;
}
function makeClient(): MockClient {
return {
get: vi.fn(),
post: vi.fn(async () => ({})),
put: vi.fn(async () => ({})),
delete: vi.fn(async () => undefined),
};
}
function apiError(status: number, body = 'err'): ApiError {
return new ApiError(status, body);
}
describe('mcpservers-materialiser', () => {
describe('parseMcpServerDeps', () => {
it('returns [] for non-arrays', () => {
expect(parseMcpServerDeps(null)).toEqual([]);
expect(parseMcpServerDeps('foo')).toEqual([]);
expect(parseMcpServerDeps({})).toEqual([]);
});
it('keeps valid entries and drops garbage', () => {
const out = parseMcpServerDeps([
{ name: 'good', fromTemplate: 't', project: 'p' },
{ name: '', fromTemplate: 't' }, // empty name → drop
{ fromTemplate: 'no-name' }, // no name → drop
{ name: 'bare' }, // valid, minimal
'string', // not an object → drop
]);
expect(out).toEqual([
{ name: 'good', fromTemplate: 't', project: 'p' },
{ name: 'bare' },
]);
});
});
describe('attachSkillMcpServers', () => {
it('attaches a new server when not already present', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
if (path === '/api/v1/projects/proj-1') return { servers: [] };
if (path === '/api/v1/servers') return [{ name: 'my-grafana' }];
throw new Error(`unexpected GET ${path}`);
});
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'my-grafana', fromTemplate: 'grafana' }],
);
expect(result.attached).toEqual(['my-grafana']);
expect(result.alreadyAttached).toEqual([]);
expect(result.missing).toEqual([]);
expect(result.errors).toEqual([]);
expect(client.post).toHaveBeenCalledWith('/api/v1/projects/proj-1/servers', { server: 'my-grafana' });
});
it('reports alreadyAttached without re-posting', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
if (path === '/api/v1/projects/proj-1') return { servers: [{ server: { name: 'my-grafana' } }] };
if (path === '/api/v1/servers') return [{ name: 'my-grafana' }];
throw new Error(`unexpected GET ${path}`);
});
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'my-grafana' }],
);
expect(result.alreadyAttached).toEqual(['my-grafana']);
expect(result.attached).toEqual([]);
expect(client.post).not.toHaveBeenCalled();
});
it('warns + skips when server does not exist on mcpd', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
if (path === '/api/v1/projects/proj-1') return { servers: [] };
if (path === '/api/v1/servers') return [{ name: 'something-else' }];
throw new Error(`unexpected GET ${path}`);
});
const warnings: string[] = [];
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'my-grafana', fromTemplate: 'grafana' }],
(m) => warnings.push(m),
);
expect(result.missing).toEqual(['my-grafana']);
expect(result.attached).toEqual([]);
expect(client.post).not.toHaveBeenCalled();
expect(warnings.some((w) => w.includes('my-grafana') && w.includes('grafana'))).toBe(true);
});
it('errors-out when the project does not exist', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return []; // no projects
throw new Error(`unexpected GET ${path}`);
});
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'no-such-project',
[{ name: 'my-grafana' }],
);
expect(result.errors).toHaveLength(1);
expect(result.errors[0]?.error).toContain('Project');
expect(client.post).not.toHaveBeenCalled();
});
it('treats 409 from POST as alreadyAttached (idempotent server-side)', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
// attachments listing fails — fall back to attempting + handling 409
if (path === '/api/v1/projects/proj-1') throw apiError(500, 'flake');
if (path === '/api/v1/servers') return [{ name: 'my-grafana' }];
throw new Error(`unexpected GET ${path}`);
});
client.post.mockRejectedValueOnce(apiError(409, 'already attached'));
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'my-grafana' }],
);
expect(result.alreadyAttached).toEqual(['my-grafana']);
expect(result.errors).toEqual([]);
});
it('treats 404 from POST as missing (server vanished mid-sync)', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
if (path === '/api/v1/projects/proj-1') return { servers: [] };
if (path === '/api/v1/servers') return [{ name: 'my-grafana' }]; // existed when we listed
throw new Error(`unexpected GET ${path}`);
});
// …but vanished by the time we POSTed.
client.post.mockRejectedValueOnce(apiError(404, 'gone'));
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'my-grafana' }],
);
expect(result.missing).toEqual(['my-grafana']);
expect(result.errors).toEqual([]);
});
it('skips deps that target a different project', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
if (path === '/api/v1/projects/proj-1') return { servers: [] };
if (path === '/api/v1/servers') return [{ name: 'my-grafana' }];
throw new Error(`unexpected GET ${path}`);
});
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'my-grafana', project: 'other-project' }],
);
expect(result.attached).toEqual([]);
expect(result.missing).toEqual([]);
expect(client.post).not.toHaveBeenCalled();
});
it('continues past per-server errors', async () => {
const client = makeClient();
client.get.mockImplementation(async (path: string) => {
if (path === '/api/v1/projects') return [{ id: 'proj-1', name: 'demo' }];
if (path === '/api/v1/projects/proj-1') return { servers: [] };
if (path === '/api/v1/servers') return [{ name: 'a' }, { name: 'b' }];
throw new Error(`unexpected GET ${path}`);
});
client.post.mockImplementation(async (path: string, body) => {
if ((body as { server: string }).server === 'a') throw apiError(500, 'boom');
return {};
});
const result = await attachSkillMcpServers(
client as unknown as ApiClient,
'demo',
[{ name: 'a' }, { name: 'b' }],
);
expect(result.errors).toHaveLength(1);
expect(result.errors[0]?.server).toBe('a');
expect(result.attached).toEqual(['b']);
});
it('returns empty on empty deps without making any calls', async () => {
const client = makeClient();
const result = await attachSkillMcpServers(client as unknown as ApiClient, 'demo', []);
expect(result).toEqual({ attached: [], alreadyAttached: [], missing: [], errors: [] });
expect(client.get).not.toHaveBeenCalled();
expect(client.post).not.toHaveBeenCalled();
});
});
});

View File

@@ -0,0 +1,223 @@
import { describe, it, expect, beforeEach, afterEach } from 'vitest';
import { mkdtemp, rm, writeFile, chmod, readFile, mkdir } from 'node:fs/promises';
import { tmpdir } from 'node:os';
import { join } from 'node:path';
import { runPostInstall, hashScript } from '../../src/utils/postinstall.js';
describe('postinstall executor', () => {
let tmp: string;
beforeEach(async () => {
tmp = await mkdtemp(join(tmpdir(), 'mcpctl-postinstall-'));
});
afterEach(async () => {
await rm(tmp, { recursive: true, force: true });
});
describe('hashScript', () => {
it('returns deterministic sha256-prefixed hash', () => {
expect(hashScript('hello')).toMatch(/^sha256:[0-9a-f]{64}$/);
expect(hashScript('hello')).toBe(hashScript('hello'));
expect(hashScript('hello')).not.toBe(hashScript('hellp'));
});
});
describe('runPostInstall — success path', () => {
it('runs a passing script and returns exit 0 + script hash', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'install.sh');
await writeFile(scriptPath, '#!/bin/sh\necho hello-stdout\necho hello-stderr 1>&2\nexit 0\n');
await chmod(scriptPath, 0o755);
const result = await runPostInstall({
installDir,
scriptPath: 'install.sh',
skillName: 'test-skill',
semver: '0.1.0',
logsDir: join(tmp, 'logs'),
});
expect(result.exitCode).toBe(0);
expect(result.timedOut).toBe(false);
expect(result.stdoutTail).toContain('hello-stdout');
expect(result.stderrTail).toContain('hello-stderr');
expect(result.scriptHash).toMatch(/^sha256:/);
});
it('passes curated env (MCPCTL_SKILL_NAME, _VERSION, _DIR, _PROJECT)', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'install.sh');
// Write env vars to a file we can read back.
const outFile = join(tmp, 'env-dump.txt');
await writeFile(scriptPath, `#!/bin/sh
echo "name=$MCPCTL_SKILL_NAME" > ${JSON.stringify(outFile)}
echo "version=$MCPCTL_SKILL_VERSION" >> ${JSON.stringify(outFile)}
echo "dir=$MCPCTL_SKILL_DIR" >> ${JSON.stringify(outFile)}
echo "project=$MCPCTL_PROJECT" >> ${JSON.stringify(outFile)}
`);
await chmod(scriptPath, 0o755);
const result = await runPostInstall({
installDir,
scriptPath: 'install.sh',
skillName: 'env-test',
semver: '1.2.3',
projectName: 'demo',
logsDir: join(tmp, 'logs'),
});
expect(result.exitCode).toBe(0);
const dumped = await readFile(outFile, 'utf-8');
expect(dumped).toContain('name=env-test');
expect(dumped).toContain('version=1.2.3');
expect(dumped).toContain('dir=' + installDir);
expect(dumped).toContain('project=demo');
});
it('chmods 0644 scripts to executable before spawn', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'install.sh');
await writeFile(scriptPath, '#!/bin/sh\nexit 0\n');
await chmod(scriptPath, 0o644); // not executable
const result = await runPostInstall({
installDir,
scriptPath: 'install.sh',
skillName: 't',
semver: '0.1.0',
logsDir: join(tmp, 'logs'),
});
expect(result.exitCode).toBe(0);
});
});
describe('runPostInstall — failure paths', () => {
it('captures non-zero exit code and returns it', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'fail.sh');
await writeFile(scriptPath, '#!/bin/sh\necho oops 1>&2\nexit 7\n');
await chmod(scriptPath, 0o755);
const result = await runPostInstall({
installDir,
scriptPath: 'fail.sh',
skillName: 't',
semver: '0.1.0',
logsDir: join(tmp, 'logs'),
});
expect(result.exitCode).toBe(7);
expect(result.timedOut).toBe(false);
expect(result.stderrTail).toContain('oops');
});
it('honors timeoutSec — kills via SIGTERM and reports timedOut=true', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'hang.sh');
// `exec` so SIGTERM hits sleep directly — without it /bin/sh
// catches the signal but the orphaned sleep keeps the streams
// open until SIGKILL; the test then has to wait for the 2s grace
// window before we force-kill, which is fine but flakier.
await writeFile(scriptPath, '#!/bin/sh\nexec sleep 30\n');
await chmod(scriptPath, 0o755);
const start = Date.now();
const result = await runPostInstall({
installDir,
scriptPath: 'hang.sh',
skillName: 't',
semver: '0.1.0',
timeoutSec: 1,
logsDir: join(tmp, 'logs'),
});
const elapsed = Date.now() - start;
expect(result.timedOut).toBe(true);
// 1s timeout + up to 2s grace before SIGKILL.
expect(elapsed).toBeLessThan(5000);
expect(elapsed).toBeGreaterThanOrEqual(1000);
}, 15_000);
it('rejects path-escape attempts', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
await expect(runPostInstall({
installDir,
scriptPath: '../escape.sh',
skillName: 't',
semver: '0.1.0',
logsDir: join(tmp, 'logs'),
})).rejects.toThrow(/escapes skill dir/);
});
it('throws when the script does not exist', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
await expect(runPostInstall({
installDir,
scriptPath: 'missing.sh',
skillName: 't',
semver: '0.1.0',
logsDir: join(tmp, 'logs'),
})).rejects.toThrow();
});
});
describe('runPostInstall — install log', () => {
it('writes stdout + stderr + exit summary to logsDir/install.log', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'install.sh');
await writeFile(scriptPath, '#!/bin/sh\necho hello\nexit 0\n');
await chmod(scriptPath, 0o755);
const logsDir = join(tmp, 'logs');
await runPostInstall({
installDir,
scriptPath: 'install.sh',
skillName: 'log-test',
semver: '0.1.0',
logsDir,
});
const log = await readFile(join(logsDir, 'install.log'), 'utf-8');
expect(log).toContain('log-test@0.1.0');
expect(log).toContain('hello');
expect(log).toContain('exit 0');
});
it('appends across runs without losing prior history', async () => {
const installDir = join(tmp, 'skill');
await mkdir(installDir, { recursive: true });
const scriptPath = join(installDir, 'install.sh');
await writeFile(scriptPath, '#!/bin/sh\necho run\nexit 0\n');
await chmod(scriptPath, 0o755);
const logsDir = join(tmp, 'logs');
const input = {
installDir,
scriptPath: 'install.sh',
skillName: 't',
semver: '0.1.0',
logsDir,
};
await runPostInstall(input);
await runPostInstall(input);
const log = await readFile(join(logsDir, 'install.log'), 'utf-8');
// Two run headers separated by `===`.
const headers = (log.match(/=== /g) ?? []).length;
expect(headers).toBeGreaterThanOrEqual(2);
});
});
});

View File

@@ -129,10 +129,18 @@ export class HealthProbeRunner {
result = await this.probeLiveness(server, timeoutMs);
} else {
const readinessCheck = healthCheck as HealthCheckSpec & { tool: string };
if (server.transport === 'SSE' || server.transport === 'STREAMABLE_HTTP') {
result = await this.probeHttp(instance, server, readinessCheck, timeoutMs);
if (server.transport === 'STDIO') {
// Route STDIO readiness through the proxy so probes hit the live
// running container rather than spawning a fresh process inside
// it. The legacy `probeStdio` (docker-exec a synthetic Node script
// that re-spawns the package binary) only worked for
// packageName-based servers — image-based STDIO servers (gitea,
// docmost) returned a fake-unhealthy "No packageName or command"
// before they even tried the tool. Going through mcpProxyService
// also means readiness failures match production failures exactly.
result = await this.probeReadinessViaProxy(server, readinessCheck, timeoutMs);
} else {
result = await this.probeStdio(instance, server, readinessCheck, timeoutMs);
result = await this.probeHttp(instance, server, readinessCheck, timeoutMs);
}
}
} catch (err) {
@@ -188,6 +196,71 @@ export class HealthProbeRunner {
return result;
}
/**
* Readiness probe via McpProxyService — sends `tools/call` against the
* configured probe tool through the live running instance. Used by
* STDIO servers; HTTP/SSE servers go through the bespoke `probeHttp`
* paths that connect directly to the container's IP+port (those work
* fine and are kept as-is to minimise the diff in this PR).
*
* If the tool returns a JSON-RPC `error` (e.g. gitea-mcp-server's
* "token is required" when GITEA_ACCESS_TOKEN didn't resolve), we mark
* the instance unhealthy with the upstream error message. That's how
* we catch broken-by-empty-secret cases that liveness (`tools/list`)
* would otherwise pass.
*/
private async probeReadinessViaProxy(
server: McpServer,
healthCheck: HealthCheckSpec & { tool: string },
timeoutMs: number,
): Promise<ProbeResult> {
const start = Date.now();
if (!this.mcpProxyService) {
return { healthy: false, latencyMs: 0, message: 'mcpProxyService not wired — cannot run readiness probe' };
}
const deadline = new Promise<ProbeResult>((resolve) => {
setTimeout(() => resolve({
healthy: false,
latencyMs: timeoutMs,
message: `Readiness probe timed out after ${timeoutMs}ms`,
}), timeoutMs);
});
const probe = this.mcpProxyService
.execute({
serverId: server.id,
method: 'tools/call',
params: { name: healthCheck.tool, arguments: healthCheck.arguments ?? {} },
})
.then((response): ProbeResult => {
const latencyMs = Date.now() - start;
if (response.error) {
return {
healthy: false,
latencyMs,
message: response.error.message ?? `tools/call ${healthCheck.tool} returned error`,
};
}
// Some servers report tool-level failures inside the result body
// (`{ isError: true, content: [...] }`) rather than as JSON-RPC
// errors. Treat that as unhealthy too.
const result = response.result as { isError?: boolean; content?: Array<{ text?: string }> } | undefined;
if (result?.isError) {
const text = result.content?.[0]?.text ?? `${healthCheck.tool} returned isError`;
return { healthy: false, latencyMs, message: text };
}
return { healthy: true, latencyMs, message: 'ok' };
})
.catch((err: unknown): ProbeResult => ({
healthy: false,
latencyMs: Date.now() - start,
message: err instanceof Error ? err.message : String(err),
}));
return Promise.race([probe, deadline]);
}
/**
* Liveness probe — sends tools/list via McpProxyService so the probe traverses
* the exact code path production clients use. Works uniformly across every
@@ -463,122 +536,14 @@ export class HealthProbeRunner {
}
}
/**
* Probe a STDIO MCP server by running `docker exec` with a disposable Node.js
* script that pipes JSON-RPC messages into the package binary.
*/
private async probeStdio(
instance: McpInstance,
server: McpServer,
healthCheck: HealthCheckSpec & { tool: string },
timeoutMs: number,
): Promise<ProbeResult> {
if (!instance.containerId) {
return { healthy: false, latencyMs: 0, message: 'No container ID' };
}
const start = Date.now();
const packageName = server.packageName as string | null;
const command = server.command as string[] | null;
// Determine how to spawn the MCP server inside the container
let spawnCmd: string[];
if (packageName) {
spawnCmd = ['npx', '--prefer-offline', '-y', packageName];
} else if (command && command.length > 0) {
spawnCmd = command;
} else {
return { healthy: false, latencyMs: 0, message: 'No packageName or command for STDIO server' };
}
// Build JSON-RPC messages for the health probe
const initMsg = JSON.stringify({
jsonrpc: '2.0', id: 1, method: 'initialize',
params: {
protocolVersion: '2024-11-05',
capabilities: {},
clientInfo: { name: 'mcpctl-health', version: '0.0.1' },
},
});
const initializedMsg = JSON.stringify({
jsonrpc: '2.0', method: 'notifications/initialized',
});
const toolCallMsg = JSON.stringify({
jsonrpc: '2.0', id: 2, method: 'tools/call',
params: { name: healthCheck.tool, arguments: healthCheck.arguments ?? {} },
});
// Use a Node.js inline script that:
// 1. Spawns the MCP server binary
// 2. Sends initialize + initialized + tool call via stdin
// 3. Reads responses from stdout
// 4. Exits with 0 if tool call succeeds, 1 if it fails
const spawnArgs = JSON.stringify(spawnCmd);
const probeScript = `
const { spawn } = require('child_process');
const args = ${spawnArgs};
const proc = spawn(args[0], args.slice(1), { stdio: ['pipe', 'pipe', 'pipe'] });
let output = '';
let responded = false;
proc.stdout.on('data', d => {
output += d;
const lines = output.split('\\n');
for (const line of lines) {
if (!line.trim()) continue;
try {
const msg = JSON.parse(line);
if (msg.id === 2) {
responded = true;
if (msg.error) {
process.stdout.write('ERROR:' + (msg.error.message || 'unknown'));
proc.kill();
process.exit(1);
} else {
process.stdout.write('OK');
proc.kill();
process.exit(0);
}
}
} catch {}
}
output = lines[lines.length - 1] || '';
});
proc.stderr.on('data', () => {});
proc.on('error', e => { process.stdout.write('ERROR:' + e.message); process.exit(1); });
proc.on('exit', (code) => { if (!responded) { process.stdout.write('ERROR:process exited ' + code); process.exit(1); } });
setTimeout(() => { if (!responded) { process.stdout.write('ERROR:timeout'); proc.kill(); process.exit(1); } }, ${timeoutMs - 2000});
proc.stdin.write(${JSON.stringify(initMsg)} + '\\n');
setTimeout(() => {
proc.stdin.write(${JSON.stringify(initializedMsg)} + '\\n');
setTimeout(() => {
proc.stdin.write(${JSON.stringify(toolCallMsg)} + '\\n');
}, 500);
}, 500);
`.trim();
try {
const result = await this.orchestrator.execInContainer(
instance.containerId,
['node', '-e', probeScript],
{ timeoutMs },
);
const latencyMs = Date.now() - start;
if (result.exitCode === 0 && result.stdout.includes('OK')) {
return { healthy: true, latencyMs, message: 'ok' };
}
// Extract error message
const errorMatch = result.stdout.match(/ERROR:(.*)/);
const errorMsg = errorMatch?.[1] ?? (result.stderr.trim() || `exit code ${result.exitCode}`);
return { healthy: false, latencyMs, message: errorMsg };
} catch (err) {
return {
healthy: false,
latencyMs: Date.now() - start,
message: err instanceof Error ? err.message : String(err),
};
}
}
// Note: a previous `probeStdio` implementation existed here that ran a
// disposable Node script inside the container via `docker exec`,
// re-spawning the package binary and piping JSON-RPC into it. It only
// worked for packageName-based servers (the spawn step required an
// npx-compatible package); image-based STDIO servers like
// gitea-mcp-server fell through with "No packageName or command" and
// were always reported unhealthy for the wrong reason. STDIO readiness
// now goes through `probeReadinessViaProxy` which calls the live
// running container — same code path as production traffic — and
// surfaces the upstream error verbatim.
}

View File

@@ -1,4 +1,4 @@
import type { McpInstance } from '@prisma/client';
import type { McpInstance, McpServer } from '@prisma/client';
import type { IMcpInstanceRepository, IMcpServerRepository } from '../repositories/interfaces.js';
import type { McpOrchestrator, ContainerSpec, ContainerInfo } from './orchestrator.js';
import { NotFoundError } from './mcp-server.service.js';
@@ -13,6 +13,36 @@ const RUNNER_IMAGES: Record<string, string> = {
/** Network for MCP server containers (matches docker-compose mcp-servers network). */
const MCP_SERVERS_NETWORK = process.env['MCPD_MCP_NETWORK'] ?? 'mcp-servers';
/**
* Backoff schedule for instance startup failures (env resolution, container
* creation, etc). Mirrors Kubernetes-style escalation: fast retries for
* transient hiccups, then a longer pause once it's clear something is
* persistently wrong.
*
* The retry state lives on `McpInstance.metadata` (no schema migration
* needed) and is preserved across reconcile cycles by the in-place
* `retryInstance` path so attemptCount actually accumulates.
*/
const FAST_RETRY_MS = 30_000; // first 5 attempts: 30s apart
const SLOW_RETRY_MS = 5 * 60_000; // afterwards: 5 minutes
const MAX_FAST_RETRIES = 5;
interface RetryMetadata {
error?: string;
attemptCount?: number;
lastAttemptAt?: string;
nextRetryAt?: string;
[k: string]: unknown;
}
function readRetryMeta(instance: McpInstance): RetryMetadata {
return (instance.metadata ?? {}) as RetryMetadata;
}
function nextDelayMs(attemptCount: number): number {
return attemptCount <= MAX_FAST_RETRIES ? FAST_RETRY_MS : SLOW_RETRY_MS;
}
export class InvalidStateError extends Error {
readonly statusCode = 409;
constructor(message: string) {
@@ -118,8 +148,12 @@ export class InstanceService {
* Reconcile ALL servers — the operator loop.
*
* For every server with replicas > 0, ensures the correct number of
* healthy instances exist. Cleans up ERROR instances and starts
* replacements. This is the core self-healing mechanism.
* healthy instances exist. ERROR instances are not blindly recreated:
* within their `nextRetryAt` window they're left alone (and counted
* against the replica budget so we don't churn replacements while one
* is in backoff); past their window they're retried in-place via
* `retryInstance` so attemptCount accumulates and backoff escalates
* correctly.
*/
async reconcileAll(): Promise<{ reconciled: number; errors: string[] }> {
await this.syncStatus();
@@ -128,6 +162,8 @@ export class InstanceService {
let reconciled = 0;
const errors: string[] = [];
const now = Date.now();
for (const server of servers) {
if (server.replicas <= 0) continue;
@@ -136,17 +172,38 @@ export class InstanceService {
const active = instances.filter((i) => i.status === 'RUNNING' || i.status === 'STARTING');
const errored = instances.filter((i) => i.status === 'ERROR');
// Clean up ERROR instances so they don't accumulate
// Partition ERROR instances by whether their backoff window has elapsed.
const dueForRetry: McpInstance[] = [];
const stillWaiting: McpInstance[] = [];
for (const inst of errored) {
await this.removeOne(inst);
const meta = readRetryMeta(inst);
const ts = meta.nextRetryAt ? Date.parse(meta.nextRetryAt) : 0;
if (Number.isNaN(ts) || ts <= now) {
dueForRetry.push(inst);
} else {
stillWaiting.push(inst);
}
}
// Scale up if needed
const toStart = server.replicas - active.length;
// Retry elapsed ones in-place. This preserves attemptCount across
// attempts so the 30s × 5 → 5min schedule actually escalates.
for (const inst of dueForRetry) {
await this.retryInstance(inst);
}
// Scale up only if we don't already have enough live attempts.
// Live attempts = currently-running OR -starting + still-waiting
// (in backoff) + just-retried (now STARTING via retryInstance).
// Counting waiting + retried against the budget prevents tight
// create-fail-create churn while previous attempts work through
// their backoff schedule.
const toStart = server.replicas - active.length - stillWaiting.length - dueForRetry.length;
if (toStart > 0) {
for (let i = 0; i < toStart; i++) {
await this.startOne(server.id);
}
}
if (toStart > 0 || dueForRetry.length > 0) {
reconciled++;
}
} catch (err) {
@@ -220,7 +277,12 @@ export class InstanceService {
return this.orchestrator.getContainerLogs(instance.containerId, opts);
}
/** Start a single instance for a server. */
/**
* Start a single instance for a server. Creates a fresh `STARTING` row
* and hands off to `attemptStart` for the env+container work. On
* failure, `attemptStart` marks the row `ERROR` with a backoff-aware
* `nextRetryAt`; the reconciler picks it up later via `retryInstance`.
*/
private async startOne(serverId: string): Promise<McpInstance> {
const server = await this.serverRepo.findById(serverId);
if (!server) throw new NotFoundError(`McpServer '${serverId}' not found`);
@@ -234,6 +296,49 @@ export class InstanceService {
});
}
const instance = await this.instanceRepo.create({
serverId,
status: 'STARTING',
});
return this.attemptStart(instance, server);
}
/**
* Re-attempt a previously-errored instance in place, preserving its
* `attemptCount` so the backoff schedule escalates correctly. Called
* by `reconcileAll` for ERROR instances whose `nextRetryAt` has elapsed.
*/
private async retryInstance(instance: McpInstance): Promise<McpInstance> {
const server = await this.serverRepo.findById(instance.serverId);
if (!server) {
// Server was deleted underneath us — nothing to retry against.
return this.markInstanceError(instance, 'Server no longer exists');
}
if (server.externalUrl) {
// External servers don't need a container; the URL is the contract.
return this.instanceRepo.updateStatus(instance.id, 'RUNNING', {
metadata: { external: true, url: server.externalUrl },
});
}
// Reset transient fields but keep retry counters via the metadata
// passed through `attemptStart` → `markInstanceError`.
await this.instanceRepo.updateStatus(instance.id, 'STARTING', {});
const refreshed = (await this.instanceRepo.findById(instance.id)) ?? instance;
return this.attemptStart(refreshed, server);
}
/**
* Run the env-resolution + container-creation steps for a STARTING
* instance. On any failure, mark the instance `ERROR` with structured
* retry metadata. Used by both initial start (`startOne`) and retry
* (`retryInstance`).
*/
private async attemptStart(
instance: McpInstance,
server: McpServer,
): Promise<McpInstance> {
// Determine image + command based on server config:
// 1. Explicit dockerImage → use as-is
// 2. packageName → use runtime-specific runner image (node/python/go/...)
@@ -253,11 +358,6 @@ export class InstanceService {
image = server.name;
}
let instance = await this.instanceRepo.create({
serverId,
status: 'STARTING',
});
try {
const spec: ContainerSpec = {
image,
@@ -265,7 +365,7 @@ export class InstanceService {
hostPort: null,
network: MCP_SERVERS_NETWORK,
labels: {
'mcpctl.server-id': serverId,
'mcpctl.server-id': server.id,
'mcpctl.instance-id': instance.id,
},
};
@@ -283,7 +383,17 @@ export class InstanceService {
}
}
// Resolve env vars from inline values and secret refs
// Resolve env vars from inline values and secret refs.
//
// Failure here is FATAL for the start attempt: a container that
// boots without its declared secrets will silently mis-behave (we
// saw this with gitea-mcp-server starting up with an empty
// GITEA_ACCESS_TOKEN when OpenBao was unreachable, then reporting
// "healthy" while every authed call failed). Marking the instance
// ERROR with a backoff-aware nextRetryAt is honest; the reconciler
// will retry it in-place on the next tick whose nextRetryAt has
// elapsed. Optional/missing env vars should be modeled as `value: ""`
// entries on the server, not as silent secret-resolution failures.
if (this.secretResolver) {
try {
const resolvedEnv = await resolveServerEnv(server, this.secretResolver);
@@ -291,8 +401,8 @@ export class InstanceService {
spec.env = resolvedEnv;
}
} catch (envErr) {
// Log but don't prevent startup — env resolution failures are non-fatal
// The container may still work if env vars are optional
const msg = envErr instanceof Error ? envErr.message : String(envErr);
return this.markInstanceError(instance, `secret resolution failed: ${msg}`);
}
}
@@ -313,14 +423,39 @@ export class InstanceService {
}
// Set STARTING — syncStatus will promote to RUNNING once the container is actually ready
instance = await this.instanceRepo.updateStatus(instance.id, 'STARTING', updateFields);
return this.instanceRepo.updateStatus(instance.id, 'STARTING', updateFields);
} catch (err) {
instance = await this.instanceRepo.updateStatus(instance.id, 'ERROR', {
metadata: { error: err instanceof Error ? err.message : String(err) },
});
return this.markInstanceError(
instance,
err instanceof Error ? err.message : String(err),
);
}
}
return instance;
/**
* Mark an instance ERROR with a backoff-aware retry schedule. The
* `attemptCount` accumulates across retries (preserved by
* `retryInstance` which reuses the same row), so the schedule
* actually escalates: 30s × 5 → 5min thereafter.
*/
private async markInstanceError(
instance: McpInstance,
error: string,
): Promise<McpInstance> {
const meta = readRetryMeta(instance);
const attemptCount = (typeof meta.attemptCount === 'number' ? meta.attemptCount : 0) + 1;
const delayMs = nextDelayMs(attemptCount);
const now = new Date();
const nextRetryAt = new Date(now.getTime() + delayMs).toISOString();
return this.instanceRepo.updateStatus(instance.id, 'ERROR', {
metadata: {
...meta,
error,
attemptCount,
lastAttemptAt: now.toISOString(),
nextRetryAt,
},
});
}
/** Stop and remove a single instance. */

View File

@@ -334,20 +334,93 @@ describe('InstanceService', () => {
expect(instanceRepo.create).not.toHaveBeenCalled();
});
it('cleans up ERROR instances and creates replacements', async () => {
it('retries ERROR instances in-place when their backoff has elapsed (no delete, no new row)', async () => {
const server = makeServer({ id: 'srv-1', replicas: 1 });
vi.mocked(serverRepo.findAll).mockResolvedValue([server]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
// ERROR instance with no nextRetryAt → retry is due immediately.
vi.mocked(instanceRepo.findAll).mockResolvedValue([
makeInstance({ id: 'inst-dead', serverId: 'srv-1', status: 'ERROR', containerId: 'ctr-dead' }),
]);
const result = await service.reconcileAll();
// Should delete ERROR instance and create a new one
// Retry-in-place semantics: don't delete the row, don't create a
// replacement. attemptCount needs to live on the same row so the
// backoff schedule can actually escalate.
expect(instanceRepo.delete).not.toHaveBeenCalled();
expect(instanceRepo.create).not.toHaveBeenCalled();
// retryInstance flips the row STARTING before attemptStart runs.
expect(instanceRepo.updateStatus).toHaveBeenCalledWith('inst-dead', 'STARTING', expect.anything());
expect(result.reconciled).toBe(1);
expect(instanceRepo.delete).toHaveBeenCalledWith('inst-dead');
expect(instanceRepo.create).toHaveBeenCalled();
});
it('leaves ERROR instances alone while their nextRetryAt is in the future', async () => {
const server = makeServer({ id: 'srv-1', replicas: 1 });
vi.mocked(serverRepo.findAll).mockResolvedValue([server]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
const futureRetry = new Date(Date.now() + 60_000).toISOString();
vi.mocked(instanceRepo.findAll).mockResolvedValue([
makeInstance({
id: 'inst-waiting',
serverId: 'srv-1',
status: 'ERROR',
metadata: { nextRetryAt: futureRetry, attemptCount: 2 },
}),
]);
const result = await service.reconcileAll();
// Within the backoff window the reconciler must not delete the row,
// not retry it, and not spawn a replacement (counting it against
// the replica budget is what prevents tight create-fail-create churn).
expect(instanceRepo.delete).not.toHaveBeenCalled();
expect(instanceRepo.create).not.toHaveBeenCalled();
expect(orchestrator.createContainer).not.toHaveBeenCalled();
expect(result.reconciled).toBe(0);
});
it('escalates the backoff: attemptCount + nextRetryAt persist on retry failures', async () => {
const server = makeServer({ id: 'srv-1', replicas: 1 });
vi.mocked(serverRepo.findAll).mockResolvedValue([server]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
// Fail container creation so attemptStart goes down the markInstanceError path.
vi.mocked(orchestrator.createContainer).mockRejectedValue(new Error('boom'));
// Existing ERROR instance with attemptCount=2 (so the next failure
// produces attemptCount=3, still inside the fast-retry window).
vi.mocked(instanceRepo.findAll).mockResolvedValue([
makeInstance({
id: 'inst-1',
serverId: 'srv-1',
status: 'ERROR',
metadata: { error: 'previous failure', attemptCount: 2, nextRetryAt: new Date(Date.now() - 1000).toISOString() },
}),
]);
// retryInstance refreshes via findById; let it return the same row.
vi.mocked(instanceRepo.findById).mockImplementation(async () => makeInstance({
id: 'inst-1',
serverId: 'srv-1',
status: 'STARTING',
metadata: { error: 'previous failure', attemptCount: 2, nextRetryAt: new Date(Date.now() - 1000).toISOString() },
}));
await service.reconcileAll();
// Look at the last updateStatus call — it should be the ERROR transition
// with attemptCount bumped to 3.
const errorCalls = vi.mocked(instanceRepo.updateStatus).mock.calls.filter(
(c) => c[1] === 'ERROR',
);
expect(errorCalls.length).toBeGreaterThan(0);
const lastErrorCall = errorCalls[errorCalls.length - 1]!;
const meta = (lastErrorCall[2] as { metadata?: Record<string, unknown> } | undefined)?.metadata;
expect(meta).toBeDefined();
expect((meta as Record<string, unknown>)['attemptCount']).toBe(3);
expect((meta as Record<string, unknown>)['nextRetryAt']).toBeTypeOf('string');
// Reason should reference the boom we threw.
expect(String((meta as Record<string, unknown>)['error'])).toContain('boom');
});
it('reconciles multiple servers independently', async () => {

View File

@@ -192,25 +192,28 @@ describe('HealthProbeRunner', () => {
expect(serverRepo.findById).not.toHaveBeenCalled();
});
it('probes STDIO instance with exec and marks healthy on success', async () => {
it('probes STDIO instance via mcpProxyService and marks healthy on success', async () => {
const instance = makeInstance();
const server = makeServer();
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 0,
stdout: 'OK',
stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1,
result: { content: [{ type: 'text', text: 'ok' }] },
});
await runner.tick();
expect(orchestrator.execInContainer).toHaveBeenCalledWith(
'container-abc',
expect.arrayContaining(['node', '-e']),
expect.objectContaining({ timeoutMs: 10000 }),
);
// STDIO readiness now goes through the proxy (the live container),
// not via docker-exec into a synthetic spawn — see comment on
// probeReadinessViaProxy for why.
expect(orchestrator.execInContainer).not.toHaveBeenCalled();
expect(mcpProxyService.execute).toHaveBeenCalledWith({
serverId: 'srv-1',
method: 'tools/call',
params: { name: 'list_datasources', arguments: {} },
});
expect(instanceRepo.updateStatus).toHaveBeenCalledWith(
'inst-1',
@@ -225,6 +228,57 @@ describe('HealthProbeRunner', () => {
);
});
it('marks unhealthy when proxy returns a JSON-RPC error (e.g. broken-secret auth failure)', async () => {
const instance = makeInstance();
const server = makeServer({
healthCheck: { tool: 'get_me', intervalSeconds: 0, failureThreshold: 1 } as McpServer['healthCheck'],
});
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1,
error: { code: -32603, message: 'token is required' },
});
await runner.tick();
expect(instanceRepo.updateStatus).toHaveBeenCalledWith(
'inst-1',
'RUNNING',
expect.objectContaining({
healthStatus: 'unhealthy',
events: expect.arrayContaining([
expect.objectContaining({ type: 'Warning', message: expect.stringContaining('token is required') }),
]),
}),
);
});
it('marks unhealthy when proxy returns a tool-level error in result.isError', async () => {
const instance = makeInstance();
const server = makeServer({
healthCheck: { tool: 'get_me', intervalSeconds: 0, failureThreshold: 1 } as McpServer['healthCheck'],
});
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1,
result: { isError: true, content: [{ type: 'text', text: 'auth failed: token is required' }] },
});
await runner.tick();
const events = vi.mocked(instanceRepo.updateStatus).mock.calls[0]?.[2]?.events as Array<{ message: string }> | undefined;
expect(events?.[events.length - 1]?.message).toContain('auth failed');
expect(instanceRepo.updateStatus).toHaveBeenCalledWith(
'inst-1',
'RUNNING',
expect.objectContaining({ healthStatus: 'unhealthy' }),
);
});
it('marks unhealthy after failureThreshold consecutive failures', async () => {
const instance = makeInstance();
const healthCheck: HealthCheckSpec = {
@@ -237,10 +291,9 @@ describe('HealthProbeRunner', () => {
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 1,
stdout: 'ERROR:connection refused',
stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1,
error: { code: -32603, message: 'connection refused' },
});
// First failure → degraded
@@ -274,15 +327,15 @@ describe('HealthProbeRunner', () => {
vi.mocked(serverRepo.findById).mockResolvedValue(server);
// Two failures
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 1, stdout: 'ERROR:fail', stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1, error: { code: -32603, message: 'fail' },
});
await runner.tick();
await runner.tick();
// Then success — should reset to healthy
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 0, stdout: 'OK', stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1, result: {},
});
await runner.tick();
@@ -290,13 +343,16 @@ describe('HealthProbeRunner', () => {
expect(lastCall?.[2]).toEqual(expect.objectContaining({ healthStatus: 'healthy' }));
});
it('handles exec timeout as failure', async () => {
it('handles probe timeout as failure', async () => {
const instance = makeInstance();
const server = makeServer();
const server = makeServer({
healthCheck: { tool: 'list_datasources', intervalSeconds: 0, timeoutSeconds: 0.05, failureThreshold: 3 } as unknown as McpServer['healthCheck'],
});
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(orchestrator.execInContainer).mockRejectedValue(new Error('Exec timed out after 10000ms'));
// Hang forever — the probe's internal deadline should fire instead.
vi.mocked(mcpProxyService.execute).mockImplementation(() => new Promise(() => { /* never resolves */ }));
await runner.tick();
@@ -323,8 +379,8 @@ describe('HealthProbeRunner', () => {
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 0, stdout: 'OK', stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1, result: {},
});
await runner.tick();
@@ -343,17 +399,17 @@ describe('HealthProbeRunner', () => {
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 0, stdout: 'OK', stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1, result: {},
});
// First tick: should probe
await runner.tick();
expect(orchestrator.execInContainer).toHaveBeenCalledTimes(1);
expect(mcpProxyService.execute).toHaveBeenCalledTimes(1);
// Second tick immediately: should skip (300s interval not elapsed)
await runner.tick();
expect(orchestrator.execInContainer).toHaveBeenCalledTimes(1);
expect(mcpProxyService.execute).toHaveBeenCalledTimes(1);
});
it('cleans up probe states for removed instances', async () => {
@@ -364,9 +420,12 @@ describe('HealthProbeRunner', () => {
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
vi.mocked(serverRepo.findById).mockResolvedValue(server);
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1, result: {},
});
await runner.tick();
expect(orchestrator.execInContainer).toHaveBeenCalledTimes(1);
expect(mcpProxyService.execute).toHaveBeenCalledTimes(1);
// Instance removed
vi.mocked(instanceRepo.findAll).mockResolvedValue([]);
@@ -375,7 +434,7 @@ describe('HealthProbeRunner', () => {
// Re-add same instance — should probe again (state was cleaned)
vi.mocked(instanceRepo.findAll).mockResolvedValue([instance]);
await runner.tick();
expect(orchestrator.execInContainer).toHaveBeenCalledTimes(2);
expect(mcpProxyService.execute).toHaveBeenCalledTimes(2);
});
it('skips STDIO instances without containerId', async () => {
@@ -397,8 +456,8 @@ describe('HealthProbeRunner', () => {
arguments: {},
};
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 0, stdout: 'OK', stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1, result: {},
});
const result = await runner.probeInstance(instance, server, healthCheck);
@@ -407,15 +466,14 @@ describe('HealthProbeRunner', () => {
expect(result.message).toBe('ok');
});
it('handles STDIO exec failure with error message', async () => {
it('surfaces upstream JSON-RPC error message verbatim', async () => {
const instance = makeInstance();
const server = makeServer();
const healthCheck: HealthCheckSpec = { tool: 'list_datasources', arguments: {} };
vi.mocked(orchestrator.execInContainer).mockResolvedValue({
exitCode: 1,
stdout: 'ERROR:ECONNREFUSED 10.0.0.1:3000',
stderr: '',
vi.mocked(mcpProxyService.execute).mockResolvedValue({
jsonrpc: '2.0', id: 1,
error: { code: -32603, message: 'ECONNREFUSED 10.0.0.1:3000' },
});
const result = await runner.probeInstance(instance, server, healthCheck);