feat: Kubernetes operator for MCP server management #47

michal · 2026-04-09T22:45:31Z

michal commented

2026-04-09 22:45:31 +00:00

Summary

mcpd as Kubernetes operator: deploys MCP server instances as pods via ServiceAccount RBAC
KubernetesOrchestrator using @kubernetes/client-node (exec, attach, pod IP, SPDY)
Automatic reconciliation loop (30s) — self-healing, replaces crashed pods
Two-namespace isolation: mcpctl (control plane) + mcpctl-servers (data plane)
10 NetworkPolicies — only mcpd can reach MCP server pods
Backup completeness: prompts + templates + all server fields (runtime, command, etc.)
Instance status accuracy: STARTING until pod is actually running
Restore fix: system user ID resolution for project ownership

Deployed and tested

Production k8s cluster (worker0-k8s0) with all 8 servers running
Data migrated from Portainer via backup/restore
630 unit tests pass, smoke tests pass

Test plan

Unit tests (630 pass)
Local k3s e2e (pod creation, network isolation)
Production deploy + data migration
Smoke tests against production k8s

🤖 Generated with Claude Code

## Summary - mcpd as Kubernetes operator: deploys MCP server instances as pods via ServiceAccount RBAC - KubernetesOrchestrator using @kubernetes/client-node (exec, attach, pod IP, SPDY) - Automatic reconciliation loop (30s) — self-healing, replaces crashed pods - Two-namespace isolation: mcpctl (control plane) + mcpctl-servers (data plane) - 10 NetworkPolicies — only mcpd can reach MCP server pods - Backup completeness: prompts + templates + all server fields (runtime, command, etc.) - Instance status accuracy: STARTING until pod is actually running - Restore fix: system user ID resolution for project ownership ## Deployed and tested - Production k8s cluster (worker0-k8s0) with all 8 servers running - Data migrated from Portainer via backup/restore - 630 unit tests pass, smoke tests pass ## Test plan - [x] Unit tests (630 pass) - [x] Local k3s e2e (pod creation, network isolation) - [x] Production deploy + data migration - [x] Smoke tests against production k8s 🤖 Generated with [Claude Code](https://claude.com/claude-code)

michal added 7 commits 2026-04-09 22:45:32 +00:00

chore: add gstack skill routing rules to CLAUDE.md f409952b0c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add Kubernetes orchestrator for MCP server pod management 5e45960a18

mcpd can now deploy MCP server instances as Kubernetes pods instead of
Docker containers. Set MCPD_ORCHESTRATOR=kubernetes to enable.

- Add @kubernetes/client-node with thin wrapper (context enforcement
  via MCPD_K8S_CONTEXT to prevent multi-cluster mishaps)
- Rewrite KubernetesOrchestrator: pod CRUD, pod IP extraction,
  exec via SPDY (one-shot + interactive), log streaming
- Manifest generator: stdin:true for STDIO servers, args (not command)
  to preserve runner image entrypoint, security hardening
- Orchestrator selection in main.ts via MCPD_ORCHESTRATOR env var
- 25 unit tests for k8s orchestrator, all 624 tests pass

Tested end-to-end on local k3s:
- mcpd deployed via Pulumi, creates pods in mcpctl-servers namespace
- NetworkPolicy verified: only mcpd can reach MCP server pods
- Python runner (uvx) successfully runs aws-documentation-mcp-server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: resolve system user ID in backup restore for projects 3663963a32

The restore service hardcoded ownerId as the literal string 'system'
instead of looking up the actual system user ID. This caused FK
constraint violations when restoring projects to a fresh database.

Now resolves the system user by email, falling back to the first
available user.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: nodeSelector for MCP server pods + restore fix 14be2fa18e

- Add MCPD_NODE_SELECTOR env var support in manifest generator
  for mixed-arch clusters (e.g. arm64+amd64)
- Fix backup restore: resolve system user ID instead of
  hardcoded 'system' string

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: automatic reconciliation loop for MCP server instances d293df738a

mcpd now runs a periodic reconcileAll() every 30s that:
- Detects crashed/missing containers (syncStatus)
- Cleans up ERROR instances
- Creates replacement pods to match desired replica count

This replaces the old syncStatus-only timer. Servers migrated
from another deployment or recovering from node failures will
automatically get their instances recreated.

6 new tests for reconcileAll covering: missing instances, skip
replicas=0, already-at-count, ERROR cleanup, multi-server,
error isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add prompts/templates to backup + STDIO attach for docker-image servers 1bd5087052

Two bugs fixed:

1. Backup completeness: JSON backup API now includes prompts and
   templates. Previously these were silently dropped during
   backup/restore, causing data loss on migration.

2. STDIO proxy for docker-image servers: servers with dockerImage
   but no packageName/command (like docmost) now use k8s Attach
   to connect to the container's PID 1 stdin/stdout instead of
   exec. This fixes "has no packageName or command" errors.

Changes:
- backup-service.ts: add BackupPrompt/BackupTemplate types, export them
- restore-service.ts: restore prompts (with project FK) and templates
- mcp-proxy-service.ts: sendViaPersistentAttach for docker-image STDIO
- orchestrator.ts: add attachInteractive to McpOrchestrator interface
- kubernetes-orchestrator.ts: implement attachInteractive via k8s Attach
- k8s-client-official.ts: expose Attach client

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: accurate instance status — STARTING until pod is actually running

CI/CD / typecheck (pull_request) Successful in 52s

Details

CI/CD / lint (pull_request) Successful in 1m53s

Details

CI/CD / test (pull_request) Successful in 1m2s

Details

CI/CD / build (pull_request) Successful in 4m0s

Details

CI/CD / smoke (pull_request) Successful in 8m38s

Details

CI/CD / publish-rpm (pull_request) Has been skipped

Details

CI/CD / publish-deb (pull_request) Has been skipped

Details

016f8abe68

Instance status now reflects actual container state:
- startOne() sets STARTING (not RUNNING) after container creation
- syncStatus() promotes STARTING→RUNNING when pod is ready
- syncStatus() demotes RUNNING→STARTING if pod restarts (CrashLoop)
- External servers still get RUNNING immediately (no container)

Previously, CrashLooping pods showed as RUNNING in mcpctl get instances.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michal merged commit 3f24527c84 into main

2026-04-09 22:46:22 +00:00

michal deleted branch feat/k8s-operator

2026-04-09 22:46:23 +00:00

michal referenced this issue from a commit

2026-04-09 22:46:24 +00:00

Merge pull request 'feat: Kubernetes operator for MCP server management' (#47) from feat/k8s-operator into main

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: michal/mcpctl#47