feat: Kubernetes operator for MCP server management #47

Merged
michal merged 7 commits from feat/k8s-operator into main 2026-04-09 22:46:22 +00:00
Owner

Summary

  • mcpd as Kubernetes operator: deploys MCP server instances as pods via ServiceAccount RBAC
  • KubernetesOrchestrator using @kubernetes/client-node (exec, attach, pod IP, SPDY)
  • Automatic reconciliation loop (30s) — self-healing, replaces crashed pods
  • Two-namespace isolation: mcpctl (control plane) + mcpctl-servers (data plane)
  • 10 NetworkPolicies — only mcpd can reach MCP server pods
  • Backup completeness: prompts + templates + all server fields (runtime, command, etc.)
  • Instance status accuracy: STARTING until pod is actually running
  • Restore fix: system user ID resolution for project ownership

Deployed and tested

  • Production k8s cluster (worker0-k8s0) with all 8 servers running
  • Data migrated from Portainer via backup/restore
  • 630 unit tests pass, smoke tests pass

Test plan

  • Unit tests (630 pass)
  • Local k3s e2e (pod creation, network isolation)
  • Production deploy + data migration
  • Smoke tests against production k8s

🤖 Generated with Claude Code

## Summary - mcpd as Kubernetes operator: deploys MCP server instances as pods via ServiceAccount RBAC - KubernetesOrchestrator using @kubernetes/client-node (exec, attach, pod IP, SPDY) - Automatic reconciliation loop (30s) — self-healing, replaces crashed pods - Two-namespace isolation: mcpctl (control plane) + mcpctl-servers (data plane) - 10 NetworkPolicies — only mcpd can reach MCP server pods - Backup completeness: prompts + templates + all server fields (runtime, command, etc.) - Instance status accuracy: STARTING until pod is actually running - Restore fix: system user ID resolution for project ownership ## Deployed and tested - Production k8s cluster (worker0-k8s0) with all 8 servers running - Data migrated from Portainer via backup/restore - 630 unit tests pass, smoke tests pass ## Test plan - [x] Unit tests (630 pass) - [x] Local k3s e2e (pod creation, network isolation) - [x] Production deploy + data migration - [x] Smoke tests against production k8s 🤖 Generated with [Claude Code](https://claude.com/claude-code)
michal added 7 commits 2026-04-09 22:45:32 +00:00
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mcpd can now deploy MCP server instances as Kubernetes pods instead of
Docker containers. Set MCPD_ORCHESTRATOR=kubernetes to enable.

- Add @kubernetes/client-node with thin wrapper (context enforcement
  via MCPD_K8S_CONTEXT to prevent multi-cluster mishaps)
- Rewrite KubernetesOrchestrator: pod CRUD, pod IP extraction,
  exec via SPDY (one-shot + interactive), log streaming
- Manifest generator: stdin:true for STDIO servers, args (not command)
  to preserve runner image entrypoint, security hardening
- Orchestrator selection in main.ts via MCPD_ORCHESTRATOR env var
- 25 unit tests for k8s orchestrator, all 624 tests pass

Tested end-to-end on local k3s:
- mcpd deployed via Pulumi, creates pods in mcpctl-servers namespace
- NetworkPolicy verified: only mcpd can reach MCP server pods
- Python runner (uvx) successfully runs aws-documentation-mcp-server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The restore service hardcoded ownerId as the literal string 'system'
instead of looking up the actual system user ID. This caused FK
constraint violations when restoring projects to a fresh database.

Now resolves the system user by email, falling back to the first
available user.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MCPD_NODE_SELECTOR env var support in manifest generator
  for mixed-arch clusters (e.g. arm64+amd64)
- Fix backup restore: resolve system user ID instead of
  hardcoded 'system' string

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mcpd now runs a periodic reconcileAll() every 30s that:
- Detects crashed/missing containers (syncStatus)
- Cleans up ERROR instances
- Creates replacement pods to match desired replica count

This replaces the old syncStatus-only timer. Servers migrated
from another deployment or recovering from node failures will
automatically get their instances recreated.

6 new tests for reconcileAll covering: missing instances, skip
replicas=0, already-at-count, ERROR cleanup, multi-server,
error isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:

1. Backup completeness: JSON backup API now includes prompts and
   templates. Previously these were silently dropped during
   backup/restore, causing data loss on migration.

2. STDIO proxy for docker-image servers: servers with dockerImage
   but no packageName/command (like docmost) now use k8s Attach
   to connect to the container's PID 1 stdin/stdout instead of
   exec. This fixes "has no packageName or command" errors.

Changes:
- backup-service.ts: add BackupPrompt/BackupTemplate types, export them
- restore-service.ts: restore prompts (with project FK) and templates
- mcp-proxy-service.ts: sendViaPersistentAttach for docker-image STDIO
- orchestrator.ts: add attachInteractive to McpOrchestrator interface
- kubernetes-orchestrator.ts: implement attachInteractive via k8s Attach
- k8s-client-official.ts: expose Attach client

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: accurate instance status — STARTING until pod is actually running
All checks were successful
CI/CD / typecheck (pull_request) Successful in 52s
CI/CD / lint (pull_request) Successful in 1m53s
CI/CD / test (pull_request) Successful in 1m2s
CI/CD / build (pull_request) Successful in 4m0s
CI/CD / smoke (pull_request) Successful in 8m38s
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
016f8abe68
Instance status now reflects actual container state:
- startOne() sets STARTING (not RUNNING) after container creation
- syncStatus() promotes STARTING→RUNNING when pod is ready
- syncStatus() demotes RUNNING→STARTING if pod restarts (CrashLoop)
- External servers still get RUNNING immediately (no container)

Previously, CrashLooping pods showed as RUNNING in mcpctl get instances.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
michal merged commit 3f24527c84 into main 2026-04-09 22:46:22 +00:00
michal deleted branch feat/k8s-operator 2026-04-09 22:46:23 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: michal/mcpctl#47