From dbbdf5f971d3d75cd0142f00e3cf3d0cc95376f9 Mon Sep 17 00:00:00 2001 From: Michal Date: Tue, 17 Mar 2026 23:46:29 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20lab=20platform=20design=20=E2=80=94=20l?= =?UTF-8?q?abd,=20agent,=20RBAC,=20multi-cloud,=20testing=20strategy?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Comprehensive design document covering: - labd master daemon with CA, RBAC, Pulumi executor - lab-agent with mTLS enrollment, heartbeat, log shipping - Module system (built-in + external repos) - Cloud/environment model (baremetal + AWS) - Ephemeral test environments (containers, VMs, cloud) - Security test patterns for RBAC - Health gates for deployment promotion - Database strategy: PostgreSQL now, CockroachDB later - Networking: Tailscale mesh + Cilium CNI Co-Authored-By: Claude Opus 4.6 (1M context) --- bastion/DESIGN-LAB-PLATFORM.md | 355 +++++++++++++++++++++++++++++++++ 1 file changed, 355 insertions(+) create mode 100644 bastion/DESIGN-LAB-PLATFORM.md diff --git a/bastion/DESIGN-LAB-PLATFORM.md b/bastion/DESIGN-LAB-PLATFORM.md new file mode 100644 index 0000000..9cbe675 --- /dev/null +++ b/bastion/DESIGN-LAB-PLATFORM.md @@ -0,0 +1,355 @@ +# Lab Platform — Design Document + +## Vision + +A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API. + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Developer Workstation (thebeast) │ +│ │ +│ lab CLI │ +│ ├── lab init bastion standalone start (PXE provisioning) │ +│ ├── lab provision install/reprovision (bare-metal) │ +│ ├── lab get servers --env production (query) │ +│ ├── lab exec -- (remote execution) │ +│ ├── lab logs (log streaming) │ +│ ├── lab apply -f infra.ts (pulumi via labd) │ +│ └── lab get roles/users/permissions (RBAC management) │ +│ │ +│ Connects to: labd via mTLS │ +└─────────────────────┬───────────────────────────────────────────┘ + │ mTLS (client cert) + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ labmaster.ad.itaz.eu (infra node, k3s single-node) │ +│ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ labd (master daemon) │ │ +│ │ ├── Certificate Authority (issues agent certs) │ │ +│ │ ├── RBAC Engine (roles, permissions, ACLs) │ │ +│ │ ├── Agent Registry (connected agents, heartbeats) │ │ +│ │ ├── Pulumi Executor (runs IaC on behalf of users) │ │ +│ │ ├── Log Aggregator (receives agent logs) │ │ +│ │ ├── Module Registry (configuration modules) │ │ +│ │ └── REST API + WebSocket (agent connections) │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ bastion (PXE provisioning) │ │ +│ │ Running as k3s pod with hostNetwork │ │ +│ └──────────────────────────────────────────────────────┘ │ +└──────────┬──────────────────────────────────────────────────────┘ + │ mTLS (agent certs) + ▼ +┌──────────────────────┐ ┌──────────────────────┐ ┌────────────┐ +│ ser9.ad.itaz.eu │ │ worker-2.ad.itaz.eu │ │ AWS EC2 │ +│ (bare-metal worker) │ │ (bare-metal worker) │ │ instances │ +│ │ │ │ │ │ +│ lab-agent │ │ lab-agent │ │ lab-agent │ +│ ├── heartbeat │ │ ├── heartbeat │ │ ├── ... │ +│ ├── log shipping │ │ ├── log shipping │ │ └── ... │ +│ ├── exec handler │ │ ├── exec handler │ │ │ +│ └── module runner │ │ └── module runner │ │ │ +└──────────────────────┘ └──────────────────────┘ └────────────┘ +``` + +## Components + +### 1. labd (Master Daemon) + +The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod. + +**Responsibilities:** +- Certificate Authority — signs agent certificates, manages trust chain +- Agent Registry — tracks connected agents, heartbeats, status +- RBAC — roles, permissions, ACLs per user/group/environment/cloud +- Pulumi Executor — runs Pulumi TypeScript code submitted by users +- Log Aggregator — receives and stores logs from agents +- Module Registry — stores and distributes configuration modules +- REST API — for CLI and external integrations +- WebSocket — persistent agent connections for real-time commands + +**Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket + +### 2. lab-agent + +Lightweight daemon running on every managed machine. + +**Responsibilities:** +- Connect to labd via mTLS (agent certificate) +- Send heartbeats (status, load, disk, memory) +- Ship logs (journald → labd) +- Execute commands on demand (like `kubectl exec`) +- Run configuration modules (like `puppet agent -tv`) +- Report module run results + +**Tech:** Standalone TypeScript binary (bun compiled), systemd service + +### 3. lab CLI (extended) + +Extends the existing `lab` CLI with platform management commands. + +**New commands:** +``` +# Server management +lab get servers # List all servers +lab get servers --env production # Filter by environment +lab get servers --cloud baremetal # Filter by cloud +lab get servers --label role=k3s-worker # Filter by label +lab describe server # Detailed server info +lab exec -- # Remote command execution +lab logs [-f] # Stream server logs + +# Infrastructure as Code +lab apply -f # Execute Pulumi code via labd +lab plan -f # Dry-run Pulumi code +lab destroy -f # Tear down resources + +# RBAC +lab get roles # List roles +lab get users # List users +lab create role # Create role +lab bind role --user # Bind role to user +lab get permissions # List permissions + +# Environment/Cloud management +lab get environments # List environments +lab get clouds # List clouds +lab create environment --cloud + +# Module management +lab get modules # List available modules +lab apply module --target # Apply module to server +``` + +### 4. Certificate Authority + +Built into labd. Issues and manages certificates for agents and users. + +**Flow:** +``` +1. Agent starts with a join token (one-time or reusable) +2. Agent generates CSR, sends to labd with token +3. labd validates token, signs certificate +4. Agent receives signed cert + CA cert +5. All future communication uses mTLS + +For CLI users: +1. User runs `lab login` or `lab init` +2. labd issues a client certificate (or uses existing SSH keys) +3. CLI uses client cert for all API calls +``` + +**Token types:** +- **One-time token** — for individual bare-metal servers (generated during PXE provision) +- **Reusable token** — for autoscaling groups (AWS ASG instances use the same token) + +### 5. RBAC Model + +Reuse mcpctl's RBAC patterns. Hierarchical permissions: + +``` +Cloud → Environment → Server → Action + +Examples: +- baremetal:lab:*:exec — can exec on any lab server +- baremetal:lab:puppet:* — full access to puppet server +- aws:production:*:read — read-only on all AWS prod servers +- *:*:*:* — superadmin +``` + +**Resources:** +- servers, environments, clouds, modules, roles, users, pulumi-stacks + +**Actions:** +- read, exec, apply, destroy, manage, admin + +**Whitelist/Blacklist:** +- Roles can have `allow` and `deny` rules +- Deny takes precedence (like AWS IAM) + +### 6. Module System + +Configuration modules define the desired state of a server. + +**Module structure:** +``` +modules/ + k3s-server/ + module.yaml # Metadata: name, version, targets, deps + src/ + index.ts # Module entry point + install.ts # Installation logic + configure.ts # Configuration logic + health.ts # Health check + tests/ + install.test.ts + k3s-agent/ + module.yaml + src/ + index.ts + labd/ + module.yaml + src/ + index.ts # Deploy labd to k3s +``` + +**module.yaml:** +```yaml +name: k3s-server +version: 0.1.0 +description: Install and configure k3s server +targets: + roles: [infra] + labels: + k3s: server +dependencies: + - base-server +``` + +**Module sources:** +- Built-in modules (in this repo, e.g., k3s-server, labd) +- External modules (separate git repos, pulled by URL) +- Module registry (future — like Puppet Forge) + +### 7. Cloud/Environment Model + +``` +Cloud: baremetal + └── Environment: lab + ├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server}) + ├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent}) + └── ... + +Cloud: aws + └── Environment: production + ├── Server: i-abc123 (from ASG web-servers) + ├── Server: i-def456 (from ASG web-servers) + └── ... + └── Environment: staging + └── ... +``` + +Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud. + +### 8. Pulumi Integration + +Users submit Pulumi TypeScript code to labd for execution. + +```bash +# Apply infrastructure code +lab apply -f infra/k3s-cluster.ts --env lab + +# The file is sent to labd, which: +# 1. Checks RBAC (does user have apply permission for this env?) +# 2. Creates a Pulumi stack +# 3. Executes `pulumi up` in a sandboxed environment +# 4. Streams output back to CLI +# 5. Stores state in Pulumi backend (local or S3) +``` + +**Future AWS extension:** +```typescript +// infra/aws-web-servers.ts +import * as aws from "@pulumi/aws"; + +const asg = new aws.autoscaling.Group("web-servers", { + maxSize: 10, + minSize: 2, + launchTemplate: { /* ... */ }, + // User data installs lab-agent with reusable join token +}); +``` + +## Project Structure + +``` +lab/ + bastion/ # Existing — PXE provisioning + + src/ + shared/ # @lab/shared — types, constants, RBAC + labd/ # @lab/labd — master daemon + src/ + main.ts + server.ts + ca/ # Certificate Authority + rbac/ # RBAC engine (reuse mcpctl patterns) + agents/ # Agent registry + WebSocket + pulumi/ # Pulumi executor + logs/ # Log aggregation + modules/ # Module registry + routes/ # REST API + agent/ # @lab/agent — agent daemon + src/ + main.ts + connection.ts # mTLS WebSocket to labd + heartbeat.ts + executor.ts # Command execution + logs.ts # Log shipping + modules.ts # Module runner + cli/ # @lab/cli — extends existing CLI + src/ + commands/ + init/bastion/ # Existing bastion commands + provision/ # Existing provision commands + get/ # New: get servers/roles/users/etc + exec/ # New: remote execution + logs/ # New: log streaming + apply/ # New: pulumi apply + rbac/ # New: role management + + modules/ # Built-in modules + k3s-server/ # Deploy k3s server + k3s-agent/ # Deploy k3s agent + labd/ # Deploy labd to k3s + lab-agent/ # Deploy lab-agent to servers + + deploy/ + k3s/ # Existing k3s manifests for bastion + labd/ # k3s manifests for labd +``` + +## Implementation Phases + +### Phase 1: Foundation (current + next) +- [x] Bastion (PXE provisioning) — DONE +- [x] CLI structure (`lab init/provision`) — DONE +- [ ] Rename puppet to labmaster, reprovision +- [ ] Deploy k3s on labmaster +- [ ] Build labd skeleton (Fastify + Prisma) +- [ ] Certificate Authority (issue/sign certs) +- [ ] Agent skeleton (connect, heartbeat) + +### Phase 2: Core Platform +- [ ] RBAC engine (roles, permissions, ACLs) +- [ ] `lab get servers` with environment/cloud/label filters +- [ ] `lab exec` remote command execution +- [ ] `lab logs` streaming +- [ ] Agent auto-enrollment via PXE provision (join token in kickstart) + +### Phase 3: Infrastructure as Code +- [ ] Module system (define, apply, health check) +- [ ] k3s-server module (deploy k3s) +- [ ] labd module (deploy labd to k3s) +- [ ] Pulumi executor in labd +- [ ] `lab apply -f` command + +### Phase 4: Multi-Cloud +- [ ] AWS provider (Pulumi-based) +- [ ] Reusable join tokens for autoscaling groups +- [ ] Cloud/environment model +- [ ] Auto-discovery of cloud instances + +## Key Design Decisions + +1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code +2. **mTLS over SSH** — proper PKI, scalable, no key management per-server +3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls +4. **RBAC from day one** — security-first, deny by default +5. **Module system inspired by Puppet** — declarative, testable, versionable +6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable +7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model