# Lab Platform — Design Document ## Vision A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API. ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ Developer Workstation (thebeast) │ │ │ │ lab CLI │ │ ├── lab init bastion standalone start (PXE provisioning) │ │ ├── lab provision install/reprovision (bare-metal) │ │ ├── lab get servers --env production (query) │ │ ├── lab exec -- (remote execution) │ │ ├── lab logs (log streaming) │ │ ├── lab apply -f infra.ts (pulumi via labd) │ │ └── lab get roles/users/permissions (RBAC management) │ │ │ │ Connects to: labd via mTLS │ └─────────────────────┬───────────────────────────────────────────┘ │ mTLS (client cert) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ labmaster.ad.itaz.eu (infra node, k3s single-node) │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ labd (master daemon) │ │ │ │ ├── Certificate Authority (issues agent certs) │ │ │ │ ├── RBAC Engine (roles, permissions, ACLs) │ │ │ │ ├── Agent Registry (connected agents, heartbeats) │ │ │ │ ├── Pulumi Executor (runs IaC on behalf of users) │ │ │ │ ├── Log Aggregator (receives agent logs) │ │ │ │ ├── Module Registry (configuration modules) │ │ │ │ └── REST API + WebSocket (agent connections) │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ bastion (PXE provisioning) │ │ │ │ Running as k3s pod with hostNetwork │ │ │ └──────────────────────────────────────────────────────┘ │ └──────────┬──────────────────────────────────────────────────────┘ │ mTLS (agent certs) ▼ ┌──────────────────────┐ ┌──────────────────────┐ ┌────────────┐ │ ser9.ad.itaz.eu │ │ worker-2.ad.itaz.eu │ │ AWS EC2 │ │ (bare-metal worker) │ │ (bare-metal worker) │ │ instances │ │ │ │ │ │ │ │ lab-agent │ │ lab-agent │ │ lab-agent │ │ ├── heartbeat │ │ ├── heartbeat │ │ ├── ... │ │ ├── log shipping │ │ ├── log shipping │ │ └── ... │ │ ├── exec handler │ │ ├── exec handler │ │ │ │ └── module runner │ │ └── module runner │ │ │ └──────────────────────┘ └──────────────────────┘ └────────────┘ ``` ## Components ### 1. labd (Master Daemon) The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod. **Responsibilities:** - Certificate Authority — signs agent certificates, manages trust chain - Agent Registry — tracks connected agents, heartbeats, status - RBAC — roles, permissions, ACLs per user/group/environment/cloud - Pulumi Executor — runs Pulumi TypeScript code submitted by users - Log Aggregator — receives and stores logs from agents - Module Registry — stores and distributes configuration modules - REST API — for CLI and external integrations - WebSocket — persistent agent connections for real-time commands **Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket ### 2. lab-agent Lightweight daemon running on every managed machine. **Responsibilities:** - Connect to labd via mTLS (agent certificate) - Send heartbeats (status, load, disk, memory) - Ship logs (journald → labd) - Execute commands on demand (like `kubectl exec`) - Run configuration modules (like `puppet agent -tv`) - Report module run results **Tech:** Standalone TypeScript binary (bun compiled), systemd service ### 3. lab CLI (extended) Extends the existing `lab` CLI with platform management commands. **New commands:** ``` # Server management lab get servers # List all servers lab get servers --env production # Filter by environment lab get servers --cloud baremetal # Filter by cloud lab get servers --label role=k3s-worker # Filter by label lab describe server # Detailed server info lab exec -- # Remote command execution lab logs [-f] # Stream server logs # Infrastructure as Code lab apply -f # Execute Pulumi code via labd lab plan -f # Dry-run Pulumi code lab destroy -f # Tear down resources # RBAC lab get roles # List roles lab get users # List users lab create role # Create role lab bind role --user # Bind role to user lab get permissions # List permissions # Environment/Cloud management lab get environments # List environments lab get clouds # List clouds lab create environment --cloud # Module management lab get modules # List available modules lab apply module --target # Apply module to server ``` ### 4. Certificate Authority Built into labd. Issues and manages certificates for agents and users. **Flow:** ``` 1. Agent starts with a join token (one-time or reusable) 2. Agent generates CSR, sends to labd with token 3. labd validates token, signs certificate 4. Agent receives signed cert + CA cert 5. All future communication uses mTLS For CLI users: 1. User runs `lab login` or `lab init` 2. labd issues a client certificate (or uses existing SSH keys) 3. CLI uses client cert for all API calls ``` **Token types:** - **One-time token** — for individual bare-metal servers (generated during PXE provision) - **Reusable token** — for autoscaling groups (AWS ASG instances use the same token) ### 5. RBAC Model Reuse mcpctl's RBAC patterns. Hierarchical permissions: ``` Cloud → Environment → Server → Action Examples: - baremetal:lab:*:exec — can exec on any lab server - baremetal:lab:puppet:* — full access to puppet server - aws:production:*:read — read-only on all AWS prod servers - *:*:*:* — superadmin ``` **Resources:** - servers, environments, clouds, modules, roles, users, pulumi-stacks **Actions:** - read, exec, apply, destroy, manage, admin **Whitelist/Blacklist:** - Roles can have `allow` and `deny` rules - Deny takes precedence (like AWS IAM) ### 6. Module System Configuration modules define the desired state of a server. **Module structure:** ``` modules/ k3s-server/ module.yaml # Metadata: name, version, targets, deps src/ index.ts # Module entry point install.ts # Installation logic configure.ts # Configuration logic health.ts # Health check tests/ install.test.ts k3s-agent/ module.yaml src/ index.ts labd/ module.yaml src/ index.ts # Deploy labd to k3s ``` **module.yaml:** ```yaml name: k3s-server version: 0.1.0 description: Install and configure k3s server targets: roles: [infra] labels: k3s: server dependencies: - base-server ``` **Module sources:** - Built-in modules (in this repo, e.g., k3s-server, labd) - External modules (separate git repos, pulled by URL) - Module registry (future — like Puppet Forge) ### 7. Cloud/Environment Model ``` Cloud: baremetal └── Environment: lab ├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server}) ├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent}) └── ... Cloud: aws └── Environment: production ├── Server: i-abc123 (from ASG web-servers) ├── Server: i-def456 (from ASG web-servers) └── ... └── Environment: staging └── ... ``` Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud. ### 8. Pulumi Integration Users submit Pulumi TypeScript code to labd for execution. ```bash # Apply infrastructure code lab apply -f infra/k3s-cluster.ts --env lab # The file is sent to labd, which: # 1. Checks RBAC (does user have apply permission for this env?) # 2. Creates a Pulumi stack # 3. Executes `pulumi up` in a sandboxed environment # 4. Streams output back to CLI # 5. Stores state in Pulumi backend (local or S3) ``` **Future AWS extension:** ```typescript // infra/aws-web-servers.ts import * as aws from "@pulumi/aws"; const asg = new aws.autoscaling.Group("web-servers", { maxSize: 10, minSize: 2, launchTemplate: { /* ... */ }, // User data installs lab-agent with reusable join token }); ``` ## Project Structure ``` lab/ bastion/ # Existing — PXE provisioning src/ shared/ # @lab/shared — types, constants, RBAC labd/ # @lab/labd — master daemon src/ main.ts server.ts ca/ # Certificate Authority rbac/ # RBAC engine (reuse mcpctl patterns) agents/ # Agent registry + WebSocket pulumi/ # Pulumi executor logs/ # Log aggregation modules/ # Module registry routes/ # REST API agent/ # @lab/agent — agent daemon src/ main.ts connection.ts # mTLS WebSocket to labd heartbeat.ts executor.ts # Command execution logs.ts # Log shipping modules.ts # Module runner cli/ # @lab/cli — extends existing CLI src/ commands/ init/bastion/ # Existing bastion commands provision/ # Existing provision commands get/ # New: get servers/roles/users/etc exec/ # New: remote execution logs/ # New: log streaming apply/ # New: pulumi apply rbac/ # New: role management modules/ # Built-in modules k3s-server/ # Deploy k3s server k3s-agent/ # Deploy k3s agent labd/ # Deploy labd to k3s lab-agent/ # Deploy lab-agent to servers deploy/ k3s/ # Existing k3s manifests for bastion labd/ # k3s manifests for labd ``` ## Implementation Phases ### Phase 1: Foundation (current + next) - [x] Bastion (PXE provisioning) — DONE - [x] CLI structure (`lab init/provision`) — DONE - [ ] Rename puppet to labmaster, reprovision - [ ] Deploy k3s on labmaster - [ ] Build labd skeleton (Fastify + Prisma) - [ ] Certificate Authority (issue/sign certs) - [ ] Agent skeleton (connect, heartbeat) ### Phase 2: Core Platform - [ ] RBAC engine (roles, permissions, ACLs) - [ ] `lab get servers` with environment/cloud/label filters - [ ] `lab exec` remote command execution - [ ] `lab logs` streaming - [ ] Agent auto-enrollment via PXE provision (join token in kickstart) ### Phase 3: Infrastructure as Code - [ ] Module system (define, apply, health check) - [ ] k3s-server module (deploy k3s) - [ ] labd module (deploy labd to k3s) - [ ] Pulumi executor in labd - [ ] `lab apply -f` command ### Phase 4: Multi-Cloud - [ ] AWS provider (Pulumi-based) - [ ] Reusable join tokens for autoscaling groups - [ ] Cloud/environment model - [ ] Auto-discovery of cloud instances ## Key Design Decisions 1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code 2. **mTLS over SSH** — proper PKI, scalable, no key management per-server 3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls 4. **RBAC from day one** — security-first, deny by default 5. **Module system inspired by Puppet** — declarative, testable, versionable 6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable 7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model