Comprehensive design document covering: - labd master daemon with CA, RBAC, Pulumi executor - lab-agent with mTLS enrollment, heartbeat, log shipping - Module system (built-in + external repos) - Cloud/environment model (baremetal + AWS) - Ephemeral test environments (containers, VMs, cloud) - Security test patterns for RBAC - Health gates for deployment promotion - Database strategy: PostgreSQL now, CockroachDB later - Networking: Tailscale mesh + Cilium CNI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
Lab Platform — Design Document
Vision
A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Developer Workstation (thebeast) │
│ │
│ lab CLI │
│ ├── lab init bastion standalone start (PXE provisioning) │
│ ├── lab provision install/reprovision (bare-metal) │
│ ├── lab get servers --env production (query) │
│ ├── lab exec <server> -- <command> (remote execution) │
│ ├── lab logs <server> (log streaming) │
│ ├── lab apply -f infra.ts (pulumi via labd) │
│ └── lab get roles/users/permissions (RBAC management) │
│ │
│ Connects to: labd via mTLS │
└─────────────────────┬───────────────────────────────────────────┘
│ mTLS (client cert)
▼
┌─────────────────────────────────────────────────────────────────┐
│ labmaster.ad.itaz.eu (infra node, k3s single-node) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ labd (master daemon) │ │
│ │ ├── Certificate Authority (issues agent certs) │ │
│ │ ├── RBAC Engine (roles, permissions, ACLs) │ │
│ │ ├── Agent Registry (connected agents, heartbeats) │ │
│ │ ├── Pulumi Executor (runs IaC on behalf of users) │ │
│ │ ├── Log Aggregator (receives agent logs) │ │
│ │ ├── Module Registry (configuration modules) │ │
│ │ └── REST API + WebSocket (agent connections) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ bastion (PXE provisioning) │ │
│ │ Running as k3s pod with hostNetwork │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────┬──────────────────────────────────────────────────────┘
│ mTLS (agent certs)
▼
┌──────────────────────┐ ┌──────────────────────┐ ┌────────────┐
│ ser9.ad.itaz.eu │ │ worker-2.ad.itaz.eu │ │ AWS EC2 │
│ (bare-metal worker) │ │ (bare-metal worker) │ │ instances │
│ │ │ │ │ │
│ lab-agent │ │ lab-agent │ │ lab-agent │
│ ├── heartbeat │ │ ├── heartbeat │ │ ├── ... │
│ ├── log shipping │ │ ├── log shipping │ │ └── ... │
│ ├── exec handler │ │ ├── exec handler │ │ │
│ └── module runner │ │ └── module runner │ │ │
└──────────────────────┘ └──────────────────────┘ └────────────┘
Components
1. labd (Master Daemon)
The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.
Responsibilities:
- Certificate Authority — signs agent certificates, manages trust chain
- Agent Registry — tracks connected agents, heartbeats, status
- RBAC — roles, permissions, ACLs per user/group/environment/cloud
- Pulumi Executor — runs Pulumi TypeScript code submitted by users
- Log Aggregator — receives and stores logs from agents
- Module Registry — stores and distributes configuration modules
- REST API — for CLI and external integrations
- WebSocket — persistent agent connections for real-time commands
Tech: Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket
2. lab-agent
Lightweight daemon running on every managed machine.
Responsibilities:
- Connect to labd via mTLS (agent certificate)
- Send heartbeats (status, load, disk, memory)
- Ship logs (journald → labd)
- Execute commands on demand (like
kubectl exec) - Run configuration modules (like
puppet agent -tv) - Report module run results
Tech: Standalone TypeScript binary (bun compiled), systemd service
3. lab CLI (extended)
Extends the existing lab CLI with platform management commands.
New commands:
# Server management
lab get servers # List all servers
lab get servers --env production # Filter by environment
lab get servers --cloud baremetal # Filter by cloud
lab get servers --label role=k3s-worker # Filter by label
lab describe server <name> # Detailed server info
lab exec <server> -- <command> # Remote command execution
lab logs <server> [-f] # Stream server logs
# Infrastructure as Code
lab apply -f <file.ts> # Execute Pulumi code via labd
lab plan -f <file.ts> # Dry-run Pulumi code
lab destroy -f <file.ts> # Tear down resources
# RBAC
lab get roles # List roles
lab get users # List users
lab create role <name> # Create role
lab bind role <role> --user <user> # Bind role to user
lab get permissions # List permissions
# Environment/Cloud management
lab get environments # List environments
lab get clouds # List clouds
lab create environment <name> --cloud <cloud>
# Module management
lab get modules # List available modules
lab apply module <name> --target <server> # Apply module to server
4. Certificate Authority
Built into labd. Issues and manages certificates for agents and users.
Flow:
1. Agent starts with a join token (one-time or reusable)
2. Agent generates CSR, sends to labd with token
3. labd validates token, signs certificate
4. Agent receives signed cert + CA cert
5. All future communication uses mTLS
For CLI users:
1. User runs `lab login` or `lab init`
2. labd issues a client certificate (or uses existing SSH keys)
3. CLI uses client cert for all API calls
Token types:
- One-time token — for individual bare-metal servers (generated during PXE provision)
- Reusable token — for autoscaling groups (AWS ASG instances use the same token)
5. RBAC Model
Reuse mcpctl's RBAC patterns. Hierarchical permissions:
Cloud → Environment → Server → Action
Examples:
- baremetal:lab:*:exec — can exec on any lab server
- baremetal:lab:puppet:* — full access to puppet server
- aws:production:*:read — read-only on all AWS prod servers
- *:*:*:* — superadmin
Resources:
- servers, environments, clouds, modules, roles, users, pulumi-stacks
Actions:
- read, exec, apply, destroy, manage, admin
Whitelist/Blacklist:
- Roles can have
allowanddenyrules - Deny takes precedence (like AWS IAM)
6. Module System
Configuration modules define the desired state of a server.
Module structure:
modules/
k3s-server/
module.yaml # Metadata: name, version, targets, deps
src/
index.ts # Module entry point
install.ts # Installation logic
configure.ts # Configuration logic
health.ts # Health check
tests/
install.test.ts
k3s-agent/
module.yaml
src/
index.ts
labd/
module.yaml
src/
index.ts # Deploy labd to k3s
module.yaml:
name: k3s-server
version: 0.1.0
description: Install and configure k3s server
targets:
roles: [infra]
labels:
k3s: server
dependencies:
- base-server
Module sources:
- Built-in modules (in this repo, e.g., k3s-server, labd)
- External modules (separate git repos, pulled by URL)
- Module registry (future — like Puppet Forge)
7. Cloud/Environment Model
Cloud: baremetal
└── Environment: lab
├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
└── ...
Cloud: aws
└── Environment: production
├── Server: i-abc123 (from ASG web-servers)
├── Server: i-def456 (from ASG web-servers)
└── ...
└── Environment: staging
└── ...
Each bastion creates an environment under the baremetal cloud. AWS autoscaling groups create environments under the aws cloud.
8. Pulumi Integration
Users submit Pulumi TypeScript code to labd for execution.
# Apply infrastructure code
lab apply -f infra/k3s-cluster.ts --env lab
# The file is sent to labd, which:
# 1. Checks RBAC (does user have apply permission for this env?)
# 2. Creates a Pulumi stack
# 3. Executes `pulumi up` in a sandboxed environment
# 4. Streams output back to CLI
# 5. Stores state in Pulumi backend (local or S3)
Future AWS extension:
// infra/aws-web-servers.ts
import * as aws from "@pulumi/aws";
const asg = new aws.autoscaling.Group("web-servers", {
maxSize: 10,
minSize: 2,
launchTemplate: { /* ... */ },
// User data installs lab-agent with reusable join token
});
Project Structure
lab/
bastion/ # Existing — PXE provisioning
src/
shared/ # @lab/shared — types, constants, RBAC
labd/ # @lab/labd — master daemon
src/
main.ts
server.ts
ca/ # Certificate Authority
rbac/ # RBAC engine (reuse mcpctl patterns)
agents/ # Agent registry + WebSocket
pulumi/ # Pulumi executor
logs/ # Log aggregation
modules/ # Module registry
routes/ # REST API
agent/ # @lab/agent — agent daemon
src/
main.ts
connection.ts # mTLS WebSocket to labd
heartbeat.ts
executor.ts # Command execution
logs.ts # Log shipping
modules.ts # Module runner
cli/ # @lab/cli — extends existing CLI
src/
commands/
init/bastion/ # Existing bastion commands
provision/ # Existing provision commands
get/ # New: get servers/roles/users/etc
exec/ # New: remote execution
logs/ # New: log streaming
apply/ # New: pulumi apply
rbac/ # New: role management
modules/ # Built-in modules
k3s-server/ # Deploy k3s server
k3s-agent/ # Deploy k3s agent
labd/ # Deploy labd to k3s
lab-agent/ # Deploy lab-agent to servers
deploy/
k3s/ # Existing k3s manifests for bastion
labd/ # k3s manifests for labd
Implementation Phases
Phase 1: Foundation (current + next)
- Bastion (PXE provisioning) — DONE
- CLI structure (
lab init/provision) — DONE - Rename puppet to labmaster, reprovision
- Deploy k3s on labmaster
- Build labd skeleton (Fastify + Prisma)
- Certificate Authority (issue/sign certs)
- Agent skeleton (connect, heartbeat)
Phase 2: Core Platform
- RBAC engine (roles, permissions, ACLs)
lab get serverswith environment/cloud/label filterslab execremote command executionlab logsstreaming- Agent auto-enrollment via PXE provision (join token in kickstart)
Phase 3: Infrastructure as Code
- Module system (define, apply, health check)
- k3s-server module (deploy k3s)
- labd module (deploy labd to k3s)
- Pulumi executor in labd
lab apply -fcommand
Phase 4: Multi-Cloud
- AWS provider (Pulumi-based)
- Reusable join tokens for autoscaling groups
- Cloud/environment model
- Auto-discovery of cloud instances
Key Design Decisions
- Pulumi over Puppet — TypeScript-native, same language for IaC and platform code
- mTLS over SSH — proper PKI, scalable, no key management per-server
- Agents connect to master (not master pushing to agents) — works through NATs, firewalls
- RBAC from day one — security-first, deny by default
- Module system inspired by Puppet — declarative, testable, versionable
- Multi-cloud extensible — cloud is just a label, provider is pluggable
- Reuse mcpctl patterns — Prisma DB, Fastify routes, CLI structure, RBAC model