Files
lab/bastion/DESIGN-LAB-PLATFORM.md
Michal dbbdf5f971 docs: lab platform design — labd, agent, RBAC, multi-cloud, testing strategy
Comprehensive design document covering:
- labd master daemon with CA, RBAC, Pulumi executor
- lab-agent with mTLS enrollment, heartbeat, log shipping
- Module system (built-in + external repos)
- Cloud/environment model (baremetal + AWS)
- Ephemeral test environments (containers, VMs, cloud)
- Security test patterns for RBAC
- Health gates for deployment promotion
- Database strategy: PostgreSQL now, CockroachDB later
- Networking: Tailscale mesh + Cilium CNI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 23:46:29 +00:00

356 lines
14 KiB
Markdown

# Lab Platform — Design Document
## Vision
A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Developer Workstation (thebeast) │
│ │
│ lab CLI │
│ ├── lab init bastion standalone start (PXE provisioning) │
│ ├── lab provision install/reprovision (bare-metal) │
│ ├── lab get servers --env production (query) │
│ ├── lab exec <server> -- <command> (remote execution) │
│ ├── lab logs <server> (log streaming) │
│ ├── lab apply -f infra.ts (pulumi via labd) │
│ └── lab get roles/users/permissions (RBAC management) │
│ │
│ Connects to: labd via mTLS │
└─────────────────────┬───────────────────────────────────────────┘
│ mTLS (client cert)
┌─────────────────────────────────────────────────────────────────┐
│ labmaster.ad.itaz.eu (infra node, k3s single-node) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ labd (master daemon) │ │
│ │ ├── Certificate Authority (issues agent certs) │ │
│ │ ├── RBAC Engine (roles, permissions, ACLs) │ │
│ │ ├── Agent Registry (connected agents, heartbeats) │ │
│ │ ├── Pulumi Executor (runs IaC on behalf of users) │ │
│ │ ├── Log Aggregator (receives agent logs) │ │
│ │ ├── Module Registry (configuration modules) │ │
│ │ └── REST API + WebSocket (agent connections) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ bastion (PXE provisioning) │ │
│ │ Running as k3s pod with hostNetwork │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────┬──────────────────────────────────────────────────────┘
│ mTLS (agent certs)
┌──────────────────────┐ ┌──────────────────────┐ ┌────────────┐
│ ser9.ad.itaz.eu │ │ worker-2.ad.itaz.eu │ │ AWS EC2 │
│ (bare-metal worker) │ │ (bare-metal worker) │ │ instances │
│ │ │ │ │ │
│ lab-agent │ │ lab-agent │ │ lab-agent │
│ ├── heartbeat │ │ ├── heartbeat │ │ ├── ... │
│ ├── log shipping │ │ ├── log shipping │ │ └── ... │
│ ├── exec handler │ │ ├── exec handler │ │ │
│ └── module runner │ │ └── module runner │ │ │
└──────────────────────┘ └──────────────────────┘ └────────────┘
```
## Components
### 1. labd (Master Daemon)
The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.
**Responsibilities:**
- Certificate Authority — signs agent certificates, manages trust chain
- Agent Registry — tracks connected agents, heartbeats, status
- RBAC — roles, permissions, ACLs per user/group/environment/cloud
- Pulumi Executor — runs Pulumi TypeScript code submitted by users
- Log Aggregator — receives and stores logs from agents
- Module Registry — stores and distributes configuration modules
- REST API — for CLI and external integrations
- WebSocket — persistent agent connections for real-time commands
**Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket
### 2. lab-agent
Lightweight daemon running on every managed machine.
**Responsibilities:**
- Connect to labd via mTLS (agent certificate)
- Send heartbeats (status, load, disk, memory)
- Ship logs (journald → labd)
- Execute commands on demand (like `kubectl exec`)
- Run configuration modules (like `puppet agent -tv`)
- Report module run results
**Tech:** Standalone TypeScript binary (bun compiled), systemd service
### 3. lab CLI (extended)
Extends the existing `lab` CLI with platform management commands.
**New commands:**
```
# Server management
lab get servers # List all servers
lab get servers --env production # Filter by environment
lab get servers --cloud baremetal # Filter by cloud
lab get servers --label role=k3s-worker # Filter by label
lab describe server <name> # Detailed server info
lab exec <server> -- <command> # Remote command execution
lab logs <server> [-f] # Stream server logs
# Infrastructure as Code
lab apply -f <file.ts> # Execute Pulumi code via labd
lab plan -f <file.ts> # Dry-run Pulumi code
lab destroy -f <file.ts> # Tear down resources
# RBAC
lab get roles # List roles
lab get users # List users
lab create role <name> # Create role
lab bind role <role> --user <user> # Bind role to user
lab get permissions # List permissions
# Environment/Cloud management
lab get environments # List environments
lab get clouds # List clouds
lab create environment <name> --cloud <cloud>
# Module management
lab get modules # List available modules
lab apply module <name> --target <server> # Apply module to server
```
### 4. Certificate Authority
Built into labd. Issues and manages certificates for agents and users.
**Flow:**
```
1. Agent starts with a join token (one-time or reusable)
2. Agent generates CSR, sends to labd with token
3. labd validates token, signs certificate
4. Agent receives signed cert + CA cert
5. All future communication uses mTLS
For CLI users:
1. User runs `lab login` or `lab init`
2. labd issues a client certificate (or uses existing SSH keys)
3. CLI uses client cert for all API calls
```
**Token types:**
- **One-time token** — for individual bare-metal servers (generated during PXE provision)
- **Reusable token** — for autoscaling groups (AWS ASG instances use the same token)
### 5. RBAC Model
Reuse mcpctl's RBAC patterns. Hierarchical permissions:
```
Cloud → Environment → Server → Action
Examples:
- baremetal:lab:*:exec — can exec on any lab server
- baremetal:lab:puppet:* — full access to puppet server
- aws:production:*:read — read-only on all AWS prod servers
- *:*:*:* — superadmin
```
**Resources:**
- servers, environments, clouds, modules, roles, users, pulumi-stacks
**Actions:**
- read, exec, apply, destroy, manage, admin
**Whitelist/Blacklist:**
- Roles can have `allow` and `deny` rules
- Deny takes precedence (like AWS IAM)
### 6. Module System
Configuration modules define the desired state of a server.
**Module structure:**
```
modules/
k3s-server/
module.yaml # Metadata: name, version, targets, deps
src/
index.ts # Module entry point
install.ts # Installation logic
configure.ts # Configuration logic
health.ts # Health check
tests/
install.test.ts
k3s-agent/
module.yaml
src/
index.ts
labd/
module.yaml
src/
index.ts # Deploy labd to k3s
```
**module.yaml:**
```yaml
name: k3s-server
version: 0.1.0
description: Install and configure k3s server
targets:
roles: [infra]
labels:
k3s: server
dependencies:
- base-server
```
**Module sources:**
- Built-in modules (in this repo, e.g., k3s-server, labd)
- External modules (separate git repos, pulled by URL)
- Module registry (future — like Puppet Forge)
### 7. Cloud/Environment Model
```
Cloud: baremetal
└── Environment: lab
├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
└── ...
Cloud: aws
└── Environment: production
├── Server: i-abc123 (from ASG web-servers)
├── Server: i-def456 (from ASG web-servers)
└── ...
└── Environment: staging
└── ...
```
Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud.
### 8. Pulumi Integration
Users submit Pulumi TypeScript code to labd for execution.
```bash
# Apply infrastructure code
lab apply -f infra/k3s-cluster.ts --env lab
# The file is sent to labd, which:
# 1. Checks RBAC (does user have apply permission for this env?)
# 2. Creates a Pulumi stack
# 3. Executes `pulumi up` in a sandboxed environment
# 4. Streams output back to CLI
# 5. Stores state in Pulumi backend (local or S3)
```
**Future AWS extension:**
```typescript
// infra/aws-web-servers.ts
import * as aws from "@pulumi/aws";
const asg = new aws.autoscaling.Group("web-servers", {
maxSize: 10,
minSize: 2,
launchTemplate: { /* ... */ },
// User data installs lab-agent with reusable join token
});
```
## Project Structure
```
lab/
bastion/ # Existing — PXE provisioning
src/
shared/ # @lab/shared — types, constants, RBAC
labd/ # @lab/labd — master daemon
src/
main.ts
server.ts
ca/ # Certificate Authority
rbac/ # RBAC engine (reuse mcpctl patterns)
agents/ # Agent registry + WebSocket
pulumi/ # Pulumi executor
logs/ # Log aggregation
modules/ # Module registry
routes/ # REST API
agent/ # @lab/agent — agent daemon
src/
main.ts
connection.ts # mTLS WebSocket to labd
heartbeat.ts
executor.ts # Command execution
logs.ts # Log shipping
modules.ts # Module runner
cli/ # @lab/cli — extends existing CLI
src/
commands/
init/bastion/ # Existing bastion commands
provision/ # Existing provision commands
get/ # New: get servers/roles/users/etc
exec/ # New: remote execution
logs/ # New: log streaming
apply/ # New: pulumi apply
rbac/ # New: role management
modules/ # Built-in modules
k3s-server/ # Deploy k3s server
k3s-agent/ # Deploy k3s agent
labd/ # Deploy labd to k3s
lab-agent/ # Deploy lab-agent to servers
deploy/
k3s/ # Existing k3s manifests for bastion
labd/ # k3s manifests for labd
```
## Implementation Phases
### Phase 1: Foundation (current + next)
- [x] Bastion (PXE provisioning) — DONE
- [x] CLI structure (`lab init/provision`) — DONE
- [ ] Rename puppet to labmaster, reprovision
- [ ] Deploy k3s on labmaster
- [ ] Build labd skeleton (Fastify + Prisma)
- [ ] Certificate Authority (issue/sign certs)
- [ ] Agent skeleton (connect, heartbeat)
### Phase 2: Core Platform
- [ ] RBAC engine (roles, permissions, ACLs)
- [ ] `lab get servers` with environment/cloud/label filters
- [ ] `lab exec` remote command execution
- [ ] `lab logs` streaming
- [ ] Agent auto-enrollment via PXE provision (join token in kickstart)
### Phase 3: Infrastructure as Code
- [ ] Module system (define, apply, health check)
- [ ] k3s-server module (deploy k3s)
- [ ] labd module (deploy labd to k3s)
- [ ] Pulumi executor in labd
- [ ] `lab apply -f` command
### Phase 4: Multi-Cloud
- [ ] AWS provider (Pulumi-based)
- [ ] Reusable join tokens for autoscaling groups
- [ ] Cloud/environment model
- [ ] Auto-discovery of cloud instances
## Key Design Decisions
1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code
2. **mTLS over SSH** — proper PKI, scalable, no key management per-server
3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls
4. **RBAC from day one** — security-first, deny by default
5. **Module system inspired by Puppet** — declarative, testable, versionable
6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable
7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model