docs: lab platform design — labd, agent, RBAC, multi-cloud, testing strategy
Comprehensive design document covering: - labd master daemon with CA, RBAC, Pulumi executor - lab-agent with mTLS enrollment, heartbeat, log shipping - Module system (built-in + external repos) - Cloud/environment model (baremetal + AWS) - Ephemeral test environments (containers, VMs, cloud) - Security test patterns for RBAC - Health gates for deployment promotion - Database strategy: PostgreSQL now, CockroachDB later - Networking: Tailscale mesh + Cilium CNI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
355
bastion/DESIGN-LAB-PLATFORM.md
Normal file
355
bastion/DESIGN-LAB-PLATFORM.md
Normal file
@@ -0,0 +1,355 @@
|
||||
# Lab Platform — Design Document
|
||||
|
||||
## Vision
|
||||
|
||||
A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Developer Workstation (thebeast) │
|
||||
│ │
|
||||
│ lab CLI │
|
||||
│ ├── lab init bastion standalone start (PXE provisioning) │
|
||||
│ ├── lab provision install/reprovision (bare-metal) │
|
||||
│ ├── lab get servers --env production (query) │
|
||||
│ ├── lab exec <server> -- <command> (remote execution) │
|
||||
│ ├── lab logs <server> (log streaming) │
|
||||
│ ├── lab apply -f infra.ts (pulumi via labd) │
|
||||
│ └── lab get roles/users/permissions (RBAC management) │
|
||||
│ │
|
||||
│ Connects to: labd via mTLS │
|
||||
└─────────────────────┬───────────────────────────────────────────┘
|
||||
│ mTLS (client cert)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ labmaster.ad.itaz.eu (infra node, k3s single-node) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ labd (master daemon) │ │
|
||||
│ │ ├── Certificate Authority (issues agent certs) │ │
|
||||
│ │ ├── RBAC Engine (roles, permissions, ACLs) │ │
|
||||
│ │ ├── Agent Registry (connected agents, heartbeats) │ │
|
||||
│ │ ├── Pulumi Executor (runs IaC on behalf of users) │ │
|
||||
│ │ ├── Log Aggregator (receives agent logs) │ │
|
||||
│ │ ├── Module Registry (configuration modules) │ │
|
||||
│ │ └── REST API + WebSocket (agent connections) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ bastion (PXE provisioning) │ │
|
||||
│ │ Running as k3s pod with hostNetwork │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
└──────────┬──────────────────────────────────────────────────────┘
|
||||
│ mTLS (agent certs)
|
||||
▼
|
||||
┌──────────────────────┐ ┌──────────────────────┐ ┌────────────┐
|
||||
│ ser9.ad.itaz.eu │ │ worker-2.ad.itaz.eu │ │ AWS EC2 │
|
||||
│ (bare-metal worker) │ │ (bare-metal worker) │ │ instances │
|
||||
│ │ │ │ │ │
|
||||
│ lab-agent │ │ lab-agent │ │ lab-agent │
|
||||
│ ├── heartbeat │ │ ├── heartbeat │ │ ├── ... │
|
||||
│ ├── log shipping │ │ ├── log shipping │ │ └── ... │
|
||||
│ ├── exec handler │ │ ├── exec handler │ │ │
|
||||
│ └── module runner │ │ └── module runner │ │ │
|
||||
└──────────────────────┘ └──────────────────────┘ └────────────┘
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. labd (Master Daemon)
|
||||
|
||||
The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.
|
||||
|
||||
**Responsibilities:**
|
||||
- Certificate Authority — signs agent certificates, manages trust chain
|
||||
- Agent Registry — tracks connected agents, heartbeats, status
|
||||
- RBAC — roles, permissions, ACLs per user/group/environment/cloud
|
||||
- Pulumi Executor — runs Pulumi TypeScript code submitted by users
|
||||
- Log Aggregator — receives and stores logs from agents
|
||||
- Module Registry — stores and distributes configuration modules
|
||||
- REST API — for CLI and external integrations
|
||||
- WebSocket — persistent agent connections for real-time commands
|
||||
|
||||
**Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket
|
||||
|
||||
### 2. lab-agent
|
||||
|
||||
Lightweight daemon running on every managed machine.
|
||||
|
||||
**Responsibilities:**
|
||||
- Connect to labd via mTLS (agent certificate)
|
||||
- Send heartbeats (status, load, disk, memory)
|
||||
- Ship logs (journald → labd)
|
||||
- Execute commands on demand (like `kubectl exec`)
|
||||
- Run configuration modules (like `puppet agent -tv`)
|
||||
- Report module run results
|
||||
|
||||
**Tech:** Standalone TypeScript binary (bun compiled), systemd service
|
||||
|
||||
### 3. lab CLI (extended)
|
||||
|
||||
Extends the existing `lab` CLI with platform management commands.
|
||||
|
||||
**New commands:**
|
||||
```
|
||||
# Server management
|
||||
lab get servers # List all servers
|
||||
lab get servers --env production # Filter by environment
|
||||
lab get servers --cloud baremetal # Filter by cloud
|
||||
lab get servers --label role=k3s-worker # Filter by label
|
||||
lab describe server <name> # Detailed server info
|
||||
lab exec <server> -- <command> # Remote command execution
|
||||
lab logs <server> [-f] # Stream server logs
|
||||
|
||||
# Infrastructure as Code
|
||||
lab apply -f <file.ts> # Execute Pulumi code via labd
|
||||
lab plan -f <file.ts> # Dry-run Pulumi code
|
||||
lab destroy -f <file.ts> # Tear down resources
|
||||
|
||||
# RBAC
|
||||
lab get roles # List roles
|
||||
lab get users # List users
|
||||
lab create role <name> # Create role
|
||||
lab bind role <role> --user <user> # Bind role to user
|
||||
lab get permissions # List permissions
|
||||
|
||||
# Environment/Cloud management
|
||||
lab get environments # List environments
|
||||
lab get clouds # List clouds
|
||||
lab create environment <name> --cloud <cloud>
|
||||
|
||||
# Module management
|
||||
lab get modules # List available modules
|
||||
lab apply module <name> --target <server> # Apply module to server
|
||||
```
|
||||
|
||||
### 4. Certificate Authority
|
||||
|
||||
Built into labd. Issues and manages certificates for agents and users.
|
||||
|
||||
**Flow:**
|
||||
```
|
||||
1. Agent starts with a join token (one-time or reusable)
|
||||
2. Agent generates CSR, sends to labd with token
|
||||
3. labd validates token, signs certificate
|
||||
4. Agent receives signed cert + CA cert
|
||||
5. All future communication uses mTLS
|
||||
|
||||
For CLI users:
|
||||
1. User runs `lab login` or `lab init`
|
||||
2. labd issues a client certificate (or uses existing SSH keys)
|
||||
3. CLI uses client cert for all API calls
|
||||
```
|
||||
|
||||
**Token types:**
|
||||
- **One-time token** — for individual bare-metal servers (generated during PXE provision)
|
||||
- **Reusable token** — for autoscaling groups (AWS ASG instances use the same token)
|
||||
|
||||
### 5. RBAC Model
|
||||
|
||||
Reuse mcpctl's RBAC patterns. Hierarchical permissions:
|
||||
|
||||
```
|
||||
Cloud → Environment → Server → Action
|
||||
|
||||
Examples:
|
||||
- baremetal:lab:*:exec — can exec on any lab server
|
||||
- baremetal:lab:puppet:* — full access to puppet server
|
||||
- aws:production:*:read — read-only on all AWS prod servers
|
||||
- *:*:*:* — superadmin
|
||||
```
|
||||
|
||||
**Resources:**
|
||||
- servers, environments, clouds, modules, roles, users, pulumi-stacks
|
||||
|
||||
**Actions:**
|
||||
- read, exec, apply, destroy, manage, admin
|
||||
|
||||
**Whitelist/Blacklist:**
|
||||
- Roles can have `allow` and `deny` rules
|
||||
- Deny takes precedence (like AWS IAM)
|
||||
|
||||
### 6. Module System
|
||||
|
||||
Configuration modules define the desired state of a server.
|
||||
|
||||
**Module structure:**
|
||||
```
|
||||
modules/
|
||||
k3s-server/
|
||||
module.yaml # Metadata: name, version, targets, deps
|
||||
src/
|
||||
index.ts # Module entry point
|
||||
install.ts # Installation logic
|
||||
configure.ts # Configuration logic
|
||||
health.ts # Health check
|
||||
tests/
|
||||
install.test.ts
|
||||
k3s-agent/
|
||||
module.yaml
|
||||
src/
|
||||
index.ts
|
||||
labd/
|
||||
module.yaml
|
||||
src/
|
||||
index.ts # Deploy labd to k3s
|
||||
```
|
||||
|
||||
**module.yaml:**
|
||||
```yaml
|
||||
name: k3s-server
|
||||
version: 0.1.0
|
||||
description: Install and configure k3s server
|
||||
targets:
|
||||
roles: [infra]
|
||||
labels:
|
||||
k3s: server
|
||||
dependencies:
|
||||
- base-server
|
||||
```
|
||||
|
||||
**Module sources:**
|
||||
- Built-in modules (in this repo, e.g., k3s-server, labd)
|
||||
- External modules (separate git repos, pulled by URL)
|
||||
- Module registry (future — like Puppet Forge)
|
||||
|
||||
### 7. Cloud/Environment Model
|
||||
|
||||
```
|
||||
Cloud: baremetal
|
||||
└── Environment: lab
|
||||
├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
|
||||
├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
|
||||
└── ...
|
||||
|
||||
Cloud: aws
|
||||
└── Environment: production
|
||||
├── Server: i-abc123 (from ASG web-servers)
|
||||
├── Server: i-def456 (from ASG web-servers)
|
||||
└── ...
|
||||
└── Environment: staging
|
||||
└── ...
|
||||
```
|
||||
|
||||
Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud.
|
||||
|
||||
### 8. Pulumi Integration
|
||||
|
||||
Users submit Pulumi TypeScript code to labd for execution.
|
||||
|
||||
```bash
|
||||
# Apply infrastructure code
|
||||
lab apply -f infra/k3s-cluster.ts --env lab
|
||||
|
||||
# The file is sent to labd, which:
|
||||
# 1. Checks RBAC (does user have apply permission for this env?)
|
||||
# 2. Creates a Pulumi stack
|
||||
# 3. Executes `pulumi up` in a sandboxed environment
|
||||
# 4. Streams output back to CLI
|
||||
# 5. Stores state in Pulumi backend (local or S3)
|
||||
```
|
||||
|
||||
**Future AWS extension:**
|
||||
```typescript
|
||||
// infra/aws-web-servers.ts
|
||||
import * as aws from "@pulumi/aws";
|
||||
|
||||
const asg = new aws.autoscaling.Group("web-servers", {
|
||||
maxSize: 10,
|
||||
minSize: 2,
|
||||
launchTemplate: { /* ... */ },
|
||||
// User data installs lab-agent with reusable join token
|
||||
});
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
lab/
|
||||
bastion/ # Existing — PXE provisioning
|
||||
|
||||
src/
|
||||
shared/ # @lab/shared — types, constants, RBAC
|
||||
labd/ # @lab/labd — master daemon
|
||||
src/
|
||||
main.ts
|
||||
server.ts
|
||||
ca/ # Certificate Authority
|
||||
rbac/ # RBAC engine (reuse mcpctl patterns)
|
||||
agents/ # Agent registry + WebSocket
|
||||
pulumi/ # Pulumi executor
|
||||
logs/ # Log aggregation
|
||||
modules/ # Module registry
|
||||
routes/ # REST API
|
||||
agent/ # @lab/agent — agent daemon
|
||||
src/
|
||||
main.ts
|
||||
connection.ts # mTLS WebSocket to labd
|
||||
heartbeat.ts
|
||||
executor.ts # Command execution
|
||||
logs.ts # Log shipping
|
||||
modules.ts # Module runner
|
||||
cli/ # @lab/cli — extends existing CLI
|
||||
src/
|
||||
commands/
|
||||
init/bastion/ # Existing bastion commands
|
||||
provision/ # Existing provision commands
|
||||
get/ # New: get servers/roles/users/etc
|
||||
exec/ # New: remote execution
|
||||
logs/ # New: log streaming
|
||||
apply/ # New: pulumi apply
|
||||
rbac/ # New: role management
|
||||
|
||||
modules/ # Built-in modules
|
||||
k3s-server/ # Deploy k3s server
|
||||
k3s-agent/ # Deploy k3s agent
|
||||
labd/ # Deploy labd to k3s
|
||||
lab-agent/ # Deploy lab-agent to servers
|
||||
|
||||
deploy/
|
||||
k3s/ # Existing k3s manifests for bastion
|
||||
labd/ # k3s manifests for labd
|
||||
```
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Foundation (current + next)
|
||||
- [x] Bastion (PXE provisioning) — DONE
|
||||
- [x] CLI structure (`lab init/provision`) — DONE
|
||||
- [ ] Rename puppet to labmaster, reprovision
|
||||
- [ ] Deploy k3s on labmaster
|
||||
- [ ] Build labd skeleton (Fastify + Prisma)
|
||||
- [ ] Certificate Authority (issue/sign certs)
|
||||
- [ ] Agent skeleton (connect, heartbeat)
|
||||
|
||||
### Phase 2: Core Platform
|
||||
- [ ] RBAC engine (roles, permissions, ACLs)
|
||||
- [ ] `lab get servers` with environment/cloud/label filters
|
||||
- [ ] `lab exec` remote command execution
|
||||
- [ ] `lab logs` streaming
|
||||
- [ ] Agent auto-enrollment via PXE provision (join token in kickstart)
|
||||
|
||||
### Phase 3: Infrastructure as Code
|
||||
- [ ] Module system (define, apply, health check)
|
||||
- [ ] k3s-server module (deploy k3s)
|
||||
- [ ] labd module (deploy labd to k3s)
|
||||
- [ ] Pulumi executor in labd
|
||||
- [ ] `lab apply -f` command
|
||||
|
||||
### Phase 4: Multi-Cloud
|
||||
- [ ] AWS provider (Pulumi-based)
|
||||
- [ ] Reusable join tokens for autoscaling groups
|
||||
- [ ] Cloud/environment model
|
||||
- [ ] Auto-discovery of cloud instances
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code
|
||||
2. **mTLS over SSH** — proper PKI, scalable, no key management per-server
|
||||
3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls
|
||||
4. **RBAC from day one** — security-first, deny by default
|
||||
5. **Module system inspired by Puppet** — declarative, testable, versionable
|
||||
6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable
|
||||
7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model
|
||||
Reference in New Issue
Block a user