michal/lab

Files

Michal dbbdf5f971 docs: lab platform design — labd, agent, RBAC, multi-cloud, testing strategy

Comprehensive design document covering:
- labd master daemon with CA, RBAC, Pulumi executor
- lab-agent with mTLS enrollment, heartbeat, log shipping
- Module system (built-in + external repos)
- Cloud/environment model (baremetal + AWS)
- Ephemeral test environments (containers, VMs, cloud)
- Security test patterns for RBAC
- Health gates for deployment promotion
- Database strategy: PostgreSQL now, CockroachDB later
- Networking: Tailscale mesh + Cilium CNI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-17 23:46:29 +00:00

14 KiB

Raw Permalink Blame History

Lab Platform — Design Document

Vision

A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│  Developer Workstation (thebeast)                               │
│                                                                 │
│  lab CLI                                                        │
│  ├── lab init bastion standalone start     (PXE provisioning)   │
│  ├── lab provision install/reprovision     (bare-metal)         │
│  ├── lab get servers --env production      (query)              │
│  ├── lab exec <server> -- <command>        (remote execution)   │
│  ├── lab logs <server>                     (log streaming)      │
│  ├── lab apply -f infra.ts                 (pulumi via labd)    │
│  └── lab get roles/users/permissions       (RBAC management)    │
│                                                                 │
│  Connects to: labd via mTLS                                    │
└─────────────────────┬───────────────────────────────────────────┘
                      │ mTLS (client cert)
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│  labmaster.ad.itaz.eu (infra node, k3s single-node)            │
│                                                                 │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  labd (master daemon)                                 │      │
│  │  ├── Certificate Authority (issues agent certs)       │      │
│  │  ├── RBAC Engine (roles, permissions, ACLs)           │      │
│  │  ├── Agent Registry (connected agents, heartbeats)    │      │
│  │  ├── Pulumi Executor (runs IaC on behalf of users)    │      │
│  │  ├── Log Aggregator (receives agent logs)             │      │
│  │  ├── Module Registry (configuration modules)          │      │
│  │  └── REST API + WebSocket (agent connections)         │      │
│  └──────────────────────────────────────────────────────┘      │
│                                                                 │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  bastion (PXE provisioning)                           │      │
│  │  Running as k3s pod with hostNetwork                  │      │
│  └──────────────────────────────────────────────────────┘      │
└──────────┬──────────────────────────────────────────────────────┘
           │ mTLS (agent certs)
           ▼
┌──────────────────────┐  ┌──────────────────────┐  ┌────────────┐
│  ser9.ad.itaz.eu     │  │  worker-2.ad.itaz.eu │  │  AWS EC2   │
│  (bare-metal worker) │  │  (bare-metal worker) │  │  instances │
│                      │  │                      │  │            │
│  lab-agent           │  │  lab-agent           │  │  lab-agent │
│  ├── heartbeat       │  │  ├── heartbeat       │  │  ├── ...   │
│  ├── log shipping    │  │  ├── log shipping    │  │  └── ...   │
│  ├── exec handler    │  │  ├── exec handler    │  │            │
│  └── module runner   │  │  └── module runner   │  │            │
└──────────────────────┘  └──────────────────────┘  └────────────┘

Components

1. labd (Master Daemon)

The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.

Responsibilities:

Certificate Authority — signs agent certificates, manages trust chain
Agent Registry — tracks connected agents, heartbeats, status
RBAC — roles, permissions, ACLs per user/group/environment/cloud
Pulumi Executor — runs Pulumi TypeScript code submitted by users
Log Aggregator — receives and stores logs from agents
Module Registry — stores and distributes configuration modules
REST API — for CLI and external integrations
WebSocket — persistent agent connections for real-time commands

Tech: Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket

2. lab-agent

Lightweight daemon running on every managed machine.

Responsibilities:

Connect to labd via mTLS (agent certificate)
Send heartbeats (status, load, disk, memory)
Ship logs (journald → labd)
Execute commands on demand (like kubectl exec)
Run configuration modules (like puppet agent -tv)
Report module run results

Tech: Standalone TypeScript binary (bun compiled), systemd service

3. lab CLI (extended)

Extends the existing lab CLI with platform management commands.

New commands:

# Server management
lab get servers                           # List all servers
lab get servers --env production          # Filter by environment
lab get servers --cloud baremetal         # Filter by cloud
lab get servers --label role=k3s-worker   # Filter by label
lab describe server <name>               # Detailed server info
lab exec <server> -- <command>           # Remote command execution
lab logs <server> [-f]                   # Stream server logs

# Infrastructure as Code
lab apply -f <file.ts>                   # Execute Pulumi code via labd
lab plan -f <file.ts>                    # Dry-run Pulumi code
lab destroy -f <file.ts>                 # Tear down resources

# RBAC
lab get roles                            # List roles
lab get users                            # List users
lab create role <name>                   # Create role
lab bind role <role> --user <user>       # Bind role to user
lab get permissions                      # List permissions

# Environment/Cloud management
lab get environments                     # List environments
lab get clouds                           # List clouds
lab create environment <name> --cloud <cloud>

# Module management
lab get modules                          # List available modules
lab apply module <name> --target <server>  # Apply module to server

4. Certificate Authority

Built into labd. Issues and manages certificates for agents and users.

Flow:

1. Agent starts with a join token (one-time or reusable)
2. Agent generates CSR, sends to labd with token
3. labd validates token, signs certificate
4. Agent receives signed cert + CA cert
5. All future communication uses mTLS

For CLI users:
1. User runs `lab login` or `lab init`
2. labd issues a client certificate (or uses existing SSH keys)
3. CLI uses client cert for all API calls

Token types:

One-time token — for individual bare-metal servers (generated during PXE provision)
Reusable token — for autoscaling groups (AWS ASG instances use the same token)

5. RBAC Model

Reuse mcpctl's RBAC patterns. Hierarchical permissions:

Cloud → Environment → Server → Action

Examples:
- baremetal:lab:*:exec           — can exec on any lab server
- baremetal:lab:puppet:*         — full access to puppet server
- aws:production:*:read         — read-only on all AWS prod servers
- *:*:*:*                       — superadmin

Resources:

servers, environments, clouds, modules, roles, users, pulumi-stacks

Actions:

read, exec, apply, destroy, manage, admin

Whitelist/Blacklist:

Roles can have allow and deny rules
Deny takes precedence (like AWS IAM)

6. Module System

Configuration modules define the desired state of a server.

Module structure:

modules/
  k3s-server/
    module.yaml          # Metadata: name, version, targets, deps
    src/
      index.ts           # Module entry point
      install.ts         # Installation logic
      configure.ts       # Configuration logic
      health.ts          # Health check
    tests/
      install.test.ts
  k3s-agent/
    module.yaml
    src/
      index.ts
  labd/
    module.yaml
    src/
      index.ts           # Deploy labd to k3s

module.yaml:

name: k3s-server
version: 0.1.0
description: Install and configure k3s server
targets:
  roles: [infra]
  labels:
    k3s: server
dependencies:
  - base-server

Module sources:

Built-in modules (in this repo, e.g., k3s-server, labd)
External modules (separate git repos, pulled by URL)
Module registry (future — like Puppet Forge)

7. Cloud/Environment Model

Cloud: baremetal
  └── Environment: lab
       ├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
       ├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
       └── ...

Cloud: aws
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       ├── Server: i-def456 (from ASG web-servers)
       └── ...
  └── Environment: staging
       └── ...

Each bastion creates an environment under the baremetal cloud. AWS autoscaling groups create environments under the aws cloud.

8. Pulumi Integration

Users submit Pulumi TypeScript code to labd for execution.

# Apply infrastructure code
lab apply -f infra/k3s-cluster.ts --env lab

# The file is sent to labd, which:
# 1. Checks RBAC (does user have apply permission for this env?)
# 2. Creates a Pulumi stack
# 3. Executes `pulumi up` in a sandboxed environment
# 4. Streams output back to CLI
# 5. Stores state in Pulumi backend (local or S3)

Future AWS extension:

// infra/aws-web-servers.ts
import * as aws from "@pulumi/aws";

const asg = new aws.autoscaling.Group("web-servers", {
  maxSize: 10,
  minSize: 2,
  launchTemplate: { /* ... */ },
  // User data installs lab-agent with reusable join token
});

Project Structure

lab/
  bastion/                    # Existing — PXE provisioning

  src/
    shared/                   # @lab/shared — types, constants, RBAC
    labd/                     # @lab/labd — master daemon
      src/
        main.ts
        server.ts
        ca/                   # Certificate Authority
        rbac/                 # RBAC engine (reuse mcpctl patterns)
        agents/               # Agent registry + WebSocket
        pulumi/               # Pulumi executor
        logs/                 # Log aggregation
        modules/              # Module registry
        routes/               # REST API
    agent/                    # @lab/agent — agent daemon
      src/
        main.ts
        connection.ts         # mTLS WebSocket to labd
        heartbeat.ts
        executor.ts           # Command execution
        logs.ts               # Log shipping
        modules.ts            # Module runner
    cli/                      # @lab/cli — extends existing CLI
      src/
        commands/
          init/bastion/       # Existing bastion commands
          provision/          # Existing provision commands
          get/                # New: get servers/roles/users/etc
          exec/               # New: remote execution
          logs/               # New: log streaming
          apply/              # New: pulumi apply
          rbac/               # New: role management

  modules/                    # Built-in modules
    k3s-server/               # Deploy k3s server
    k3s-agent/                # Deploy k3s agent
    labd/                     # Deploy labd to k3s
    lab-agent/                # Deploy lab-agent to servers

  deploy/
    k3s/                      # Existing k3s manifests for bastion
    labd/                     # k3s manifests for labd

Implementation Phases

Phase 1: Foundation (current + next)

Bastion (PXE provisioning) — DONE
CLI structure (lab init/provision) — DONE
Rename puppet to labmaster, reprovision
Deploy k3s on labmaster
Build labd skeleton (Fastify + Prisma)
Certificate Authority (issue/sign certs)
Agent skeleton (connect, heartbeat)

Phase 2: Core Platform

RBAC engine (roles, permissions, ACLs)
lab get servers with environment/cloud/label filters
lab exec remote command execution
lab logs streaming
Agent auto-enrollment via PXE provision (join token in kickstart)

Phase 3: Infrastructure as Code

Module system (define, apply, health check)
k3s-server module (deploy k3s)
labd module (deploy labd to k3s)
Pulumi executor in labd
lab apply -f command

Phase 4: Multi-Cloud

AWS provider (Pulumi-based)
Reusable join tokens for autoscaling groups
Cloud/environment model
Auto-discovery of cloud instances

Key Design Decisions

Pulumi over Puppet — TypeScript-native, same language for IaC and platform code
mTLS over SSH — proper PKI, scalable, no key management per-server
Agents connect to master (not master pushing to agents) — works through NATs, firewalls
RBAC from day one — security-first, deny by default
Module system inspired by Puppet — declarative, testable, versionable
Multi-cloud extensible — cloud is just a label, provider is pluggable
Reuse mcpctl patterns — Prisma DB, Fastify routes, CLI structure, RBAC model

14 KiB Raw Permalink Blame History