2026-03-29 00:50:05 +00:00
1 changed files with 355 additions and 0 deletions
--- a/bastion/DESIGN-LAB-PLATFORM.md
+++ b/bastion/DESIGN-LAB-PLATFORM.md
@@ -0,0 +1,355 @@
 # Lab Platform — Design Document
 ## Vision
 A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.
 ## Architecture Overview
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │  Developer Workstation (thebeast)                               │
 │                                                                 │
 │  lab CLI                                                        │
 │  ├── lab init bastion standalone start     (PXE provisioning)   │
 │  ├── lab provision install/reprovision     (bare-metal)         │
 │  ├── lab get servers --env production      (query)              │
 │  ├── lab exec <server> -- <command>        (remote execution)   │
 │  ├── lab logs <server>                     (log streaming)      │
 │  ├── lab apply -f infra.ts                 (pulumi via labd)    │
 │  └── lab get roles/users/permissions       (RBAC management)    │
 │                                                                 │
 │  Connects to: labd via mTLS                                    │
 └─────────────────────┬───────────────────────────────────────────┘
                      │ mTLS (client cert)
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  labmaster.ad.itaz.eu (infra node, k3s single-node)            │
 │                                                                 │
 │  ┌──────────────────────────────────────────────────────┐      │
 │  │  labd (master daemon)                                 │      │
 │  │  ├── Certificate Authority (issues agent certs)       │      │
 │  │  ├── RBAC Engine (roles, permissions, ACLs)           │      │
 │  │  ├── Agent Registry (connected agents, heartbeats)    │      │
 │  │  ├── Pulumi Executor (runs IaC on behalf of users)    │      │
 │  │  ├── Log Aggregator (receives agent logs)             │      │
 │  │  ├── Module Registry (configuration modules)          │      │
 │  │  └── REST API + WebSocket (agent connections)         │      │
 │  └──────────────────────────────────────────────────────┘      │
 │                                                                 │
 │  ┌──────────────────────────────────────────────────────┐      │
 │  │  bastion (PXE provisioning)                           │      │
 │  │  Running as k3s pod with hostNetwork                  │      │
 │  └──────────────────────────────────────────────────────┘      │
 └──────────┬──────────────────────────────────────────────────────┘
           │ mTLS (agent certs)
           ▼
 ┌──────────────────────┐  ┌──────────────────────┐  ┌────────────┐
 │  ser9.ad.itaz.eu     │  │  worker-2.ad.itaz.eu │  │  AWS EC2   │
 │  (bare-metal worker) │  │  (bare-metal worker) │  │  instances │
 │                      │  │                      │  │            │
 │  lab-agent           │  │  lab-agent           │  │  lab-agent │
 │  ├── heartbeat       │  │  ├── heartbeat       │  │  ├── ...   │
 │  ├── log shipping    │  │  ├── log shipping    │  │  └── ...   │
 │  ├── exec handler    │  │  ├── exec handler    │  │            │
 │  └── module runner   │  │  └── module runner   │  │            │
 └──────────────────────┘  └──────────────────────┘  └────────────┘
 ```
 ## Components
 ### 1. labd (Master Daemon)
 The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.
 **Responsibilities:**
 - Certificate Authority — signs agent certificates, manages trust chain
 - Agent Registry — tracks connected agents, heartbeats, status
 - RBAC — roles, permissions, ACLs per user/group/environment/cloud
 - Pulumi Executor — runs Pulumi TypeScript code submitted by users
 - Log Aggregator — receives and stores logs from agents
 - Module Registry — stores and distributes configuration modules
 - REST API — for CLI and external integrations
 - WebSocket — persistent agent connections for real-time commands
 **Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket
 ### 2. lab-agent
 Lightweight daemon running on every managed machine.
 **Responsibilities:**
 - Connect to labd via mTLS (agent certificate)
 - Send heartbeats (status, load, disk, memory)
 - Ship logs (journald → labd)
 - Execute commands on demand (like `kubectl exec`)
 - Run configuration modules (like `puppet agent -tv`)
 - Report module run results
 **Tech:** Standalone TypeScript binary (bun compiled), systemd service
 ### 3. lab CLI (extended)
 Extends the existing `lab` CLI with platform management commands.
 **New commands:**
 ```
 # Server management
 lab get servers                           # List all servers
 lab get servers --env production          # Filter by environment
 lab get servers --cloud baremetal         # Filter by cloud
 lab get servers --label role=k3s-worker   # Filter by label
 lab describe server <name>               # Detailed server info
 lab exec <server> -- <command>           # Remote command execution
 lab logs <server> [-f]                   # Stream server logs
 # Infrastructure as Code
 lab apply -f <file.ts>                   # Execute Pulumi code via labd
 lab plan -f <file.ts>                    # Dry-run Pulumi code
 lab destroy -f <file.ts>                 # Tear down resources
 # RBAC
 lab get roles                            # List roles
 lab get users                            # List users
 lab create role <name>                   # Create role
 lab bind role <role> --user <user>       # Bind role to user
 lab get permissions                      # List permissions
 # Environment/Cloud management
 lab get environments                     # List environments
 lab get clouds                           # List clouds
 lab create environment <name> --cloud <cloud>
 # Module management
 lab get modules                          # List available modules
 lab apply module <name> --target <server>  # Apply module to server
 ```
 ### 4. Certificate Authority
 Built into labd. Issues and manages certificates for agents and users.
 **Flow:**
 ```
 1. Agent starts with a join token (one-time or reusable)
 2. Agent generates CSR, sends to labd with token
 3. labd validates token, signs certificate
 4. Agent receives signed cert + CA cert
 5. All future communication uses mTLS
 For CLI users:
 1. User runs `lab login` or `lab init`
 2. labd issues a client certificate (or uses existing SSH keys)
 3. CLI uses client cert for all API calls
 ```
 **Token types:**
 - **One-time token** — for individual bare-metal servers (generated during PXE provision)
 - **Reusable token** — for autoscaling groups (AWS ASG instances use the same token)
 ### 5. RBAC Model
 Reuse mcpctl's RBAC patterns. Hierarchical permissions:
 ```
 Cloud → Environment → Server → Action
 Examples:
 - baremetal:lab:*:exec           — can exec on any lab server
 - baremetal:lab:puppet:*         — full access to puppet server
 - aws:production:*:read         — read-only on all AWS prod servers
 - *:*:*:*                       — superadmin
 ```
 **Resources:**
 - servers, environments, clouds, modules, roles, users, pulumi-stacks
 **Actions:**
 - read, exec, apply, destroy, manage, admin
 **Whitelist/Blacklist:**
 - Roles can have `allow` and `deny` rules
 - Deny takes precedence (like AWS IAM)
 ### 6. Module System
 Configuration modules define the desired state of a server.
 **Module structure:**
 ```
 modules/
  k3s-server/
    module.yaml          # Metadata: name, version, targets, deps
    src/
      index.ts           # Module entry point
      install.ts         # Installation logic
      configure.ts       # Configuration logic
      health.ts          # Health check
    tests/
      install.test.ts
  k3s-agent/
    module.yaml
    src/
      index.ts
  labd/
    module.yaml
    src/
      index.ts           # Deploy labd to k3s
 ```
 **module.yaml:**
 ```yaml
 name: k3s-server
 version: 0.1.0
 description: Install and configure k3s server
 targets:
  roles: [infra]
  labels:
    k3s: server
 dependencies:
  - base-server
 ```
 **Module sources:**
 - Built-in modules (in this repo, e.g., k3s-server, labd)
 - External modules (separate git repos, pulled by URL)
 - Module registry (future — like Puppet Forge)
 ### 7. Cloud/Environment Model
 ```
 Cloud: baremetal
  └── Environment: lab
       ├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
       ├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
       └── ...
 Cloud: aws
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       ├── Server: i-def456 (from ASG web-servers)
       └── ...
  └── Environment: staging
       └── ...
 ```
 Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud.
 ### 8. Pulumi Integration
 Users submit Pulumi TypeScript code to labd for execution.
 ```bash
 # Apply infrastructure code
 lab apply -f infra/k3s-cluster.ts --env lab
 # The file is sent to labd, which:
 # 1. Checks RBAC (does user have apply permission for this env?)
 # 2. Creates a Pulumi stack
 # 3. Executes `pulumi up` in a sandboxed environment
 # 4. Streams output back to CLI
 # 5. Stores state in Pulumi backend (local or S3)
 ```
 **Future AWS extension:**
 ```typescript
 // infra/aws-web-servers.ts
 import * as aws from "@pulumi/aws";
 const asg = new aws.autoscaling.Group("web-servers", {
  maxSize: 10,
  minSize: 2,
  launchTemplate: { /* ... */ },
  // User data installs lab-agent with reusable join token
 });
 ```
 ## Project Structure
 ```
 lab/
  bastion/                    # Existing — PXE provisioning
  src/
    shared/                   # @lab/shared — types, constants, RBAC
    labd/                     # @lab/labd — master daemon
      src/
        main.ts
        server.ts
        ca/                   # Certificate Authority
        rbac/                 # RBAC engine (reuse mcpctl patterns)
        agents/               # Agent registry + WebSocket
        pulumi/               # Pulumi executor
        logs/                 # Log aggregation
        modules/              # Module registry
        routes/               # REST API
    agent/                    # @lab/agent — agent daemon
      src/
        main.ts
        connection.ts         # mTLS WebSocket to labd
        heartbeat.ts
        executor.ts           # Command execution
        logs.ts               # Log shipping
        modules.ts            # Module runner
    cli/                      # @lab/cli — extends existing CLI
      src/
        commands/
          init/bastion/       # Existing bastion commands
          provision/          # Existing provision commands
          get/                # New: get servers/roles/users/etc
          exec/               # New: remote execution
          logs/               # New: log streaming
          apply/              # New: pulumi apply
          rbac/               # New: role management
  modules/                    # Built-in modules
    k3s-server/               # Deploy k3s server
    k3s-agent/                # Deploy k3s agent
    labd/                     # Deploy labd to k3s
    lab-agent/                # Deploy lab-agent to servers
  deploy/
    k3s/                      # Existing k3s manifests for bastion
    labd/                     # k3s manifests for labd
 ```
 ## Implementation Phases
 ### Phase 1: Foundation (current + next)
 - [x] Bastion (PXE provisioning) — DONE
 - [x] CLI structure (`lab init/provision`) — DONE
 - [ ] Rename puppet to labmaster, reprovision
 - [ ] Deploy k3s on labmaster
 - [ ] Build labd skeleton (Fastify + Prisma)
 - [ ] Certificate Authority (issue/sign certs)
 - [ ] Agent skeleton (connect, heartbeat)
 ### Phase 2: Core Platform
 - [ ] RBAC engine (roles, permissions, ACLs)
 - [ ] `lab get servers` with environment/cloud/label filters
 - [ ] `lab exec` remote command execution
 - [ ] `lab logs` streaming
 - [ ] Agent auto-enrollment via PXE provision (join token in kickstart)
 ### Phase 3: Infrastructure as Code
 - [ ] Module system (define, apply, health check)
 - [ ] k3s-server module (deploy k3s)
 - [ ] labd module (deploy labd to k3s)
 - [ ] Pulumi executor in labd
 - [ ] `lab apply -f` command
 ### Phase 4: Multi-Cloud
 - [ ] AWS provider (Pulumi-based)
 - [ ] Reusable join tokens for autoscaling groups
 - [ ] Cloud/environment model
 - [ ] Auto-discovery of cloud instances
 ## Key Design Decisions
 1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code
 2. **mTLS over SSH** — proper PKI, scalable, no key management per-server
 3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls
 4. **RBAC from day one** — security-first, deny by default
 5. **Module system inspired by Puppet** — declarative, testable, versionable
 6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable
 7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model