lab/bastion/DESIGN-LAB-PLATFORM.md

# Lab Platform — Design Document

## Vision

A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│  Developer Workstation (thebeast)                               │
│                                                                 │
│  lab CLI                                                        │
│  ├── lab init bastion standalone start     (PXE provisioning)   │
│  ├── lab provision install/reprovision     (bare-metal)         │
│  ├── lab get servers --env production      (query)              │
│  ├── lab exec <server> -- <command>        (remote execution)   │
│  ├── lab logs <server>                     (log streaming)      │
│  ├── lab apply -f infra.ts                 (pulumi via labd)    │
│  └── lab get roles/users/permissions       (RBAC management)    │
│                                                                 │
│  Connects to: labd via mTLS                                    │
└─────────────────────┬───────────────────────────────────────────┘
                      │ mTLS (client cert)
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│  labmaster.ad.itaz.eu (infra node, k3s single-node)            │
│                                                                 │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  labd (master daemon)                                 │      │
│  │  ├── Certificate Authority (issues agent certs)       │      │
│  │  ├── RBAC Engine (roles, permissions, ACLs)           │      │
│  │  ├── Agent Registry (connected agents, heartbeats)    │      │
│  │  ├── Pulumi Executor (runs IaC on behalf of users)    │      │
│  │  ├── Log Aggregator (receives agent logs)             │      │
│  │  ├── Module Registry (configuration modules)          │      │
│  │  └── REST API + WebSocket (agent connections)         │      │
│  └──────────────────────────────────────────────────────┘      │
│                                                                 │
│  ┌──────────────────────────────────────────────────────┐      │
│  │  bastion (PXE provisioning)                           │      │
│  │  Running as k3s pod with hostNetwork                  │      │
│  └──────────────────────────────────────────────────────┘      │
└──────────┬──────────────────────────────────────────────────────┘
           │ mTLS (agent certs)
           ▼
┌──────────────────────┐  ┌──────────────────────┐  ┌────────────┐
│  ser9.ad.itaz.eu     │  │  worker-2.ad.itaz.eu │  │  AWS EC2   │
│  (bare-metal worker) │  │  (bare-metal worker) │  │  instances │
│                      │  │                      │  │            │
│  lab-agent           │  │  lab-agent           │  │  lab-agent │
│  ├── heartbeat       │  │  ├── heartbeat       │  │  ├── ...   │
│  ├── log shipping    │  │  ├── log shipping    │  │  └── ...   │
│  ├── exec handler    │  │  ├── exec handler    │  │            │
│  └── module runner   │  │  └── module runner   │  │            │
└──────────────────────┘  └──────────────────────┘  └────────────┘
```

## Components

### 1. labd (Master Daemon)

The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.

**Responsibilities:**
- Certificate Authority — signs agent certificates, manages trust chain
- Agent Registry — tracks connected agents, heartbeats, status
- RBAC — roles, permissions, ACLs per user/group/environment/cloud
- Pulumi Executor — runs Pulumi TypeScript code submitted by users
- Log Aggregator — receives and stores logs from agents
- Module Registry — stores and distributes configuration modules
- REST API — for CLI and external integrations
- WebSocket — persistent agent connections for real-time commands

**Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket

### 2. lab-agent

Lightweight daemon running on every managed machine.

**Responsibilities:**
- Connect to labd via mTLS (agent certificate)
- Send heartbeats (status, load, disk, memory)
- Ship logs (journald → labd)
- Execute commands on demand (like `kubectl exec`)
- Run configuration modules (like `puppet agent -tv`)
- Report module run results

**Tech:** Standalone TypeScript binary (bun compiled), systemd service

### 3. lab CLI (extended)

Extends the existing `lab` CLI with platform management commands.

**New commands:**
```
# Server management
lab get servers                           # List all servers
lab get servers --env production          # Filter by environment
lab get servers --cloud baremetal         # Filter by cloud
lab get servers --label role=k3s-worker   # Filter by label
lab describe server <name>               # Detailed server info
lab exec <server> -- <command>           # Remote command execution
lab logs <server> [-f]                   # Stream server logs

# Infrastructure as Code
lab apply -f <file.ts>                   # Execute Pulumi code via labd
lab plan -f <file.ts>                    # Dry-run Pulumi code
lab destroy -f <file.ts>                 # Tear down resources

# RBAC
lab get roles                            # List roles
lab get users                            # List users
lab create role <name>                   # Create role
lab bind role <role> --user <user>       # Bind role to user
lab get permissions                      # List permissions

# Environment/Cloud management
lab get environments                     # List environments
lab get clouds                           # List clouds
lab create environment <name> --cloud <cloud>

# Module management
lab get modules                          # List available modules
lab apply module <name> --target <server>  # Apply module to server
```

### 4. Certificate Authority

Built into labd. Issues and manages certificates for agents and users.

**Flow:**
```
1. Agent starts with a join token (one-time or reusable)
2. Agent generates CSR, sends to labd with token
3. labd validates token, signs certificate
4. Agent receives signed cert + CA cert
5. All future communication uses mTLS

For CLI users:
1. User runs `lab login` or `lab init`
2. labd issues a client certificate (or uses existing SSH keys)
3. CLI uses client cert for all API calls
```

**Token types:**
- **One-time token** — for individual bare-metal servers (generated during PXE provision)
- **Reusable token** — for autoscaling groups (AWS ASG instances use the same token)

### 5. RBAC Model

Reuse mcpctl's RBAC patterns. Hierarchical permissions:

```
Cloud → Environment → Server → Action

Examples:
- baremetal:lab:*:exec           — can exec on any lab server
- baremetal:lab:puppet:*         — full access to puppet server
- aws:production:*:read         — read-only on all AWS prod servers
- *:*:*:*                       — superadmin
```

**Resources:**
- servers, environments, clouds, modules, roles, users, pulumi-stacks

**Actions:**
- read, exec, apply, destroy, manage, admin

**Whitelist/Blacklist:**
- Roles can have `allow` and `deny` rules
- Deny takes precedence (like AWS IAM)

### 6. Module System

Configuration modules define the desired state of a server.

**Module structure:**
```
modules/
  k3s-server/
    module.yaml          # Metadata: name, version, targets, deps
    src/
      index.ts           # Module entry point
      install.ts         # Installation logic
      configure.ts       # Configuration logic
      health.ts          # Health check
    tests/
      install.test.ts
  k3s-agent/
    module.yaml
    src/
      index.ts
  labd/
    module.yaml
    src/
      index.ts           # Deploy labd to k3s
```

**module.yaml:**
```yaml
name: k3s-server
version: 0.1.0
description: Install and configure k3s server
targets:
  roles: [infra]
  labels:
    k3s: server
dependencies:
  - base-server
```

**Module sources:**
- Built-in modules (in this repo, e.g., k3s-server, labd)
- External modules (separate git repos, pulled by URL)
- Module registry (future — like Puppet Forge)

### 7. Cloud/Environment Model

```
Cloud: baremetal
  └── Environment: lab
       ├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
       ├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
       └── ...

Cloud: aws
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       ├── Server: i-def456 (from ASG web-servers)
       └── ...
  └── Environment: staging
       └── ...
```

Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud.

### 8. Pulumi Integration

Users submit Pulumi TypeScript code to labd for execution.

```bash
# Apply infrastructure code
lab apply -f infra/k3s-cluster.ts --env lab

# The file is sent to labd, which:
# 1. Checks RBAC (does user have apply permission for this env?)
# 2. Creates a Pulumi stack
# 3. Executes `pulumi up` in a sandboxed environment
# 4. Streams output back to CLI
# 5. Stores state in Pulumi backend (local or S3)
```

**Future AWS extension:**
```typescript
// infra/aws-web-servers.ts
import * as aws from "@pulumi/aws";

const asg = new aws.autoscaling.Group("web-servers", {
  maxSize: 10,
  minSize: 2,
  launchTemplate: { /* ... */ },
  // User data installs lab-agent with reusable join token
});
```

## Project Structure

```
lab/
  bastion/                    # Existing — PXE provisioning

  src/
    shared/                   # @lab/shared — types, constants, RBAC
    labd/                     # @lab/labd — master daemon
      src/
        main.ts
        server.ts
        ca/                   # Certificate Authority
        rbac/                 # RBAC engine (reuse mcpctl patterns)
        agents/               # Agent registry + WebSocket
        pulumi/               # Pulumi executor
        logs/                 # Log aggregation
        modules/              # Module registry
        routes/               # REST API
    agent/                    # @lab/agent — agent daemon
      src/
        main.ts
        connection.ts         # mTLS WebSocket to labd
        heartbeat.ts
        executor.ts           # Command execution
        logs.ts               # Log shipping
        modules.ts            # Module runner
    cli/                      # @lab/cli — extends existing CLI
      src/
        commands/
          init/bastion/       # Existing bastion commands
          provision/          # Existing provision commands
          get/                # New: get servers/roles/users/etc
          exec/               # New: remote execution
          logs/               # New: log streaming
          apply/              # New: pulumi apply
          rbac/               # New: role management

  modules/                    # Built-in modules
    k3s-server/               # Deploy k3s server
    k3s-agent/                # Deploy k3s agent
    labd/                     # Deploy labd to k3s
    lab-agent/                # Deploy lab-agent to servers

  deploy/
    k3s/                      # Existing k3s manifests for bastion
    labd/                     # k3s manifests for labd
```

## Implementation Phases

### Phase 1: Foundation (current + next)
- [x] Bastion (PXE provisioning) — DONE
- [x] CLI structure (`lab init/provision`) — DONE
- [ ] Rename puppet to labmaster, reprovision
- [ ] Deploy k3s on labmaster
- [ ] Build labd skeleton (Fastify + Prisma)
- [ ] Certificate Authority (issue/sign certs)
- [ ] Agent skeleton (connect, heartbeat)

### Phase 2: Core Platform
- [ ] RBAC engine (roles, permissions, ACLs)
- [ ] `lab get servers` with environment/cloud/label filters
- [ ] `lab exec` remote command execution
- [ ] `lab logs` streaming
- [ ] Agent auto-enrollment via PXE provision (join token in kickstart)

### Phase 3: Infrastructure as Code
- [ ] Module system (define, apply, health check)
- [ ] k3s-server module (deploy k3s)
- [ ] labd module (deploy labd to k3s)
- [ ] Pulumi executor in labd
- [ ] `lab apply -f` command

### Phase 4: Multi-Cloud
- [ ] AWS provider (Pulumi-based)
- [ ] Reusable join tokens for autoscaling groups
- [ ] Cloud/environment model
- [ ] Auto-discovery of cloud instances

## Key Design Decisions

1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code
2. **mTLS over SSH** — proper PKI, scalable, no key management per-server
3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls
4. **RBAC from day one** — security-first, deny by default
5. **Module system inspired by Puppet** — declarative, testable, versionable
6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable
7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model