fix: PXE boot debugging — bisect root cause, syslog logging, serial console #3
355
bastion/DESIGN-LAB-PLATFORM.md
Normal file
355
bastion/DESIGN-LAB-PLATFORM.md
Normal file
@@ -0,0 +1,355 @@
|
|||||||
|
# Lab Platform — Design Document
|
||||||
|
|
||||||
|
## Vision
|
||||||
|
|
||||||
|
A unified infrastructure management platform that replaces Puppet with a modern, Pulumi-based system. Manages bare-metal servers, cloud VMs, and k3s clusters through a single CLI and API.
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Developer Workstation (thebeast) │
|
||||||
|
│ │
|
||||||
|
│ lab CLI │
|
||||||
|
│ ├── lab init bastion standalone start (PXE provisioning) │
|
||||||
|
│ ├── lab provision install/reprovision (bare-metal) │
|
||||||
|
│ ├── lab get servers --env production (query) │
|
||||||
|
│ ├── lab exec <server> -- <command> (remote execution) │
|
||||||
|
│ ├── lab logs <server> (log streaming) │
|
||||||
|
│ ├── lab apply -f infra.ts (pulumi via labd) │
|
||||||
|
│ └── lab get roles/users/permissions (RBAC management) │
|
||||||
|
│ │
|
||||||
|
│ Connects to: labd via mTLS │
|
||||||
|
└─────────────────────┬───────────────────────────────────────────┘
|
||||||
|
│ mTLS (client cert)
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ labmaster.ad.itaz.eu (infra node, k3s single-node) │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ labd (master daemon) │ │
|
||||||
|
│ │ ├── Certificate Authority (issues agent certs) │ │
|
||||||
|
│ │ ├── RBAC Engine (roles, permissions, ACLs) │ │
|
||||||
|
│ │ ├── Agent Registry (connected agents, heartbeats) │ │
|
||||||
|
│ │ ├── Pulumi Executor (runs IaC on behalf of users) │ │
|
||||||
|
│ │ ├── Log Aggregator (receives agent logs) │ │
|
||||||
|
│ │ ├── Module Registry (configuration modules) │ │
|
||||||
|
│ │ └── REST API + WebSocket (agent connections) │ │
|
||||||
|
│ └──────────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ bastion (PXE provisioning) │ │
|
||||||
|
│ │ Running as k3s pod with hostNetwork │ │
|
||||||
|
│ └──────────────────────────────────────────────────────┘ │
|
||||||
|
└──────────┬──────────────────────────────────────────────────────┘
|
||||||
|
│ mTLS (agent certs)
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐ ┌──────────────────────┐ ┌────────────┐
|
||||||
|
│ ser9.ad.itaz.eu │ │ worker-2.ad.itaz.eu │ │ AWS EC2 │
|
||||||
|
│ (bare-metal worker) │ │ (bare-metal worker) │ │ instances │
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ lab-agent │ │ lab-agent │ │ lab-agent │
|
||||||
|
│ ├── heartbeat │ │ ├── heartbeat │ │ ├── ... │
|
||||||
|
│ ├── log shipping │ │ ├── log shipping │ │ └── ... │
|
||||||
|
│ ├── exec handler │ │ ├── exec handler │ │ │
|
||||||
|
│ └── module runner │ │ └── module runner │ │ │
|
||||||
|
└──────────────────────┘ └──────────────────────┘ └────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### 1. labd (Master Daemon)
|
||||||
|
|
||||||
|
The central control plane. Runs on labmaster.ad.itaz.eu as a k3s pod.
|
||||||
|
|
||||||
|
**Responsibilities:**
|
||||||
|
- Certificate Authority — signs agent certificates, manages trust chain
|
||||||
|
- Agent Registry — tracks connected agents, heartbeats, status
|
||||||
|
- RBAC — roles, permissions, ACLs per user/group/environment/cloud
|
||||||
|
- Pulumi Executor — runs Pulumi TypeScript code submitted by users
|
||||||
|
- Log Aggregator — receives and stores logs from agents
|
||||||
|
- Module Registry — stores and distributes configuration modules
|
||||||
|
- REST API — for CLI and external integrations
|
||||||
|
- WebSocket — persistent agent connections for real-time commands
|
||||||
|
|
||||||
|
**Tech:** Fastify, PostgreSQL (via Prisma, reuse mcpctl patterns), WebSocket
|
||||||
|
|
||||||
|
### 2. lab-agent
|
||||||
|
|
||||||
|
Lightweight daemon running on every managed machine.
|
||||||
|
|
||||||
|
**Responsibilities:**
|
||||||
|
- Connect to labd via mTLS (agent certificate)
|
||||||
|
- Send heartbeats (status, load, disk, memory)
|
||||||
|
- Ship logs (journald → labd)
|
||||||
|
- Execute commands on demand (like `kubectl exec`)
|
||||||
|
- Run configuration modules (like `puppet agent -tv`)
|
||||||
|
- Report module run results
|
||||||
|
|
||||||
|
**Tech:** Standalone TypeScript binary (bun compiled), systemd service
|
||||||
|
|
||||||
|
### 3. lab CLI (extended)
|
||||||
|
|
||||||
|
Extends the existing `lab` CLI with platform management commands.
|
||||||
|
|
||||||
|
**New commands:**
|
||||||
|
```
|
||||||
|
# Server management
|
||||||
|
lab get servers # List all servers
|
||||||
|
lab get servers --env production # Filter by environment
|
||||||
|
lab get servers --cloud baremetal # Filter by cloud
|
||||||
|
lab get servers --label role=k3s-worker # Filter by label
|
||||||
|
lab describe server <name> # Detailed server info
|
||||||
|
lab exec <server> -- <command> # Remote command execution
|
||||||
|
lab logs <server> [-f] # Stream server logs
|
||||||
|
|
||||||
|
# Infrastructure as Code
|
||||||
|
lab apply -f <file.ts> # Execute Pulumi code via labd
|
||||||
|
lab plan -f <file.ts> # Dry-run Pulumi code
|
||||||
|
lab destroy -f <file.ts> # Tear down resources
|
||||||
|
|
||||||
|
# RBAC
|
||||||
|
lab get roles # List roles
|
||||||
|
lab get users # List users
|
||||||
|
lab create role <name> # Create role
|
||||||
|
lab bind role <role> --user <user> # Bind role to user
|
||||||
|
lab get permissions # List permissions
|
||||||
|
|
||||||
|
# Environment/Cloud management
|
||||||
|
lab get environments # List environments
|
||||||
|
lab get clouds # List clouds
|
||||||
|
lab create environment <name> --cloud <cloud>
|
||||||
|
|
||||||
|
# Module management
|
||||||
|
lab get modules # List available modules
|
||||||
|
lab apply module <name> --target <server> # Apply module to server
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Certificate Authority
|
||||||
|
|
||||||
|
Built into labd. Issues and manages certificates for agents and users.
|
||||||
|
|
||||||
|
**Flow:**
|
||||||
|
```
|
||||||
|
1. Agent starts with a join token (one-time or reusable)
|
||||||
|
2. Agent generates CSR, sends to labd with token
|
||||||
|
3. labd validates token, signs certificate
|
||||||
|
4. Agent receives signed cert + CA cert
|
||||||
|
5. All future communication uses mTLS
|
||||||
|
|
||||||
|
For CLI users:
|
||||||
|
1. User runs `lab login` or `lab init`
|
||||||
|
2. labd issues a client certificate (or uses existing SSH keys)
|
||||||
|
3. CLI uses client cert for all API calls
|
||||||
|
```
|
||||||
|
|
||||||
|
**Token types:**
|
||||||
|
- **One-time token** — for individual bare-metal servers (generated during PXE provision)
|
||||||
|
- **Reusable token** — for autoscaling groups (AWS ASG instances use the same token)
|
||||||
|
|
||||||
|
### 5. RBAC Model
|
||||||
|
|
||||||
|
Reuse mcpctl's RBAC patterns. Hierarchical permissions:
|
||||||
|
|
||||||
|
```
|
||||||
|
Cloud → Environment → Server → Action
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
- baremetal:lab:*:exec — can exec on any lab server
|
||||||
|
- baremetal:lab:puppet:* — full access to puppet server
|
||||||
|
- aws:production:*:read — read-only on all AWS prod servers
|
||||||
|
- *:*:*:* — superadmin
|
||||||
|
```
|
||||||
|
|
||||||
|
**Resources:**
|
||||||
|
- servers, environments, clouds, modules, roles, users, pulumi-stacks
|
||||||
|
|
||||||
|
**Actions:**
|
||||||
|
- read, exec, apply, destroy, manage, admin
|
||||||
|
|
||||||
|
**Whitelist/Blacklist:**
|
||||||
|
- Roles can have `allow` and `deny` rules
|
||||||
|
- Deny takes precedence (like AWS IAM)
|
||||||
|
|
||||||
|
### 6. Module System
|
||||||
|
|
||||||
|
Configuration modules define the desired state of a server.
|
||||||
|
|
||||||
|
**Module structure:**
|
||||||
|
```
|
||||||
|
modules/
|
||||||
|
k3s-server/
|
||||||
|
module.yaml # Metadata: name, version, targets, deps
|
||||||
|
src/
|
||||||
|
index.ts # Module entry point
|
||||||
|
install.ts # Installation logic
|
||||||
|
configure.ts # Configuration logic
|
||||||
|
health.ts # Health check
|
||||||
|
tests/
|
||||||
|
install.test.ts
|
||||||
|
k3s-agent/
|
||||||
|
module.yaml
|
||||||
|
src/
|
||||||
|
index.ts
|
||||||
|
labd/
|
||||||
|
module.yaml
|
||||||
|
src/
|
||||||
|
index.ts # Deploy labd to k3s
|
||||||
|
```
|
||||||
|
|
||||||
|
**module.yaml:**
|
||||||
|
```yaml
|
||||||
|
name: k3s-server
|
||||||
|
version: 0.1.0
|
||||||
|
description: Install and configure k3s server
|
||||||
|
targets:
|
||||||
|
roles: [infra]
|
||||||
|
labels:
|
||||||
|
k3s: server
|
||||||
|
dependencies:
|
||||||
|
- base-server
|
||||||
|
```
|
||||||
|
|
||||||
|
**Module sources:**
|
||||||
|
- Built-in modules (in this repo, e.g., k3s-server, labd)
|
||||||
|
- External modules (separate git repos, pulled by URL)
|
||||||
|
- Module registry (future — like Puppet Forge)
|
||||||
|
|
||||||
|
### 7. Cloud/Environment Model
|
||||||
|
|
||||||
|
```
|
||||||
|
Cloud: baremetal
|
||||||
|
└── Environment: lab
|
||||||
|
├── Server: puppet.ad.itaz.eu (role=infra, labels={k3s=server})
|
||||||
|
├── Server: ser9.ad.itaz.eu (role=worker, labels={k3s=agent})
|
||||||
|
└── ...
|
||||||
|
|
||||||
|
Cloud: aws
|
||||||
|
└── Environment: production
|
||||||
|
├── Server: i-abc123 (from ASG web-servers)
|
||||||
|
├── Server: i-def456 (from ASG web-servers)
|
||||||
|
└── ...
|
||||||
|
└── Environment: staging
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
Each bastion creates an environment under the `baremetal` cloud. AWS autoscaling groups create environments under the `aws` cloud.
|
||||||
|
|
||||||
|
### 8. Pulumi Integration
|
||||||
|
|
||||||
|
Users submit Pulumi TypeScript code to labd for execution.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Apply infrastructure code
|
||||||
|
lab apply -f infra/k3s-cluster.ts --env lab
|
||||||
|
|
||||||
|
# The file is sent to labd, which:
|
||||||
|
# 1. Checks RBAC (does user have apply permission for this env?)
|
||||||
|
# 2. Creates a Pulumi stack
|
||||||
|
# 3. Executes `pulumi up` in a sandboxed environment
|
||||||
|
# 4. Streams output back to CLI
|
||||||
|
# 5. Stores state in Pulumi backend (local or S3)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Future AWS extension:**
|
||||||
|
```typescript
|
||||||
|
// infra/aws-web-servers.ts
|
||||||
|
import * as aws from "@pulumi/aws";
|
||||||
|
|
||||||
|
const asg = new aws.autoscaling.Group("web-servers", {
|
||||||
|
maxSize: 10,
|
||||||
|
minSize: 2,
|
||||||
|
launchTemplate: { /* ... */ },
|
||||||
|
// User data installs lab-agent with reusable join token
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
lab/
|
||||||
|
bastion/ # Existing — PXE provisioning
|
||||||
|
|
||||||
|
src/
|
||||||
|
shared/ # @lab/shared — types, constants, RBAC
|
||||||
|
labd/ # @lab/labd — master daemon
|
||||||
|
src/
|
||||||
|
main.ts
|
||||||
|
server.ts
|
||||||
|
ca/ # Certificate Authority
|
||||||
|
rbac/ # RBAC engine (reuse mcpctl patterns)
|
||||||
|
agents/ # Agent registry + WebSocket
|
||||||
|
pulumi/ # Pulumi executor
|
||||||
|
logs/ # Log aggregation
|
||||||
|
modules/ # Module registry
|
||||||
|
routes/ # REST API
|
||||||
|
agent/ # @lab/agent — agent daemon
|
||||||
|
src/
|
||||||
|
main.ts
|
||||||
|
connection.ts # mTLS WebSocket to labd
|
||||||
|
heartbeat.ts
|
||||||
|
executor.ts # Command execution
|
||||||
|
logs.ts # Log shipping
|
||||||
|
modules.ts # Module runner
|
||||||
|
cli/ # @lab/cli — extends existing CLI
|
||||||
|
src/
|
||||||
|
commands/
|
||||||
|
init/bastion/ # Existing bastion commands
|
||||||
|
provision/ # Existing provision commands
|
||||||
|
get/ # New: get servers/roles/users/etc
|
||||||
|
exec/ # New: remote execution
|
||||||
|
logs/ # New: log streaming
|
||||||
|
apply/ # New: pulumi apply
|
||||||
|
rbac/ # New: role management
|
||||||
|
|
||||||
|
modules/ # Built-in modules
|
||||||
|
k3s-server/ # Deploy k3s server
|
||||||
|
k3s-agent/ # Deploy k3s agent
|
||||||
|
labd/ # Deploy labd to k3s
|
||||||
|
lab-agent/ # Deploy lab-agent to servers
|
||||||
|
|
||||||
|
deploy/
|
||||||
|
k3s/ # Existing k3s manifests for bastion
|
||||||
|
labd/ # k3s manifests for labd
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation Phases
|
||||||
|
|
||||||
|
### Phase 1: Foundation (current + next)
|
||||||
|
- [x] Bastion (PXE provisioning) — DONE
|
||||||
|
- [x] CLI structure (`lab init/provision`) — DONE
|
||||||
|
- [ ] Rename puppet to labmaster, reprovision
|
||||||
|
- [ ] Deploy k3s on labmaster
|
||||||
|
- [ ] Build labd skeleton (Fastify + Prisma)
|
||||||
|
- [ ] Certificate Authority (issue/sign certs)
|
||||||
|
- [ ] Agent skeleton (connect, heartbeat)
|
||||||
|
|
||||||
|
### Phase 2: Core Platform
|
||||||
|
- [ ] RBAC engine (roles, permissions, ACLs)
|
||||||
|
- [ ] `lab get servers` with environment/cloud/label filters
|
||||||
|
- [ ] `lab exec` remote command execution
|
||||||
|
- [ ] `lab logs` streaming
|
||||||
|
- [ ] Agent auto-enrollment via PXE provision (join token in kickstart)
|
||||||
|
|
||||||
|
### Phase 3: Infrastructure as Code
|
||||||
|
- [ ] Module system (define, apply, health check)
|
||||||
|
- [ ] k3s-server module (deploy k3s)
|
||||||
|
- [ ] labd module (deploy labd to k3s)
|
||||||
|
- [ ] Pulumi executor in labd
|
||||||
|
- [ ] `lab apply -f` command
|
||||||
|
|
||||||
|
### Phase 4: Multi-Cloud
|
||||||
|
- [ ] AWS provider (Pulumi-based)
|
||||||
|
- [ ] Reusable join tokens for autoscaling groups
|
||||||
|
- [ ] Cloud/environment model
|
||||||
|
- [ ] Auto-discovery of cloud instances
|
||||||
|
|
||||||
|
## Key Design Decisions
|
||||||
|
|
||||||
|
1. **Pulumi over Puppet** — TypeScript-native, same language for IaC and platform code
|
||||||
|
2. **mTLS over SSH** — proper PKI, scalable, no key management per-server
|
||||||
|
3. **Agents connect to master** (not master pushing to agents) — works through NATs, firewalls
|
||||||
|
4. **RBAC from day one** — security-first, deny by default
|
||||||
|
5. **Module system inspired by Puppet** — declarative, testable, versionable
|
||||||
|
6. **Multi-cloud extensible** — cloud is just a label, provider is pluggable
|
||||||
|
7. **Reuse mcpctl patterns** — Prisma DB, Fastify routes, CLI structure, RBAC model
|
||||||
Reference in New Issue
Block a user