docs: comprehensive architecture document
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 14s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped

Covers all components (bastion, labd, labctl, agent, modules),
data flow, machine lifecycle, disk layout, kickstart features,
deployment, testing, security, known issues, and planned work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Michal
2026-03-30 17:31:29 +01:00
parent a0f6161533
commit d7a25066bd

View File

@@ -0,0 +1,431 @@
# Lab Platform Architecture
## Overview
A bare-metal and hybrid cloud infrastructure platform for automated machine provisioning, Kubernetes cluster management, and fleet operations. The platform discovers hardware via PXE boot, installs operating systems unattended, deploys k3s clusters, and provides centralized management through a CLI and API.
**Components:**
- **bastion** -- PXE boot server (DHCP/TFTP/HTTP) for machine discovery and OS installation
- **labd** -- Master daemon for multi-bastion aggregation, persistent state, agent management
- **labctl** -- CLI tool for operators (kubectl-style interface)
- **lab-agent** -- Daemon on provisioned servers for remote execution and monitoring
- **modules** -- Declarative configuration system (k3s, labcontroller)
---
## Architecture
```
labctl (CLI)
|
labd (master daemon)
/ | \
bastion1 bastion2 ... (PXE provisioning)
/ \ |
[machines] [machines] (bare metal)
| |
lab-agent lab-agent (remote exec)
```
### Communication Patterns
| Path | Protocol | Auth |
|------|----------|------|
| labctl -> labd | HTTP/HTTPS | mTLS cert (future: token) |
| bastion -> labd | WebSocket | Join token enrollment |
| lab-agent -> labd | WebSocket | mTLS certificate |
| machine -> bastion | HTTP | None (local network) |
| Anaconda -> bastion | HTTP + UDP syslog | None (install-time) |
| labctl -> bastion | HTTP | None (standalone mode) |
### Standalone vs Centralized
The bastion can operate in two modes:
1. **Standalone** -- single bastion, state in local JSON file, CLI talks directly to bastion HTTP API
2. **Centralized** -- bastion registers with labd via WebSocket, state aggregated in CockroachDB, CLI talks to labd which routes commands to the correct bastion
---
## Machine Lifecycle
```
PXE boot
|
+--------v--------+
| DISCOVERED | Hardware inventory collected
+---------+-------+
|
labctl provision install
|
+---------v-------+
| INSTALL_QUEUE | Waiting for next PXE boot
+---------+-------+
|
PXE boot (Anaconda)
|
+---------v-------+
| INSTALLING | Progress: partitioning -> packages -> post-install
+---------+-------+
|
+---------v-------+
| INSTALLED | OS ready, SSH accessible
+---------+-------+
|
labctl app k3s install
|
+---------v-------+
| K3S RUNNING | Kubernetes node operational
+--------+--------+
|
labctl provision reprovision
|
(back to INSTALL_QUEUE)
```
Side paths:
- **DEBUG** -- `labctl provision debug` boots Anaconda rescue mode for diagnostics
- **FORGET** -- `labctl provision forget` removes machine from all state
---
## Packages
### Monorepo Structure
TypeScript ESM monorepo with pnpm workspaces. Six packages:
| Package | Role | Key Tech |
|---------|------|----------|
| `@lab/shared` | Types, protocol, constants | - |
| `@lab/bastion` | PXE server | Fastify, dnsmasq |
| `@lab/cli` | CLI binary | Commander.js |
| `@lab/labd` | Master daemon | Fastify, Prisma, CockroachDB |
| `@lab/agent` | Server agent | WebSocket |
| `@lab/modules` | Config modules | SSH, k8s-client |
### @lab/shared
Core type system shared by all packages.
**State Model:**
```typescript
interface BastionState {
discovered: Record<MAC, HardwareInfo>
install_queue: Record<MAC, InstallConfig>
installed: Record<MAC, InstalledInfo>
debug: Record<MAC, DebugConfig>
}
```
**Roles:**
- `vanilla` -- OS only, no k3s, no cluster services
- `worker` -- k3s agent + Longhorn storage (joins existing cluster)
- `infra` -- k3s server + etcd (control plane node)
- `labcontroller` -- infra + bastion + labd + CockroachDB (self-sufficient)
**OS Support:**
- `fedora-43` -- Anaconda kickstart installer
- `ubuntu-26.04` -- cloud-init autoinstall
**Protocol:** Discriminated union message types for WebSocket communication between agents, bastions, and labd. Type guards and parsers for runtime validation.
### @lab/bastion
PXE boot server that handles the physical provisioning lifecycle.
**Services:**
- `StateManager` -- JSON file persistence with immutable update pattern
- `SyslogListener` -- UDP syslog receiver (port 5514) for Anaconda install logs
- `InstallLogBuffer` -- In-memory ring buffer + disk persistence per machine
- `BastionConnection` -- WebSocket client to labd for centralized mode
- dnsmasq management (spawn, config generation, proxy/full DHCP)
- Network auto-detection (interface, IP, subnet, gateway)
- ISO builder (xorriso + mtools for non-PXE machines)
**HTTP Routes:**
| Endpoint | Purpose |
|----------|---------|
| `GET /dispatch?mac=` | Dynamic iPXE script (discover/install/debug/local-boot) |
| `GET /ks?mac=` | Per-machine Anaconda kickstart |
| `GET /debug.ks` | Rescue mode kickstart |
| `GET /debug-setup.sh` | nc listener setup script for rescue shell |
| `GET /discover.ks` | Hardware discovery kickstart |
| `POST /api/discover` | Hardware inventory report |
| `POST /api/install` | Queue machine for install |
| `POST /api/progress` | Install progress callback |
| `POST /api/log` | Raw log line ingestion |
| `POST /api/debug` | Queue debug/rescue mode |
| `GET /api/machines` | List all machines |
| `GET /api/logs/:mac` | Install logs + progress |
| `GET /api/logs/:mac/follow` | SSE stream of progress events |
| `DELETE /api/machines/:mac` | Forget machine |
**Templates:**
- `boot.ipxe.ts` -- iPXE scripts for each boot mode (discover, install, debug, pxe-boot-debug, local-boot)
- `install.ks.ts` -- Full Fedora kickstart with LVM, SSH, k3s prereqs, progress callbacks, SysRq keys
- `debug.ks.ts` -- Minimal rescue kickstart (SSH via inst.sshd)
- `ubuntu-autoinstall.ts` -- cloud-init for Ubuntu
- `dnsmasq.conf.ts` -- DHCP/TFTP configuration
**Boot Dispatch Logic:**
```
1. debug[mac]? -> renderDebugIpxe (auto-clear after serving)
2. install_queue[mac]? -> renderInstallIpxe
3. installed[mac]? -> renderLocalBootIpxe (exit to disk)
4. unknown -> renderDiscoverIpxe
```
### @lab/labd
Central management daemon. Aggregates multiple bastions, stores persistent state in CockroachDB, relays commands, manages agent fleet.
**Database (Prisma + CockroachDB):**
- `Server` -- hostname, MAC, IP, role, status, cloud, environment, labels
- `Bastion` -- hostname, network, serverIp, lastHeartbeat
- `Agent` -- certificate, enrollment, heartbeat
- `Cluster` -- name, cloud, environment, kubeconfig (encrypted)
- `User` / `Role` / `Permission` -- RBAC (action:cloud:env:server matrix)
- `JoinToken` -- one-time/reusable enrollment tokens
- `AuditLog` -- action, resource, result, timestamp
**Key Services:**
- `BastionRegistry` -- in-memory registry of connected bastions, state aggregation, MAC-to-bastion routing
- `AgentRegistry` -- connected agents, heartbeat tracking
- `MessageRouter` -- command relay between CLI/agents and bastions
**Command Routing:**
```
CLI: labctl provision install <mac> <hostname>
-> POST /api/machines/install
-> labd finds bastion that knows this MAC
-> WebSocket: {type: "command-install", mac, hostname, disk, role}
-> bastion updates install_queue
-> WebSocket: {type: "command-response", status: "ok"}
-> HTTP response to CLI
```
### @lab/cli (labctl)
Operator CLI. Commander.js binary, distributed as RPM/DEB or standalone bun-compiled executable.
**Command Groups:**
```
labctl init bastion standalone start|stop|status
labctl provision list|install|reprovision|forget|debug|logs|makeiso
labctl app k3s install|health|list
labctl config list|get|set|path
labctl login
labctl doctor
labctl roles
```
**Key Features:**
- Target resolution: hostname, MAC, or IP -> machine lookup
- SSH reboot into PXE for reprovision/debug (efibootmgr --bootnext)
- Follow mode: `labctl provision logs <target> -f` (5s polling)
- Shell completions: bash, fish
### @lab/modules
Declarative configuration modules with three-phase lifecycle: install -> configure -> health.
**k3s Module:**
- 5 operation groups: host-prep, networking, k3s-server, k3s-agent, hardening
- 15+ individual operations: kernel modules, sysctl, firewall, Cilium CNI, SELinux, audit policy, pod security, cert checks
- Health checks: service running, node ready, API health, pod status, Cilium status, secrets encryption
- SSH execution backend with progress callbacks
### @lab/agent
Daemon on provisioned servers. WebSocket to labd for:
- Heartbeat (hostname, uptime, CPU/mem usage)
- Command execution (with stdout/stderr streaming)
- Log streaming (journalctl relay)
- mTLS certificate enrollment and rotation
---
## Disk Layout
### LVM Partitioning (labvg)
All roles share a common LVM layout. The kickstart `%pre` auto-detects the install disk (NVMe preferred, then SATA, skipping USB/removable).
| Volume | Size | FS | Reprovision |
|--------|------|-----|-------------|
| `/boot/efi` | 600 MB | vfat | Reused |
| `/boot` | 3 GB | ext4 | Reused |
| `swap` | 27 GB | swap | Recreated |
| `/` (root) | 33 GB | xfs | Recreated |
| `/var` | 100 GB | xfs | Recreated |
| `/var/log` | 10 GB | xfs | Recreated |
| `/home` | 10 GB | xfs | **Preserved** |
| `/srv` | 20 GB | xfs | **Preserved** |
| `/var/lib/longhorn` | remaining | xfs | **Preserved** (worker) |
| `/var/lib/rancher` | 20 GB | xfs | **Preserved** (infra) |
| `/tmp` | 4 GB | tmpfs | - |
Reprovision detection: if `labvg` VG exists, reuse EFI/boot partitions and preserve data volumes.
---
## Kickstart Features
The Fedora kickstart template (`install.ks.ts`) includes:
- **Dynamic disk detection** -- `%pre` probes NVMe/SATA/virtio, skips USB/removable, supports both fresh install and reprovision
- **Progress callbacks** -- `curl -sf POST /api/progress` at each stage (partitioning, post-install substeps, complete)
- **Anaconda syslog forwarding** -- `logging --host --port` streams real-time install logs to bastion
- **SSH hardening** -- key-only auth, root login via pubkey only, admin user with passwordless sudo
- **Network-first boot order** -- `efibootmgr` reorders boot entries so PXE is always first (bastion controls every reboot)
- **SysRq magic keys** -- `kernel.sysrq=1` for emergency reboot via KVM keyboard
- **Role-specific setup:**
- `vanilla`: chronyd only
- `worker`/`infra`: kernel modules (br_netfilter, overlay), sysctl (ip_forward, inotify), firewalld disabled, k3s binary installed
- `infra`: k3s server binary pre-installed
**What is NOT in the kickstart:**
- `console=ttyS0` -- causes 30s-per-step boot timeout on hardware without physical serial UART (discovered 2026-03-30, see docs/pxe-boot-debugging-2026-03-30.md)
- Background log streamer (`tail -f`) -- prevents Anaconda from syncing filesystem, causes %post writes to not persist
---
## Deployment
### Container Images
**bastion** (`Dockerfile.bastion`):
- Base: Fedora 43 (needs dnsmasq, iPXE)
- Multi-stage: Alpine build -> Fedora runtime
- iPXE rebuilt from source (SNP driver for EFI)
- hostNetwork in k8s (DHCP needs raw sockets)
- Capabilities: NET_ADMIN, NET_RAW
**labd** (`Dockerfile.labd`):
- Base: Alpine (minimal)
- Multi-stage build with Prisma client generation
- Runs as non-root `node` user
### Kubernetes (k3s)
```
Namespace: lab-infra
Deployment: bastion (hostNetwork, PVC for /data, host SSH keys)
ConfigMap: bastion-config (env vars)
Secret: bastion-join-token
PVC: bastion-state (local-path)
Namespace: lab-system
Deployment: labd
Service: labd (NodePort 30100)
StatefulSet: cockroachdb-0
```
### CLI Distribution
Built with `nfpm` as RPM/DEB. Includes:
- `/usr/bin/labctl` (bun-compiled standalone binary)
- `/usr/share/bash-completion/completions/labctl`
- `/usr/share/fish/vendor_completions.d/labctl.fish`
Config: `~/.labctl/config.yaml` with `labdUrl`, output format, default cloud/environment.
---
## Build & Release
```bash
# Development
pnpm install && pnpm build # Compile all packages
pnpm test:run # Unit tests (vitest)
npx tsc --noEmit # Type check
# Deploy
bash scripts/deploy.sh all # Build containers + RPM, push, restart pods
bash scripts/deploy.sh bastion # Just bastion
bash scripts/deploy.sh labd # Just labd
bash scripts/deploy.sh labctl # Just CLI (local RPM install)
# Container builds
bash scripts/build-bastion.sh --platforms linux/amd64 --push latest
bash scripts/build-labd.sh --platforms linux/amd64 --push latest
bash scripts/build-rpm.sh # RPM + DEB packages
# Integration tests (require libvirt, sudo)
sudo tests/integration/run-pxe-test.sh
```
Registry: `mysources.co.uk` (Gitea at 10.0.0.194:3012)
---
## Testing
### Unit Tests
- Kickstart rendering (ksvalidator syntax check, partition layout, role-specific sections)
- State management (load, save, update, debug field)
- Dispatch routing (correct iPXE script for each machine state)
- Syslog listener (UDP receive, IP->MAC resolution, RFC 3164 parsing)
### Integration Tests (libvirt VMs)
- **pxe-provision.test.ts** -- Full end-to-end: create VM -> PXE discovery -> queue install -> Anaconda install -> SSH verification -> systemd health -> SELinux enforcing -> boot order check
- **iso-provision.test.ts** -- ISO boot for non-PXE machines
- **k3s-single-node.test.ts** -- Post-provision k3s installation and health
- VM screenshot capture during boot for debugging
---
## Security
- **mTLS** for agent-labd communication (certificate enrollment via join tokens)
- **SSH key-only auth** on provisioned machines (no password auth)
- **SELinux enforcing** verified in integration tests
- **RBAC** (planned): action:cloud:environment:server permission matrix
- **Audit logging** (planned): every mutation tracked in CockroachDB
- **Network-first boot order** prevents machines from booting without bastion approval
- **SysRq keys** enabled for emergency reboot without SSH access
---
## Known Issues & Lessons Learned
### Serial Console Boot Delay (2026-03-30)
`console=ttyS0,115200n8` in kernel cmdline causes 30-second timeout at every systemd boot phase on hardware without a physical serial UART. Root cause: systemd blocks writing to non-existent UART. Fix: removed from kickstart entirely.
### Anaconda %post Log Streamer
Background `tail -f` in kickstart `%post` prevents Anaconda from syncing the filesystem. All file writes in %post appear to succeed but are lost on reboot. Fix: removed background log streamer, replaced with Anaconda's built-in `logging --host --port` syslog forwarding.
### Disk Auto-Detection
Hardcoded `/dev/sda` default broke NVMe-only machines. Fix: default to empty string (auto-detect) which triggers the `%pre` disk probe logic.
### Anaconda Rescue Mode Limitations
`%pre` and `%post` sections do not execute in `inst.rescue` mode. SSH in rescue mode is provided by Anaconda's `inst.sshd` kernel parameter + `sshpw` kickstart directive. Manual setup via `curl bastion:8080/debug-setup.sh | bash` for nc listener.
---
## Planned Work (Taskmaster)
13 tasks in queue, all pending:
1. **#72** Expand Prisma schema with resource relationships (Network, ServerNic, ServerDisk, ClusterMember)
2. **#73** State persistence service (bastion state -> CockroachDB)
3. **#74** State loading from labd on bastion startup
4. **#75** Fix bastion --dir env var default
5. **#76** Resource type registry with aliases (kubectl-style)
6. **#77** `labctl get <resource>` command
7. **#78** `labctl describe <resource>` command
8. **#79** `labctl create/delete` commands
9. **#80** Refactor provision commands to kubectl-style
10. **#81** Server and resource API endpoints in labd
11. **#82** RBAC permission checks in CLI
12. **#83** Audit logging for resource operations
13. **#84** Update CLI entry point and help text
Additional items not in taskmaster:
- Ubuntu autoinstall disk auto-detect (still defaults to /dev/sda)
- Verify `inst.sshd` works end-to-end in rescue mode
- k3s cluster join vs new cluster distinction in `labctl app k3s install`
- arm64 container build (iPXE cross-compilation broken)