Kickstart installs on real hardware failed silently — no error reporting, only 3 progress callbacks, zero log streaming. This overhaul makes every install fully observable. Kickstart improvements: - Error trapping in %pre and %post (trap ERR sends failure details to bastion) - 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata - Background log streamer: tails %post output and batch-sends to /api/log - bastion_log() function for explicit log lines from kickstart scripts Bastion API: - POST /api/log — receives raw log lines from kickstart (single or batch) - InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence - GET /api/logs/:mac — now returns log_lines + log_total alongside stages - SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log) - Progress events forwarded to labd via bastion-progress WebSocket message - Post-provision k3s logs routed through progressBus (was console-only) dnsmasq fixes found during VM testing: - HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach) - pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode) - PXEClient vendor class echo for UEFI firmware compatibility Integration tests: - PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install - ISO boot test: blank VM boots from bastion-generated ISO → same flow - Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot) - test-provision.sh: runs both PXE + ISO tests with prerequisite checks - 250GB sparse QCOW2 disk (LVM layout needs ~204GB) 201 unit tests passing (11 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.4 KiB
PRD: Resource Tracking & kubectl-style CLI
Problem
The lab platform currently has fragmented state management:
- Bastion keeps machine state in an ephemeral JSON file (
/tmp/lab-bastion/state.json) that is lost on pod restart - labd receives state syncs from bastions but only stores them in memory — the
Servertable in CockroachDB is never written to - There is no system to track relationships between resources (servers belong to clusters, clusters run on servers, networks connect servers)
- The CLI (
labctl) uses an inconsistent verb-noun structure (labctl provision list,labctl app k3s install) instead of a uniform resource-oriented pattern - RBAC permissions reference resources (server, cloud, environment) but there is no resource registry to validate against
Vision
A unified resource tracking system where all infrastructure objects (servers, clusters, networks, bastions, VMs) are persisted in CockroachDB via labd, with relationships between them, and managed through a kubectl-style CLI. This replaces the ephemeral JSON state and becomes the single source of truth for the platform.
Current State
Database (CockroachDB via Prisma)
Existing models that are scaffolded but mostly unused:
Server— hostname, mac, cloud, environment, role, labels, ip, status (0 rows)Agent— mTLS certificate enrollment per server (0 rows)Bastion— PXE server registration (1 row, labmaster)Cluster— k8s cluster metadata (0 rows)User,Role,Permission,UserRole— RBAC framework (seeded with 3 roles, 6 permissions)JoinToken— agent/bastion enrollment tokensAuditLog— action audit trail
Bastion State (ephemeral JSON)
Three categories tracked per-bastion:
discovered— machines found via PXE with hardware info (CPU, RAM, disks, NICs, arch)install_queue— machines queued for OS install with progress trackinginstalled— machines with OS installed (hostname, role, IP, OS)
CLI Structure (current)
labctl init bastion standalone [start|stop|status]
labctl provision [list|install|reprovision|forget|logs]
labctl app [k3s|labcontroller]
labctl config [list|get|set]
labctl roles
labctl doctor
labctl login
labctl logs
Requirements
1. Persist Bastion State to Database
When labd receives bastion-state-sync messages, it must upsert machines into the Server table:
- Discovered machines → create/update Server with status "discovered", store HardwareInfo as JSON labels
- Queued machines → update Server status to "provisioning"
- Installed machines → update Server with hostname, IP, role, OS, status "installed"
- Track which bastion owns which server (add
bastionIdto Server model) - Track hardware info: arch, cpu_model, cpu_cores, memory_gb, disks, nics
The bastion's local JSON state becomes a cache; labd's database is the source of truth. On bastion startup, it should load its state from labd if available.
2. Resource Model Expansion
Add new models to the Prisma schema for tracking infrastructure:
Network — L2/L3 network segments
- name, cidr, vlan, gateway, domain, dhcpEnabled
- Servers have NICs on networks
ServerNic — NIC-to-network mapping
- serverId, networkId, mac, ip, name, state (UP/DOWN)
- Derived from HardwareInfo during discovery
ServerDisk — Disk inventory per server
- serverId, name, sizeGb, model
- Derived from HardwareInfo during discovery
ClusterMember — Server-to-cluster membership
- clusterId, serverId, role (control-plane, worker)
3. kubectl-style CLI Redesign
Restructure labctl to follow the mcpctl / kubectl pattern:
# Core CRUD verbs that work on any resource
labctl get <resource> [name] # List or get specific resource
labctl describe <resource> <name> # Detailed view with relationships
labctl create <resource> [flags] # Create a resource
labctl delete <resource> <name> # Delete a resource
labctl edit <resource> <name> # Edit in $EDITOR
labctl apply -f <file> # Declarative apply from YAML
# Resource types (with aliases)
servers (server, srv)
clusters (cluster)
networks (network, net)
bastions (bastion)
roles (role)
users (user)
tokens (token)
audit (audit)
# Output formats
-o table (default), -o json, -o yaml, -o wide
# Examples
labctl get servers # List all servers
labctl get servers -o wide # With extra columns (disks, NICs)
labctl get server labmaster # Get specific server
labctl describe server labmaster # Full details + relationships
labctl get servers --role worker # Filter by role
labctl get servers --status discovered # Filter by status
labctl get clusters # List clusters
labctl describe cluster lab-k3s # Cluster members, health
labctl get networks # List networks
labctl create network --name lab --cidr 192.168.8.0/24 --gateway 192.168.8.1
# Provisioning becomes actions on server resources
labctl provision <server> --os fedora-43 --role worker # Queue install
labctl reprovision <server> # Reinstall
labctl forget <server> # Remove from tracking
# App management stays as-is but simplified
labctl app install k3s <server>
labctl app health k3s [server]
# Admin
labctl bastion start [--foreground] # Start local bastion
labctl bastion status # Bastion health
labctl login # Auth
labctl doctor # Diagnostics
4. Resource Aliases & Resolution
Follow mcpctl's pattern from shared.ts:
- Accept singular, plural, and short aliases:
server,servers,srvall resolve to the same resource - Accept name or ID:
labctl get server labmasterorlabctl get server <uuid> - Accept MAC address for servers:
labctl get server 38:05:25:33:e2:e4
5. RBAC Integration
The existing Permission model uses action:cloud:environment:server patterns. Wire this into the resource system:
- CLI commands check permissions before executing
labctl getrespects read permissions (only show resources the user can see)labctl provisionrequiresapplypermission on the target serverlabctl deleterequiresdestroypermission- Audit all resource operations to the AuditLog table
6. Bastion State Directory Fix
Fix the bug where the CLI's --dir default (/tmp/lab-bastion) overrides the BASTION_DIR=/data environment variable. The CLI option should use the env var as its default:
.option("--dir <dir>", "Bastion data directory", process.env["BASTION_DIR"] ?? "/tmp/lab-bastion")
Technical Constraints
- Database: CockroachDB with Prisma ORM (already deployed)
- API: Fastify + WebSocket (labd)
- CLI: Commander.js (labctl)
- Auth: mTLS certificates (planned), join tokens (implemented)
- Monorepo: pnpm workspace with @lab/shared, @lab/bastion, @lab/cli, @lab/labd
- The bastion-to-labd WebSocket protocol is defined in @lab/shared/protocol
Success Criteria
labctl get serversshows all machines (discovered, provisioning, installed) from the database- Server state survives bastion and labd pod restarts
labctl describe server <name>shows hardware info, network, cluster membership- Resources have tracked relationships (server→cluster, server→network, bastion→server)
- RBAC permissions are enforced on CLI operations
- All resource mutations are audit-logged
- CLI follows consistent kubectl-style
verb resource [name] [flags]pattern