Files
lab/bastion/.taskmaster/docs/resource-tracking.md
Michal 46b017d77e
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
feat: install logging, error trapping, PXE/ISO integration tests
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00

7.4 KiB

PRD: Resource Tracking & kubectl-style CLI

Problem

The lab platform currently has fragmented state management:

  • Bastion keeps machine state in an ephemeral JSON file (/tmp/lab-bastion/state.json) that is lost on pod restart
  • labd receives state syncs from bastions but only stores them in memory — the Server table in CockroachDB is never written to
  • There is no system to track relationships between resources (servers belong to clusters, clusters run on servers, networks connect servers)
  • The CLI (labctl) uses an inconsistent verb-noun structure (labctl provision list, labctl app k3s install) instead of a uniform resource-oriented pattern
  • RBAC permissions reference resources (server, cloud, environment) but there is no resource registry to validate against

Vision

A unified resource tracking system where all infrastructure objects (servers, clusters, networks, bastions, VMs) are persisted in CockroachDB via labd, with relationships between them, and managed through a kubectl-style CLI. This replaces the ephemeral JSON state and becomes the single source of truth for the platform.

Current State

Database (CockroachDB via Prisma)

Existing models that are scaffolded but mostly unused:

  • Server — hostname, mac, cloud, environment, role, labels, ip, status (0 rows)
  • Agent — mTLS certificate enrollment per server (0 rows)
  • Bastion — PXE server registration (1 row, labmaster)
  • Cluster — k8s cluster metadata (0 rows)
  • User, Role, Permission, UserRole — RBAC framework (seeded with 3 roles, 6 permissions)
  • JoinToken — agent/bastion enrollment tokens
  • AuditLog — action audit trail

Bastion State (ephemeral JSON)

Three categories tracked per-bastion:

  • discovered — machines found via PXE with hardware info (CPU, RAM, disks, NICs, arch)
  • install_queue — machines queued for OS install with progress tracking
  • installed — machines with OS installed (hostname, role, IP, OS)

CLI Structure (current)

labctl init bastion standalone [start|stop|status]
labctl provision [list|install|reprovision|forget|logs]
labctl app [k3s|labcontroller]
labctl config [list|get|set]
labctl roles
labctl doctor
labctl login
labctl logs

Requirements

1. Persist Bastion State to Database

When labd receives bastion-state-sync messages, it must upsert machines into the Server table:

  • Discovered machines → create/update Server with status "discovered", store HardwareInfo as JSON labels
  • Queued machines → update Server status to "provisioning"
  • Installed machines → update Server with hostname, IP, role, OS, status "installed"
  • Track which bastion owns which server (add bastionId to Server model)
  • Track hardware info: arch, cpu_model, cpu_cores, memory_gb, disks, nics

The bastion's local JSON state becomes a cache; labd's database is the source of truth. On bastion startup, it should load its state from labd if available.

2. Resource Model Expansion

Add new models to the Prisma schema for tracking infrastructure:

Network — L2/L3 network segments

  • name, cidr, vlan, gateway, domain, dhcpEnabled
  • Servers have NICs on networks

ServerNic — NIC-to-network mapping

  • serverId, networkId, mac, ip, name, state (UP/DOWN)
  • Derived from HardwareInfo during discovery

ServerDisk — Disk inventory per server

  • serverId, name, sizeGb, model
  • Derived from HardwareInfo during discovery

ClusterMember — Server-to-cluster membership

  • clusterId, serverId, role (control-plane, worker)

3. kubectl-style CLI Redesign

Restructure labctl to follow the mcpctl / kubectl pattern:

# Core CRUD verbs that work on any resource
labctl get <resource> [name]          # List or get specific resource
labctl describe <resource> <name>     # Detailed view with relationships
labctl create <resource> [flags]      # Create a resource
labctl delete <resource> <name>       # Delete a resource
labctl edit <resource> <name>         # Edit in $EDITOR
labctl apply -f <file>                # Declarative apply from YAML

# Resource types (with aliases)
servers (server, srv)
clusters (cluster)
networks (network, net)
bastions (bastion)
roles (role)
users (user)
tokens (token)
audit (audit)

# Output formats
-o table (default), -o json, -o yaml, -o wide

# Examples
labctl get servers                     # List all servers
labctl get servers -o wide             # With extra columns (disks, NICs)
labctl get server labmaster            # Get specific server
labctl describe server labmaster       # Full details + relationships
labctl get servers --role worker       # Filter by role
labctl get servers --status discovered # Filter by status
labctl get clusters                    # List clusters
labctl describe cluster lab-k3s        # Cluster members, health
labctl get networks                    # List networks
labctl create network --name lab --cidr 192.168.8.0/24 --gateway 192.168.8.1

# Provisioning becomes actions on server resources
labctl provision <server> --os fedora-43 --role worker   # Queue install
labctl reprovision <server>                              # Reinstall
labctl forget <server>                                   # Remove from tracking

# App management stays as-is but simplified
labctl app install k3s <server>
labctl app health k3s [server]

# Admin
labctl bastion start [--foreground]    # Start local bastion
labctl bastion status                  # Bastion health
labctl login                           # Auth
labctl doctor                          # Diagnostics

4. Resource Aliases & Resolution

Follow mcpctl's pattern from shared.ts:

  • Accept singular, plural, and short aliases: server, servers, srv all resolve to the same resource
  • Accept name or ID: labctl get server labmaster or labctl get server <uuid>
  • Accept MAC address for servers: labctl get server 38:05:25:33:e2:e4

5. RBAC Integration

The existing Permission model uses action:cloud:environment:server patterns. Wire this into the resource system:

  • CLI commands check permissions before executing
  • labctl get respects read permissions (only show resources the user can see)
  • labctl provision requires apply permission on the target server
  • labctl delete requires destroy permission
  • Audit all resource operations to the AuditLog table

6. Bastion State Directory Fix

Fix the bug where the CLI's --dir default (/tmp/lab-bastion) overrides the BASTION_DIR=/data environment variable. The CLI option should use the env var as its default:

.option("--dir <dir>", "Bastion data directory", process.env["BASTION_DIR"] ?? "/tmp/lab-bastion")

Technical Constraints

  • Database: CockroachDB with Prisma ORM (already deployed)
  • API: Fastify + WebSocket (labd)
  • CLI: Commander.js (labctl)
  • Auth: mTLS certificates (planned), join tokens (implemented)
  • Monorepo: pnpm workspace with @lab/shared, @lab/bastion, @lab/cli, @lab/labd
  • The bastion-to-labd WebSocket protocol is defined in @lab/shared/protocol

Success Criteria

  1. labctl get servers shows all machines (discovered, provisioning, installed) from the database
  2. Server state survives bastion and labd pod restarts
  3. labctl describe server <name> shows hardware info, network, cluster membership
  4. Resources have tracked relationships (server→cluster, server→network, bastion→server)
  5. RBAC permissions are enforced on CLI operations
  6. All resource mutations are audit-logged
  7. CLI follows consistent kubectl-style verb resource [name] [flags] pattern