Files
lab/bastion/.taskmaster/docs/resource-tracking.md
Michal 46b017d77e
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
feat: install logging, error trapping, PXE/ISO integration tests
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00

173 lines
7.4 KiB
Markdown

# PRD: Resource Tracking & kubectl-style CLI
## Problem
The lab platform currently has fragmented state management:
- Bastion keeps machine state in an ephemeral JSON file (`/tmp/lab-bastion/state.json`) that is lost on pod restart
- labd receives state syncs from bastions but only stores them in memory — the `Server` table in CockroachDB is never written to
- There is no system to track relationships between resources (servers belong to clusters, clusters run on servers, networks connect servers)
- The CLI (`labctl`) uses an inconsistent verb-noun structure (`labctl provision list`, `labctl app k3s install`) instead of a uniform resource-oriented pattern
- RBAC permissions reference resources (server, cloud, environment) but there is no resource registry to validate against
## Vision
A unified resource tracking system where all infrastructure objects (servers, clusters, networks, bastions, VMs) are persisted in CockroachDB via labd, with relationships between them, and managed through a kubectl-style CLI. This replaces the ephemeral JSON state and becomes the single source of truth for the platform.
## Current State
### Database (CockroachDB via Prisma)
Existing models that are scaffolded but mostly unused:
- `Server` — hostname, mac, cloud, environment, role, labels, ip, status (0 rows)
- `Agent` — mTLS certificate enrollment per server (0 rows)
- `Bastion` — PXE server registration (1 row, labmaster)
- `Cluster` — k8s cluster metadata (0 rows)
- `User`, `Role`, `Permission`, `UserRole` — RBAC framework (seeded with 3 roles, 6 permissions)
- `JoinToken` — agent/bastion enrollment tokens
- `AuditLog` — action audit trail
### Bastion State (ephemeral JSON)
Three categories tracked per-bastion:
- `discovered` — machines found via PXE with hardware info (CPU, RAM, disks, NICs, arch)
- `install_queue` — machines queued for OS install with progress tracking
- `installed` — machines with OS installed (hostname, role, IP, OS)
### CLI Structure (current)
```
labctl init bastion standalone [start|stop|status]
labctl provision [list|install|reprovision|forget|logs]
labctl app [k3s|labcontroller]
labctl config [list|get|set]
labctl roles
labctl doctor
labctl login
labctl logs
```
## Requirements
### 1. Persist Bastion State to Database
When labd receives `bastion-state-sync` messages, it must upsert machines into the `Server` table:
- Discovered machines → create/update Server with status "discovered", store HardwareInfo as JSON labels
- Queued machines → update Server status to "provisioning"
- Installed machines → update Server with hostname, IP, role, OS, status "installed"
- Track which bastion owns which server (add `bastionId` to Server model)
- Track hardware info: arch, cpu_model, cpu_cores, memory_gb, disks, nics
The bastion's local JSON state becomes a cache; labd's database is the source of truth. On bastion startup, it should load its state from labd if available.
### 2. Resource Model Expansion
Add new models to the Prisma schema for tracking infrastructure:
**Network** — L2/L3 network segments
- name, cidr, vlan, gateway, domain, dhcpEnabled
- Servers have NICs on networks
**ServerNic** — NIC-to-network mapping
- serverId, networkId, mac, ip, name, state (UP/DOWN)
- Derived from HardwareInfo during discovery
**ServerDisk** — Disk inventory per server
- serverId, name, sizeGb, model
- Derived from HardwareInfo during discovery
**ClusterMember** — Server-to-cluster membership
- clusterId, serverId, role (control-plane, worker)
### 3. kubectl-style CLI Redesign
Restructure labctl to follow the `mcpctl` / `kubectl` pattern:
```
# Core CRUD verbs that work on any resource
labctl get <resource> [name] # List or get specific resource
labctl describe <resource> <name> # Detailed view with relationships
labctl create <resource> [flags] # Create a resource
labctl delete <resource> <name> # Delete a resource
labctl edit <resource> <name> # Edit in $EDITOR
labctl apply -f <file> # Declarative apply from YAML
# Resource types (with aliases)
servers (server, srv)
clusters (cluster)
networks (network, net)
bastions (bastion)
roles (role)
users (user)
tokens (token)
audit (audit)
# Output formats
-o table (default), -o json, -o yaml, -o wide
# Examples
labctl get servers # List all servers
labctl get servers -o wide # With extra columns (disks, NICs)
labctl get server labmaster # Get specific server
labctl describe server labmaster # Full details + relationships
labctl get servers --role worker # Filter by role
labctl get servers --status discovered # Filter by status
labctl get clusters # List clusters
labctl describe cluster lab-k3s # Cluster members, health
labctl get networks # List networks
labctl create network --name lab --cidr 192.168.8.0/24 --gateway 192.168.8.1
# Provisioning becomes actions on server resources
labctl provision <server> --os fedora-43 --role worker # Queue install
labctl reprovision <server> # Reinstall
labctl forget <server> # Remove from tracking
# App management stays as-is but simplified
labctl app install k3s <server>
labctl app health k3s [server]
# Admin
labctl bastion start [--foreground] # Start local bastion
labctl bastion status # Bastion health
labctl login # Auth
labctl doctor # Diagnostics
```
### 4. Resource Aliases & Resolution
Follow mcpctl's pattern from `shared.ts`:
- Accept singular, plural, and short aliases: `server`, `servers`, `srv` all resolve to the same resource
- Accept name or ID: `labctl get server labmaster` or `labctl get server <uuid>`
- Accept MAC address for servers: `labctl get server 38:05:25:33:e2:e4`
### 5. RBAC Integration
The existing Permission model uses `action:cloud:environment:server` patterns. Wire this into the resource system:
- CLI commands check permissions before executing
- `labctl get` respects read permissions (only show resources the user can see)
- `labctl provision` requires `apply` permission on the target server
- `labctl delete` requires `destroy` permission
- Audit all resource operations to the AuditLog table
### 6. Bastion State Directory Fix
Fix the bug where the CLI's `--dir` default (`/tmp/lab-bastion`) overrides the `BASTION_DIR=/data` environment variable. The CLI option should use the env var as its default:
```typescript
.option("--dir <dir>", "Bastion data directory", process.env["BASTION_DIR"] ?? "/tmp/lab-bastion")
```
## Technical Constraints
- Database: CockroachDB with Prisma ORM (already deployed)
- API: Fastify + WebSocket (labd)
- CLI: Commander.js (labctl)
- Auth: mTLS certificates (planned), join tokens (implemented)
- Monorepo: pnpm workspace with @lab/shared, @lab/bastion, @lab/cli, @lab/labd
- The bastion-to-labd WebSocket protocol is defined in @lab/shared/protocol
## Success Criteria
1. `labctl get servers` shows all machines (discovered, provisioning, installed) from the database
2. Server state survives bastion and labd pod restarts
3. `labctl describe server <name>` shows hardware info, network, cluster membership
4. Resources have tracked relationships (server→cluster, server→network, bastion→server)
5. RBAC permissions are enforced on CLI operations
6. All resource mutations are audit-logged
7. CLI follows consistent kubectl-style `verb resource [name] [flags]` pattern