Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting, only 3 progress callbacks, zero log streaming. This overhaul makes every install fully observable. Kickstart improvements: - Error trapping in %pre and %post (trap ERR sends failure details to bastion) - 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata - Background log streamer: tails %post output and batch-sends to /api/log - bastion_log() function for explicit log lines from kickstart scripts Bastion API: - POST /api/log — receives raw log lines from kickstart (single or batch) - InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence - GET /api/logs/:mac — now returns log_lines + log_total alongside stages - SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log) - Progress events forwarded to labd via bastion-progress WebSocket message - Post-provision k3s logs routed through progressBus (was console-only) dnsmasq fixes found during VM testing: - HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach) - pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode) - PXEClient vendor class echo for UEFI firmware compatibility Integration tests: - PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install - ISO boot test: blank VM boots from bastion-generated ISO → same flow - Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot) - test-provision.sh: runs both PXE + ISO tests with prerequisite checks - 250GB sparse QCOW2 disk (LVM layout needs ~204GB) 201 unit tests passing (11 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
173 lines
7.4 KiB
Markdown
173 lines
7.4 KiB
Markdown
# PRD: Resource Tracking & kubectl-style CLI
|
|
|
|
## Problem
|
|
|
|
The lab platform currently has fragmented state management:
|
|
- Bastion keeps machine state in an ephemeral JSON file (`/tmp/lab-bastion/state.json`) that is lost on pod restart
|
|
- labd receives state syncs from bastions but only stores them in memory — the `Server` table in CockroachDB is never written to
|
|
- There is no system to track relationships between resources (servers belong to clusters, clusters run on servers, networks connect servers)
|
|
- The CLI (`labctl`) uses an inconsistent verb-noun structure (`labctl provision list`, `labctl app k3s install`) instead of a uniform resource-oriented pattern
|
|
- RBAC permissions reference resources (server, cloud, environment) but there is no resource registry to validate against
|
|
|
|
## Vision
|
|
|
|
A unified resource tracking system where all infrastructure objects (servers, clusters, networks, bastions, VMs) are persisted in CockroachDB via labd, with relationships between them, and managed through a kubectl-style CLI. This replaces the ephemeral JSON state and becomes the single source of truth for the platform.
|
|
|
|
## Current State
|
|
|
|
### Database (CockroachDB via Prisma)
|
|
Existing models that are scaffolded but mostly unused:
|
|
- `Server` — hostname, mac, cloud, environment, role, labels, ip, status (0 rows)
|
|
- `Agent` — mTLS certificate enrollment per server (0 rows)
|
|
- `Bastion` — PXE server registration (1 row, labmaster)
|
|
- `Cluster` — k8s cluster metadata (0 rows)
|
|
- `User`, `Role`, `Permission`, `UserRole` — RBAC framework (seeded with 3 roles, 6 permissions)
|
|
- `JoinToken` — agent/bastion enrollment tokens
|
|
- `AuditLog` — action audit trail
|
|
|
|
### Bastion State (ephemeral JSON)
|
|
Three categories tracked per-bastion:
|
|
- `discovered` — machines found via PXE with hardware info (CPU, RAM, disks, NICs, arch)
|
|
- `install_queue` — machines queued for OS install with progress tracking
|
|
- `installed` — machines with OS installed (hostname, role, IP, OS)
|
|
|
|
### CLI Structure (current)
|
|
```
|
|
labctl init bastion standalone [start|stop|status]
|
|
labctl provision [list|install|reprovision|forget|logs]
|
|
labctl app [k3s|labcontroller]
|
|
labctl config [list|get|set]
|
|
labctl roles
|
|
labctl doctor
|
|
labctl login
|
|
labctl logs
|
|
```
|
|
|
|
## Requirements
|
|
|
|
### 1. Persist Bastion State to Database
|
|
|
|
When labd receives `bastion-state-sync` messages, it must upsert machines into the `Server` table:
|
|
- Discovered machines → create/update Server with status "discovered", store HardwareInfo as JSON labels
|
|
- Queued machines → update Server status to "provisioning"
|
|
- Installed machines → update Server with hostname, IP, role, OS, status "installed"
|
|
- Track which bastion owns which server (add `bastionId` to Server model)
|
|
- Track hardware info: arch, cpu_model, cpu_cores, memory_gb, disks, nics
|
|
|
|
The bastion's local JSON state becomes a cache; labd's database is the source of truth. On bastion startup, it should load its state from labd if available.
|
|
|
|
### 2. Resource Model Expansion
|
|
|
|
Add new models to the Prisma schema for tracking infrastructure:
|
|
|
|
**Network** — L2/L3 network segments
|
|
- name, cidr, vlan, gateway, domain, dhcpEnabled
|
|
- Servers have NICs on networks
|
|
|
|
**ServerNic** — NIC-to-network mapping
|
|
- serverId, networkId, mac, ip, name, state (UP/DOWN)
|
|
- Derived from HardwareInfo during discovery
|
|
|
|
**ServerDisk** — Disk inventory per server
|
|
- serverId, name, sizeGb, model
|
|
- Derived from HardwareInfo during discovery
|
|
|
|
**ClusterMember** — Server-to-cluster membership
|
|
- clusterId, serverId, role (control-plane, worker)
|
|
|
|
### 3. kubectl-style CLI Redesign
|
|
|
|
Restructure labctl to follow the `mcpctl` / `kubectl` pattern:
|
|
|
|
```
|
|
# Core CRUD verbs that work on any resource
|
|
labctl get <resource> [name] # List or get specific resource
|
|
labctl describe <resource> <name> # Detailed view with relationships
|
|
labctl create <resource> [flags] # Create a resource
|
|
labctl delete <resource> <name> # Delete a resource
|
|
labctl edit <resource> <name> # Edit in $EDITOR
|
|
labctl apply -f <file> # Declarative apply from YAML
|
|
|
|
# Resource types (with aliases)
|
|
servers (server, srv)
|
|
clusters (cluster)
|
|
networks (network, net)
|
|
bastions (bastion)
|
|
roles (role)
|
|
users (user)
|
|
tokens (token)
|
|
audit (audit)
|
|
|
|
# Output formats
|
|
-o table (default), -o json, -o yaml, -o wide
|
|
|
|
# Examples
|
|
labctl get servers # List all servers
|
|
labctl get servers -o wide # With extra columns (disks, NICs)
|
|
labctl get server labmaster # Get specific server
|
|
labctl describe server labmaster # Full details + relationships
|
|
labctl get servers --role worker # Filter by role
|
|
labctl get servers --status discovered # Filter by status
|
|
labctl get clusters # List clusters
|
|
labctl describe cluster lab-k3s # Cluster members, health
|
|
labctl get networks # List networks
|
|
labctl create network --name lab --cidr 192.168.8.0/24 --gateway 192.168.8.1
|
|
|
|
# Provisioning becomes actions on server resources
|
|
labctl provision <server> --os fedora-43 --role worker # Queue install
|
|
labctl reprovision <server> # Reinstall
|
|
labctl forget <server> # Remove from tracking
|
|
|
|
# App management stays as-is but simplified
|
|
labctl app install k3s <server>
|
|
labctl app health k3s [server]
|
|
|
|
# Admin
|
|
labctl bastion start [--foreground] # Start local bastion
|
|
labctl bastion status # Bastion health
|
|
labctl login # Auth
|
|
labctl doctor # Diagnostics
|
|
```
|
|
|
|
### 4. Resource Aliases & Resolution
|
|
|
|
Follow mcpctl's pattern from `shared.ts`:
|
|
- Accept singular, plural, and short aliases: `server`, `servers`, `srv` all resolve to the same resource
|
|
- Accept name or ID: `labctl get server labmaster` or `labctl get server <uuid>`
|
|
- Accept MAC address for servers: `labctl get server 38:05:25:33:e2:e4`
|
|
|
|
### 5. RBAC Integration
|
|
|
|
The existing Permission model uses `action:cloud:environment:server` patterns. Wire this into the resource system:
|
|
- CLI commands check permissions before executing
|
|
- `labctl get` respects read permissions (only show resources the user can see)
|
|
- `labctl provision` requires `apply` permission on the target server
|
|
- `labctl delete` requires `destroy` permission
|
|
- Audit all resource operations to the AuditLog table
|
|
|
|
### 6. Bastion State Directory Fix
|
|
|
|
Fix the bug where the CLI's `--dir` default (`/tmp/lab-bastion`) overrides the `BASTION_DIR=/data` environment variable. The CLI option should use the env var as its default:
|
|
```typescript
|
|
.option("--dir <dir>", "Bastion data directory", process.env["BASTION_DIR"] ?? "/tmp/lab-bastion")
|
|
```
|
|
|
|
## Technical Constraints
|
|
|
|
- Database: CockroachDB with Prisma ORM (already deployed)
|
|
- API: Fastify + WebSocket (labd)
|
|
- CLI: Commander.js (labctl)
|
|
- Auth: mTLS certificates (planned), join tokens (implemented)
|
|
- Monorepo: pnpm workspace with @lab/shared, @lab/bastion, @lab/cli, @lab/labd
|
|
- The bastion-to-labd WebSocket protocol is defined in @lab/shared/protocol
|
|
|
|
## Success Criteria
|
|
|
|
1. `labctl get servers` shows all machines (discovered, provisioning, installed) from the database
|
|
2. Server state survives bastion and labd pod restarts
|
|
3. `labctl describe server <name>` shows hardware info, network, cluster membership
|
|
4. Resources have tracked relationships (server→cluster, server→network, bastion→server)
|
|
5. RBAC permissions are enforced on CLI operations
|
|
6. All resource mutations are audit-logged
|
|
7. CLI follows consistent kubectl-style `verb resource [name] [flags]` pattern
|