lab/bastion/.taskmaster/docs/resource-tracking.md

# PRD: Resource Tracking & kubectl-style CLI

## Problem

The lab platform currently has fragmented state management:
- Bastion keeps machine state in an ephemeral JSON file (`/tmp/lab-bastion/state.json`) that is lost on pod restart
- labd receives state syncs from bastions but only stores them in memory — the `Server` table in CockroachDB is never written to
- There is no system to track relationships between resources (servers belong to clusters, clusters run on servers, networks connect servers)
- The CLI (`labctl`) uses an inconsistent verb-noun structure (`labctl provision list`, `labctl app k3s install`) instead of a uniform resource-oriented pattern
- RBAC permissions reference resources (server, cloud, environment) but there is no resource registry to validate against

## Vision

A unified resource tracking system where all infrastructure objects (servers, clusters, networks, bastions, VMs) are persisted in CockroachDB via labd, with relationships between them, and managed through a kubectl-style CLI. This replaces the ephemeral JSON state and becomes the single source of truth for the platform.

## Current State

### Database (CockroachDB via Prisma)
Existing models that are scaffolded but mostly unused:
- `Server` — hostname, mac, cloud, environment, role, labels, ip, status (0 rows)
- `Agent` — mTLS certificate enrollment per server (0 rows)
- `Bastion` — PXE server registration (1 row, labmaster)
- `Cluster` — k8s cluster metadata (0 rows)
- `User`, `Role`, `Permission`, `UserRole` — RBAC framework (seeded with 3 roles, 6 permissions)
- `JoinToken` — agent/bastion enrollment tokens
- `AuditLog` — action audit trail

### Bastion State (ephemeral JSON)
Three categories tracked per-bastion:
- `discovered` — machines found via PXE with hardware info (CPU, RAM, disks, NICs, arch)
- `install_queue` — machines queued for OS install with progress tracking
- `installed` — machines with OS installed (hostname, role, IP, OS)

### CLI Structure (current)
```
labctl init bastion standalone [start|stop|status]
labctl provision [list|install|reprovision|forget|logs]
labctl app [k3s|labcontroller]
labctl config [list|get|set]
labctl roles
labctl doctor
labctl login
labctl logs
```

## Requirements

### 1. Persist Bastion State to Database

When labd receives `bastion-state-sync` messages, it must upsert machines into the `Server` table:
- Discovered machines → create/update Server with status "discovered", store HardwareInfo as JSON labels
- Queued machines → update Server status to "provisioning"
- Installed machines → update Server with hostname, IP, role, OS, status "installed"
- Track which bastion owns which server (add `bastionId` to Server model)
- Track hardware info: arch, cpu_model, cpu_cores, memory_gb, disks, nics

The bastion's local JSON state becomes a cache; labd's database is the source of truth. On bastion startup, it should load its state from labd if available.

### 2. Resource Model Expansion

Add new models to the Prisma schema for tracking infrastructure:

**Network** — L2/L3 network segments
- name, cidr, vlan, gateway, domain, dhcpEnabled
- Servers have NICs on networks

**ServerNic** — NIC-to-network mapping
- serverId, networkId, mac, ip, name, state (UP/DOWN)
- Derived from HardwareInfo during discovery

**ServerDisk** — Disk inventory per server
- serverId, name, sizeGb, model
- Derived from HardwareInfo during discovery

**ClusterMember** — Server-to-cluster membership
- clusterId, serverId, role (control-plane, worker)

### 3. kubectl-style CLI Redesign

Restructure labctl to follow the `mcpctl` / `kubectl` pattern:

```
# Core CRUD verbs that work on any resource
labctl get <resource> [name]          # List or get specific resource
labctl describe <resource> <name>     # Detailed view with relationships
labctl create <resource> [flags]      # Create a resource
labctl delete <resource> <name>       # Delete a resource
labctl edit <resource> <name>         # Edit in $EDITOR
labctl apply -f <file>                # Declarative apply from YAML

# Resource types (with aliases)
servers (server, srv)
clusters (cluster)
networks (network, net)
bastions (bastion)
roles (role)
users (user)
tokens (token)
audit (audit)

# Output formats
-o table (default), -o json, -o yaml, -o wide

# Examples
labctl get servers                     # List all servers
labctl get servers -o wide             # With extra columns (disks, NICs)
labctl get server labmaster            # Get specific server
labctl describe server labmaster       # Full details + relationships
labctl get servers --role worker       # Filter by role
labctl get servers --status discovered # Filter by status
labctl get clusters                    # List clusters
labctl describe cluster lab-k3s        # Cluster members, health
labctl get networks                    # List networks
labctl create network --name lab --cidr 192.168.8.0/24 --gateway 192.168.8.1

# Provisioning becomes actions on server resources
labctl provision <server> --os fedora-43 --role worker   # Queue install
labctl reprovision <server>                              # Reinstall
labctl forget <server>                                   # Remove from tracking

# App management stays as-is but simplified
labctl app install k3s <server>
labctl app health k3s [server]

# Admin
labctl bastion start [--foreground]    # Start local bastion
labctl bastion status                  # Bastion health
labctl login                           # Auth
labctl doctor                          # Diagnostics
```

### 4. Resource Aliases & Resolution

Follow mcpctl's pattern from `shared.ts`:
- Accept singular, plural, and short aliases: `server`, `servers`, `srv` all resolve to the same resource
- Accept name or ID: `labctl get server labmaster` or `labctl get server <uuid>`
- Accept MAC address for servers: `labctl get server 38:05:25:33:e2:e4`

### 5. RBAC Integration

The existing Permission model uses `action:cloud:environment:server` patterns. Wire this into the resource system:
- CLI commands check permissions before executing
- `labctl get` respects read permissions (only show resources the user can see)
- `labctl provision` requires `apply` permission on the target server
- `labctl delete` requires `destroy` permission
- Audit all resource operations to the AuditLog table

### 6. Bastion State Directory Fix

Fix the bug where the CLI's `--dir` default (`/tmp/lab-bastion`) overrides the `BASTION_DIR=/data` environment variable. The CLI option should use the env var as its default:
```typescript
.option("--dir <dir>", "Bastion data directory", process.env["BASTION_DIR"] ?? "/tmp/lab-bastion")
```

## Technical Constraints

- Database: CockroachDB with Prisma ORM (already deployed)
- API: Fastify + WebSocket (labd)
- CLI: Commander.js (labctl)
- Auth: mTLS certificates (planned), join tokens (implemented)
- Monorepo: pnpm workspace with @lab/shared, @lab/bastion, @lab/cli, @lab/labd
- The bastion-to-labd WebSocket protocol is defined in @lab/shared/protocol

## Success Criteria

1. `labctl get servers` shows all machines (discovered, provisioning, installed) from the database
2. Server state survives bastion and labd pod restarts
3. `labctl describe server <name>` shows hardware info, network, cluster membership
4. Resources have tracked relationships (server→cluster, server→network, bastion→server)
5. RBAC permissions are enforced on CLI operations
6. All resource mutations are audit-logged
7. CLI follows consistent kubectl-style `verb resource [name] [flags]` pattern