docs: comprehensive PRD for taskmaster — labctl platform

Full product requirements covering: architecture, CLI commands, partition layout, modules, testing strategy, cloud model, app model, implementation phases, tech stack, and lessons from mcpctl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:23:24 +00:00
parent 44f1ebb843
commit ffc4a782d2
1 changed files with 452 additions and 0 deletions
--- a/.taskmaster/docs/prd.md
+++ b/.taskmaster/docs/prd.md
@@ -0,0 +1,452 @@
 # labctl — Infrastructure Management Platform
 ## Product Requirements Document
 ## 1. Overview
 labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.
 ### 1.1 Core Principles
 - **Single CLI** (`labctl`) for all infrastructure operations
 - **mTLS everywhere** — built-in Certificate Authority, no SSH key management
 - **RBAC from day one** — deny by default, audit everything
 - **Multi-cloud** — bare metal now, AWS later, extensible to any cloud
 - **Test infrastructure like code** — ephemeral environments, smoke tests, security tests
 - **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating
 ### 1.2 Current State (completed)
 - PXE bastion for bare-metal provisioning (discover, install, reprovision)
 - CLI with subcommands: `labctl init bastion`, `labctl provision`
 - LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
 - Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
 - 32 unit tests, VM smoke tests verified on real hardware
 - Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
 - labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)
 ### 1.3 Hardware
 - labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
 - Future: additional bare-metal worker nodes, AWS EC2 instances
 ## 2. Architecture
 ### 2.1 Components
 ```
 labctl CLI → labd (master) → lab-agent (on every server)
                ↓
          CockroachDB
 ```
 **labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.
 **labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.
 **lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.
 **CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.
 **Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.
 ### 2.2 Network Architecture
 **Cilium** as k8s CNI (replacing default flannel):
 - eBPF-based pod networking
 - Built-in WireGuard encryption between nodes
 - Network policies (ties into RBAC)
 - Hubble for observability
 - Future: Cluster Mesh for multi-site transparent networking
 No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.
 ### 2.3 Authentication
 **mTLS with built-in Certificate Authority:**
 1. labd generates root CA on first start (stored encrypted in CockroachDB)
 2. Agents enroll with join token → receive signed certificate
 3. CLI users authenticate with client certificates (or SSH key-based initial auth)
 4. All communication authenticated via mutual TLS
 5. Certificate rotation and revocation supported
 **Join tokens:**
 - One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
 - Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
 - Tokens can be revoked, have optional expiry
 ### 2.4 RBAC Model
 Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:
 ```
 action:cloud:environment:server
 Examples:
  read:*:*:*                    — read everything
  exec:baremetal:lab:*          — exec on any lab bare-metal server
  kubectl:*:*:*                 — kubectl proxy on any cluster
  *:baremetal:lab:puppet        — full access to puppet server only
  manage:*:*:*                  — manage apps, clusters, tokens
  admin:*:*:*                   — full admin (create users, roles)
 ```
 **Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks
 **Actions:** read, exec, apply, destroy, manage, admin, kubectl
 **Deny rules:** explicit deny overrides any allow (like AWS IAM)
 Prisma models: Role, Permission (allow/deny), UserRole binding.
 ### 2.5 Database
 **CockroachDB** chosen over PostgreSQL and Cassandra:
 - PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
 - Multi-master replication — any node accepts reads AND writes
 - Strong consistency (not eventual like Cassandra)
 - Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
 - Auto-rebalancing when adding nodes
 - Start single-node, scale to multi-node with zero code changes (just add nodes)
 **Schema (already scaffolded in Prisma):**
 - Server — managed machines (hostname, mac, cloud, env, role, labels, status)
 - Agent — connected agents (cert, enrollment, last seen)
 - User — platform users (username, cert fingerprint)
 - Role — RBAC roles with permissions
 - Permission — allow/deny rules (action:cloud:env:server)
 - UserRole — user-to-role bindings
 - JoinToken — enrollment tokens (one-time, reusable, revocable)
 - AuditLog — every action logged (user, session, action, resource, result, duration)
 - PulumiRun — infrastructure-as-code execution records
 - Cluster — managed k8s clusters (kubeconfig encrypted)
 ## 3. CLI Command Reference
 ### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED
 ```bash
 sudo labctl init bastion standalone start [--foreground] [--port 8080]
 sudo labctl init bastion standalone stop
 labctl init bastion standalone status
 ```
 ### 3.2 Provisioning — IMPLEMENTED
 ```bash
 labctl provision list
 labctl provision install <mac> <hostname> --role worker|infra
 labctl provision reprovision <mac> <hostname> --role worker|infra
 labctl provision forget <mac>
 ```
 ### 3.3 Server Management — TO BUILD
 ```bash
 labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
 labctl describe server/<name>
 ```
 ### 3.4 Remote Execution — TO BUILD
 ```bash
 labctl exec server/<name> -- <command>
 labctl exec server/<name> -it -- bash          # interactive TTY
 labctl exec server/<name> --timeout 30s -- cmd
 ```
 ### 3.5 Kubernetes Proxy — TO BUILD
 ```bash
 labctl kubectl --cluster <name> <kubectl-args>
 labctl clusters add <name> --kubeconfig <path>
 labctl clusters list
 labctl clusters remove <name>
 ```
 ### 3.6 Logs — TO BUILD
 ```bash
 # Server logs (journalctl passthrough, no DB in hot path)
 labctl logs server/<name>                     # all journal
 labctl logs server/<name> -f                  # follow (live WebSocket relay)
 labctl logs server/<name> -n 100              # last 100 lines
 labctl logs server/<name> -u k3s              # specific unit
 labctl logs server/<name> -u sshd --since "1h ago"
 labctl logs server/<name> -k                  # kernel
 labctl logs server/<name> -p err              # errors only
 labctl logs server/<name> --file /var/log/nginx/error.log
 # App logs (k8s pod logs)
 labctl logs app/<name> [-f] [--container NAME]
 # Pulumi execution logs
 labctl logs pulumi/<run-id> [-f]
 # Bastion logs
 labctl logs bastion/<env> [--mac MAC]
 # Agent daemon logs
 labctl logs agent/<server>
 # Audit logs (from CockroachDB)
 labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
 labctl logs audit/<user-date-sessionid>       # specific session
 ```
 Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.
 ### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD
 ```bash
 labctl apps list
 labctl apps install <name> [--set key=value] [-f values.yaml]
 labctl apps status <name>
 labctl apps upgrade <name>
 labctl apps history <name>
 labctl apps rollback <name> <version>
 labctl apps uninstall <name>
 ```
 ### 3.8 Infrastructure as Code — TO BUILD
 ```bash
 labctl apply -f <file.ts> --env <env>
 labctl plan -f <file.ts> --env <env>
 labctl destroy -f <file.ts> --env <env>
 ```
 ### 3.9 RBAC — TO BUILD
 ```bash
 labctl get roles
 labctl get users
 labctl create role <name> --allow "action:cloud:env:server"
 labctl create role <name> --deny "destroy:*:*:*"
 labctl bind role <role> --user <user>
 labctl unbind role <role> --user <user>
 labctl get permissions
 ```
 ### 3.10 Environments and Clouds — TO BUILD
 ```bash
 labctl get environments
 labctl get clouds
 labctl create environment <name> --cloud <cloud>
 ```
 ## 4. Partition Layout
 ### Worker Role
 ```
 /boot/efi       600MB  EFI
 /boot           3GB    ext4
 ── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/longhorn  rest  xfs     ← preserved (Longhorn PVC storage)
  /tmp          tmpfs 4GB
 ```
 ### Infra Role
 ```
 /boot/efi       600MB  EFI
 /boot           3GB    ext4
 ── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/rancher  20GB  xfs      ← preserved (k3s etcd data)
  /tmp          tmpfs 4GB
 ```
 ## 5. Module System
 Configuration modules define desired state. Three tiers:
 1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion
 2. **Official modules** (separate repos): monitoring, cilium, DNS
 3. **Custom modules** (user repos): pulled by git URL
 Module structure:
 ```
 module.yaml          # name, version, targets (roles/labels), deps
 src/index.ts         # entry point
 src/install.ts       # installation logic
 src/configure.ts     # configuration logic
 src/health.ts        # health check
 tests/               # vitest tests (mandatory)
 ```
 ## 6. Testing Strategy
 ### 6.1 Testing Pyramid
 ```
 Unit Tests        → pure logic, milliseconds, every commit
 Smoke Tests       → containers (podman-compose), minutes, every commit
 Integration Tests → VMs (libvirt), 10-15 min, PRs
 E2E Tests         → real hardware/cloud, 20-30 min, pre-release
 ```
 ### 6.2 Smoke Test Stack (podman-compose)
 ```yaml
 services:
  cockroachdb:
    image: cockroachdb/cockroach:latest-v24.3
  labd:
    build: .
    depends_on: [cockroachdb]
  agent-1:
    build: ./agent
    depends_on: [labd]
  agent-2:
    build: ./agent
    depends_on: [labd]
 ```
 Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.
 ### 6.3 Security Tests (RBAC)
 - Deny exec without permission
 - Deny cross-environment access
 - Deny rules override allow rules
 - Cannot escalate own permissions
 - Audit logs all denied attempts
 - Certificate-based auth cannot be spoofed
 - Join tokens cannot be reused (one-time)
 - Expired tokens rejected
 ### 6.4 Ephemeral Test Environments
 ```bash
 labctl test smoke                                    # podman-compose
 labctl test integration                              # libvirt VMs
 labctl env create pr-123 --cloud containers          # CI ephemeral
 labctl env create pr-123 --cloud aws                 # cloud ephemeral (future)
 ```
 ### 6.5 Health Gates for Deployment
 Before promoting to production, ALL must pass:
 - labd API responds
 - Expected number of agents connected
 - k3s nodes Ready
 - Certificates valid (>30 days)
 - RBAC smoke test passes
 - No error logs in last 5 minutes
 ## 7. Cloud/Environment Model
 ```
 Cloud: baremetal
  └── Environment: lab
       ├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
       └── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})
 Cloud: aws (future)
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       └── Server: i-def456 (from ASG web-servers)
 ```
 Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.
 ## 8. App Model (Pulumi Charts)
 Each app is a Pulumi TypeScript program:
 ```
 app.yaml             # name, version, inputs schema, required permissions
 src/index.ts         # Pulumi program
 values.yaml          # defaults
 tests/               # vitest tests
 ```
 First apps to build:
 - bastion — PXE provisioning (wrap existing code)
 - labd — master daemon (self-deployment)
 - cockroachdb — database
 - cilium — CNI
 ## 9. Implementation Phases
 ### Phase 1: Foundation (PARTIALLY DONE)
 - [x] PXE bastion (discover, install, reprovision)
 - [x] CLI structure (labctl init/provision)
 - [x] labd scaffold (Fastify + CockroachDB/Prisma schema)
 - [x] Multi-arch builds, packaging, CI/CD
 - [ ] Certificate Authority in labd
 - [ ] lab-agent skeleton (connect, heartbeat, enrollment)
 - [ ] Agent enrollment via join tokens
 - [ ] RBAC engine
 - [ ] labctl exec (remote execution)
 - [ ] labctl logs (resource-scoped streaming)
 - [ ] labctl get servers (with filters)
 - [ ] Smoke test stack (podman-compose)
 ### Phase 2: Deployment
 - [ ] Reprovision labmaster as labmaster.ad.itaz.eu
 - [ ] Deploy k3s with Cilium CNI
 - [ ] Deploy CockroachDB on k3s
 - [ ] Deploy labd on k3s
 - [ ] Deploy bastion as managed app
 - [ ] Auto-enroll agents during PXE provision
 ### Phase 3: Infrastructure as Code
 - [ ] Module system
 - [ ] Pulumi charts (replacing Helm)
 - [ ] labctl apps install/upgrade/rollback
 - [ ] labctl apply -f (Pulumi execution)
 - [ ] kubectl proxy (audited)
 - [ ] Kubeconfig store (encrypted)
 ### Phase 4: Multi-Cloud
 - [ ] AWS provider (Pulumi)
 - [ ] Reusable join tokens for ASGs
 - [ ] Cilium Cluster Mesh
 - [ ] Ephemeral test environments
 - [ ] Grafana Loki for cold logs
 ## 10. Technology Stack
 | Component | Technology | Notes |
 |-----------|-----------|-------|
 | Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC |
 | CLI | Commander.js | Matches mcpctl patterns |
 | HTTP Server | Fastify + WebSocket | labd and bastion |
 | Database | CockroachDB | PostgreSQL compatible, Prisma ORM |
 | ORM | Prisma | Reuse mcpctl patterns |
 | IaC | Pulumi (TypeScript) | Replaces Helm and Puppet |
 | k8s CNI | Cilium | eBPF, WireGuard, network policies |
 | Auth | mTLS (built-in CA) | Certificate-based, no SSH keys |
 | Packaging | nfpm (RPM/DEB) | bun compile for standalone binary |
 | Containers | Podman + podman-compose | No Docker dependency |
 | CI/CD | Gitea Actions | Self-hosted on mysources.co.uk |
 | Testing | Vitest | Unit + smoke + integration |
 | Registry | Gitea packages | RPM, DEB, container images |
 ## 11. Lessons from mcpctl
 The mcpctl project (../mcpctl/) established patterns reused here:
 **Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.
 **CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).
 **Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.
 **Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.
 **RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.
 **Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.
 **CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.
 **Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.
 **Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.
 **Key learnings applied:**
 - Start with proper monorepo structure (not flat scripts)
 - Type safety across packages via workspace references
 - Test-driven (unit tests before features)
 - CI from the start (not retrofitted)
 - RBAC and audit from the start (not bolted on)
 - Database-first design (schema defines the domain)
 ## 12. Gitea Registry
 **Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194)
 **Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN
 **Packages:** RPM and DEB published to Gitea packages API
 **Container images:** pushed to Gitea container registry
 **API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)