docs: comprehensive PRD for taskmaster — labctl platform

Full product requirements covering: architecture, CLI commands,
partition layout, modules, testing strategy, cloud model, app model,
implementation phases, tech stack, and lessons from mcpctl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Michal
2026-03-18 00:23:24 +00:00
parent 44f1ebb843
commit ffc4a782d2

452
.taskmaster/docs/prd.md Normal file
View File

@@ -0,0 +1,452 @@
# labctl — Infrastructure Management Platform
## Product Requirements Document
## 1. Overview
labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.
### 1.1 Core Principles
- **Single CLI** (`labctl`) for all infrastructure operations
- **mTLS everywhere** — built-in Certificate Authority, no SSH key management
- **RBAC from day one** — deny by default, audit everything
- **Multi-cloud** — bare metal now, AWS later, extensible to any cloud
- **Test infrastructure like code** — ephemeral environments, smoke tests, security tests
- **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating
### 1.2 Current State (completed)
- PXE bastion for bare-metal provisioning (discover, install, reprovision)
- CLI with subcommands: `labctl init bastion`, `labctl provision`
- LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
- Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
- 32 unit tests, VM smoke tests verified on real hardware
- Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
- labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)
### 1.3 Hardware
- labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
- Future: additional bare-metal worker nodes, AWS EC2 instances
## 2. Architecture
### 2.1 Components
```
labctl CLI → labd (master) → lab-agent (on every server)
CockroachDB
```
**labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.
**labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.
**lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.
**CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.
**Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.
### 2.2 Network Architecture
**Cilium** as k8s CNI (replacing default flannel):
- eBPF-based pod networking
- Built-in WireGuard encryption between nodes
- Network policies (ties into RBAC)
- Hubble for observability
- Future: Cluster Mesh for multi-site transparent networking
No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.
### 2.3 Authentication
**mTLS with built-in Certificate Authority:**
1. labd generates root CA on first start (stored encrypted in CockroachDB)
2. Agents enroll with join token → receive signed certificate
3. CLI users authenticate with client certificates (or SSH key-based initial auth)
4. All communication authenticated via mutual TLS
5. Certificate rotation and revocation supported
**Join tokens:**
- One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
- Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
- Tokens can be revoked, have optional expiry
### 2.4 RBAC Model
Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:
```
action:cloud:environment:server
Examples:
read:*:*:* — read everything
exec:baremetal:lab:* — exec on any lab bare-metal server
kubectl:*:*:* — kubectl proxy on any cluster
*:baremetal:lab:puppet — full access to puppet server only
manage:*:*:* — manage apps, clusters, tokens
admin:*:*:* — full admin (create users, roles)
```
**Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks
**Actions:** read, exec, apply, destroy, manage, admin, kubectl
**Deny rules:** explicit deny overrides any allow (like AWS IAM)
Prisma models: Role, Permission (allow/deny), UserRole binding.
### 2.5 Database
**CockroachDB** chosen over PostgreSQL and Cassandra:
- PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
- Multi-master replication — any node accepts reads AND writes
- Strong consistency (not eventual like Cassandra)
- Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
- Auto-rebalancing when adding nodes
- Start single-node, scale to multi-node with zero code changes (just add nodes)
**Schema (already scaffolded in Prisma):**
- Server — managed machines (hostname, mac, cloud, env, role, labels, status)
- Agent — connected agents (cert, enrollment, last seen)
- User — platform users (username, cert fingerprint)
- Role — RBAC roles with permissions
- Permission — allow/deny rules (action:cloud:env:server)
- UserRole — user-to-role bindings
- JoinToken — enrollment tokens (one-time, reusable, revocable)
- AuditLog — every action logged (user, session, action, resource, result, duration)
- PulumiRun — infrastructure-as-code execution records
- Cluster — managed k8s clusters (kubeconfig encrypted)
## 3. CLI Command Reference
### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED
```bash
sudo labctl init bastion standalone start [--foreground] [--port 8080]
sudo labctl init bastion standalone stop
labctl init bastion standalone status
```
### 3.2 Provisioning — IMPLEMENTED
```bash
labctl provision list
labctl provision install <mac> <hostname> --role worker|infra
labctl provision reprovision <mac> <hostname> --role worker|infra
labctl provision forget <mac>
```
### 3.3 Server Management — TO BUILD
```bash
labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
labctl describe server/<name>
```
### 3.4 Remote Execution — TO BUILD
```bash
labctl exec server/<name> -- <command>
labctl exec server/<name> -it -- bash # interactive TTY
labctl exec server/<name> --timeout 30s -- cmd
```
### 3.5 Kubernetes Proxy — TO BUILD
```bash
labctl kubectl --cluster <name> <kubectl-args>
labctl clusters add <name> --kubeconfig <path>
labctl clusters list
labctl clusters remove <name>
```
### 3.6 Logs — TO BUILD
```bash
# Server logs (journalctl passthrough, no DB in hot path)
labctl logs server/<name> # all journal
labctl logs server/<name> -f # follow (live WebSocket relay)
labctl logs server/<name> -n 100 # last 100 lines
labctl logs server/<name> -u k3s # specific unit
labctl logs server/<name> -u sshd --since "1h ago"
labctl logs server/<name> -k # kernel
labctl logs server/<name> -p err # errors only
labctl logs server/<name> --file /var/log/nginx/error.log
# App logs (k8s pod logs)
labctl logs app/<name> [-f] [--container NAME]
# Pulumi execution logs
labctl logs pulumi/<run-id> [-f]
# Bastion logs
labctl logs bastion/<env> [--mac MAC]
# Agent daemon logs
labctl logs agent/<server>
# Audit logs (from CockroachDB)
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
labctl logs audit/<user-date-sessionid> # specific session
```
Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.
### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD
```bash
labctl apps list
labctl apps install <name> [--set key=value] [-f values.yaml]
labctl apps status <name>
labctl apps upgrade <name>
labctl apps history <name>
labctl apps rollback <name> <version>
labctl apps uninstall <name>
```
### 3.8 Infrastructure as Code — TO BUILD
```bash
labctl apply -f <file.ts> --env <env>
labctl plan -f <file.ts> --env <env>
labctl destroy -f <file.ts> --env <env>
```
### 3.9 RBAC — TO BUILD
```bash
labctl get roles
labctl get users
labctl create role <name> --allow "action:cloud:env:server"
labctl create role <name> --deny "destroy:*:*:*"
labctl bind role <role> --user <user>
labctl unbind role <role> --user <user>
labctl get permissions
```
### 3.10 Environments and Clouds — TO BUILD
```bash
labctl get environments
labctl get clouds
labctl create environment <name> --cloud <cloud>
```
## 4. Partition Layout
### Worker Role
```
/boot/efi 600MB EFI
/boot 3GB ext4
── LVM VG: labvg ──
swap 27GB
/ 33GB xfs
/var 100GB xfs
/var/log 10GB xfs
/home 10GB xfs ← preserved on reprovision
/srv 20GB xfs ← preserved on reprovision
/var/lib/longhorn rest xfs ← preserved (Longhorn PVC storage)
/tmp tmpfs 4GB
```
### Infra Role
```
/boot/efi 600MB EFI
/boot 3GB ext4
── LVM VG: labvg ──
swap 27GB
/ 33GB xfs
/var 100GB xfs
/var/log 10GB xfs
/home 10GB xfs ← preserved on reprovision
/srv 20GB xfs ← preserved on reprovision
/var/lib/rancher 20GB xfs ← preserved (k3s etcd data)
/tmp tmpfs 4GB
```
## 5. Module System
Configuration modules define desired state. Three tiers:
1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion
2. **Official modules** (separate repos): monitoring, cilium, DNS
3. **Custom modules** (user repos): pulled by git URL
Module structure:
```
module.yaml # name, version, targets (roles/labels), deps
src/index.ts # entry point
src/install.ts # installation logic
src/configure.ts # configuration logic
src/health.ts # health check
tests/ # vitest tests (mandatory)
```
## 6. Testing Strategy
### 6.1 Testing Pyramid
```
Unit Tests → pure logic, milliseconds, every commit
Smoke Tests → containers (podman-compose), minutes, every commit
Integration Tests → VMs (libvirt), 10-15 min, PRs
E2E Tests → real hardware/cloud, 20-30 min, pre-release
```
### 6.2 Smoke Test Stack (podman-compose)
```yaml
services:
cockroachdb:
image: cockroachdb/cockroach:latest-v24.3
labd:
build: .
depends_on: [cockroachdb]
agent-1:
build: ./agent
depends_on: [labd]
agent-2:
build: ./agent
depends_on: [labd]
```
Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.
### 6.3 Security Tests (RBAC)
- Deny exec without permission
- Deny cross-environment access
- Deny rules override allow rules
- Cannot escalate own permissions
- Audit logs all denied attempts
- Certificate-based auth cannot be spoofed
- Join tokens cannot be reused (one-time)
- Expired tokens rejected
### 6.4 Ephemeral Test Environments
```bash
labctl test smoke # podman-compose
labctl test integration # libvirt VMs
labctl env create pr-123 --cloud containers # CI ephemeral
labctl env create pr-123 --cloud aws # cloud ephemeral (future)
```
### 6.5 Health Gates for Deployment
Before promoting to production, ALL must pass:
- labd API responds
- Expected number of agents connected
- k3s nodes Ready
- Certificates valid (>30 days)
- RBAC smoke test passes
- No error logs in last 5 minutes
## 7. Cloud/Environment Model
```
Cloud: baremetal
└── Environment: lab
├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
└── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})
Cloud: aws (future)
└── Environment: production
├── Server: i-abc123 (from ASG web-servers)
└── Server: i-def456 (from ASG web-servers)
```
Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.
## 8. App Model (Pulumi Charts)
Each app is a Pulumi TypeScript program:
```
app.yaml # name, version, inputs schema, required permissions
src/index.ts # Pulumi program
values.yaml # defaults
tests/ # vitest tests
```
First apps to build:
- bastion — PXE provisioning (wrap existing code)
- labd — master daemon (self-deployment)
- cockroachdb — database
- cilium — CNI
## 9. Implementation Phases
### Phase 1: Foundation (PARTIALLY DONE)
- [x] PXE bastion (discover, install, reprovision)
- [x] CLI structure (labctl init/provision)
- [x] labd scaffold (Fastify + CockroachDB/Prisma schema)
- [x] Multi-arch builds, packaging, CI/CD
- [ ] Certificate Authority in labd
- [ ] lab-agent skeleton (connect, heartbeat, enrollment)
- [ ] Agent enrollment via join tokens
- [ ] RBAC engine
- [ ] labctl exec (remote execution)
- [ ] labctl logs (resource-scoped streaming)
- [ ] labctl get servers (with filters)
- [ ] Smoke test stack (podman-compose)
### Phase 2: Deployment
- [ ] Reprovision labmaster as labmaster.ad.itaz.eu
- [ ] Deploy k3s with Cilium CNI
- [ ] Deploy CockroachDB on k3s
- [ ] Deploy labd on k3s
- [ ] Deploy bastion as managed app
- [ ] Auto-enroll agents during PXE provision
### Phase 3: Infrastructure as Code
- [ ] Module system
- [ ] Pulumi charts (replacing Helm)
- [ ] labctl apps install/upgrade/rollback
- [ ] labctl apply -f (Pulumi execution)
- [ ] kubectl proxy (audited)
- [ ] Kubeconfig store (encrypted)
### Phase 4: Multi-Cloud
- [ ] AWS provider (Pulumi)
- [ ] Reusable join tokens for ASGs
- [ ] Cilium Cluster Mesh
- [ ] Ephemeral test environments
- [ ] Grafana Loki for cold logs
## 10. Technology Stack
| Component | Technology | Notes |
|-----------|-----------|-------|
| Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC |
| CLI | Commander.js | Matches mcpctl patterns |
| HTTP Server | Fastify + WebSocket | labd and bastion |
| Database | CockroachDB | PostgreSQL compatible, Prisma ORM |
| ORM | Prisma | Reuse mcpctl patterns |
| IaC | Pulumi (TypeScript) | Replaces Helm and Puppet |
| k8s CNI | Cilium | eBPF, WireGuard, network policies |
| Auth | mTLS (built-in CA) | Certificate-based, no SSH keys |
| Packaging | nfpm (RPM/DEB) | bun compile for standalone binary |
| Containers | Podman + podman-compose | No Docker dependency |
| CI/CD | Gitea Actions | Self-hosted on mysources.co.uk |
| Testing | Vitest | Unit + smoke + integration |
| Registry | Gitea packages | RPM, DEB, container images |
## 11. Lessons from mcpctl
The mcpctl project (../mcpctl/) established patterns reused here:
**Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.
**CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).
**Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.
**Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.
**RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.
**Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.
**CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.
**Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.
**Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.
**Key learnings applied:**
- Start with proper monorepo structure (not flat scripts)
- Type safety across packages via workspace references
- Test-driven (unit tests before features)
- CI from the start (not retrofitted)
- RBAC and audit from the start (not bolted on)
- Database-first design (schema defines the domain)
## 12. Gitea Registry
**Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194)
**Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN
**Packages:** RPM and DEB published to Gitea packages API
**Container images:** pushed to Gitea container registry
**API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)