diff --git a/.taskmaster/docs/prd.md b/.taskmaster/docs/prd.md new file mode 100644 index 0000000..0110694 --- /dev/null +++ b/.taskmaster/docs/prd.md @@ -0,0 +1,452 @@ +# labctl — Infrastructure Management Platform + +## Product Requirements Document + +## 1. Overview + +labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code. + +### 1.1 Core Principles +- **Single CLI** (`labctl`) for all infrastructure operations +- **mTLS everywhere** — built-in Certificate Authority, no SSH key management +- **RBAC from day one** — deny by default, audit everything +- **Multi-cloud** — bare metal now, AWS later, extensible to any cloud +- **Test infrastructure like code** — ephemeral environments, smoke tests, security tests +- **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating + +### 1.2 Current State (completed) +- PXE bastion for bare-metal provisioning (discover, install, reprovision) +- CLI with subcommands: `labctl init bastion`, `labctl provision` +- LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher) +- Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd) +- 32 unit tests, VM smoke tests verified on real hardware +- Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD +- labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun) + +### 1.3 Hardware +- labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role +- Future: additional bare-metal worker nodes, AWS EC2 instances + +## 2. Architecture + +### 2.1 Components + +``` +labctl CLI → labd (master) → lab-agent (on every server) + ↓ + CockroachDB +``` + +**labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary. + +**labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay. + +**lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service. + +**CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state. + +**Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites. + +### 2.2 Network Architecture + +**Cilium** as k8s CNI (replacing default flannel): +- eBPF-based pod networking +- Built-in WireGuard encryption between nodes +- Network policies (ties into RBAC) +- Hubble for observability +- Future: Cluster Mesh for multi-site transparent networking + +No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS. + +### 2.3 Authentication + +**mTLS with built-in Certificate Authority:** +1. labd generates root CA on first start (stored encrypted in CockroachDB) +2. Agents enroll with join token → receive signed certificate +3. CLI users authenticate with client certificates (or SSH key-based initial auth) +4. All communication authenticated via mutual TLS +5. Certificate rotation and revocation supported + +**Join tokens:** +- One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart) +- Reusable tokens: for autoscaling groups (AWS ASG instances share a token) +- Tokens can be revoked, have optional expiry + +### 2.4 RBAC Model + +Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions: + +``` +action:cloud:environment:server + +Examples: + read:*:*:* — read everything + exec:baremetal:lab:* — exec on any lab bare-metal server + kubectl:*:*:* — kubectl proxy on any cluster + *:baremetal:lab:puppet — full access to puppet server only + manage:*:*:* — manage apps, clusters, tokens + admin:*:*:* — full admin (create users, roles) +``` + +**Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks +**Actions:** read, exec, apply, destroy, manage, admin, kubectl +**Deny rules:** explicit deny overrides any allow (like AWS IAM) + +Prisma models: Role, Permission (allow/deny), UserRole binding. + +### 2.5 Database + +**CockroachDB** chosen over PostgreSQL and Cassandra: +- PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable +- Multi-master replication — any node accepts reads AND writes +- Strong consistency (not eventual like Cassandra) +- Survives node failures (3 nodes = 1 failure, 5 nodes = 2) +- Auto-rebalancing when adding nodes +- Start single-node, scale to multi-node with zero code changes (just add nodes) + +**Schema (already scaffolded in Prisma):** +- Server — managed machines (hostname, mac, cloud, env, role, labels, status) +- Agent — connected agents (cert, enrollment, last seen) +- User — platform users (username, cert fingerprint) +- Role — RBAC roles with permissions +- Permission — allow/deny rules (action:cloud:env:server) +- UserRole — user-to-role bindings +- JoinToken — enrollment tokens (one-time, reusable, revocable) +- AuditLog — every action logged (user, session, action, resource, result, duration) +- PulumiRun — infrastructure-as-code execution records +- Cluster — managed k8s clusters (kubeconfig encrypted) + +## 3. CLI Command Reference + +### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED +```bash +sudo labctl init bastion standalone start [--foreground] [--port 8080] +sudo labctl init bastion standalone stop +labctl init bastion standalone status +``` + +### 3.2 Provisioning — IMPLEMENTED +```bash +labctl provision list +labctl provision install --role worker|infra +labctl provision reprovision --role worker|infra +labctl provision forget +``` + +### 3.3 Server Management — TO BUILD +```bash +labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE] +labctl describe server/ +``` + +### 3.4 Remote Execution — TO BUILD +```bash +labctl exec server/ -- +labctl exec server/ -it -- bash # interactive TTY +labctl exec server/ --timeout 30s -- cmd +``` + +### 3.5 Kubernetes Proxy — TO BUILD +```bash +labctl kubectl --cluster +labctl clusters add --kubeconfig +labctl clusters list +labctl clusters remove +``` + +### 3.6 Logs — TO BUILD +```bash +# Server logs (journalctl passthrough, no DB in hot path) +labctl logs server/ # all journal +labctl logs server/ -f # follow (live WebSocket relay) +labctl logs server/ -n 100 # last 100 lines +labctl logs server/ -u k3s # specific unit +labctl logs server/ -u sshd --since "1h ago" +labctl logs server/ -k # kernel +labctl logs server/ -p err # errors only +labctl logs server/ --file /var/log/nginx/error.log + +# App logs (k8s pod logs) +labctl logs app/ [-f] [--container NAME] + +# Pulumi execution logs +labctl logs pulumi/ [-f] + +# Bastion logs +labctl logs bastion/ [--mac MAC] + +# Agent daemon logs +labctl logs agent/ + +# Audit logs (from CockroachDB) +labctl logs audit [--user NAME] [--action ACTION] [--since TIME] +labctl logs audit/ # specific session +``` + +Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage. + +### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD +```bash +labctl apps list +labctl apps install [--set key=value] [-f values.yaml] +labctl apps status +labctl apps upgrade +labctl apps history +labctl apps rollback +labctl apps uninstall +``` + +### 3.8 Infrastructure as Code — TO BUILD +```bash +labctl apply -f --env +labctl plan -f --env +labctl destroy -f --env +``` + +### 3.9 RBAC — TO BUILD +```bash +labctl get roles +labctl get users +labctl create role --allow "action:cloud:env:server" +labctl create role --deny "destroy:*:*:*" +labctl bind role --user +labctl unbind role --user +labctl get permissions +``` + +### 3.10 Environments and Clouds — TO BUILD +```bash +labctl get environments +labctl get clouds +labctl create environment --cloud +``` + +## 4. Partition Layout + +### Worker Role +``` +/boot/efi 600MB EFI +/boot 3GB ext4 +── LVM VG: labvg ── + swap 27GB + / 33GB xfs + /var 100GB xfs + /var/log 10GB xfs + /home 10GB xfs ← preserved on reprovision + /srv 20GB xfs ← preserved on reprovision + /var/lib/longhorn rest xfs ← preserved (Longhorn PVC storage) + /tmp tmpfs 4GB +``` + +### Infra Role +``` +/boot/efi 600MB EFI +/boot 3GB ext4 +── LVM VG: labvg ── + swap 27GB + / 33GB xfs + /var 100GB xfs + /var/log 10GB xfs + /home 10GB xfs ← preserved on reprovision + /srv 20GB xfs ← preserved on reprovision + /var/lib/rancher 20GB xfs ← preserved (k3s etcd data) + /tmp tmpfs 4GB +``` + +## 5. Module System + +Configuration modules define desired state. Three tiers: +1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion +2. **Official modules** (separate repos): monitoring, cilium, DNS +3. **Custom modules** (user repos): pulled by git URL + +Module structure: +``` +module.yaml # name, version, targets (roles/labels), deps +src/index.ts # entry point +src/install.ts # installation logic +src/configure.ts # configuration logic +src/health.ts # health check +tests/ # vitest tests (mandatory) +``` + +## 6. Testing Strategy + +### 6.1 Testing Pyramid +``` +Unit Tests → pure logic, milliseconds, every commit +Smoke Tests → containers (podman-compose), minutes, every commit +Integration Tests → VMs (libvirt), 10-15 min, PRs +E2E Tests → real hardware/cloud, 20-30 min, pre-release +``` + +### 6.2 Smoke Test Stack (podman-compose) +```yaml +services: + cockroachdb: + image: cockroachdb/cockroach:latest-v24.3 + labd: + build: . + depends_on: [cockroachdb] + agent-1: + build: ./agent + depends_on: [labd] + agent-2: + build: ./agent + depends_on: [labd] +``` +Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow. + +### 6.3 Security Tests (RBAC) +- Deny exec without permission +- Deny cross-environment access +- Deny rules override allow rules +- Cannot escalate own permissions +- Audit logs all denied attempts +- Certificate-based auth cannot be spoofed +- Join tokens cannot be reused (one-time) +- Expired tokens rejected + +### 6.4 Ephemeral Test Environments +```bash +labctl test smoke # podman-compose +labctl test integration # libvirt VMs +labctl env create pr-123 --cloud containers # CI ephemeral +labctl env create pr-123 --cloud aws # cloud ephemeral (future) +``` + +### 6.5 Health Gates for Deployment +Before promoting to production, ALL must pass: +- labd API responds +- Expected number of agents connected +- k3s nodes Ready +- Certificates valid (>30 days) +- RBAC smoke test passes +- No error logs in last 5 minutes + +## 7. Cloud/Environment Model + +``` +Cloud: baremetal + └── Environment: lab + ├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server}) + └── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent}) + +Cloud: aws (future) + └── Environment: production + ├── Server: i-abc123 (from ASG web-servers) + └── Server: i-def456 (from ASG web-servers) +``` + +Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud. + +## 8. App Model (Pulumi Charts) + +Each app is a Pulumi TypeScript program: +``` +app.yaml # name, version, inputs schema, required permissions +src/index.ts # Pulumi program +values.yaml # defaults +tests/ # vitest tests +``` + +First apps to build: +- bastion — PXE provisioning (wrap existing code) +- labd — master daemon (self-deployment) +- cockroachdb — database +- cilium — CNI + +## 9. Implementation Phases + +### Phase 1: Foundation (PARTIALLY DONE) +- [x] PXE bastion (discover, install, reprovision) +- [x] CLI structure (labctl init/provision) +- [x] labd scaffold (Fastify + CockroachDB/Prisma schema) +- [x] Multi-arch builds, packaging, CI/CD +- [ ] Certificate Authority in labd +- [ ] lab-agent skeleton (connect, heartbeat, enrollment) +- [ ] Agent enrollment via join tokens +- [ ] RBAC engine +- [ ] labctl exec (remote execution) +- [ ] labctl logs (resource-scoped streaming) +- [ ] labctl get servers (with filters) +- [ ] Smoke test stack (podman-compose) + +### Phase 2: Deployment +- [ ] Reprovision labmaster as labmaster.ad.itaz.eu +- [ ] Deploy k3s with Cilium CNI +- [ ] Deploy CockroachDB on k3s +- [ ] Deploy labd on k3s +- [ ] Deploy bastion as managed app +- [ ] Auto-enroll agents during PXE provision + +### Phase 3: Infrastructure as Code +- [ ] Module system +- [ ] Pulumi charts (replacing Helm) +- [ ] labctl apps install/upgrade/rollback +- [ ] labctl apply -f (Pulumi execution) +- [ ] kubectl proxy (audited) +- [ ] Kubeconfig store (encrypted) + +### Phase 4: Multi-Cloud +- [ ] AWS provider (Pulumi) +- [ ] Reusable join tokens for ASGs +- [ ] Cilium Cluster Mesh +- [ ] Ephemeral test environments +- [ ] Grafana Loki for cold logs + +## 10. Technology Stack + +| Component | Technology | Notes | +|-----------|-----------|-------| +| Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC | +| CLI | Commander.js | Matches mcpctl patterns | +| HTTP Server | Fastify + WebSocket | labd and bastion | +| Database | CockroachDB | PostgreSQL compatible, Prisma ORM | +| ORM | Prisma | Reuse mcpctl patterns | +| IaC | Pulumi (TypeScript) | Replaces Helm and Puppet | +| k8s CNI | Cilium | eBPF, WireGuard, network policies | +| Auth | mTLS (built-in CA) | Certificate-based, no SSH keys | +| Packaging | nfpm (RPM/DEB) | bun compile for standalone binary | +| Containers | Podman + podman-compose | No Docker dependency | +| CI/CD | Gitea Actions | Self-hosted on mysources.co.uk | +| Testing | Vitest | Unit + smoke + integration | +| Registry | Gitea packages | RPM, DEB, container images | + +## 11. Lessons from mcpctl + +The mcpctl project (../mcpctl/) established patterns reused here: + +**Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts. + +**CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply). + +**Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints. + +**Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup. + +**RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware. + +**Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC. + +**CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images. + +**Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons. + +**Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages. + +**Key learnings applied:** +- Start with proper monorepo structure (not flat scripts) +- Type safety across packages via workspace references +- Test-driven (unit tests before features) +- CI from the start (not retrofitted) +- RBAC and audit from the start (not bolted on) +- Database-first design (schema defines the domain) + +## 12. Gitea Registry + +**Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194) +**Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN +**Packages:** RPM and DEB published to Gitea packages API +**Container images:** pushed to Gitea container registry +**API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)