fix: PXE boot debugging — bisect root cause, syslog logging, serial console #3
452
.taskmaster/docs/prd.md
Normal file
452
.taskmaster/docs/prd.md
Normal file
@@ -0,0 +1,452 @@
|
||||
# labctl — Infrastructure Management Platform
|
||||
|
||||
## Product Requirements Document
|
||||
|
||||
## 1. Overview
|
||||
|
||||
labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.
|
||||
|
||||
### 1.1 Core Principles
|
||||
- **Single CLI** (`labctl`) for all infrastructure operations
|
||||
- **mTLS everywhere** — built-in Certificate Authority, no SSH key management
|
||||
- **RBAC from day one** — deny by default, audit everything
|
||||
- **Multi-cloud** — bare metal now, AWS later, extensible to any cloud
|
||||
- **Test infrastructure like code** — ephemeral environments, smoke tests, security tests
|
||||
- **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating
|
||||
|
||||
### 1.2 Current State (completed)
|
||||
- PXE bastion for bare-metal provisioning (discover, install, reprovision)
|
||||
- CLI with subcommands: `labctl init bastion`, `labctl provision`
|
||||
- LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
|
||||
- Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
|
||||
- 32 unit tests, VM smoke tests verified on real hardware
|
||||
- Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
|
||||
- labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)
|
||||
|
||||
### 1.3 Hardware
|
||||
- labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
|
||||
- Future: additional bare-metal worker nodes, AWS EC2 instances
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
### 2.1 Components
|
||||
|
||||
```
|
||||
labctl CLI → labd (master) → lab-agent (on every server)
|
||||
↓
|
||||
CockroachDB
|
||||
```
|
||||
|
||||
**labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.
|
||||
|
||||
**labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.
|
||||
|
||||
**lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.
|
||||
|
||||
**CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.
|
||||
|
||||
**Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.
|
||||
|
||||
### 2.2 Network Architecture
|
||||
|
||||
**Cilium** as k8s CNI (replacing default flannel):
|
||||
- eBPF-based pod networking
|
||||
- Built-in WireGuard encryption between nodes
|
||||
- Network policies (ties into RBAC)
|
||||
- Hubble for observability
|
||||
- Future: Cluster Mesh for multi-site transparent networking
|
||||
|
||||
No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.
|
||||
|
||||
### 2.3 Authentication
|
||||
|
||||
**mTLS with built-in Certificate Authority:**
|
||||
1. labd generates root CA on first start (stored encrypted in CockroachDB)
|
||||
2. Agents enroll with join token → receive signed certificate
|
||||
3. CLI users authenticate with client certificates (or SSH key-based initial auth)
|
||||
4. All communication authenticated via mutual TLS
|
||||
5. Certificate rotation and revocation supported
|
||||
|
||||
**Join tokens:**
|
||||
- One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
|
||||
- Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
|
||||
- Tokens can be revoked, have optional expiry
|
||||
|
||||
### 2.4 RBAC Model
|
||||
|
||||
Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:
|
||||
|
||||
```
|
||||
action:cloud:environment:server
|
||||
|
||||
Examples:
|
||||
read:*:*:* — read everything
|
||||
exec:baremetal:lab:* — exec on any lab bare-metal server
|
||||
kubectl:*:*:* — kubectl proxy on any cluster
|
||||
*:baremetal:lab:puppet — full access to puppet server only
|
||||
manage:*:*:* — manage apps, clusters, tokens
|
||||
admin:*:*:* — full admin (create users, roles)
|
||||
```
|
||||
|
||||
**Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks
|
||||
**Actions:** read, exec, apply, destroy, manage, admin, kubectl
|
||||
**Deny rules:** explicit deny overrides any allow (like AWS IAM)
|
||||
|
||||
Prisma models: Role, Permission (allow/deny), UserRole binding.
|
||||
|
||||
### 2.5 Database
|
||||
|
||||
**CockroachDB** chosen over PostgreSQL and Cassandra:
|
||||
- PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
|
||||
- Multi-master replication — any node accepts reads AND writes
|
||||
- Strong consistency (not eventual like Cassandra)
|
||||
- Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
|
||||
- Auto-rebalancing when adding nodes
|
||||
- Start single-node, scale to multi-node with zero code changes (just add nodes)
|
||||
|
||||
**Schema (already scaffolded in Prisma):**
|
||||
- Server — managed machines (hostname, mac, cloud, env, role, labels, status)
|
||||
- Agent — connected agents (cert, enrollment, last seen)
|
||||
- User — platform users (username, cert fingerprint)
|
||||
- Role — RBAC roles with permissions
|
||||
- Permission — allow/deny rules (action:cloud:env:server)
|
||||
- UserRole — user-to-role bindings
|
||||
- JoinToken — enrollment tokens (one-time, reusable, revocable)
|
||||
- AuditLog — every action logged (user, session, action, resource, result, duration)
|
||||
- PulumiRun — infrastructure-as-code execution records
|
||||
- Cluster — managed k8s clusters (kubeconfig encrypted)
|
||||
|
||||
## 3. CLI Command Reference
|
||||
|
||||
### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED
|
||||
```bash
|
||||
sudo labctl init bastion standalone start [--foreground] [--port 8080]
|
||||
sudo labctl init bastion standalone stop
|
||||
labctl init bastion standalone status
|
||||
```
|
||||
|
||||
### 3.2 Provisioning — IMPLEMENTED
|
||||
```bash
|
||||
labctl provision list
|
||||
labctl provision install <mac> <hostname> --role worker|infra
|
||||
labctl provision reprovision <mac> <hostname> --role worker|infra
|
||||
labctl provision forget <mac>
|
||||
```
|
||||
|
||||
### 3.3 Server Management — TO BUILD
|
||||
```bash
|
||||
labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
|
||||
labctl describe server/<name>
|
||||
```
|
||||
|
||||
### 3.4 Remote Execution — TO BUILD
|
||||
```bash
|
||||
labctl exec server/<name> -- <command>
|
||||
labctl exec server/<name> -it -- bash # interactive TTY
|
||||
labctl exec server/<name> --timeout 30s -- cmd
|
||||
```
|
||||
|
||||
### 3.5 Kubernetes Proxy — TO BUILD
|
||||
```bash
|
||||
labctl kubectl --cluster <name> <kubectl-args>
|
||||
labctl clusters add <name> --kubeconfig <path>
|
||||
labctl clusters list
|
||||
labctl clusters remove <name>
|
||||
```
|
||||
|
||||
### 3.6 Logs — TO BUILD
|
||||
```bash
|
||||
# Server logs (journalctl passthrough, no DB in hot path)
|
||||
labctl logs server/<name> # all journal
|
||||
labctl logs server/<name> -f # follow (live WebSocket relay)
|
||||
labctl logs server/<name> -n 100 # last 100 lines
|
||||
labctl logs server/<name> -u k3s # specific unit
|
||||
labctl logs server/<name> -u sshd --since "1h ago"
|
||||
labctl logs server/<name> -k # kernel
|
||||
labctl logs server/<name> -p err # errors only
|
||||
labctl logs server/<name> --file /var/log/nginx/error.log
|
||||
|
||||
# App logs (k8s pod logs)
|
||||
labctl logs app/<name> [-f] [--container NAME]
|
||||
|
||||
# Pulumi execution logs
|
||||
labctl logs pulumi/<run-id> [-f]
|
||||
|
||||
# Bastion logs
|
||||
labctl logs bastion/<env> [--mac MAC]
|
||||
|
||||
# Agent daemon logs
|
||||
labctl logs agent/<server>
|
||||
|
||||
# Audit logs (from CockroachDB)
|
||||
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
|
||||
labctl logs audit/<user-date-sessionid> # specific session
|
||||
```
|
||||
|
||||
Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.
|
||||
|
||||
### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD
|
||||
```bash
|
||||
labctl apps list
|
||||
labctl apps install <name> [--set key=value] [-f values.yaml]
|
||||
labctl apps status <name>
|
||||
labctl apps upgrade <name>
|
||||
labctl apps history <name>
|
||||
labctl apps rollback <name> <version>
|
||||
labctl apps uninstall <name>
|
||||
```
|
||||
|
||||
### 3.8 Infrastructure as Code — TO BUILD
|
||||
```bash
|
||||
labctl apply -f <file.ts> --env <env>
|
||||
labctl plan -f <file.ts> --env <env>
|
||||
labctl destroy -f <file.ts> --env <env>
|
||||
```
|
||||
|
||||
### 3.9 RBAC — TO BUILD
|
||||
```bash
|
||||
labctl get roles
|
||||
labctl get users
|
||||
labctl create role <name> --allow "action:cloud:env:server"
|
||||
labctl create role <name> --deny "destroy:*:*:*"
|
||||
labctl bind role <role> --user <user>
|
||||
labctl unbind role <role> --user <user>
|
||||
labctl get permissions
|
||||
```
|
||||
|
||||
### 3.10 Environments and Clouds — TO BUILD
|
||||
```bash
|
||||
labctl get environments
|
||||
labctl get clouds
|
||||
labctl create environment <name> --cloud <cloud>
|
||||
```
|
||||
|
||||
## 4. Partition Layout
|
||||
|
||||
### Worker Role
|
||||
```
|
||||
/boot/efi 600MB EFI
|
||||
/boot 3GB ext4
|
||||
── LVM VG: labvg ──
|
||||
swap 27GB
|
||||
/ 33GB xfs
|
||||
/var 100GB xfs
|
||||
/var/log 10GB xfs
|
||||
/home 10GB xfs ← preserved on reprovision
|
||||
/srv 20GB xfs ← preserved on reprovision
|
||||
/var/lib/longhorn rest xfs ← preserved (Longhorn PVC storage)
|
||||
/tmp tmpfs 4GB
|
||||
```
|
||||
|
||||
### Infra Role
|
||||
```
|
||||
/boot/efi 600MB EFI
|
||||
/boot 3GB ext4
|
||||
── LVM VG: labvg ──
|
||||
swap 27GB
|
||||
/ 33GB xfs
|
||||
/var 100GB xfs
|
||||
/var/log 10GB xfs
|
||||
/home 10GB xfs ← preserved on reprovision
|
||||
/srv 20GB xfs ← preserved on reprovision
|
||||
/var/lib/rancher 20GB xfs ← preserved (k3s etcd data)
|
||||
/tmp tmpfs 4GB
|
||||
```
|
||||
|
||||
## 5. Module System
|
||||
|
||||
Configuration modules define desired state. Three tiers:
|
||||
1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion
|
||||
2. **Official modules** (separate repos): monitoring, cilium, DNS
|
||||
3. **Custom modules** (user repos): pulled by git URL
|
||||
|
||||
Module structure:
|
||||
```
|
||||
module.yaml # name, version, targets (roles/labels), deps
|
||||
src/index.ts # entry point
|
||||
src/install.ts # installation logic
|
||||
src/configure.ts # configuration logic
|
||||
src/health.ts # health check
|
||||
tests/ # vitest tests (mandatory)
|
||||
```
|
||||
|
||||
## 6. Testing Strategy
|
||||
|
||||
### 6.1 Testing Pyramid
|
||||
```
|
||||
Unit Tests → pure logic, milliseconds, every commit
|
||||
Smoke Tests → containers (podman-compose), minutes, every commit
|
||||
Integration Tests → VMs (libvirt), 10-15 min, PRs
|
||||
E2E Tests → real hardware/cloud, 20-30 min, pre-release
|
||||
```
|
||||
|
||||
### 6.2 Smoke Test Stack (podman-compose)
|
||||
```yaml
|
||||
services:
|
||||
cockroachdb:
|
||||
image: cockroachdb/cockroach:latest-v24.3
|
||||
labd:
|
||||
build: .
|
||||
depends_on: [cockroachdb]
|
||||
agent-1:
|
||||
build: ./agent
|
||||
depends_on: [labd]
|
||||
agent-2:
|
||||
build: ./agent
|
||||
depends_on: [labd]
|
||||
```
|
||||
Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.
|
||||
|
||||
### 6.3 Security Tests (RBAC)
|
||||
- Deny exec without permission
|
||||
- Deny cross-environment access
|
||||
- Deny rules override allow rules
|
||||
- Cannot escalate own permissions
|
||||
- Audit logs all denied attempts
|
||||
- Certificate-based auth cannot be spoofed
|
||||
- Join tokens cannot be reused (one-time)
|
||||
- Expired tokens rejected
|
||||
|
||||
### 6.4 Ephemeral Test Environments
|
||||
```bash
|
||||
labctl test smoke # podman-compose
|
||||
labctl test integration # libvirt VMs
|
||||
labctl env create pr-123 --cloud containers # CI ephemeral
|
||||
labctl env create pr-123 --cloud aws # cloud ephemeral (future)
|
||||
```
|
||||
|
||||
### 6.5 Health Gates for Deployment
|
||||
Before promoting to production, ALL must pass:
|
||||
- labd API responds
|
||||
- Expected number of agents connected
|
||||
- k3s nodes Ready
|
||||
- Certificates valid (>30 days)
|
||||
- RBAC smoke test passes
|
||||
- No error logs in last 5 minutes
|
||||
|
||||
## 7. Cloud/Environment Model
|
||||
|
||||
```
|
||||
Cloud: baremetal
|
||||
└── Environment: lab
|
||||
├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
|
||||
└── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})
|
||||
|
||||
Cloud: aws (future)
|
||||
└── Environment: production
|
||||
├── Server: i-abc123 (from ASG web-servers)
|
||||
└── Server: i-def456 (from ASG web-servers)
|
||||
```
|
||||
|
||||
Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.
|
||||
|
||||
## 8. App Model (Pulumi Charts)
|
||||
|
||||
Each app is a Pulumi TypeScript program:
|
||||
```
|
||||
app.yaml # name, version, inputs schema, required permissions
|
||||
src/index.ts # Pulumi program
|
||||
values.yaml # defaults
|
||||
tests/ # vitest tests
|
||||
```
|
||||
|
||||
First apps to build:
|
||||
- bastion — PXE provisioning (wrap existing code)
|
||||
- labd — master daemon (self-deployment)
|
||||
- cockroachdb — database
|
||||
- cilium — CNI
|
||||
|
||||
## 9. Implementation Phases
|
||||
|
||||
### Phase 1: Foundation (PARTIALLY DONE)
|
||||
- [x] PXE bastion (discover, install, reprovision)
|
||||
- [x] CLI structure (labctl init/provision)
|
||||
- [x] labd scaffold (Fastify + CockroachDB/Prisma schema)
|
||||
- [x] Multi-arch builds, packaging, CI/CD
|
||||
- [ ] Certificate Authority in labd
|
||||
- [ ] lab-agent skeleton (connect, heartbeat, enrollment)
|
||||
- [ ] Agent enrollment via join tokens
|
||||
- [ ] RBAC engine
|
||||
- [ ] labctl exec (remote execution)
|
||||
- [ ] labctl logs (resource-scoped streaming)
|
||||
- [ ] labctl get servers (with filters)
|
||||
- [ ] Smoke test stack (podman-compose)
|
||||
|
||||
### Phase 2: Deployment
|
||||
- [ ] Reprovision labmaster as labmaster.ad.itaz.eu
|
||||
- [ ] Deploy k3s with Cilium CNI
|
||||
- [ ] Deploy CockroachDB on k3s
|
||||
- [ ] Deploy labd on k3s
|
||||
- [ ] Deploy bastion as managed app
|
||||
- [ ] Auto-enroll agents during PXE provision
|
||||
|
||||
### Phase 3: Infrastructure as Code
|
||||
- [ ] Module system
|
||||
- [ ] Pulumi charts (replacing Helm)
|
||||
- [ ] labctl apps install/upgrade/rollback
|
||||
- [ ] labctl apply -f (Pulumi execution)
|
||||
- [ ] kubectl proxy (audited)
|
||||
- [ ] Kubeconfig store (encrypted)
|
||||
|
||||
### Phase 4: Multi-Cloud
|
||||
- [ ] AWS provider (Pulumi)
|
||||
- [ ] Reusable join tokens for ASGs
|
||||
- [ ] Cilium Cluster Mesh
|
||||
- [ ] Ephemeral test environments
|
||||
- [ ] Grafana Loki for cold logs
|
||||
|
||||
## 10. Technology Stack
|
||||
|
||||
| Component | Technology | Notes |
|
||||
|-----------|-----------|-------|
|
||||
| Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC |
|
||||
| CLI | Commander.js | Matches mcpctl patterns |
|
||||
| HTTP Server | Fastify + WebSocket | labd and bastion |
|
||||
| Database | CockroachDB | PostgreSQL compatible, Prisma ORM |
|
||||
| ORM | Prisma | Reuse mcpctl patterns |
|
||||
| IaC | Pulumi (TypeScript) | Replaces Helm and Puppet |
|
||||
| k8s CNI | Cilium | eBPF, WireGuard, network policies |
|
||||
| Auth | mTLS (built-in CA) | Certificate-based, no SSH keys |
|
||||
| Packaging | nfpm (RPM/DEB) | bun compile for standalone binary |
|
||||
| Containers | Podman + podman-compose | No Docker dependency |
|
||||
| CI/CD | Gitea Actions | Self-hosted on mysources.co.uk |
|
||||
| Testing | Vitest | Unit + smoke + integration |
|
||||
| Registry | Gitea packages | RPM, DEB, container images |
|
||||
|
||||
## 11. Lessons from mcpctl
|
||||
|
||||
The mcpctl project (../mcpctl/) established patterns reused here:
|
||||
|
||||
**Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.
|
||||
|
||||
**CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).
|
||||
|
||||
**Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.
|
||||
|
||||
**Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.
|
||||
|
||||
**RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.
|
||||
|
||||
**Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.
|
||||
|
||||
**CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.
|
||||
|
||||
**Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.
|
||||
|
||||
**Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.
|
||||
|
||||
**Key learnings applied:**
|
||||
- Start with proper monorepo structure (not flat scripts)
|
||||
- Type safety across packages via workspace references
|
||||
- Test-driven (unit tests before features)
|
||||
- CI from the start (not retrofitted)
|
||||
- RBAC and audit from the start (not bolted on)
|
||||
- Database-first design (schema defines the domain)
|
||||
|
||||
## 12. Gitea Registry
|
||||
|
||||
**Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194)
|
||||
**Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN
|
||||
**Packages:** RPM and DEB published to Gitea packages API
|
||||
**Container images:** pushed to Gitea container registry
|
||||
**API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)
|
||||
Reference in New Issue
Block a user