Full product requirements covering: architecture, CLI commands, partition layout, modules, testing strategy, cloud model, app model, implementation phases, tech stack, and lessons from mcpctl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
17 KiB
labctl — Infrastructure Management Platform
Product Requirements Document
1. Overview
labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.
1.1 Core Principles
- Single CLI (
labctl) for all infrastructure operations - mTLS everywhere — built-in Certificate Authority, no SSH key management
- RBAC from day one — deny by default, audit everything
- Multi-cloud — bare metal now, AWS later, extensible to any cloud
- Test infrastructure like code — ephemeral environments, smoke tests, security tests
- Pulumi over Helm — TypeScript charts, typed, testable, no YAML templating
1.2 Current State (completed)
- PXE bastion for bare-metal provisioning (discover, install, reprovision)
- CLI with subcommands:
labctl init bastion,labctl provision - LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
- Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
- 32 unit tests, VM smoke tests verified on real hardware
- Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
- labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)
1.3 Hardware
- labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
- Future: additional bare-metal worker nodes, AWS EC2 instances
2. Architecture
2.1 Components
labctl CLI → labd (master) → lab-agent (on every server)
↓
CockroachDB
labctl — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.
labd — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.
lab-agent — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.
CockroachDB — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.
Bastion — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.
2.2 Network Architecture
Cilium as k8s CNI (replacing default flannel):
- eBPF-based pod networking
- Built-in WireGuard encryption between nodes
- Network policies (ties into RBAC)
- Hubble for observability
- Future: Cluster Mesh for multi-site transparent networking
No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.
2.3 Authentication
mTLS with built-in Certificate Authority:
- labd generates root CA on first start (stored encrypted in CockroachDB)
- Agents enroll with join token → receive signed certificate
- CLI users authenticate with client certificates (or SSH key-based initial auth)
- All communication authenticated via mutual TLS
- Certificate rotation and revocation supported
Join tokens:
- One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
- Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
- Tokens can be revoked, have optional expiry
2.4 RBAC Model
Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:
action:cloud:environment:server
Examples:
read:*:*:* — read everything
exec:baremetal:lab:* — exec on any lab bare-metal server
kubectl:*:*:* — kubectl proxy on any cluster
*:baremetal:lab:puppet — full access to puppet server only
manage:*:*:* — manage apps, clusters, tokens
admin:*:*:* — full admin (create users, roles)
Resources: servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks Actions: read, exec, apply, destroy, manage, admin, kubectl Deny rules: explicit deny overrides any allow (like AWS IAM)
Prisma models: Role, Permission (allow/deny), UserRole binding.
2.5 Database
CockroachDB chosen over PostgreSQL and Cassandra:
- PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
- Multi-master replication — any node accepts reads AND writes
- Strong consistency (not eventual like Cassandra)
- Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
- Auto-rebalancing when adding nodes
- Start single-node, scale to multi-node with zero code changes (just add nodes)
Schema (already scaffolded in Prisma):
- Server — managed machines (hostname, mac, cloud, env, role, labels, status)
- Agent — connected agents (cert, enrollment, last seen)
- User — platform users (username, cert fingerprint)
- Role — RBAC roles with permissions
- Permission — allow/deny rules (action:cloud:env:server)
- UserRole — user-to-role bindings
- JoinToken — enrollment tokens (one-time, reusable, revocable)
- AuditLog — every action logged (user, session, action, resource, result, duration)
- PulumiRun — infrastructure-as-code execution records
- Cluster — managed k8s clusters (kubeconfig encrypted)
3. CLI Command Reference
3.1 Bastion (PXE Provisioning) — IMPLEMENTED
sudo labctl init bastion standalone start [--foreground] [--port 8080]
sudo labctl init bastion standalone stop
labctl init bastion standalone status
3.2 Provisioning — IMPLEMENTED
labctl provision list
labctl provision install <mac> <hostname> --role worker|infra
labctl provision reprovision <mac> <hostname> --role worker|infra
labctl provision forget <mac>
3.3 Server Management — TO BUILD
labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
labctl describe server/<name>
3.4 Remote Execution — TO BUILD
labctl exec server/<name> -- <command>
labctl exec server/<name> -it -- bash # interactive TTY
labctl exec server/<name> --timeout 30s -- cmd
3.5 Kubernetes Proxy — TO BUILD
labctl kubectl --cluster <name> <kubectl-args>
labctl clusters add <name> --kubeconfig <path>
labctl clusters list
labctl clusters remove <name>
3.6 Logs — TO BUILD
# Server logs (journalctl passthrough, no DB in hot path)
labctl logs server/<name> # all journal
labctl logs server/<name> -f # follow (live WebSocket relay)
labctl logs server/<name> -n 100 # last 100 lines
labctl logs server/<name> -u k3s # specific unit
labctl logs server/<name> -u sshd --since "1h ago"
labctl logs server/<name> -k # kernel
labctl logs server/<name> -p err # errors only
labctl logs server/<name> --file /var/log/nginx/error.log
# App logs (k8s pod logs)
labctl logs app/<name> [-f] [--container NAME]
# Pulumi execution logs
labctl logs pulumi/<run-id> [-f]
# Bastion logs
labctl logs bastion/<env> [--mac MAC]
# Agent daemon logs
labctl logs agent/<server>
# Audit logs (from CockroachDB)
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
labctl logs audit/<user-date-sessionid> # specific session
Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.
3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD
labctl apps list
labctl apps install <name> [--set key=value] [-f values.yaml]
labctl apps status <name>
labctl apps upgrade <name>
labctl apps history <name>
labctl apps rollback <name> <version>
labctl apps uninstall <name>
3.8 Infrastructure as Code — TO BUILD
labctl apply -f <file.ts> --env <env>
labctl plan -f <file.ts> --env <env>
labctl destroy -f <file.ts> --env <env>
3.9 RBAC — TO BUILD
labctl get roles
labctl get users
labctl create role <name> --allow "action:cloud:env:server"
labctl create role <name> --deny "destroy:*:*:*"
labctl bind role <role> --user <user>
labctl unbind role <role> --user <user>
labctl get permissions
3.10 Environments and Clouds — TO BUILD
labctl get environments
labctl get clouds
labctl create environment <name> --cloud <cloud>
4. Partition Layout
Worker Role
/boot/efi 600MB EFI
/boot 3GB ext4
── LVM VG: labvg ──
swap 27GB
/ 33GB xfs
/var 100GB xfs
/var/log 10GB xfs
/home 10GB xfs ← preserved on reprovision
/srv 20GB xfs ← preserved on reprovision
/var/lib/longhorn rest xfs ← preserved (Longhorn PVC storage)
/tmp tmpfs 4GB
Infra Role
/boot/efi 600MB EFI
/boot 3GB ext4
── LVM VG: labvg ──
swap 27GB
/ 33GB xfs
/var 100GB xfs
/var/log 10GB xfs
/home 10GB xfs ← preserved on reprovision
/srv 20GB xfs ← preserved on reprovision
/var/lib/rancher 20GB xfs ← preserved (k3s etcd data)
/tmp tmpfs 4GB
5. Module System
Configuration modules define desired state. Three tiers:
- Core modules (this repo,
modules/): k3s-server, k3s-agent, labd, lab-agent, bastion - Official modules (separate repos): monitoring, cilium, DNS
- Custom modules (user repos): pulled by git URL
Module structure:
module.yaml # name, version, targets (roles/labels), deps
src/index.ts # entry point
src/install.ts # installation logic
src/configure.ts # configuration logic
src/health.ts # health check
tests/ # vitest tests (mandatory)
6. Testing Strategy
6.1 Testing Pyramid
Unit Tests → pure logic, milliseconds, every commit
Smoke Tests → containers (podman-compose), minutes, every commit
Integration Tests → VMs (libvirt), 10-15 min, PRs
E2E Tests → real hardware/cloud, 20-30 min, pre-release
6.2 Smoke Test Stack (podman-compose)
services:
cockroachdb:
image: cockroachdb/cockroach:latest-v24.3
labd:
build: .
depends_on: [cockroachdb]
agent-1:
build: ./agent
depends_on: [labd]
agent-2:
build: ./agent
depends_on: [labd]
Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.
6.3 Security Tests (RBAC)
- Deny exec without permission
- Deny cross-environment access
- Deny rules override allow rules
- Cannot escalate own permissions
- Audit logs all denied attempts
- Certificate-based auth cannot be spoofed
- Join tokens cannot be reused (one-time)
- Expired tokens rejected
6.4 Ephemeral Test Environments
labctl test smoke # podman-compose
labctl test integration # libvirt VMs
labctl env create pr-123 --cloud containers # CI ephemeral
labctl env create pr-123 --cloud aws # cloud ephemeral (future)
6.5 Health Gates for Deployment
Before promoting to production, ALL must pass:
- labd API responds
- Expected number of agents connected
- k3s nodes Ready
- Certificates valid (>30 days)
- RBAC smoke test passes
- No error logs in last 5 minutes
7. Cloud/Environment Model
Cloud: baremetal
└── Environment: lab
├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
└── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})
Cloud: aws (future)
└── Environment: production
├── Server: i-abc123 (from ASG web-servers)
└── Server: i-def456 (from ASG web-servers)
Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.
8. App Model (Pulumi Charts)
Each app is a Pulumi TypeScript program:
app.yaml # name, version, inputs schema, required permissions
src/index.ts # Pulumi program
values.yaml # defaults
tests/ # vitest tests
First apps to build:
- bastion — PXE provisioning (wrap existing code)
- labd — master daemon (self-deployment)
- cockroachdb — database
- cilium — CNI
9. Implementation Phases
Phase 1: Foundation (PARTIALLY DONE)
- PXE bastion (discover, install, reprovision)
- CLI structure (labctl init/provision)
- labd scaffold (Fastify + CockroachDB/Prisma schema)
- Multi-arch builds, packaging, CI/CD
- Certificate Authority in labd
- lab-agent skeleton (connect, heartbeat, enrollment)
- Agent enrollment via join tokens
- RBAC engine
- labctl exec (remote execution)
- labctl logs (resource-scoped streaming)
- labctl get servers (with filters)
- Smoke test stack (podman-compose)
Phase 2: Deployment
- Reprovision labmaster as labmaster.ad.itaz.eu
- Deploy k3s with Cilium CNI
- Deploy CockroachDB on k3s
- Deploy labd on k3s
- Deploy bastion as managed app
- Auto-enroll agents during PXE provision
Phase 3: Infrastructure as Code
- Module system
- Pulumi charts (replacing Helm)
- labctl apps install/upgrade/rollback
- labctl apply -f (Pulumi execution)
- kubectl proxy (audited)
- Kubeconfig store (encrypted)
Phase 4: Multi-Cloud
- AWS provider (Pulumi)
- Reusable join tokens for ASGs
- Cilium Cluster Mesh
- Ephemeral test environments
- Grafana Loki for cold logs
10. Technology Stack
| Component | Technology | Notes |
|---|---|---|
| Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC |
| CLI | Commander.js | Matches mcpctl patterns |
| HTTP Server | Fastify + WebSocket | labd and bastion |
| Database | CockroachDB | PostgreSQL compatible, Prisma ORM |
| ORM | Prisma | Reuse mcpctl patterns |
| IaC | Pulumi (TypeScript) | Replaces Helm and Puppet |
| k8s CNI | Cilium | eBPF, WireGuard, network policies |
| Auth | mTLS (built-in CA) | Certificate-based, no SSH keys |
| Packaging | nfpm (RPM/DEB) | bun compile for standalone binary |
| Containers | Podman + podman-compose | No Docker dependency |
| CI/CD | Gitea Actions | Self-hosted on mysources.co.uk |
| Testing | Vitest | Unit + smoke + integration |
| Registry | Gitea packages | RPM, DEB, container images |
11. Lessons from mcpctl
The mcpctl project (../mcpctl/) established patterns reused here:
Project structure: pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.
CLI patterns: Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).
Server patterns: Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.
Database: Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.
RBAC: Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.
Testing: Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.
CI/CD: Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.
Deployment: Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.
Completions: Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.
Key learnings applied:
- Start with proper monorepo structure (not flat scripts)
- Type safety across packages via workspace references
- Test-driven (unit tests before features)
- CI from the start (not retrofitted)
- RBAC and audit from the start (not bolted on)
- Database-first design (schema defines the domain)
12. Gitea Registry
Registry: mysources.co.uk (self-hosted Gitea at 10.0.0.194) Token: stored at ~/.gitea-token, env var PACKAGES_TOKEN Packages: RPM and DEB published to Gitea packages API Container images: pushed to Gitea container registry API pattern: Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)