Files
lab/.taskmaster/docs/prd.md
Michal ffc4a782d2 docs: comprehensive PRD for taskmaster — labctl platform
Full product requirements covering: architecture, CLI commands,
partition layout, modules, testing strategy, cloud model, app model,
implementation phases, tech stack, and lessons from mcpctl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:23:24 +00:00

17 KiB

labctl — Infrastructure Management Platform

Product Requirements Document

1. Overview

labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.

1.1 Core Principles

  • Single CLI (labctl) for all infrastructure operations
  • mTLS everywhere — built-in Certificate Authority, no SSH key management
  • RBAC from day one — deny by default, audit everything
  • Multi-cloud — bare metal now, AWS later, extensible to any cloud
  • Test infrastructure like code — ephemeral environments, smoke tests, security tests
  • Pulumi over Helm — TypeScript charts, typed, testable, no YAML templating

1.2 Current State (completed)

  • PXE bastion for bare-metal provisioning (discover, install, reprovision)
  • CLI with subcommands: labctl init bastion, labctl provision
  • LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
  • Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
  • 32 unit tests, VM smoke tests verified on real hardware
  • Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
  • labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)

1.3 Hardware

  • labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
  • Future: additional bare-metal worker nodes, AWS EC2 instances

2. Architecture

2.1 Components

labctl CLI → labd (master) → lab-agent (on every server)
                ↓
          CockroachDB

labctl — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.

labd — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.

lab-agent — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.

CockroachDB — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.

Bastion — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.

2.2 Network Architecture

Cilium as k8s CNI (replacing default flannel):

  • eBPF-based pod networking
  • Built-in WireGuard encryption between nodes
  • Network policies (ties into RBAC)
  • Hubble for observability
  • Future: Cluster Mesh for multi-site transparent networking

No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.

2.3 Authentication

mTLS with built-in Certificate Authority:

  1. labd generates root CA on first start (stored encrypted in CockroachDB)
  2. Agents enroll with join token → receive signed certificate
  3. CLI users authenticate with client certificates (or SSH key-based initial auth)
  4. All communication authenticated via mutual TLS
  5. Certificate rotation and revocation supported

Join tokens:

  • One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
  • Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
  • Tokens can be revoked, have optional expiry

2.4 RBAC Model

Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:

action:cloud:environment:server

Examples:
  read:*:*:*                    — read everything
  exec:baremetal:lab:*          — exec on any lab bare-metal server
  kubectl:*:*:*                 — kubectl proxy on any cluster
  *:baremetal:lab:puppet        — full access to puppet server only
  manage:*:*:*                  — manage apps, clusters, tokens
  admin:*:*:*                   — full admin (create users, roles)

Resources: servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks Actions: read, exec, apply, destroy, manage, admin, kubectl Deny rules: explicit deny overrides any allow (like AWS IAM)

Prisma models: Role, Permission (allow/deny), UserRole binding.

2.5 Database

CockroachDB chosen over PostgreSQL and Cassandra:

  • PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
  • Multi-master replication — any node accepts reads AND writes
  • Strong consistency (not eventual like Cassandra)
  • Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
  • Auto-rebalancing when adding nodes
  • Start single-node, scale to multi-node with zero code changes (just add nodes)

Schema (already scaffolded in Prisma):

  • Server — managed machines (hostname, mac, cloud, env, role, labels, status)
  • Agent — connected agents (cert, enrollment, last seen)
  • User — platform users (username, cert fingerprint)
  • Role — RBAC roles with permissions
  • Permission — allow/deny rules (action:cloud:env:server)
  • UserRole — user-to-role bindings
  • JoinToken — enrollment tokens (one-time, reusable, revocable)
  • AuditLog — every action logged (user, session, action, resource, result, duration)
  • PulumiRun — infrastructure-as-code execution records
  • Cluster — managed k8s clusters (kubeconfig encrypted)

3. CLI Command Reference

3.1 Bastion (PXE Provisioning) — IMPLEMENTED

sudo labctl init bastion standalone start [--foreground] [--port 8080]
sudo labctl init bastion standalone stop
labctl init bastion standalone status

3.2 Provisioning — IMPLEMENTED

labctl provision list
labctl provision install <mac> <hostname> --role worker|infra
labctl provision reprovision <mac> <hostname> --role worker|infra
labctl provision forget <mac>

3.3 Server Management — TO BUILD

labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
labctl describe server/<name>

3.4 Remote Execution — TO BUILD

labctl exec server/<name> -- <command>
labctl exec server/<name> -it -- bash          # interactive TTY
labctl exec server/<name> --timeout 30s -- cmd

3.5 Kubernetes Proxy — TO BUILD

labctl kubectl --cluster <name> <kubectl-args>
labctl clusters add <name> --kubeconfig <path>
labctl clusters list
labctl clusters remove <name>

3.6 Logs — TO BUILD

# Server logs (journalctl passthrough, no DB in hot path)
labctl logs server/<name>                     # all journal
labctl logs server/<name> -f                  # follow (live WebSocket relay)
labctl logs server/<name> -n 100              # last 100 lines
labctl logs server/<name> -u k3s              # specific unit
labctl logs server/<name> -u sshd --since "1h ago"
labctl logs server/<name> -k                  # kernel
labctl logs server/<name> -p err              # errors only
labctl logs server/<name> --file /var/log/nginx/error.log

# App logs (k8s pod logs)
labctl logs app/<name> [-f] [--container NAME]

# Pulumi execution logs
labctl logs pulumi/<run-id> [-f]

# Bastion logs
labctl logs bastion/<env> [--mac MAC]

# Agent daemon logs
labctl logs agent/<server>

# Audit logs (from CockroachDB)
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
labctl logs audit/<user-date-sessionid>       # specific session

Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.

3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD

labctl apps list
labctl apps install <name> [--set key=value] [-f values.yaml]
labctl apps status <name>
labctl apps upgrade <name>
labctl apps history <name>
labctl apps rollback <name> <version>
labctl apps uninstall <name>

3.8 Infrastructure as Code — TO BUILD

labctl apply -f <file.ts> --env <env>
labctl plan -f <file.ts> --env <env>
labctl destroy -f <file.ts> --env <env>

3.9 RBAC — TO BUILD

labctl get roles
labctl get users
labctl create role <name> --allow "action:cloud:env:server"
labctl create role <name> --deny "destroy:*:*:*"
labctl bind role <role> --user <user>
labctl unbind role <role> --user <user>
labctl get permissions

3.10 Environments and Clouds — TO BUILD

labctl get environments
labctl get clouds
labctl create environment <name> --cloud <cloud>

4. Partition Layout

Worker Role

/boot/efi       600MB  EFI
/boot           3GB    ext4
── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/longhorn  rest  xfs     ← preserved (Longhorn PVC storage)
  /tmp          tmpfs 4GB

Infra Role

/boot/efi       600MB  EFI
/boot           3GB    ext4
── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/rancher  20GB  xfs      ← preserved (k3s etcd data)
  /tmp          tmpfs 4GB

5. Module System

Configuration modules define desired state. Three tiers:

  1. Core modules (this repo, modules/): k3s-server, k3s-agent, labd, lab-agent, bastion
  2. Official modules (separate repos): monitoring, cilium, DNS
  3. Custom modules (user repos): pulled by git URL

Module structure:

module.yaml          # name, version, targets (roles/labels), deps
src/index.ts         # entry point
src/install.ts       # installation logic
src/configure.ts     # configuration logic
src/health.ts        # health check
tests/               # vitest tests (mandatory)

6. Testing Strategy

6.1 Testing Pyramid

Unit Tests        → pure logic, milliseconds, every commit
Smoke Tests       → containers (podman-compose), minutes, every commit
Integration Tests → VMs (libvirt), 10-15 min, PRs
E2E Tests         → real hardware/cloud, 20-30 min, pre-release

6.2 Smoke Test Stack (podman-compose)

services:
  cockroachdb:
    image: cockroachdb/cockroach:latest-v24.3
  labd:
    build: .
    depends_on: [cockroachdb]
  agent-1:
    build: ./agent
    depends_on: [labd]
  agent-2:
    build: ./agent
    depends_on: [labd]

Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.

6.3 Security Tests (RBAC)

  • Deny exec without permission
  • Deny cross-environment access
  • Deny rules override allow rules
  • Cannot escalate own permissions
  • Audit logs all denied attempts
  • Certificate-based auth cannot be spoofed
  • Join tokens cannot be reused (one-time)
  • Expired tokens rejected

6.4 Ephemeral Test Environments

labctl test smoke                                    # podman-compose
labctl test integration                              # libvirt VMs
labctl env create pr-123 --cloud containers          # CI ephemeral
labctl env create pr-123 --cloud aws                 # cloud ephemeral (future)

6.5 Health Gates for Deployment

Before promoting to production, ALL must pass:

  • labd API responds
  • Expected number of agents connected
  • k3s nodes Ready
  • Certificates valid (>30 days)
  • RBAC smoke test passes
  • No error logs in last 5 minutes

7. Cloud/Environment Model

Cloud: baremetal
  └── Environment: lab
       ├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
       └── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})

Cloud: aws (future)
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       └── Server: i-def456 (from ASG web-servers)

Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.

8. App Model (Pulumi Charts)

Each app is a Pulumi TypeScript program:

app.yaml             # name, version, inputs schema, required permissions
src/index.ts         # Pulumi program
values.yaml          # defaults
tests/               # vitest tests

First apps to build:

  • bastion — PXE provisioning (wrap existing code)
  • labd — master daemon (self-deployment)
  • cockroachdb — database
  • cilium — CNI

9. Implementation Phases

Phase 1: Foundation (PARTIALLY DONE)

  • PXE bastion (discover, install, reprovision)
  • CLI structure (labctl init/provision)
  • labd scaffold (Fastify + CockroachDB/Prisma schema)
  • Multi-arch builds, packaging, CI/CD
  • Certificate Authority in labd
  • lab-agent skeleton (connect, heartbeat, enrollment)
  • Agent enrollment via join tokens
  • RBAC engine
  • labctl exec (remote execution)
  • labctl logs (resource-scoped streaming)
  • labctl get servers (with filters)
  • Smoke test stack (podman-compose)

Phase 2: Deployment

  • Reprovision labmaster as labmaster.ad.itaz.eu
  • Deploy k3s with Cilium CNI
  • Deploy CockroachDB on k3s
  • Deploy labd on k3s
  • Deploy bastion as managed app
  • Auto-enroll agents during PXE provision

Phase 3: Infrastructure as Code

  • Module system
  • Pulumi charts (replacing Helm)
  • labctl apps install/upgrade/rollback
  • labctl apply -f (Pulumi execution)
  • kubectl proxy (audited)
  • Kubeconfig store (encrypted)

Phase 4: Multi-Cloud

  • AWS provider (Pulumi)
  • Reusable join tokens for ASGs
  • Cilium Cluster Mesh
  • Ephemeral test environments
  • Grafana Loki for cold logs

10. Technology Stack

Component Technology Notes
Language TypeScript (ESM) Same for CLI, daemon, agents, IaC
CLI Commander.js Matches mcpctl patterns
HTTP Server Fastify + WebSocket labd and bastion
Database CockroachDB PostgreSQL compatible, Prisma ORM
ORM Prisma Reuse mcpctl patterns
IaC Pulumi (TypeScript) Replaces Helm and Puppet
k8s CNI Cilium eBPF, WireGuard, network policies
Auth mTLS (built-in CA) Certificate-based, no SSH keys
Packaging nfpm (RPM/DEB) bun compile for standalone binary
Containers Podman + podman-compose No Docker dependency
CI/CD Gitea Actions Self-hosted on mysources.co.uk
Testing Vitest Unit + smoke + integration
Registry Gitea packages RPM, DEB, container images

11. Lessons from mcpctl

The mcpctl project (../mcpctl/) established patterns reused here:

Project structure: pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.

CLI patterns: Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).

Server patterns: Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.

Database: Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.

RBAC: Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.

Testing: Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.

CI/CD: Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.

Deployment: Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.

Completions: Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.

Key learnings applied:

  • Start with proper monorepo structure (not flat scripts)
  • Type safety across packages via workspace references
  • Test-driven (unit tests before features)
  • CI from the start (not retrofitted)
  • RBAC and audit from the start (not bolted on)
  • Database-first design (schema defines the domain)

12. Gitea Registry

Registry: mysources.co.uk (self-hosted Gitea at 10.0.0.194) Token: stored at ~/.gitea-token, env var PACKAGES_TOKEN Packages: RPM and DEB published to Gitea packages API Container images: pushed to Gitea container registry API pattern: Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)