michal/lab

Files

Michal ffc4a782d2 docs: comprehensive PRD for taskmaster — labctl platform

Full product requirements covering: architecture, CLI commands,
partition layout, modules, testing strategy, cloud model, app model,
implementation phases, tech stack, and lessons from mcpctl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 00:23:24 +00:00

17 KiB

Raw Blame History

labctl — Infrastructure Management Platform

Product Requirements Document

1. Overview

labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.

1.1 Core Principles

Single CLI (labctl) for all infrastructure operations
mTLS everywhere — built-in Certificate Authority, no SSH key management
RBAC from day one — deny by default, audit everything
Multi-cloud — bare metal now, AWS later, extensible to any cloud
Test infrastructure like code — ephemeral environments, smoke tests, security tests
Pulumi over Helm — TypeScript charts, typed, testable, no YAML templating

1.2 Current State (completed)

PXE bastion for bare-metal provisioning (discover, install, reprovision)
CLI with subcommands: labctl init bastion, labctl provision
LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
32 unit tests, VM smoke tests verified on real hardware
Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)

1.3 Hardware

labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
Future: additional bare-metal worker nodes, AWS EC2 instances

2. Architecture

2.1 Components

labctl CLI → labd (master) → lab-agent (on every server)
                ↓
          CockroachDB

labctl — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.

labd — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.

lab-agent — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.

CockroachDB — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.

Bastion — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.

2.2 Network Architecture

Cilium as k8s CNI (replacing default flannel):

eBPF-based pod networking
Built-in WireGuard encryption between nodes
Network policies (ties into RBAC)
Hubble for observability
Future: Cluster Mesh for multi-site transparent networking

No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.

2.3 Authentication

mTLS with built-in Certificate Authority:

labd generates root CA on first start (stored encrypted in CockroachDB)
Agents enroll with join token → receive signed certificate
CLI users authenticate with client certificates (or SSH key-based initial auth)
All communication authenticated via mutual TLS
Certificate rotation and revocation supported

Join tokens:

One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
Tokens can be revoked, have optional expiry

2.4 RBAC Model

Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:

action:cloud:environment:server

Examples:
  read:*:*:*                    — read everything
  exec:baremetal:lab:*          — exec on any lab bare-metal server
  kubectl:*:*:*                 — kubectl proxy on any cluster
  *:baremetal:lab:puppet        — full access to puppet server only
  manage:*:*:*                  — manage apps, clusters, tokens
  admin:*:*:*                   — full admin (create users, roles)

Resources: servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks Actions: read, exec, apply, destroy, manage, admin, kubectl Deny rules: explicit deny overrides any allow (like AWS IAM)

Prisma models: Role, Permission (allow/deny), UserRole binding.

2.5 Database

CockroachDB chosen over PostgreSQL and Cassandra:

PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
Multi-master replication — any node accepts reads AND writes
Strong consistency (not eventual like Cassandra)
Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
Auto-rebalancing when adding nodes
Start single-node, scale to multi-node with zero code changes (just add nodes)

Schema (already scaffolded in Prisma):

Server — managed machines (hostname, mac, cloud, env, role, labels, status)
Agent — connected agents (cert, enrollment, last seen)
User — platform users (username, cert fingerprint)
Role — RBAC roles with permissions
Permission — allow/deny rules (action:cloud:env:server)
UserRole — user-to-role bindings
JoinToken — enrollment tokens (one-time, reusable, revocable)
AuditLog — every action logged (user, session, action, resource, result, duration)
PulumiRun — infrastructure-as-code execution records
Cluster — managed k8s clusters (kubeconfig encrypted)

3. CLI Command Reference

3.1 Bastion (PXE Provisioning) — IMPLEMENTED

sudo labctl init bastion standalone start [--foreground] [--port 8080]
sudo labctl init bastion standalone stop
labctl init bastion standalone status

3.2 Provisioning — IMPLEMENTED

labctl provision list
labctl provision install <mac> <hostname> --role worker|infra
labctl provision reprovision <mac> <hostname> --role worker|infra
labctl provision forget <mac>

3.3 Server Management — TO BUILD

labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
labctl describe server/<name>

3.4 Remote Execution — TO BUILD

labctl exec server/<name> -- <command>
labctl exec server/<name> -it -- bash          # interactive TTY
labctl exec server/<name> --timeout 30s -- cmd

3.5 Kubernetes Proxy — TO BUILD

labctl kubectl --cluster <name> <kubectl-args>
labctl clusters add <name> --kubeconfig <path>
labctl clusters list
labctl clusters remove <name>

3.6 Logs — TO BUILD

# Server logs (journalctl passthrough, no DB in hot path)
labctl logs server/<name>                     # all journal
labctl logs server/<name> -f                  # follow (live WebSocket relay)
labctl logs server/<name> -n 100              # last 100 lines
labctl logs server/<name> -u k3s              # specific unit
labctl logs server/<name> -u sshd --since "1h ago"
labctl logs server/<name> -k                  # kernel
labctl logs server/<name> -p err              # errors only
labctl logs server/<name> --file /var/log/nginx/error.log

# App logs (k8s pod logs)
labctl logs app/<name> [-f] [--container NAME]

# Pulumi execution logs
labctl logs pulumi/<run-id> [-f]

# Bastion logs
labctl logs bastion/<env> [--mac MAC]

# Agent daemon logs
labctl logs agent/<server>

# Audit logs (from CockroachDB)
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
labctl logs audit/<user-date-sessionid>       # specific session

Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.

3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD

labctl apps list
labctl apps install <name> [--set key=value] [-f values.yaml]
labctl apps status <name>
labctl apps upgrade <name>
labctl apps history <name>
labctl apps rollback <name> <version>
labctl apps uninstall <name>

3.8 Infrastructure as Code — TO BUILD

labctl apply -f <file.ts> --env <env>
labctl plan -f <file.ts> --env <env>
labctl destroy -f <file.ts> --env <env>

3.9 RBAC — TO BUILD

labctl get roles
labctl get users
labctl create role <name> --allow "action:cloud:env:server"
labctl create role <name> --deny "destroy:*:*:*"
labctl bind role <role> --user <user>
labctl unbind role <role> --user <user>
labctl get permissions

3.10 Environments and Clouds — TO BUILD

labctl get environments
labctl get clouds
labctl create environment <name> --cloud <cloud>

4. Partition Layout

Worker Role

/boot/efi       600MB  EFI
/boot           3GB    ext4
── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/longhorn  rest  xfs     ← preserved (Longhorn PVC storage)
  /tmp          tmpfs 4GB

Infra Role

/boot/efi       600MB  EFI
/boot           3GB    ext4
── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/rancher  20GB  xfs      ← preserved (k3s etcd data)
  /tmp          tmpfs 4GB

5. Module System

Configuration modules define desired state. Three tiers:

Core modules (this repo, modules/): k3s-server, k3s-agent, labd, lab-agent, bastion
Official modules (separate repos): monitoring, cilium, DNS
Custom modules (user repos): pulled by git URL

Module structure:

module.yaml          # name, version, targets (roles/labels), deps
src/index.ts         # entry point
src/install.ts       # installation logic
src/configure.ts     # configuration logic
src/health.ts        # health check
tests/               # vitest tests (mandatory)

6. Testing Strategy

6.1 Testing Pyramid

Unit Tests        → pure logic, milliseconds, every commit
Smoke Tests       → containers (podman-compose), minutes, every commit
Integration Tests → VMs (libvirt), 10-15 min, PRs
E2E Tests         → real hardware/cloud, 20-30 min, pre-release

6.2 Smoke Test Stack (podman-compose)

services:
  cockroachdb:
    image: cockroachdb/cockroach:latest-v24.3
  labd:
    build: .
    depends_on: [cockroachdb]
  agent-1:
    build: ./agent
    depends_on: [labd]
  agent-2:
    build: ./agent
    depends_on: [labd]

Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.

6.3 Security Tests (RBAC)

Deny exec without permission
Deny cross-environment access
Deny rules override allow rules
Cannot escalate own permissions
Audit logs all denied attempts
Certificate-based auth cannot be spoofed
Join tokens cannot be reused (one-time)
Expired tokens rejected

6.4 Ephemeral Test Environments

labctl test smoke                                    # podman-compose
labctl test integration                              # libvirt VMs
labctl env create pr-123 --cloud containers          # CI ephemeral
labctl env create pr-123 --cloud aws                 # cloud ephemeral (future)

6.5 Health Gates for Deployment

Before promoting to production, ALL must pass:

labd API responds
Expected number of agents connected
k3s nodes Ready
Certificates valid (>30 days)
RBAC smoke test passes
No error logs in last 5 minutes

7. Cloud/Environment Model

Cloud: baremetal
  └── Environment: lab
       ├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
       └── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})

Cloud: aws (future)
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       └── Server: i-def456 (from ASG web-servers)

Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.

8. App Model (Pulumi Charts)

Each app is a Pulumi TypeScript program:

app.yaml             # name, version, inputs schema, required permissions
src/index.ts         # Pulumi program
values.yaml          # defaults
tests/               # vitest tests

First apps to build:

bastion — PXE provisioning (wrap existing code)
labd — master daemon (self-deployment)
cockroachdb — database
cilium — CNI

9. Implementation Phases

Phase 1: Foundation (PARTIALLY DONE)

PXE bastion (discover, install, reprovision)
CLI structure (labctl init/provision)
labd scaffold (Fastify + CockroachDB/Prisma schema)
Multi-arch builds, packaging, CI/CD
Certificate Authority in labd
lab-agent skeleton (connect, heartbeat, enrollment)
Agent enrollment via join tokens
RBAC engine
labctl exec (remote execution)
labctl logs (resource-scoped streaming)
labctl get servers (with filters)
Smoke test stack (podman-compose)

Phase 2: Deployment

Reprovision labmaster as labmaster.ad.itaz.eu
Deploy k3s with Cilium CNI
Deploy CockroachDB on k3s
Deploy labd on k3s
Deploy bastion as managed app
Auto-enroll agents during PXE provision

Phase 3: Infrastructure as Code

Module system
Pulumi charts (replacing Helm)
labctl apps install/upgrade/rollback
labctl apply -f (Pulumi execution)
kubectl proxy (audited)
Kubeconfig store (encrypted)

Phase 4: Multi-Cloud

AWS provider (Pulumi)
Reusable join tokens for ASGs
Cilium Cluster Mesh
Ephemeral test environments
Grafana Loki for cold logs

10. Technology Stack

Component	Technology	Notes
Language	TypeScript (ESM)	Same for CLI, daemon, agents, IaC
CLI	Commander.js	Matches mcpctl patterns
HTTP Server	Fastify + WebSocket	labd and bastion
Database	CockroachDB	PostgreSQL compatible, Prisma ORM
ORM	Prisma	Reuse mcpctl patterns
IaC	Pulumi (TypeScript)	Replaces Helm and Puppet
k8s CNI	Cilium	eBPF, WireGuard, network policies
Auth	mTLS (built-in CA)	Certificate-based, no SSH keys
Packaging	nfpm (RPM/DEB)	bun compile for standalone binary
Containers	Podman + podman-compose	No Docker dependency
CI/CD	Gitea Actions	Self-hosted on mysources.co.uk
Testing	Vitest	Unit + smoke + integration
Registry	Gitea packages	RPM, DEB, container images

11. Lessons from mcpctl

The mcpctl project (../mcpctl/) established patterns reused here:

Project structure: pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.

CLI patterns: Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).

Server patterns: Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.

Database: Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.

RBAC: Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.

Testing: Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.

CI/CD: Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.

Deployment: Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.

Completions: Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.

Key learnings applied:

Start with proper monorepo structure (not flat scripts)
Type safety across packages via workspace references
Test-driven (unit tests before features)
CI from the start (not retrofitted)
RBAC and audit from the start (not bolted on)
Database-first design (schema defines the domain)

12. Gitea Registry

Registry: mysources.co.uk (self-hosted Gitea at 10.0.0.194) Token: stored at ~/.gitea-token, env var PACKAGES_TOKEN Packages: RPM and DEB published to Gitea packages API Container images: pushed to Gitea container registry API pattern: Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)

17 KiB Raw Blame History