# labctl — Infrastructure Management Platform ## Product Requirements Document ## 1. Overview labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code. ### 1.1 Core Principles - **Single CLI** (`labctl`) for all infrastructure operations - **mTLS everywhere** — built-in Certificate Authority, no SSH key management - **RBAC from day one** — deny by default, audit everything - **Multi-cloud** — bare metal now, AWS later, extensible to any cloud - **Test infrastructure like code** — ephemeral environments, smoke tests, security tests - **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating ### 1.2 Current State (completed) - PXE bastion for bare-metal provisioning (discover, install, reprovision) - CLI with subcommands: `labctl init bastion`, `labctl provision` - LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher) - Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd) - 32 unit tests, VM smoke tests verified on real hardware - Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD - labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun) ### 1.3 Hardware - labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role - Future: additional bare-metal worker nodes, AWS EC2 instances ## 2. Architecture ### 2.1 Components ``` labctl CLI → labd (master) → lab-agent (on every server) ↓ CockroachDB ``` **labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary. **labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay. **lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service. **CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state. **Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites. ### 2.2 Network Architecture **Cilium** as k8s CNI (replacing default flannel): - eBPF-based pod networking - Built-in WireGuard encryption between nodes - Network policies (ties into RBAC) - Hubble for observability - Future: Cluster Mesh for multi-site transparent networking No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS. ### 2.3 Authentication **mTLS with built-in Certificate Authority:** 1. labd generates root CA on first start (stored encrypted in CockroachDB) 2. Agents enroll with join token → receive signed certificate 3. CLI users authenticate with client certificates (or SSH key-based initial auth) 4. All communication authenticated via mutual TLS 5. Certificate rotation and revocation supported **Join tokens:** - One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart) - Reusable tokens: for autoscaling groups (AWS ASG instances share a token) - Tokens can be revoked, have optional expiry ### 2.4 RBAC Model Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions: ``` action:cloud:environment:server Examples: read:*:*:* — read everything exec:baremetal:lab:* — exec on any lab bare-metal server kubectl:*:*:* — kubectl proxy on any cluster *:baremetal:lab:puppet — full access to puppet server only manage:*:*:* — manage apps, clusters, tokens admin:*:*:* — full admin (create users, roles) ``` **Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks **Actions:** read, exec, apply, destroy, manage, admin, kubectl **Deny rules:** explicit deny overrides any allow (like AWS IAM) Prisma models: Role, Permission (allow/deny), UserRole binding. ### 2.5 Database **CockroachDB** chosen over PostgreSQL and Cassandra: - PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable - Multi-master replication — any node accepts reads AND writes - Strong consistency (not eventual like Cassandra) - Survives node failures (3 nodes = 1 failure, 5 nodes = 2) - Auto-rebalancing when adding nodes - Start single-node, scale to multi-node with zero code changes (just add nodes) **Schema (already scaffolded in Prisma):** - Server — managed machines (hostname, mac, cloud, env, role, labels, status) - Agent — connected agents (cert, enrollment, last seen) - User — platform users (username, cert fingerprint) - Role — RBAC roles with permissions - Permission — allow/deny rules (action:cloud:env:server) - UserRole — user-to-role bindings - JoinToken — enrollment tokens (one-time, reusable, revocable) - AuditLog — every action logged (user, session, action, resource, result, duration) - PulumiRun — infrastructure-as-code execution records - Cluster — managed k8s clusters (kubeconfig encrypted) ## 3. CLI Command Reference ### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED ```bash sudo labctl init bastion standalone start [--foreground] [--port 8080] sudo labctl init bastion standalone stop labctl init bastion standalone status ``` ### 3.2 Provisioning — IMPLEMENTED ```bash labctl provision list labctl provision install --role worker|infra labctl provision reprovision --role worker|infra labctl provision forget ``` ### 3.3 Server Management — TO BUILD ```bash labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE] labctl describe server/ ``` ### 3.4 Remote Execution — TO BUILD ```bash labctl exec server/ -- labctl exec server/ -it -- bash # interactive TTY labctl exec server/ --timeout 30s -- cmd ``` ### 3.5 Kubernetes Proxy — TO BUILD ```bash labctl kubectl --cluster labctl clusters add --kubeconfig labctl clusters list labctl clusters remove ``` ### 3.6 Logs — TO BUILD ```bash # Server logs (journalctl passthrough, no DB in hot path) labctl logs server/ # all journal labctl logs server/ -f # follow (live WebSocket relay) labctl logs server/ -n 100 # last 100 lines labctl logs server/ -u k3s # specific unit labctl logs server/ -u sshd --since "1h ago" labctl logs server/ -k # kernel labctl logs server/ -p err # errors only labctl logs server/ --file /var/log/nginx/error.log # App logs (k8s pod logs) labctl logs app/ [-f] [--container NAME] # Pulumi execution logs labctl logs pulumi/ [-f] # Bastion logs labctl logs bastion/ [--mac MAC] # Agent daemon logs labctl logs agent/ # Audit logs (from CockroachDB) labctl logs audit [--user NAME] [--action ACTION] [--since TIME] labctl logs audit/ # specific session ``` Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage. ### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD ```bash labctl apps list labctl apps install [--set key=value] [-f values.yaml] labctl apps status labctl apps upgrade labctl apps history labctl apps rollback labctl apps uninstall ``` ### 3.8 Infrastructure as Code — TO BUILD ```bash labctl apply -f --env labctl plan -f --env labctl destroy -f --env ``` ### 3.9 RBAC — TO BUILD ```bash labctl get roles labctl get users labctl create role --allow "action:cloud:env:server" labctl create role --deny "destroy:*:*:*" labctl bind role --user labctl unbind role --user labctl get permissions ``` ### 3.10 Environments and Clouds — TO BUILD ```bash labctl get environments labctl get clouds labctl create environment --cloud ``` ## 4. Partition Layout ### Worker Role ``` /boot/efi 600MB EFI /boot 3GB ext4 ── LVM VG: labvg ── swap 27GB / 33GB xfs /var 100GB xfs /var/log 10GB xfs /home 10GB xfs ← preserved on reprovision /srv 20GB xfs ← preserved on reprovision /var/lib/longhorn rest xfs ← preserved (Longhorn PVC storage) /tmp tmpfs 4GB ``` ### Infra Role ``` /boot/efi 600MB EFI /boot 3GB ext4 ── LVM VG: labvg ── swap 27GB / 33GB xfs /var 100GB xfs /var/log 10GB xfs /home 10GB xfs ← preserved on reprovision /srv 20GB xfs ← preserved on reprovision /var/lib/rancher 20GB xfs ← preserved (k3s etcd data) /tmp tmpfs 4GB ``` ## 5. Module System Configuration modules define desired state. Three tiers: 1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion 2. **Official modules** (separate repos): monitoring, cilium, DNS 3. **Custom modules** (user repos): pulled by git URL Module structure: ``` module.yaml # name, version, targets (roles/labels), deps src/index.ts # entry point src/install.ts # installation logic src/configure.ts # configuration logic src/health.ts # health check tests/ # vitest tests (mandatory) ``` ## 6. Testing Strategy ### 6.1 Testing Pyramid ``` Unit Tests → pure logic, milliseconds, every commit Smoke Tests → containers (podman-compose), minutes, every commit Integration Tests → VMs (libvirt), 10-15 min, PRs E2E Tests → real hardware/cloud, 20-30 min, pre-release ``` ### 6.2 Smoke Test Stack (podman-compose) ```yaml services: cockroachdb: image: cockroachdb/cockroach:latest-v24.3 labd: build: . depends_on: [cockroachdb] agent-1: build: ./agent depends_on: [labd] agent-2: build: ./agent depends_on: [labd] ``` Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow. ### 6.3 Security Tests (RBAC) - Deny exec without permission - Deny cross-environment access - Deny rules override allow rules - Cannot escalate own permissions - Audit logs all denied attempts - Certificate-based auth cannot be spoofed - Join tokens cannot be reused (one-time) - Expired tokens rejected ### 6.4 Ephemeral Test Environments ```bash labctl test smoke # podman-compose labctl test integration # libvirt VMs labctl env create pr-123 --cloud containers # CI ephemeral labctl env create pr-123 --cloud aws # cloud ephemeral (future) ``` ### 6.5 Health Gates for Deployment Before promoting to production, ALL must pass: - labd API responds - Expected number of agents connected - k3s nodes Ready - Certificates valid (>30 days) - RBAC smoke test passes - No error logs in last 5 minutes ## 7. Cloud/Environment Model ``` Cloud: baremetal └── Environment: lab ├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server}) └── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent}) Cloud: aws (future) └── Environment: production ├── Server: i-abc123 (from ASG web-servers) └── Server: i-def456 (from ASG web-servers) ``` Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud. ## 8. App Model (Pulumi Charts) Each app is a Pulumi TypeScript program: ``` app.yaml # name, version, inputs schema, required permissions src/index.ts # Pulumi program values.yaml # defaults tests/ # vitest tests ``` First apps to build: - bastion — PXE provisioning (wrap existing code) - labd — master daemon (self-deployment) - cockroachdb — database - cilium — CNI ## 9. Implementation Phases ### Phase 1: Foundation (PARTIALLY DONE) - [x] PXE bastion (discover, install, reprovision) - [x] CLI structure (labctl init/provision) - [x] labd scaffold (Fastify + CockroachDB/Prisma schema) - [x] Multi-arch builds, packaging, CI/CD - [ ] Certificate Authority in labd - [ ] lab-agent skeleton (connect, heartbeat, enrollment) - [ ] Agent enrollment via join tokens - [ ] RBAC engine - [ ] labctl exec (remote execution) - [ ] labctl logs (resource-scoped streaming) - [ ] labctl get servers (with filters) - [ ] Smoke test stack (podman-compose) ### Phase 2: Deployment - [ ] Reprovision labmaster as labmaster.ad.itaz.eu - [ ] Deploy k3s with Cilium CNI - [ ] Deploy CockroachDB on k3s - [ ] Deploy labd on k3s - [ ] Deploy bastion as managed app - [ ] Auto-enroll agents during PXE provision ### Phase 3: Infrastructure as Code - [ ] Module system - [ ] Pulumi charts (replacing Helm) - [ ] labctl apps install/upgrade/rollback - [ ] labctl apply -f (Pulumi execution) - [ ] kubectl proxy (audited) - [ ] Kubeconfig store (encrypted) ### Phase 4: Multi-Cloud - [ ] AWS provider (Pulumi) - [ ] Reusable join tokens for ASGs - [ ] Cilium Cluster Mesh - [ ] Ephemeral test environments - [ ] Grafana Loki for cold logs ## 10. Technology Stack | Component | Technology | Notes | |-----------|-----------|-------| | Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC | | CLI | Commander.js | Matches mcpctl patterns | | HTTP Server | Fastify + WebSocket | labd and bastion | | Database | CockroachDB | PostgreSQL compatible, Prisma ORM | | ORM | Prisma | Reuse mcpctl patterns | | IaC | Pulumi (TypeScript) | Replaces Helm and Puppet | | k8s CNI | Cilium | eBPF, WireGuard, network policies | | Auth | mTLS (built-in CA) | Certificate-based, no SSH keys | | Packaging | nfpm (RPM/DEB) | bun compile for standalone binary | | Containers | Podman + podman-compose | No Docker dependency | | CI/CD | Gitea Actions | Self-hosted on mysources.co.uk | | Testing | Vitest | Unit + smoke + integration | | Registry | Gitea packages | RPM, DEB, container images | ## 11. Lessons from mcpctl The mcpctl project (../mcpctl/) established patterns reused here: **Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts. **CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply). **Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints. **Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup. **RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware. **Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC. **CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images. **Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons. **Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages. **Key learnings applied:** - Start with proper monorepo structure (not flat scripts) - Type safety across packages via workspace references - Test-driven (unit tests before features) - CI from the start (not retrofitted) - RBAC and audit from the start (not bolted on) - Database-first design (schema defines the domain) ## 12. Gitea Registry **Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194) **Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN **Packages:** RPM and DEB published to Gitea packages API **Container images:** pushed to Gitea container registry **API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)