Full product requirements covering: architecture, CLI commands, partition layout, modules, testing strategy, cloud model, app model, implementation phases, tech stack, and lessons from mcpctl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
453 lines
17 KiB
Markdown
453 lines
17 KiB
Markdown
# labctl — Infrastructure Management Platform
|
|
|
|
## Product Requirements Document
|
|
|
|
## 1. Overview
|
|
|
|
labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.
|
|
|
|
### 1.1 Core Principles
|
|
- **Single CLI** (`labctl`) for all infrastructure operations
|
|
- **mTLS everywhere** — built-in Certificate Authority, no SSH key management
|
|
- **RBAC from day one** — deny by default, audit everything
|
|
- **Multi-cloud** — bare metal now, AWS later, extensible to any cloud
|
|
- **Test infrastructure like code** — ephemeral environments, smoke tests, security tests
|
|
- **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating
|
|
|
|
### 1.2 Current State (completed)
|
|
- PXE bastion for bare-metal provisioning (discover, install, reprovision)
|
|
- CLI with subcommands: `labctl init bastion`, `labctl provision`
|
|
- LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
|
|
- Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
|
|
- 32 unit tests, VM smoke tests verified on real hardware
|
|
- Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
|
|
- labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)
|
|
|
|
### 1.3 Hardware
|
|
- labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
|
|
- Future: additional bare-metal worker nodes, AWS EC2 instances
|
|
|
|
## 2. Architecture
|
|
|
|
### 2.1 Components
|
|
|
|
```
|
|
labctl CLI → labd (master) → lab-agent (on every server)
|
|
↓
|
|
CockroachDB
|
|
```
|
|
|
|
**labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.
|
|
|
|
**labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.
|
|
|
|
**lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.
|
|
|
|
**CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.
|
|
|
|
**Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.
|
|
|
|
### 2.2 Network Architecture
|
|
|
|
**Cilium** as k8s CNI (replacing default flannel):
|
|
- eBPF-based pod networking
|
|
- Built-in WireGuard encryption between nodes
|
|
- Network policies (ties into RBAC)
|
|
- Hubble for observability
|
|
- Future: Cluster Mesh for multi-site transparent networking
|
|
|
|
No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.
|
|
|
|
### 2.3 Authentication
|
|
|
|
**mTLS with built-in Certificate Authority:**
|
|
1. labd generates root CA on first start (stored encrypted in CockroachDB)
|
|
2. Agents enroll with join token → receive signed certificate
|
|
3. CLI users authenticate with client certificates (or SSH key-based initial auth)
|
|
4. All communication authenticated via mutual TLS
|
|
5. Certificate rotation and revocation supported
|
|
|
|
**Join tokens:**
|
|
- One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
|
|
- Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
|
|
- Tokens can be revoked, have optional expiry
|
|
|
|
### 2.4 RBAC Model
|
|
|
|
Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:
|
|
|
|
```
|
|
action:cloud:environment:server
|
|
|
|
Examples:
|
|
read:*:*:* — read everything
|
|
exec:baremetal:lab:* — exec on any lab bare-metal server
|
|
kubectl:*:*:* — kubectl proxy on any cluster
|
|
*:baremetal:lab:puppet — full access to puppet server only
|
|
manage:*:*:* — manage apps, clusters, tokens
|
|
admin:*:*:* — full admin (create users, roles)
|
|
```
|
|
|
|
**Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks
|
|
**Actions:** read, exec, apply, destroy, manage, admin, kubectl
|
|
**Deny rules:** explicit deny overrides any allow (like AWS IAM)
|
|
|
|
Prisma models: Role, Permission (allow/deny), UserRole binding.
|
|
|
|
### 2.5 Database
|
|
|
|
**CockroachDB** chosen over PostgreSQL and Cassandra:
|
|
- PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
|
|
- Multi-master replication — any node accepts reads AND writes
|
|
- Strong consistency (not eventual like Cassandra)
|
|
- Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
|
|
- Auto-rebalancing when adding nodes
|
|
- Start single-node, scale to multi-node with zero code changes (just add nodes)
|
|
|
|
**Schema (already scaffolded in Prisma):**
|
|
- Server — managed machines (hostname, mac, cloud, env, role, labels, status)
|
|
- Agent — connected agents (cert, enrollment, last seen)
|
|
- User — platform users (username, cert fingerprint)
|
|
- Role — RBAC roles with permissions
|
|
- Permission — allow/deny rules (action:cloud:env:server)
|
|
- UserRole — user-to-role bindings
|
|
- JoinToken — enrollment tokens (one-time, reusable, revocable)
|
|
- AuditLog — every action logged (user, session, action, resource, result, duration)
|
|
- PulumiRun — infrastructure-as-code execution records
|
|
- Cluster — managed k8s clusters (kubeconfig encrypted)
|
|
|
|
## 3. CLI Command Reference
|
|
|
|
### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED
|
|
```bash
|
|
sudo labctl init bastion standalone start [--foreground] [--port 8080]
|
|
sudo labctl init bastion standalone stop
|
|
labctl init bastion standalone status
|
|
```
|
|
|
|
### 3.2 Provisioning — IMPLEMENTED
|
|
```bash
|
|
labctl provision list
|
|
labctl provision install <mac> <hostname> --role worker|infra
|
|
labctl provision reprovision <mac> <hostname> --role worker|infra
|
|
labctl provision forget <mac>
|
|
```
|
|
|
|
### 3.3 Server Management — TO BUILD
|
|
```bash
|
|
labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
|
|
labctl describe server/<name>
|
|
```
|
|
|
|
### 3.4 Remote Execution — TO BUILD
|
|
```bash
|
|
labctl exec server/<name> -- <command>
|
|
labctl exec server/<name> -it -- bash # interactive TTY
|
|
labctl exec server/<name> --timeout 30s -- cmd
|
|
```
|
|
|
|
### 3.5 Kubernetes Proxy — TO BUILD
|
|
```bash
|
|
labctl kubectl --cluster <name> <kubectl-args>
|
|
labctl clusters add <name> --kubeconfig <path>
|
|
labctl clusters list
|
|
labctl clusters remove <name>
|
|
```
|
|
|
|
### 3.6 Logs — TO BUILD
|
|
```bash
|
|
# Server logs (journalctl passthrough, no DB in hot path)
|
|
labctl logs server/<name> # all journal
|
|
labctl logs server/<name> -f # follow (live WebSocket relay)
|
|
labctl logs server/<name> -n 100 # last 100 lines
|
|
labctl logs server/<name> -u k3s # specific unit
|
|
labctl logs server/<name> -u sshd --since "1h ago"
|
|
labctl logs server/<name> -k # kernel
|
|
labctl logs server/<name> -p err # errors only
|
|
labctl logs server/<name> --file /var/log/nginx/error.log
|
|
|
|
# App logs (k8s pod logs)
|
|
labctl logs app/<name> [-f] [--container NAME]
|
|
|
|
# Pulumi execution logs
|
|
labctl logs pulumi/<run-id> [-f]
|
|
|
|
# Bastion logs
|
|
labctl logs bastion/<env> [--mac MAC]
|
|
|
|
# Agent daemon logs
|
|
labctl logs agent/<server>
|
|
|
|
# Audit logs (from CockroachDB)
|
|
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
|
|
labctl logs audit/<user-date-sessionid> # specific session
|
|
```
|
|
|
|
Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.
|
|
|
|
### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD
|
|
```bash
|
|
labctl apps list
|
|
labctl apps install <name> [--set key=value] [-f values.yaml]
|
|
labctl apps status <name>
|
|
labctl apps upgrade <name>
|
|
labctl apps history <name>
|
|
labctl apps rollback <name> <version>
|
|
labctl apps uninstall <name>
|
|
```
|
|
|
|
### 3.8 Infrastructure as Code — TO BUILD
|
|
```bash
|
|
labctl apply -f <file.ts> --env <env>
|
|
labctl plan -f <file.ts> --env <env>
|
|
labctl destroy -f <file.ts> --env <env>
|
|
```
|
|
|
|
### 3.9 RBAC — TO BUILD
|
|
```bash
|
|
labctl get roles
|
|
labctl get users
|
|
labctl create role <name> --allow "action:cloud:env:server"
|
|
labctl create role <name> --deny "destroy:*:*:*"
|
|
labctl bind role <role> --user <user>
|
|
labctl unbind role <role> --user <user>
|
|
labctl get permissions
|
|
```
|
|
|
|
### 3.10 Environments and Clouds — TO BUILD
|
|
```bash
|
|
labctl get environments
|
|
labctl get clouds
|
|
labctl create environment <name> --cloud <cloud>
|
|
```
|
|
|
|
## 4. Partition Layout
|
|
|
|
### Worker Role
|
|
```
|
|
/boot/efi 600MB EFI
|
|
/boot 3GB ext4
|
|
── LVM VG: labvg ──
|
|
swap 27GB
|
|
/ 33GB xfs
|
|
/var 100GB xfs
|
|
/var/log 10GB xfs
|
|
/home 10GB xfs ← preserved on reprovision
|
|
/srv 20GB xfs ← preserved on reprovision
|
|
/var/lib/longhorn rest xfs ← preserved (Longhorn PVC storage)
|
|
/tmp tmpfs 4GB
|
|
```
|
|
|
|
### Infra Role
|
|
```
|
|
/boot/efi 600MB EFI
|
|
/boot 3GB ext4
|
|
── LVM VG: labvg ──
|
|
swap 27GB
|
|
/ 33GB xfs
|
|
/var 100GB xfs
|
|
/var/log 10GB xfs
|
|
/home 10GB xfs ← preserved on reprovision
|
|
/srv 20GB xfs ← preserved on reprovision
|
|
/var/lib/rancher 20GB xfs ← preserved (k3s etcd data)
|
|
/tmp tmpfs 4GB
|
|
```
|
|
|
|
## 5. Module System
|
|
|
|
Configuration modules define desired state. Three tiers:
|
|
1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion
|
|
2. **Official modules** (separate repos): monitoring, cilium, DNS
|
|
3. **Custom modules** (user repos): pulled by git URL
|
|
|
|
Module structure:
|
|
```
|
|
module.yaml # name, version, targets (roles/labels), deps
|
|
src/index.ts # entry point
|
|
src/install.ts # installation logic
|
|
src/configure.ts # configuration logic
|
|
src/health.ts # health check
|
|
tests/ # vitest tests (mandatory)
|
|
```
|
|
|
|
## 6. Testing Strategy
|
|
|
|
### 6.1 Testing Pyramid
|
|
```
|
|
Unit Tests → pure logic, milliseconds, every commit
|
|
Smoke Tests → containers (podman-compose), minutes, every commit
|
|
Integration Tests → VMs (libvirt), 10-15 min, PRs
|
|
E2E Tests → real hardware/cloud, 20-30 min, pre-release
|
|
```
|
|
|
|
### 6.2 Smoke Test Stack (podman-compose)
|
|
```yaml
|
|
services:
|
|
cockroachdb:
|
|
image: cockroachdb/cockroach:latest-v24.3
|
|
labd:
|
|
build: .
|
|
depends_on: [cockroachdb]
|
|
agent-1:
|
|
build: ./agent
|
|
depends_on: [labd]
|
|
agent-2:
|
|
build: ./agent
|
|
depends_on: [labd]
|
|
```
|
|
Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.
|
|
|
|
### 6.3 Security Tests (RBAC)
|
|
- Deny exec without permission
|
|
- Deny cross-environment access
|
|
- Deny rules override allow rules
|
|
- Cannot escalate own permissions
|
|
- Audit logs all denied attempts
|
|
- Certificate-based auth cannot be spoofed
|
|
- Join tokens cannot be reused (one-time)
|
|
- Expired tokens rejected
|
|
|
|
### 6.4 Ephemeral Test Environments
|
|
```bash
|
|
labctl test smoke # podman-compose
|
|
labctl test integration # libvirt VMs
|
|
labctl env create pr-123 --cloud containers # CI ephemeral
|
|
labctl env create pr-123 --cloud aws # cloud ephemeral (future)
|
|
```
|
|
|
|
### 6.5 Health Gates for Deployment
|
|
Before promoting to production, ALL must pass:
|
|
- labd API responds
|
|
- Expected number of agents connected
|
|
- k3s nodes Ready
|
|
- Certificates valid (>30 days)
|
|
- RBAC smoke test passes
|
|
- No error logs in last 5 minutes
|
|
|
|
## 7. Cloud/Environment Model
|
|
|
|
```
|
|
Cloud: baremetal
|
|
└── Environment: lab
|
|
├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
|
|
└── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})
|
|
|
|
Cloud: aws (future)
|
|
└── Environment: production
|
|
├── Server: i-abc123 (from ASG web-servers)
|
|
└── Server: i-def456 (from ASG web-servers)
|
|
```
|
|
|
|
Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.
|
|
|
|
## 8. App Model (Pulumi Charts)
|
|
|
|
Each app is a Pulumi TypeScript program:
|
|
```
|
|
app.yaml # name, version, inputs schema, required permissions
|
|
src/index.ts # Pulumi program
|
|
values.yaml # defaults
|
|
tests/ # vitest tests
|
|
```
|
|
|
|
First apps to build:
|
|
- bastion — PXE provisioning (wrap existing code)
|
|
- labd — master daemon (self-deployment)
|
|
- cockroachdb — database
|
|
- cilium — CNI
|
|
|
|
## 9. Implementation Phases
|
|
|
|
### Phase 1: Foundation (PARTIALLY DONE)
|
|
- [x] PXE bastion (discover, install, reprovision)
|
|
- [x] CLI structure (labctl init/provision)
|
|
- [x] labd scaffold (Fastify + CockroachDB/Prisma schema)
|
|
- [x] Multi-arch builds, packaging, CI/CD
|
|
- [ ] Certificate Authority in labd
|
|
- [ ] lab-agent skeleton (connect, heartbeat, enrollment)
|
|
- [ ] Agent enrollment via join tokens
|
|
- [ ] RBAC engine
|
|
- [ ] labctl exec (remote execution)
|
|
- [ ] labctl logs (resource-scoped streaming)
|
|
- [ ] labctl get servers (with filters)
|
|
- [ ] Smoke test stack (podman-compose)
|
|
|
|
### Phase 2: Deployment
|
|
- [ ] Reprovision labmaster as labmaster.ad.itaz.eu
|
|
- [ ] Deploy k3s with Cilium CNI
|
|
- [ ] Deploy CockroachDB on k3s
|
|
- [ ] Deploy labd on k3s
|
|
- [ ] Deploy bastion as managed app
|
|
- [ ] Auto-enroll agents during PXE provision
|
|
|
|
### Phase 3: Infrastructure as Code
|
|
- [ ] Module system
|
|
- [ ] Pulumi charts (replacing Helm)
|
|
- [ ] labctl apps install/upgrade/rollback
|
|
- [ ] labctl apply -f (Pulumi execution)
|
|
- [ ] kubectl proxy (audited)
|
|
- [ ] Kubeconfig store (encrypted)
|
|
|
|
### Phase 4: Multi-Cloud
|
|
- [ ] AWS provider (Pulumi)
|
|
- [ ] Reusable join tokens for ASGs
|
|
- [ ] Cilium Cluster Mesh
|
|
- [ ] Ephemeral test environments
|
|
- [ ] Grafana Loki for cold logs
|
|
|
|
## 10. Technology Stack
|
|
|
|
| Component | Technology | Notes |
|
|
|-----------|-----------|-------|
|
|
| Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC |
|
|
| CLI | Commander.js | Matches mcpctl patterns |
|
|
| HTTP Server | Fastify + WebSocket | labd and bastion |
|
|
| Database | CockroachDB | PostgreSQL compatible, Prisma ORM |
|
|
| ORM | Prisma | Reuse mcpctl patterns |
|
|
| IaC | Pulumi (TypeScript) | Replaces Helm and Puppet |
|
|
| k8s CNI | Cilium | eBPF, WireGuard, network policies |
|
|
| Auth | mTLS (built-in CA) | Certificate-based, no SSH keys |
|
|
| Packaging | nfpm (RPM/DEB) | bun compile for standalone binary |
|
|
| Containers | Podman + podman-compose | No Docker dependency |
|
|
| CI/CD | Gitea Actions | Self-hosted on mysources.co.uk |
|
|
| Testing | Vitest | Unit + smoke + integration |
|
|
| Registry | Gitea packages | RPM, DEB, container images |
|
|
|
|
## 11. Lessons from mcpctl
|
|
|
|
The mcpctl project (../mcpctl/) established patterns reused here:
|
|
|
|
**Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.
|
|
|
|
**CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).
|
|
|
|
**Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.
|
|
|
|
**Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.
|
|
|
|
**RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.
|
|
|
|
**Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.
|
|
|
|
**CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.
|
|
|
|
**Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.
|
|
|
|
**Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.
|
|
|
|
**Key learnings applied:**
|
|
- Start with proper monorepo structure (not flat scripts)
|
|
- Type safety across packages via workspace references
|
|
- Test-driven (unit tests before features)
|
|
- CI from the start (not retrofitted)
|
|
- RBAC and audit from the start (not bolted on)
|
|
- Database-first design (schema defines the domain)
|
|
|
|
## 12. Gitea Registry
|
|
|
|
**Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194)
|
|
**Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN
|
|
**Packages:** RPM and DEB published to Gitea packages API
|
|
**Container images:** pushed to Gitea container registry
|
|
**API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)
|