# labctl — Infrastructure Management Platform

## Product Requirements Document

## 1. Overview

labctl is a unified infrastructure management platform for bare-metal servers, Kubernetes clusters, and cloud resources. It replaces Puppet with a modern, TypeScript-native system using Pulumi for infrastructure as code.

### 1.1 Core Principles
- **Single CLI** (`labctl`) for all infrastructure operations
- **mTLS everywhere** — built-in Certificate Authority, no SSH key management
- **RBAC from day one** — deny by default, audit everything
- **Multi-cloud** — bare metal now, AWS later, extensible to any cloud
- **Test infrastructure like code** — ephemeral environments, smoke tests, security tests
- **Pulumi over Helm** — TypeScript charts, typed, testable, no YAML templating

### 1.2 Current State (completed)
- PXE bastion for bare-metal provisioning (discover, install, reprovision)
- CLI with subcommands: `labctl init bastion`, `labctl provision`
- LVM partitioning with reprovision data preservation (/home, /srv, /var/lib/longhorn, /var/lib/rancher)
- Worker role (k3s agent + Longhorn) and infra role (k3s server + etcd)
- 32 unit tests, VM smoke tests verified on real hardware
- Multi-arch builds (x86_64 + arm64), RPM/DEB packaging, Gitea CI/CD
- labd scaffold with CockroachDB Prisma schema (Server, Agent, User, Role, Permission, AuditLog, JoinToken, Cluster, PulumiRun)

### 1.3 Hardware
- labmaster (puppet.ad.itaz.eu / 78:55:36:08:35:14): MinisForum SER9, AMD Ryzen 7 255, 16 cores, 27GB RAM, 1TB NVMe, infra role
- Future: additional bare-metal worker nodes, AWS EC2 instances

## 2. Architecture

### 2.1 Components

```
labctl CLI → labd (master) → lab-agent (on every server)
                ↓
          CockroachDB
```

**labctl** — CLI binary installed on developer workstations. Compiled with bun to standalone binary. Distributed as RPM/DEB/binary.

**labd** — Master daemon running as k8s Deployment on labmaster's k3s cluster. Stateless (all state in CockroachDB). Multiple instances behind k8s Service for HA. Manages: CA, RBAC, agent registry, Pulumi executor, kubectl proxy, app deployments, log relay.

**lab-agent** — Lightweight daemon on every managed machine. Connects to labd via mTLS WebSocket. Handles: heartbeat, command execution, log streaming, module application. Compiled to standalone binary with bun. Installed via systemd service.

**CockroachDB** — Distributed SQL database. PostgreSQL wire-compatible (Prisma works unchanged). Single node to start, multi-node for HA. Stores: server state, RBAC, audit logs, certificates, kubeconfigs (encrypted), Pulumi state.

**Bastion** — PXE provisioning server. Runs as k8s pod with hostNetwork (needs DHCP/TFTP). Managed by labd as an "app". Multiple bastions for multiple sites.

### 2.2 Network Architecture

**Cilium** as k8s CNI (replacing default flannel):
- eBPF-based pod networking
- Built-in WireGuard encryption between nodes
- Network policies (ties into RBAC)
- Hubble for observability
- Future: Cluster Mesh for multi-site transparent networking

No Tailscale dependency — Cilium handles node-to-node encryption. Agents connect to labd over standard TCP/TLS.

### 2.3 Authentication

**mTLS with built-in Certificate Authority:**
1. labd generates root CA on first start (stored encrypted in CockroachDB)
2. Agents enroll with join token → receive signed certificate
3. CLI users authenticate with client certificates (or SSH key-based initial auth)
4. All communication authenticated via mutual TLS
5. Certificate rotation and revocation supported

**Join tokens:**
- One-time tokens: for individual bare-metal servers (generated during PXE provision, embedded in kickstart)
- Reusable tokens: for autoscaling groups (AWS ASG instances share a token)
- Tokens can be revoked, have optional expiry

### 2.4 RBAC Model

Inspired by mcpctl's RBAC (src/mcpd/src/services/, middleware/auth). Hierarchical permissions:

```
action:cloud:environment:server

Examples:
  read:*:*:*                    — read everything
  exec:baremetal:lab:*          — exec on any lab bare-metal server
  kubectl:*:*:*                 — kubectl proxy on any cluster
  *:baremetal:lab:puppet        — full access to puppet server only
  manage:*:*:*                  — manage apps, clusters, tokens
  admin:*:*:*                   — full admin (create users, roles)
```

**Resources:** servers, environments, clouds, modules, roles, users, clusters, apps, pulumi-stacks
**Actions:** read, exec, apply, destroy, manage, admin, kubectl
**Deny rules:** explicit deny overrides any allow (like AWS IAM)

Prisma models: Role, Permission (allow/deny), UserRole binding.

### 2.5 Database

**CockroachDB** chosen over PostgreSQL and Cassandra:
- PostgreSQL wire-compatible — Prisma works, mcpctl patterns reusable
- Multi-master replication — any node accepts reads AND writes
- Strong consistency (not eventual like Cassandra)
- Survives node failures (3 nodes = 1 failure, 5 nodes = 2)
- Auto-rebalancing when adding nodes
- Start single-node, scale to multi-node with zero code changes (just add nodes)

**Schema (already scaffolded in Prisma):**
- Server — managed machines (hostname, mac, cloud, env, role, labels, status)
- Agent — connected agents (cert, enrollment, last seen)
- User — platform users (username, cert fingerprint)
- Role — RBAC roles with permissions
- Permission — allow/deny rules (action:cloud:env:server)
- UserRole — user-to-role bindings
- JoinToken — enrollment tokens (one-time, reusable, revocable)
- AuditLog — every action logged (user, session, action, resource, result, duration)
- PulumiRun — infrastructure-as-code execution records
- Cluster — managed k8s clusters (kubeconfig encrypted)

## 3. CLI Command Reference

### 3.1 Bastion (PXE Provisioning) — IMPLEMENTED
```bash
sudo labctl init bastion standalone start [--foreground] [--port 8080]
sudo labctl init bastion standalone stop
labctl init bastion standalone status
```

### 3.2 Provisioning — IMPLEMENTED
```bash
labctl provision list
labctl provision install <mac> <hostname> --role worker|infra
labctl provision reprovision <mac> <hostname> --role worker|infra
labctl provision forget <mac>
```

### 3.3 Server Management — TO BUILD
```bash
labctl get servers [--env NAME] [--cloud NAME] [--label KEY=VALUE]
labctl describe server/<name>
```

### 3.4 Remote Execution — TO BUILD
```bash
labctl exec server/<name> -- <command>
labctl exec server/<name> -it -- bash          # interactive TTY
labctl exec server/<name> --timeout 30s -- cmd
```

### 3.5 Kubernetes Proxy — TO BUILD
```bash
labctl kubectl --cluster <name> <kubectl-args>
labctl clusters add <name> --kubeconfig <path>
labctl clusters list
labctl clusters remove <name>
```

### 3.6 Logs — TO BUILD
```bash
# Server logs (journalctl passthrough, no DB in hot path)
labctl logs server/<name>                     # all journal
labctl logs server/<name> -f                  # follow (live WebSocket relay)
labctl logs server/<name> -n 100              # last 100 lines
labctl logs server/<name> -u k3s              # specific unit
labctl logs server/<name> -u sshd --since "1h ago"
labctl logs server/<name> -k                  # kernel
labctl logs server/<name> -p err              # errors only
labctl logs server/<name> --file /var/log/nginx/error.log

# App logs (k8s pod logs)
labctl logs app/<name> [-f] [--container NAME]

# Pulumi execution logs
labctl logs pulumi/<run-id> [-f]

# Bastion logs
labctl logs bastion/<env> [--mac MAC]

# Agent daemon logs
labctl logs agent/<server>

# Audit logs (from CockroachDB)
labctl logs audit [--user NAME] [--action ACTION] [--since TIME]
labctl logs audit/<user-date-sessionid>       # specific session
```

Log architecture: agent runs journalctl/tail with user-provided flags, streams stdout over WebSocket to labd, labd relays to CLI. No database in the hot path. Future: Grafana Loki integration for cold storage.

### 3.7 Apps (Pulumi Charts, replacing Helm) — TO BUILD
```bash
labctl apps list
labctl apps install <name> [--set key=value] [-f values.yaml]
labctl apps status <name>
labctl apps upgrade <name>
labctl apps history <name>
labctl apps rollback <name> <version>
labctl apps uninstall <name>
```

### 3.8 Infrastructure as Code — TO BUILD
```bash
labctl apply -f <file.ts> --env <env>
labctl plan -f <file.ts> --env <env>
labctl destroy -f <file.ts> --env <env>
```

### 3.9 RBAC — TO BUILD
```bash
labctl get roles
labctl get users
labctl create role <name> --allow "action:cloud:env:server"
labctl create role <name> --deny "destroy:*:*:*"
labctl bind role <role> --user <user>
labctl unbind role <role> --user <user>
labctl get permissions
```

### 3.10 Environments and Clouds — TO BUILD
```bash
labctl get environments
labctl get clouds
labctl create environment <name> --cloud <cloud>
```

## 4. Partition Layout

### Worker Role
```
/boot/efi       600MB  EFI
/boot           3GB    ext4
── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/longhorn  rest  xfs     ← preserved (Longhorn PVC storage)
  /tmp          tmpfs 4GB
```

### Infra Role
```
/boot/efi       600MB  EFI
/boot           3GB    ext4
── LVM VG: labvg ──
  swap          27GB
  /             33GB   xfs
  /var          100GB  xfs
  /var/log      10GB   xfs
  /home         10GB   xfs         ← preserved on reprovision
  /srv          20GB   xfs         ← preserved on reprovision
  /var/lib/rancher  20GB  xfs      ← preserved (k3s etcd data)
  /tmp          tmpfs 4GB
```

## 5. Module System

Configuration modules define desired state. Three tiers:
1. **Core modules** (this repo, `modules/`): k3s-server, k3s-agent, labd, lab-agent, bastion
2. **Official modules** (separate repos): monitoring, cilium, DNS
3. **Custom modules** (user repos): pulled by git URL

Module structure:
```
module.yaml          # name, version, targets (roles/labels), deps
src/index.ts         # entry point
src/install.ts       # installation logic
src/configure.ts     # configuration logic
src/health.ts        # health check
tests/               # vitest tests (mandatory)
```

## 6. Testing Strategy

### 6.1 Testing Pyramid
```
Unit Tests        → pure logic, milliseconds, every commit
Smoke Tests       → containers (podman-compose), minutes, every commit
Integration Tests → VMs (libvirt), 10-15 min, PRs
E2E Tests         → real hardware/cloud, 20-30 min, pre-release
```

### 6.2 Smoke Test Stack (podman-compose)
```yaml
services:
  cockroachdb:
    image: cockroachdb/cockroach:latest-v24.3
  labd:
    build: .
    depends_on: [cockroachdb]
  agent-1:
    build: ./agent
    depends_on: [labd]
  agent-2:
    build: ./agent
    depends_on: [labd]
```
Tests: agent enrollment, certificate issuance, heartbeat, exec, logs, RBAC deny/allow.

### 6.3 Security Tests (RBAC)
- Deny exec without permission
- Deny cross-environment access
- Deny rules override allow rules
- Cannot escalate own permissions
- Audit logs all denied attempts
- Certificate-based auth cannot be spoofed
- Join tokens cannot be reused (one-time)
- Expired tokens rejected

### 6.4 Ephemeral Test Environments
```bash
labctl test smoke                                    # podman-compose
labctl test integration                              # libvirt VMs
labctl env create pr-123 --cloud containers          # CI ephemeral
labctl env create pr-123 --cloud aws                 # cloud ephemeral (future)
```

### 6.5 Health Gates for Deployment
Before promoting to production, ALL must pass:
- labd API responds
- Expected number of agents connected
- k3s nodes Ready
- Certificates valid (>30 days)
- RBAC smoke test passes
- No error logs in last 5 minutes

## 7. Cloud/Environment Model

```
Cloud: baremetal
  └── Environment: lab
       ├── Server: labmaster.ad.itaz.eu (infra, labels={k3s=server})
       └── Server: ser9.ad.itaz.eu (worker, labels={k3s=agent})

Cloud: aws (future)
  └── Environment: production
       ├── Server: i-abc123 (from ASG web-servers)
       └── Server: i-def456 (from ASG web-servers)
```

Each bastion creates an environment under baremetal cloud. AWS autoscaling groups create environments under aws cloud.

## 8. App Model (Pulumi Charts)

Each app is a Pulumi TypeScript program:
```
app.yaml             # name, version, inputs schema, required permissions
src/index.ts         # Pulumi program
values.yaml          # defaults
tests/               # vitest tests
```

First apps to build:
- bastion — PXE provisioning (wrap existing code)
- labd — master daemon (self-deployment)
- cockroachdb — database
- cilium — CNI

## 9. Implementation Phases

### Phase 1: Foundation (PARTIALLY DONE)
- [x] PXE bastion (discover, install, reprovision)
- [x] CLI structure (labctl init/provision)
- [x] labd scaffold (Fastify + CockroachDB/Prisma schema)
- [x] Multi-arch builds, packaging, CI/CD
- [ ] Certificate Authority in labd
- [ ] lab-agent skeleton (connect, heartbeat, enrollment)
- [ ] Agent enrollment via join tokens
- [ ] RBAC engine
- [ ] labctl exec (remote execution)
- [ ] labctl logs (resource-scoped streaming)
- [ ] labctl get servers (with filters)
- [ ] Smoke test stack (podman-compose)

### Phase 2: Deployment
- [ ] Reprovision labmaster as labmaster.ad.itaz.eu
- [ ] Deploy k3s with Cilium CNI
- [ ] Deploy CockroachDB on k3s
- [ ] Deploy labd on k3s
- [ ] Deploy bastion as managed app
- [ ] Auto-enroll agents during PXE provision

### Phase 3: Infrastructure as Code
- [ ] Module system
- [ ] Pulumi charts (replacing Helm)
- [ ] labctl apps install/upgrade/rollback
- [ ] labctl apply -f (Pulumi execution)
- [ ] kubectl proxy (audited)
- [ ] Kubeconfig store (encrypted)

### Phase 4: Multi-Cloud
- [ ] AWS provider (Pulumi)
- [ ] Reusable join tokens for ASGs
- [ ] Cilium Cluster Mesh
- [ ] Ephemeral test environments
- [ ] Grafana Loki for cold logs

## 10. Technology Stack

| Component | Technology | Notes |
|-----------|-----------|-------|
| Language | TypeScript (ESM) | Same for CLI, daemon, agents, IaC |
| CLI | Commander.js | Matches mcpctl patterns |
| HTTP Server | Fastify + WebSocket | labd and bastion |
| Database | CockroachDB | PostgreSQL compatible, Prisma ORM |
| ORM | Prisma | Reuse mcpctl patterns |
| IaC | Pulumi (TypeScript) | Replaces Helm and Puppet |
| k8s CNI | Cilium | eBPF, WireGuard, network policies |
| Auth | mTLS (built-in CA) | Certificate-based, no SSH keys |
| Packaging | nfpm (RPM/DEB) | bun compile for standalone binary |
| Containers | Podman + podman-compose | No Docker dependency |
| CI/CD | Gitea Actions | Self-hosted on mysources.co.uk |
| Testing | Vitest | Unit + smoke + integration |
| Registry | Gitea packages | RPM, DEB, container images |

## 11. Lessons from mcpctl

The mcpctl project (../mcpctl/) established patterns reused here:

**Project structure:** pnpm monorepo with workspace packages (shared, cli, daemon). Each package has own package.json, tsconfig.json, vitest.config.ts.

**CLI patterns:** Commander.js with factory functions (createXxxCommand). Global options (--project → --env/--cloud). Resource CRUD (get, describe, delete, create, apply).

**Server patterns:** Fastify with route registration functions. Services layer with repository pattern. Middleware for auth. Health endpoints.

**Database:** Prisma ORM with PostgreSQL (now CockroachDB, wire-compatible). Migration-first schema. Seed data for initial setup.

**RBAC:** Role-based with permission strings. Middleware checks on every request. Audit logging in middleware.

**Testing:** Vitest with separate configs for unit vs smoke. Smoke tests with real database and services. Security tests for RBAC.

**CI/CD:** Gitea Actions with lint→typecheck→test→build→publish pipeline. nfpm for RPM/DEB. Bun compile for standalone binaries. Podman for container images.

**Deployment:** Docker/Podman compose for dev stack. Portainer API for production deploy (we'll use k3s instead). systemd for local daemons.

**Completions:** Generated from Commander tree. Bash + Fish. --write and --check modes. Included in packages.

**Key learnings applied:**
- Start with proper monorepo structure (not flat scripts)
- Type safety across packages via workspace references
- Test-driven (unit tests before features)
- CI from the start (not retrofitted)
- RBAC and audit from the start (not bolted on)
- Database-first design (schema defines the domain)

## 12. Gitea Registry

**Registry:** mysources.co.uk (self-hosted Gitea at 10.0.0.194)
**Token:** stored at ~/.gitea-token, env var PACKAGES_TOKEN
**Packages:** RPM and DEB published to Gitea packages API
**Container images:** pushed to Gitea container registry
**API pattern:** Same as mcpctl publish scripts (check existing, delete, re-upload, link to repo)