commit ac695f506ff7dfa71b07b9c5c95fc114ea4c5919
Author: Michal Rydlikowski <michal@itaz.eu>
Date:   Sun Mar 15 23:50:43 2026 +0000

    first commit

diff --git a/architecture.md b/architecture.md
new file mode 100644
index 0000000..0788c2e
--- /dev/null
+++ b/architecture.md
@@ -0,0 +1,246 @@
+# Architecture Decisions
+
+## Core Principles
+
+1. Build for homelab first, design for AWS/multi-cloud from the start
+2. Labels as the universal abstraction — config attaches to labels, not machines
+3. Code is the policy — declarations grant access, no separate policy management
+4. Availability over consistency — stale data is acceptable, no data is not
+5. No single point of failure — everything works offline with local cache
+6. Don't reinvent the wheel — wrap existing tools, build the glue and UX
+7. One engine everywhere — CLI, server, and init all use the same code path
+
+## The Tool: "lab"
+
+Unified infrastructure lifecycle platform. Full spec in `lab-tool-spec.md`.
+
+### Component Dependency Map
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        LAB PLATFORM                                  │
+│                                                                      │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │                    CORE (no external deps)                   │    │
+│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │    │
+│  │  │ Label    │ │ Group    │ │ Targeting│ │ Render Engine │  │    │
+│  │  │ Engine   │ │ Engine   │ │ Engine   │ │ (CLI tables,  │  │    │
+│  │  │          │ │          │ │          │ │  TUI, diff)   │  │    │
+│  │  └──────────┘ └──────────┘ └──────────┘ └───────────────┘  │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Profile      │ │ State Store  │ │ Plugin Registry  │    │    │
+│  │  │ Engine       │ │ (SQLite +    │ │                  │    │    │
+│  │  │ (t-shirt     │ │  Litestream) │ │                  │    │    │
+│  │  │  sizes)      │ │              │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│       ▲ depends on core                                              │
+│  ┌────┴────────────────────────────────────────────────────────┐    │
+│  │              LIFECYCLE (depends on: core + providers)        │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Lifecycle    │ │ Artifact     │ │ K8s Deployer     │    │    │
+│  │  │ Manager      │ │ Builder      │ │                  │    │    │
+│  │  │ (plan/apply/ │ │ (puppet →    │ │                  │    │    │
+│  │  │  destroy)    │ │  container)  │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│       ▲ depends on lifecycle                                         │
+│  ┌────┴────────────────────────────────────────────────────────┐    │
+│  │              IDENTITY & SECRETS (depends on: lifecycle)      │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Identity     │ │ Secret Store │ │ Token Issuer     │    │    │
+│  │  │ Manager      │ │ (privileged  │ │ (one-time join   │    │    │
+│  │  │ (enroll,     │ │  label, local│ │  tokens)         │    │    │
+│  │  │  DNS, certs, │ │  cache, git  │ │                  │    │    │
+│  │  │  SSH keys)   │ │  backup)     │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│       ▲ depends on identity                                          │
+│  ┌────┴────────────────────────────────────────────────────────┐    │
+│  │              OBSERVABILITY (depends on: core + identity)    │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Health       │ │ Alert        │ │ Audit Log        │    │    │
+│  │  │ Aggregator   │ │ Generator    │ │                  │    │    │
+│  │  │              │ │ (auto + user │ │                  │    │    │
+│  │  │              │ │  defined)    │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│                                                                      │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │              INTERFACES (depends on: everything above)      │    │
+│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐  │    │
+│  │  │ gRPC/REST│ │ CLI      │ │ TUI      │ │ Web UI       │  │    │
+│  │  │ API      │ │ (cobra)  │ │(bubbletea)│ │ (future)     │  │    │
+│  │  └──────────┘ └──────────┘ └──────────┘ └──────────────┘  │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+└─────────────────────────────────────────────────────────────────────┘
+
+PROVIDER PLUGINS (external, loaded at runtime):
+  ┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐
+  │provider-aws│ │provider-   │ │provider-     │ │provider-k8s│
+  │ (Pulumi)   │ │xcpng (XO)  │ │baremetal     │ │ (Pulumi)   │
+  └────────────┘ └────────────┘ │(Tinkerbell)  │ └────────────┘
+                                └──────────────┘
+HEALTH PLUGINS:                 IDENTITY PLUGINS:
+  ┌────────────┐ ┌──────────┐   ┌───────────┐ ┌─────────────┐
+  │health-     │ │health-   │   │id-openvox │ │id-dns       │
+  │prometheus  │ │naemon    │   │           │ │             │
+  └────────────┘ └──────────┘   └───────────┘ └─────────────┘
+  ┌────────────┐                ┌───────────┐ ┌─────────────┐
+  │health-     │                │id-ssh-ca  │ │id-secret    │
+  │cloudwatch  │                │           │ │             │
+  └────────────┘                └───────────┘ └─────────────┘
+```
+
+### Build Order (what depends on what)
+
+```
+Phase 1: CORE (can be built and tested independently)
+  ├── Label Engine
+  ├── Group Engine (depends on: labels)
+  ├── Targeting Engine (depends on: labels, groups)
+  ├── Profile Engine (t-shirt sizes)
+  ├── Render Engine
+  ├── State Store (SQLite + Litestream)
+  ├── Plugin Registry
+  ├── CLI framework (cobra)
+  └── gRPC/REST API skeleton
+
+Phase 2: PROVIDERS (can be built in parallel, each independent)
+  ├── provider-ssh (simplest, needed for onboarding existing machines)
+  ├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server)
+  ├── provider-portainer (deploy via Portainer API)
+  ├── provider-k8s (needed for k8s deployments)
+  ├── provider-aws (Pulumi AWS)
+  └── provider-xcpng (Pulumi XO / XO REST API)
+
+Phase 3: LIFECYCLE (depends on: core + at least one provider)
+  ├── Lifecycle Manager (plan/apply/destroy)
+  ├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll)
+  ├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk)
+  ├── Local mode (lab init --local, engine on user device)
+  ├── Self-deploy (lab init — deploy to remote target)
+  ├── Self-migration (lab server migrate)
+  └── Artifact Builder (puppet → container)
+
+Phase 4: IDENTITY (depends on: lifecycle)
+  ├── Token Issuer (one-time join tokens)
+  ├── OpenVox Enrollor (cert signing, node classification)
+  ├── DNS Manager (auto-registration, IP mobility)
+  ├── SSH CA integration
+  └── Secret Store (privileged label, local cache, git backup)
+
+Phase 5: OBSERVABILITY (depends on: core + identity)
+  ├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins)
+  ├── Alert Generator (auto + user-defined, targeting engine)
+  ├── Four-pillar status (sync + puppet + health + identity)
+  └── Audit log
+
+Phase 6: UX POLISH
+  ├── TUI (bubbletea, k9s-style, cross-linked navigation)
+  ├── lab show / lab targets (visibility commands)
+  ├── lab render (multi-provider comparison)
+  └── Web UI (future)
+```
+
+### Key Concepts
+
+| Concept | Description |
+|---------|-------------|
+| **Labels** | Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels |
+| **Groups** | Composable, nested, with exclusions. Target by label, group, server, environment |
+| **Targeting** | Unified query syntax used everywhere: alerts, secrets, puppet, queries |
+| **Four Pillars** | Every resource shows: Sync + Puppet + Health + Identity |
+| **Profiles** | T-shirt sizing with per-provider mappings, user-owned |
+| **Secret Store** | Privileged label holding all secrets, machines get only entitled subset |
+| **Code = Policy** | `lab::secret()` in puppet code = usage AND access declaration |
+| **Artifact Builder** | Same puppet modules → VM config OR container image |
+| **Self-deploy** | Lab deploys itself using same engine as everything else |
+| **Visibility** | Two-way: server→everything applied, label→all servers affected |
+
+## Infrastructure Stack
+
+| Layer | Homelab | AWS Equivalent | Status |
+|-------|---------|----------------|--------|
+| Orchestration | k3s | EKS | Decided |
+| IaC engine | Pulumi | Pulumi | Decided |
+| GitOps | ArgoCD | ArgoCD | Decided |
+| Monitoring (k8s) | Prometheus + Grafana | Prometheus + Grafana | Decided |
+| Monitoring (infra) | Naemon | N/A (bare metal only) | Decided |
+| Secrets backend | TBD | TBD | Needs investigation |
+| DNS | PowerDNS + ExternalDNS | Route53 + ExternalDNS | Decided — see `dns-research.md` |
+| TLS / CA | TBD | TBD | Needs investigation |
+| SSH CA | TBD | TBD | Needs investigation |
+| Storage | Longhorn | EBS CSI | Decided |
+| Config mgmt | OpenVox | OpenVox | Decided |
+| Bare metal boot | Tinkerbell / iPXE | N/A | Needs investigation |
+| State store | SQLite + Litestream | SQLite + Litestream | Leading candidate |
+| Container build | Buildah / Docker | Buildah / Docker | Needs investigation |
+
+## Decisions Made
+
+| Decision | Choice | Why | Alternatives Considered |
+|----------|--------|-----|------------------------|
+| IaC engine | Pulumi | Real languages, plan/preview, component packages, XCP-ng provider exists | Terraform (no abstraction), Crossplane (no plan) |
+| Config mgmt | OpenVox | Puppet fork, Apache 2.0, existing modules, active community | Puppet (Perforce EULA, 25-node limit) |
+| Multi-cloud abstraction | Custom (Lab) | Nothing exists that does labels + plan + bare metal + XCP-ng | Crossplane (no plan), Terraform (re-implement per cloud) |
+| Kubernetes | k3s | Puppet-friendly, multi-arch, lightweight, same K8s API as EKS | OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based) |
+| Target OS list | Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS | Multi-arch, each with different install automation | See `os-install-research.md` |
+| State store | NOT etcd | etcd crashes over serving stale data, availability > consistency | Leading: SQLite + Litestream |
+| Secret access model | Code = policy | Declarations in code/labels auto-grant access, no manual Vault policies | Manual Vault policy management |
+| Secret distribution | Privileged store + local cache | Prevents secret sprawl, machines only get entitled secrets | Peer-to-peer sync (leaks secrets sideways) |
+| Resilience model | Offline-capable | Local cache keeps everything running, git backup for DR | Central server dependency (FreeIPA burned us) |
+| Bootstrap | Self-deploying | lab init uses same engine as lab apply, no special codepath | Separate init provider interface |
+
+## Evaluated and Rejected
+
+| Tool | Why Rejected | Details |
+|------|-------------|---------|
+| **Crossplane** | No plan/preview — dealbreaker for enterprise | `crossplane-evaluation.md` |
+| **Foreman** | Obsolete, poor UX, user has used it | Memory: `feedback_foreman.md` |
+| **Terraform/OpenTofu** | No multi-platform abstraction | Re-implement per cloud at thousands of nodes |
+| **MAAS** | Bare metal only | No cloud VMs, no Puppet integration |
+| **OpenShift** | Fights external config mgmt, heavy, limited ARM | See `kubernetes-flavors.md` |
+| **Talos** | Immutable OS, no SSH, no puppet | Incompatible with our approach |
+| **MicroK8s** | Snap-based | Puppet managing snaps is awkward |
+| **HashiCorp Vault** | Not impressed, central-server mindset | Will evaluate alternatives (OpenBao, Infisical, etc.) |
+| **etcd** | Consistency over availability | Crashes rather than serving stale data |
+| **FreeIPA** | Unstable | Good features (DNS, SSH, CA, secrets) but unreliable |
+
+## Investigation Queue
+
+Things we've identified but haven't evaluated yet, in rough priority order:
+
+| # | Topic | Context | Options to Investigate |
+|---|-------|---------|----------------------|
+| 1 | Secret backend | Distributed, offline-capable, policy-filtered | OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite |
+| 2 | ~~DNS auto-registration~~ | ~~Every managed resource auto-registered~~ | **DECIDED: PowerDNS + ExternalDNS** — see `dns-research.md` |
+| 3 | SSH CA | CA-signed host keys, short-lived user certs | Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary |
+| 4 | TLS / Internal CA | Machine certs, auto-renewal | OpenVox CA, Vault PKI, step-ca, cert-manager |
+| 5 | Bare metal provisioning | Universal PXE agent + rootfs deploy (NOT native installers) | Wrap Tinkerbell vs build own agent — see `os-install-research.md` |
+| 6 | State store | Embedded, auto-backup, auto-recover | SQLite+Litestream, bbolt, Badger |
+| 7 | Container build | Puppet modules → OCI images | Buildah, Docker, Kaniko |
+| 8 | Local cache encryption | Machine-specific key for secret cache | TPM 2.0, kernel keyring, LUKS-bound, secure enclave |
+| 9 | Alert rendering | Generate monitoring configs from lab alerts | Prometheus rules, Naemon configs, CloudWatch |
+| 10 | Input format | How users define resources and labels | YAML (Compose-like), Pkl, KCL, CUE, TypeScript |
+| 11 | Auth (CLI to server) | Secure CLI-to-lab-server communication | mTLS, OIDC, Vault tokens |
+| 12 | XCP-ng Pulumi provider | May need Upjet wrapper or direct API | Existing Terraform provider via Upjet, Pulumi XO provider |
+| 13 | Multi-tenancy | Team scoping for labels/resources | Namespaces, RBAC, org hierarchy |
+| 14 | Image production pipeline | Build rootfs tarballs per OS per arch | mkosi, debootstrap, dnf --installroot, Packer |
+| 15 | Tinkerbell evaluation | Hands-on: does wrapping it work, or build our own agent? | HookOS + actions vs custom LinuxKit agent |
+| 16 | XCP-ng rootfs extraction | How to produce deployable XCP-ng rootfs (not native installer) | Extract from ISO, capture installed system |
+| 17 | VyOS rootfs extraction | How to produce deployable VyOS rootfs | VyOS build system, published images, Docker mode |
+| 18 | Multi-arch PXE | Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI | Per-arch agent OS builds, iPXE configs |
+
+## Project Files
+
+| File | Contents |
+|------|----------|
+| `lab-tool-spec.md` | Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap) |
+| `architecture.md` | This file — decisions, dependencies, investigation queue |
+| `hardware.md` | Homelab hardware inventory and node roles |
+| `crossplane-evaluation.md` | Crossplane evaluation and rejection rationale |
+| `config-format-research.md` | YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.) |
+| `os-install-research.md` | OS install automation, rootfs production, image pipeline, deployment matrix |
+| `kubernetes-flavors.md` | k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale |
+| `dns-research.md` | PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS |
diff --git a/bastion.sh b/bastion.sh
new file mode 100755
index 0000000..f7e8a9a
--- /dev/null
+++ b/bastion.sh
@@ -0,0 +1,337 @@
+#!/usr/bin/env bash
+# ─────────────────────────────────────────────────────────────────────
+# Lab PXE Bastion — ephemeral PXE server for bare-metal provisioning
+#
+# Turns this machine into a temporary PXE boot server.  Target machines
+# on the same network can PXE boot and get Fedora installed automatically.
+#
+# Usage:
+#   sudo bash bastion.sh                        # interactive, auto-detect everything
+#   sudo TARGET_HOSTNAME=puppet SSH_PUBKEY=~/.ssh/id_ed25519.pub bash bastion.sh
+#
+# Requirements: Fedora/RHEL host with dnsmasq, python3, curl
+# ─────────────────────────────────────────────────────────────────────
+set -euo pipefail
+
+# ──── Defaults (override via environment) ──────────────────────────
+FEDORA_VERSION="${FEDORA_VERSION:-41}"
+ARCH="${ARCH:-x86_64}"
+HTTP_PORT="${HTTP_PORT:-8080}"
+TARGET_HOSTNAME="${TARGET_HOSTNAME:-lab-node}"
+TARGET_DISK="${TARGET_DISK:-}"          # empty = anaconda auto-picks
+SSH_PUBKEY="${SSH_PUBKEY:-}"            # path to .pub file, auto-detected
+TIMEZONE="${TIMEZONE:-Europe/London}"
+LOCALE="${LOCALE:-en_GB.UTF-8}"
+BASTION_DIR="${BASTION_DIR:-/tmp/lab-bastion}"
+
+# ──── Colors ───────────────────────────────────────────────────────
+RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'
+CYAN='\033[0;36m'; BOLD='\033[1m'; NC='\033[0m'
+
+log()  { echo -e "${GREEN}[bastion]${NC} $*"; }
+warn() { echo -e "${YELLOW}[bastion]${NC} $*"; }
+err()  { echo -e "${RED}[bastion]${NC} $*" >&2; }
+die()  { err "$@"; exit 1; }
+
+# ──── Preflight ────────────────────────────────────────────────────
+[[ $EUID -eq 0 ]] || die "Must run as root (need DHCP/TFTP ports). Use: sudo bash bastion.sh"
+
+command -v python3 >/dev/null || die "python3 not found"
+command -v curl    >/dev/null || die "curl not found"
+
+# Install dnsmasq if missing
+if ! command -v dnsmasq >/dev/null; then
+    log "Installing dnsmasq..."
+    if command -v dnf >/dev/null; then
+        dnf install -y dnsmasq
+    elif command -v apt-get >/dev/null; then
+        apt-get install -y dnsmasq
+    else
+        die "Cannot install dnsmasq — install it manually"
+    fi
+fi
+
+# ──── Auto-detect network ─────────────────────────────────────────
+IFACE="${IFACE:-$(ip route | awk '/default/ {print $5; exit}')}"
+SERVER_IP="$(ip -4 addr show "$IFACE" | awk '/inet / {split($2,a,"/"); print a[1]; exit}')"
+NETWORK="$(echo "$SERVER_IP" | awk -F. '{print $1"."$2"."$3".0"}')"
+
+[[ -n "$SERVER_IP" ]] || die "Cannot detect IP on interface $IFACE"
+log "Interface: ${BOLD}$IFACE${NC}  IP: ${BOLD}$SERVER_IP${NC}  Network: ${BOLD}$NETWORK${NC}"
+
+# ──── Auto-detect SSH pubkey ───────────────────────────────────────
+if [[ -z "$SSH_PUBKEY" ]]; then
+    # When run via sudo, check the real user's home
+    REAL_HOME="${HOME}"
+    if [[ -n "${SUDO_USER:-}" ]]; then
+        REAL_HOME="$(getent passwd "$SUDO_USER" | cut -d: -f6)"
+    fi
+    for keyfile in "$REAL_HOME/.ssh/id_ed25519.pub" "$REAL_HOME/.ssh/id_rsa.pub" "$REAL_HOME/.ssh/id_ecdsa.pub"; do
+        if [[ -f "$keyfile" ]]; then
+            SSH_PUBKEY="$keyfile"
+            break
+        fi
+    done
+fi
+
+if [[ -n "$SSH_PUBKEY" && -f "$SSH_PUBKEY" ]]; then
+    SSH_KEY_CONTENT="$(cat "$SSH_PUBKEY")"
+    log "SSH key: ${BOLD}$SSH_PUBKEY${NC}"
+else
+    warn "No SSH public key found. Root password will be set to 'changeme'."
+    warn "Set SSH_PUBKEY=/path/to/key.pub to use key-based auth instead."
+    SSH_KEY_CONTENT=""
+fi
+
+# ──── Prepare directories ─────────────────────────────────────────
+TFTPDIR="$BASTION_DIR/tftp"
+HTTPDIR="$BASTION_DIR/http"
+mkdir -p "$TFTPDIR" "$HTTPDIR"
+
+# ──── Cleanup handler ─────────────────────────────────────────────
+DNSMASQ_PID=""
+HTTP_PID=""
+FW_OPENED=false
+
+cleanup() {
+    echo ""
+    log "Shutting down..."
+    [[ -n "$DNSMASQ_PID" ]] && kill "$DNSMASQ_PID" 2>/dev/null && log "Stopped dnsmasq"
+    [[ -n "$HTTP_PID" ]]    && kill "$HTTP_PID"    2>/dev/null && log "Stopped HTTP server"
+
+    if $FW_OPENED && command -v firewall-cmd >/dev/null; then
+        log "Removing firewall rules..."
+        firewall-cmd --quiet --remove-service=dhcp     2>/dev/null || true
+        firewall-cmd --quiet --remove-service=tftp     2>/dev/null || true
+        firewall-cmd --quiet --remove-port=${HTTP_PORT}/tcp 2>/dev/null || true
+        firewall-cmd --quiet --remove-service=proxy-dhcp 2>/dev/null || true
+    fi
+
+    log "Done. Bastion artifacts remain in $BASTION_DIR"
+    log "Re-run this script to reprovision. Remove with: rm -rf $BASTION_DIR"
+}
+trap cleanup EXIT INT TERM
+
+# ──── Download artifacts (cached) ─────────────────────────────────
+download() {
+    local url="$1" dest="$2" label="$3"
+    if [[ -f "$dest" ]]; then
+        log "  ${label} — cached"
+        return
+    fi
+    log "  ${label} — downloading..."
+    curl -# -L -o "$dest" "$url" || die "Failed to download $label from $url"
+}
+
+FEDORA_MIRROR="https://download.fedoraproject.org/pub/fedora/linux/releases/${FEDORA_VERSION}/Everything/${ARCH}/os"
+
+log "Fetching boot artifacts (Fedora ${FEDORA_VERSION} ${ARCH})..."
+download "https://boot.ipxe.org/undionly.kpxe"   "$TFTPDIR/undionly.kpxe"   "iPXE BIOS"
+download "https://boot.ipxe.org/ipxe.efi"        "$TFTPDIR/ipxe.efi"        "iPXE UEFI"
+download "${FEDORA_MIRROR}/images/pxeboot/vmlinuz"    "$HTTPDIR/vmlinuz"     "Fedora kernel"
+download "${FEDORA_MIRROR}/images/pxeboot/initrd.img" "$HTTPDIR/initrd.img"  "Fedora initrd"
+
+# ──── Generate kickstart ──────────────────────────────────────────
+log "Generating kickstart for ${BOLD}${TARGET_HOSTNAME}${NC}..."
+
+# Disk config
+if [[ -n "$TARGET_DISK" ]]; then
+    DISK_CMDS="ignoredisk --only-use=${TARGET_DISK}
+clearpart --all --initlabel --drives=${TARGET_DISK}
+autopart --type=plain"
+else
+    DISK_CMDS="clearpart --all --initlabel
+autopart --type=plain"
+fi
+
+# Auth config
+if [[ -n "$SSH_KEY_CONTENT" ]]; then
+    AUTH_CMDS="rootpw --lock
+sshkey --username=root \"${SSH_KEY_CONTENT}\""
+else
+    AUTH_CMDS='rootpw --plaintext changeme'
+fi
+
+cat > "$HTTPDIR/ks.cfg" << KICKSTART
+# Lab Bastion — Fedora ${FEDORA_VERSION} kickstart
+# Generated: $(date -Iseconds)
+# Target: ${TARGET_HOSTNAME}
+
+# Install mode
+text
+reboot
+
+# Locale
+lang ${LOCALE}
+keyboard uk
+timezone ${TIMEZONE} --utc
+
+# Network
+network --bootproto=dhcp --activate --hostname=${TARGET_HOSTNAME}
+
+# Auth
+${AUTH_CMDS}
+
+# Disk
+${DISK_CMDS}
+
+# Bootloader
+bootloader --append="console=tty0 console=ttyS0,115200n8"
+
+# Install source
+url --mirrorlist=https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-\$releasever&arch=\$basearch
+
+# Packages — minimal server + essentials
+%packages
+@core
+@server-product
+openssh-server
+vim-enhanced
+tmux
+git
+curl
+python3
+dnf-plugins-core
+%end
+
+# Post-install
+%post --log=/root/bastion-post-install.log
+#!/bin/bash
+set -x
+
+# Ensure SSH is enabled
+systemctl enable --now sshd
+
+# Allow root SSH with key (password auth disabled)
+sed -i 's/^#\?PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
+sed -i 's/^#\?PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
+
+# Set hostname
+hostnamectl set-hostname ${TARGET_HOSTNAME}
+
+# Leave a breadcrumb
+echo "Provisioned by lab-bastion on $(date -Iseconds)" > /etc/lab-provisioned
+
+# Placeholder: puppet enrollment will go here later
+# puppet is not installed yet — this IS the puppet server
+echo "# Lab bootstrap node — puppet server setup pending" > /root/README
+
+%end
+KICKSTART
+
+log "Kickstart written to ${HTTPDIR}/ks.cfg"
+
+# ──── Generate iPXE boot script ───────────────────────────────────
+cat > "$HTTPDIR/boot.ipxe" << IPXE
+#!ipxe
+
+echo
+echo =======================================
+echo   Lab PXE Bastion — Fedora ${FEDORA_VERSION}
+echo   Target: ${TARGET_HOSTNAME}
+echo =======================================
+echo
+
+kernel http://${SERVER_IP}:${HTTP_PORT}/vmlinuz inst.ks=http://${SERVER_IP}:${HTTP_PORT}/ks.cfg inst.repo=${FEDORA_MIRROR} inst.text
+initrd http://${SERVER_IP}:${HTTP_PORT}/initrd.img
+boot
+IPXE
+
+# ──── Generate dnsmasq config ─────────────────────────────────────
+cat > "$BASTION_DIR/dnsmasq.conf" << DNSMASQ
+# Lab PXE Bastion — dnsmasq config
+# ProxyDHCP mode: adds PXE options without replacing existing DHCP
+
+# Disable DNS (we only want DHCP/TFTP)
+port=0
+
+# Listen on the right interface
+interface=${IFACE}
+bind-interfaces
+
+# ProxyDHCP — works alongside existing DHCP (UniFi etc)
+dhcp-range=${NETWORK},proxy
+
+# TFTP for initial PXE boot
+enable-tftp
+tftp-root=${TFTPDIR}
+
+# Detect client architecture
+dhcp-match=set:bios,option:client-arch,0
+dhcp-match=set:efi64,option:client-arch,7
+dhcp-match=set:efi64,option:client-arch,9
+
+# Detect iPXE clients (already chainloaded)
+dhcp-userclass=set:ipxe,iPXE
+
+# First PXE boot → serve iPXE binary via TFTP
+dhcp-boot=tag:bios,tag:!ipxe,undionly.kpxe
+dhcp-boot=tag:efi64,tag:!ipxe,ipxe.efi
+
+# iPXE clients → chain to boot script via HTTP
+dhcp-boot=tag:ipxe,http://${SERVER_IP}:${HTTP_PORT}/boot.ipxe
+
+# Verbose logging (see what's happening)
+log-dhcp
+DNSMASQ
+
+# ──── Open firewall ───────────────────────────────────────────────
+if command -v firewall-cmd >/dev/null && firewall-cmd --state >/dev/null 2>&1; then
+    log "Opening firewall ports (DHCP, TFTP, HTTP:${HTTP_PORT})..."
+    firewall-cmd --quiet --add-service=dhcp
+    firewall-cmd --quiet --add-service=tftp
+    firewall-cmd --quiet --add-port=${HTTP_PORT}/tcp
+    # ProxyDHCP uses port 4011
+    firewall-cmd --quiet --add-port=4011/udp 2>/dev/null || true
+    FW_OPENED=true
+fi
+
+# ──── Stop conflicting services ───────────────────────────────────
+# dnsmasq might be running as a system service
+if systemctl is-active --quiet dnsmasq 2>/dev/null; then
+    warn "System dnsmasq is running — stopping it temporarily"
+    systemctl stop dnsmasq
+    RESTART_DNSMASQ=true
+fi
+
+# ──── Start HTTP server ───────────────────────────────────────────
+log "Starting HTTP server on :${HTTP_PORT}..."
+(cd "$HTTPDIR" && python3 -m http.server "$HTTP_PORT" --bind 0.0.0.0 >/dev/null 2>&1) &
+HTTP_PID=$!
+sleep 0.5
+
+if ! kill -0 "$HTTP_PID" 2>/dev/null; then
+    die "HTTP server failed to start — is port ${HTTP_PORT} in use?"
+fi
+
+# ──── Start dnsmasq (proxyDHCP + TFTP) ────────────────────────────
+log "Starting PXE server (proxyDHCP on ${IFACE})..."
+echo ""
+echo -e "${CYAN}${BOLD}════════════════════════════════════════════════════════${NC}"
+echo -e "${CYAN}${BOLD}  PXE Bastion ready!${NC}"
+echo -e "${CYAN}${BOLD}════════════════════════════════════════════════════════${NC}"
+echo ""
+echo -e "  Network:    ${BOLD}${NETWORK}/24${NC} via ${BOLD}${IFACE}${NC}"
+echo -e "  HTTP:       ${BOLD}http://${SERVER_IP}:${HTTP_PORT}/${NC}"
+echo -e "  OS:         ${BOLD}Fedora ${FEDORA_VERSION} (${ARCH})${NC}"
+echo -e "  Hostname:   ${BOLD}${TARGET_HOSTNAME}${NC}"
+echo -e "  Kickstart:  ${BOLD}http://${SERVER_IP}:${HTTP_PORT}/ks.cfg${NC}"
+echo ""
+echo -e "  ${YELLOW}Now PXE-boot the target machine.${NC}"
+echo -e "  ${YELLOW}Set boot order to Network/PXE in BIOS, or use one-time boot menu.${NC}"
+echo ""
+echo -e "  Press ${BOLD}Ctrl-C${NC} to stop the bastion."
+echo ""
+echo -e "${CYAN}──── dnsmasq log (watch for DHCP/PXE requests) ────${NC}"
+echo ""
+
+# Run dnsmasq in foreground so logs stream to terminal
+dnsmasq --no-daemon --conf-file="$BASTION_DIR/dnsmasq.conf" &
+DNSMASQ_PID=$!
+
+# Wait for dnsmasq — if it exits, something went wrong
+wait "$DNSMASQ_PID" || {
+    err "dnsmasq exited unexpectedly. Check if another DHCP/TFTP service is running."
+    err "Try: ss -ulnp | grep -E ':(67|69|4011) '"
+    exit 1
+}
diff --git a/config-format-research.md b/config-format-research.md
new file mode 100644
index 0000000..66b7fb4
--- /dev/null
+++ b/config-format-research.md
@@ -0,0 +1,121 @@
+# Configuration Format Research
+
+## Decision: PENDING — exploring alternatives to raw Kubernetes YAML
+
+## The Problem
+
+Kubernetes YAML is verbose, repetitive, lacks type safety, and forces users to specify
+every layer of concern (intent, team defaults, org standards, k8s boilerplate) in one file.
+Helm "solves" this with Go templating, which produces unreadable template spaghetti.
+
+Docker Compose is the gold standard for UX — 6 lines vs 35 for the same deployment.
+The problem was never YAML itself; it was being forced to write too much of it.
+
+## Core Design Principle
+
+Users should only define what they care about. Everything else should be inherited from
+expert-defined defaults. YAML (or JSON) can exist underneath as:
+- Easy, non-binary backup format
+- Live editing capability
+- Debugging / inspection output
+
+## Layered Architecture
+
+```
+Layer 1: User intent       "I want an api service running myapp"        ← USER WRITES THIS
+Layer 2: Team defaults     "Our services get health checks, limits"     ← Team lead defines
+Layer 3: Org standards     "All pods need security context, labels"     ← Platform team defines
+Layer 4: Output            Full YAML/JSON for kubectl, backup, debug    ← GENERATED
+```
+
+Docker Compose feels good because it's only Layer 1 — Docker handles the rest.
+Kubernetes forces all 4 layers into one file.
+
+## Evaluated Alternatives
+
+### Tier 1 — Strong Contenders
+
+**Pkl (Apple)**
+- Best syntax for "amend a template" via `amends` keyword
+- Strong static typing, clean readable syntax
+- Lowest ceremony for simple cases
+- Risk: Apple may abandon it, requires JVM runtime
+- K8s support: `pkl-k8s` package exists
+
+**KCL (CNCF Sandbox)**
+- Python-like syntax, lowest learning curve of typed options
+- Schema defaults, validation, constraints built in
+- CNCF backing gives legitimacy
+- Risk: primarily driven by Ant Group (Alibaba)
+
+**CUE**
+- Most principled — constraint-based unification, not inheritance
+- Used by Timoni (Helm replacement), KubeVela, Dagger
+- Defaults marked with `*`, types and values on same spectrum
+- Risk: steep learning curve, novel paradigm
+- Most mature K8s ecosystem of the three
+
+### Tier 2 — Viable But Weaker Fit
+
+**CDK8s+ (TypeScript)**
+- Full IDE support, strongest type safety
+- cdk8s+ has intent-driven APIs ("I want a web service" → generates Deployment+Service)
+- Risk: brings software engineering complexity into config, AWS-centric
+- Good if team is TypeScript-native
+
+**Jsonnet (via Tanka)**
+- Proven at scale (Grafana uses it across hundreds of services)
+- Object mixins via `+` operator for composition
+- Risk: weak type safety, no compile-time validation of field names
+
+### Tier 3 — Not Recommended
+
+**Dhall** — strongest type safety but Haskell-like syntax, small/stale community
+**Nickel** — elegant contracts system but tiny K8s ecosystem
+**Starlark** — no type safety, no schema system, just a scripting layer
+**HCL** — great for infra provisioning, wrong fit for k8s manifests
+
+### Dead Projects
+- **Winglang** — shut down April 2025
+- **Klotho** — archived, pivoted to InfraCopilot
+- **Acorn** — pivoted to AI agents (Obot)
+
+## Compose-Like Input Format (Preferred Direction)
+
+The user prefers Docker Compose brevity. The tool we build could use a Compose-inspired
+input format at Layer 1, generating full k8s manifests + provider-specific resources underneath:
+
+```yaml
+# What the user writes
+services:
+  api:
+    image: myapp:latest
+    size: medium
+    ports: [8080]
+    env:
+      DB_HOST: postgres
+
+# System generates: full k8s Deployment, Service, NetworkPolicy,
+# resource limits, security context, health checks, etc.
+```
+
+YAML is fine for Layer 1 if it's short enough. The problem was never the format —
+it was the verbosity. Compose proves short YAML works.
+
+## Open Questions
+
+1. Should Layer 1 input be YAML (Compose-like), or a typed language (Pkl/KCL/CUE)?
+2. How do team defaults (Layer 2) and org standards (Layer 3) get defined and distributed?
+3. Should the render view show the generated YAML diff when changing Layer 1 input?
+4. How does this integrate with the Pulumi multi-cloud abstraction layer?
+5. Could the input format support both k8s workloads AND infrastructure resources
+   (VMs, networks, storage) in the same spec?
+
+## GUI/TUI Space — Underserved Opportunity
+
+No tool has achieved significant adoption for visually *defining* infrastructure.
+Existing tools (K9s, Lens, Rancher) are for monitoring/management, not authoring.
+
+The ideal: platform engineers define schemas with constraints/defaults,
+developers interact with a form/wizard showing only fields they need,
+validated config generated underneath. Nobody has built this well yet.
diff --git a/crossplane-evaluation.md b/crossplane-evaluation.md
new file mode 100644
index 0000000..5b033d7
--- /dev/null
+++ b/crossplane-evaluation.md
@@ -0,0 +1,106 @@
+# Crossplane Evaluation
+
+## Decision: NOT ADOPTING
+
+Crossplane will not be used in this stack. The lack of a plan/preview mechanism is a dealbreaker
+for enterprise adoption and safe infrastructure management.
+
+---
+
+## Why We Evaluated It
+
+The core problem: Terraform/OpenTofu requires re-implementing the same infrastructure concepts
+per platform (AWS, XCP-ng, bare metal). At thousands of nodes across multiple platforms, this is
+a massive maintenance burden. Crossplane's XRD/Composition model promised a unified API:
+
+```
+XRD: "VirtualMachine" (universal API)
+  ├── Composition: AWS      → EC2 instance
+  ├── Composition: XCP-ng   → XO VM
+  └── Composition: bare metal → MAAS / Ansible
+```
+
+One API, multiple backends — teams request a "VirtualMachine" and the right composition handles it.
+
+## Strengths
+
+- **CNCF Graduated** (Nov 2025, v2.2) — Apache 2.0 license, top-tier maturity
+- **Continuous drift detection** — automatically reverts manual changes, unlike Terraform's on-demand plan/apply
+- **No state file management** — no remote backends, locking issues, or state corruption
+- **Kubernetes-native** — works with ArgoCD, Flux, kubectl, RBAC out of the box
+- **XRDs/Compositions** — genuine multi-platform abstraction layer, solves the "re-implement per cloud" problem
+- **Eventual consistency** — resources with complex dependencies don't get stuck like Terraform's dependency graph
+- **Enterprise adoption** — Deutsche Kreditbank, Elastic, Nike, Apple, NASA, Grafana Labs, 60+ orgs
+- **Deutsche Kreditbank** replaced Terraform; deployments went from weeks to under one hour
+
+## Dealbreaker: No Plan/Preview
+
+The single biggest issue. Terraform's `terraform plan` lets operators see exactly what will change
+before applying. Crossplane applies changes immediately upon resource creation/modification.
+
+- Discussed in the community for 2+ years with no resolution
+- A Kubernetes-native solution would be a `Plan` CRD that shows proposed changes before approval
+- ArgoCD `sync --dry-run` is a partial workaround but only shows k8s resource diffs, not what the
+  cloud provider will actually do underneath
+- **For regulated environments and SRE teams at scale, change preview is non-negotiable**
+
+Possible reasons it hasn't been implemented:
+- The continuous reconciliation architecture may make point-in-time snapshots fundamentally hard
+- Upbound (commercial entity) may be reserving it for their paid platform
+- Or simply not prioritised
+
+## Other Significant Concerns
+
+### CRD Bloat
+- `provider-aws` installs 900+ CRDs — can make API server unresponsive for up to an hour (GitHub #2649)
+- Exceeds Kubernetes' recommended ~500 CRD limit
+- Mitigated by "Provider Families" (install per-service sub-providers) but requires careful planning
+
+### Debugging Difficulty
+- Errors propagate through layers: Claim → XR → Composition → Managed Resource → Provider → Cloud API
+- Multiple sources report debugging compositions is painful
+- Pipeline Inspector (alpha in v2.2) is being introduced but not production-ready
+
+### Chicken-and-Egg Problem
+- Crossplane runs inside Kubernetes — cannot provision the cluster it runs on
+- Requires a "management cluster" bootstrapped by other means (Terraform, Puppet, etc.)
+- If the management cluster dies, no drift detection or reconciliation runs
+- Recovery: applying YAMLs to a new cluster works if deterministic resource names are used,
+  otherwise risks creating duplicate cloud resources
+
+### Cluster Loss / Immutability Concerns
+- State lives in etcd, not a versionable state file
+- No independent audit trail or easy way to diff historical states
+- On new cluster: resources with explicit external names get adopted; auto-named resources get duplicated
+- Need etcd backups as insurance, and deterministic naming everywhere
+
+### Performance at Scale
+- ~2000 composites took 6+ minutes to reconcile on k3d (GitHub #2256)
+- Reconciliation interval not easily configurable globally (GitHub #5934)
+
+### YAML Limitations
+- No native loops, conditionals, or programming constructs
+- Complex compositions require changes in multiple locations
+
+## XCP-ng Provider Gap
+
+- No Crossplane provider for XCP-ng exists today
+- A mature Terraform provider (`terraform-provider-xenorchestra`) exists, maintained by Vates
+- Could be wrapped via Upjet to auto-generate a Crossplane provider — but nobody has done it
+- Would be a greenfield open-source project
+
+## Real Issues Reported
+
+- API server unresponsiveness with too many CRDs (GitHub #2649)
+- CRD scaling issues beyond ~500 CRDs (GitHub #2895)
+- GCP SQL resources randomly marked for deletion — dangerous for production databases
+- Reconciliation rate limiting at scale (GitHub #2256)
+
+## Conclusion
+
+Crossplane solves a real problem (multi-platform abstraction) that we need, but the lack of
+plan/preview makes it unsuitable for enterprise-scale production infrastructure management.
+The operational concerns (CRD bloat, debugging, cluster dependency) add further risk.
+
+We need to find an alternative approach to the multi-platform abstraction problem that Crossplane
+solves, while retaining plan/preview capabilities.
diff --git a/dns-research.md b/dns-research.md
new file mode 100644
index 0000000..5c07ac0
--- /dev/null
+++ b/dns-research.md
@@ -0,0 +1,143 @@
+# DNS Solution Research
+
+## Decision: PowerDNS Authoritative + ExternalDNS
+
+### Why PowerDNS
+
+| Feature | PowerDNS | CoreDNS | BIND9 | Technitium |
+|---------|----------|---------|-------|------------|
+| REST API | Full | No (needs etcd) | No (nsupdate) | Yes |
+| Database backend | PostgreSQL/MySQL/SQLite | etcd | Zone files | Custom |
+| Health-aware DNS | Lua records (ifportup, ifurlup) | No | No | No |
+| ExternalDNS provider | Yes | Yes (via etcd) | Yes (RFC 2136) | No |
+| DNSSEC | Yes | Limited | Best | Yes |
+| Split DNS | dnsdist routing | Corefile blocks | Views (best) | APP records |
+| Maturity | ISP-grade | K8s-focused | Oldest | Newer |
+
+PowerDNS wins on: REST API (critical for Lab), health-check-aware Lua records,
+database backend for HA, and ExternalDNS integration.
+
+### Architecture
+
+```
+                    Lab Server
+                    (control plane)
+                        │
+                        │ PowerDNS REST API
+                        ▼
+                ┌───────────────┐
+                │  PowerDNS     │
+                │  Authoritative│──── PostgreSQL/SQLite backend
+                │  Server       │
+                └───────┬───────┘
+                        │
+            ┌───────────┼───────────┐
+            │           │           │
+            ▼           ▼           ▼
+      Internal DNS   ExternalDNS   dnsdist
+      .lab.internal  (k8s syncs    (split DNS
+                      Services/    routing)
+                      Ingress)
+```
+
+### How Lab Uses DNS
+
+#### Auto-registration on onboard
+When `lab onboard` completes, Lab calls PowerDNS API:
+- A record: `<server>.lab.internal → <ip>`
+- PTR record: `<reverse-ip>.in-addr.arpa → <server>.lab.internal`
+- Both created/updated atomically
+
+#### Domain claims via labels
+Labels can claim shared domain names:
+```yaml
+labels:
+  mailserver:
+    dns:
+      records:
+        - type: A
+          name: "{{server.name}}.lab.internal"
+      claims:
+        - name: mail.example.com
+          type: A
+          health_check: { port: 25 }
+```
+All servers with label `mailserver` contribute to `mail.example.com` round-robin.
+PowerDNS Lua records remove unhealthy servers automatically.
+
+#### IP mobility
+Lab agent on machine reports IP change → Lab server updates PowerDNS API →
+A record, PTR, and all claimed domains updated.
+
+#### K8s integration
+ExternalDNS runs in k8s, syncs Service/Ingress records to same PowerDNS instance.
+Same DNS server serves both bare metal and k8s records.
+
+#### Groups claiming domains
+Groups can claim domains for all member servers:
+```yaml
+groups:
+  production-web:
+    match:
+      labels: [web-frontend]
+      environment: prod
+    dns:
+      claims:
+        - name: www.example.com
+          type: A
+          health_check: { url: "https://{{server.ip}}/healthz" }
+```
+
+### DNS Plugin Interface
+
+```go
+type DNSPlugin interface {
+    Name() string
+
+    // Record management
+    CreateRecord(zone, name, recordType string, targets []string, ttl int) error
+    UpdateRecord(zone, name, recordType string, targets []string, ttl int) error
+    DeleteRecord(zone, name, recordType string) error
+    ListRecords(zone string) ([]Record, error)
+
+    // Health-checked records
+    CreateHealthCheckedRecord(zone, name string, targets []string, check HealthCheck) error
+
+    // Zone management
+    CreateZone(name string, kind string) error
+    DeleteZone(name string) error
+}
+```
+
+Built-in:
+- `dns-powerdns` — PowerDNS REST API (primary)
+- `dns-route53` — AWS Route53 (for cloud deployments)
+- `dns-rfc2136` — RFC 2136 dynamic updates (BIND/Knot fallback)
+
+### Split DNS Setup
+
+Internal zones (`.lab.internal`) served by PowerDNS authoritatively.
+External queries forwarded upstream (8.8.8.8, ISP DNS).
+
+Options:
+- **dnsdist** (PowerDNS ecosystem) routes by source subnet
+- **CoreDNS as resolver** — serves internal from PowerDNS, forwards external
+- **BIND views** — if we need view-based split on same zone (unlikely)
+
+### Evaluated and Not Chosen
+
+| Tool | Why Not |
+|------|---------|
+| CoreDNS | No REST API, needs etcd intermediary, k8s-focused |
+| BIND9 | No REST API, nsupdate is cumbersome for automation |
+| Technitium | No ExternalDNS provider, newer/smaller community |
+| dnsmasq | Not suitable — caching forwarder, no API, ~1000 client limit |
+| Knot DNS | No REST API, better as secondary/downstream |
+
+### DNS-as-Code (Optional Layer)
+
+For static DNS infrastructure (SOA, NS, MX, base zone config):
+- **octoDNS** (GitHub) or **DNSControl** (Stack Exchange)
+- GitOps workflow: PR → review → merge → sync to PowerDNS
+- Dynamic records (server A records, claims) managed by Lab directly via API
+- Static records managed via DNS-as-code in Git
diff --git a/hardware.md b/hardware.md
new file mode 100644
index 0000000..8c7fcab
--- /dev/null
+++ b/hardware.md
@@ -0,0 +1,37 @@
+# Homelab Hardware Inventory
+
+## Compute Nodes
+
+| Node | CPU Arch | RAM | Role | Cost |
+|------|----------|-----|------|------|
+| Beelink SER9 MAX | x86_64 | 64GB | k3s worker, ROCm GPU, Longhorn storage | ~£869 |
+| Beelink SER9 Pro | x86_64 | 32GB | Bootstrap: Puppet, DNS, UniFi, Vault, Naemon | ~£300 |
+| Minisforum MS-R1 | ARM (aarch64) | 64GB | k3s node | ~£500-640 |
+| Nvidia DGX Spark | ARM (Grace) | 128GB | CUDA/AI inference | ~£3,700 |
+| Mac Studio M1 Max | ARM (aarch64) | 32GB | k3s server #1 (etcd) | ~£775 |
+
+## Networking
+
+| Device | Specs | Cost |
+|--------|-------|------|
+| USW-Flex-XG x2 | 8x 10GbE ports total (4 per switch) | £458 |
+
+## Summary
+
+- **Total RAM:** 320GB
+- **Architectures:** x86_64, aarch64 (Apple Silicon + ARM + Grace)
+- **GPU compute:** ROCm (SER9 MAX), CUDA (DGX Spark)
+- **Estimated total:** ~£6,600-6,740
+
+## Node Roles
+
+### Bootstrap Node (Beelink SER9 Pro) — Outside k3s
+- Puppet (bare metal config management)
+- DNS (CoreDNS or PowerDNS)
+- UniFi controller
+- Vault (secrets management)
+- Naemon (bare metal, network, black-box endpoint monitoring)
+
+### k3s Cluster
+- **Server (control plane + etcd):** Mac Studio M1 Max
+- **Workers:** Beelink SER9 MAX, Minisforum MS-R1, DGX Spark
diff --git a/kubernetes-flavors.md b/kubernetes-flavors.md
new file mode 100644
index 0000000..b81e2c4
--- /dev/null
+++ b/kubernetes-flavors.md
@@ -0,0 +1,120 @@
+# Kubernetes Flavor Decision
+
+## Decision: k3s (confirmed)
+
+k3s is the best fit for Lab. OpenShift and most other flavors conflict with
+the puppet-managed, multi-arch, lightweight approach.
+
+## Evaluation
+
+| Flavor | Puppet-Friendly | ARM | Multi-arch | Enterprise | License | Verdict |
+|--------|:-:|:-:|:-:|:-:|---------|---------|
+| **k3s** | ✓ binary + config files | ✓ | ✓ | Rancher/SUSE | Apache 2.0 | **CHOSEN** |
+| **k0s** | ✓ single binary, config-driven | ✓ | ✓ | Mirantis | Apache 2.0 | Good alternative |
+| **kubeadm** | ✓ well-understood bootstrap | ✓ | ✓ | Upstream K8s | Apache 2.0 | Viable but heavier |
+| **RKE2** | ✓ config files | ✓ | ✓ | Rancher/SUSE | Apache 2.0 | Heavier k3s |
+| **OpenShift** | ✗ operator-driven, fights puppet | ✗ limited | ✗ limited | Red Hat | Proprietary | REJECTED |
+| **MicroK8s** | ⚠ snap-based, puppet+snaps awkward | ✓ | ✓ | Canonical | Apache 2.0 | Not great |
+| **Talos** | ✗ immutable OS, no SSH, no puppet | ✓ | ✓ | Sidero Labs | MPL 2.0 | Incompatible |
+
+## Why NOT OpenShift — Deep Analysis
+
+### OpenShift Does Overlap With Lab
+
+OpenShift is the closest existing thing to what Lab does. The overlap is real:
+
+| Capability | OpenShift | Lab |
+|-----------|-----------|-----|
+| Manages nodes end-to-end | Yes (RHCOS + MCO) | Yes (OpenVox + labels) |
+| Immutable infrastructure | Yes (rpm-ostree, operator-driven) | Yes (puppet convergence) |
+| Fights config drift | Yes (operators reconcile) | Yes (puppet + sync pillar) |
+| Built-in monitoring | Yes (Prometheus + Alertmanager bundled) | Yes (health aggregator) |
+| Built-in secrets | Yes (etcd-encrypted secrets) | Yes (secret store + local cache) |
+| Certificate management | Yes (internal CA, auto-rotation) | Yes (identity layer) |
+| Node lifecycle | Yes (MachineSet, MachinePools) | Yes (onboard, labels, providers) |
+| Self-managing | Yes (operators update themselves) | Yes (lab manages itself) |
+
+### Why OpenShift Still Doesn't Fit
+
+**1. Single OS** — OpenShift control plane = RHCOS only. Can't run on Apple Silicon,
+   Asahi Linux, or any non-RHCOS system. Lab needs Ubuntu, Debian, Fedora, AlmaLinux,
+   XCP-ng, VyOS across x86 and ARM.
+
+**2. K8s only** — OpenShift manages k8s nodes. Lab manages everything: k8s nodes,
+   standalone VMs, bare metal hypervisors, network appliances, physical servers that
+   will never run k8s. Not everything is a container.
+
+**3. Single cluster scope** — OpenShift manages one cluster. Lab manages homelab k3s +
+   enterprise AWS EKS + XCP-ng hypervisors + bare metal + OVH vRack. Cross-provider,
+   cross-cluster.
+
+**4. Fights puppet** — OpenShift has ~30+ operators that each own a piece of the system.
+   If puppet changes kubelet config, the Machine Config Operator detects "drift" and
+   reverts it. Two reconciliation loops fighting each other, possibly rebooting nodes
+   in a loop. You're supposed to change everything via CRDs, not external tools.
+
+**5. No XCP-ng/hypervisor management** — Can't provision VMs on XCP-ng, manage Xen
+   hosts, or understand hypervisors that aren't VMware/OpenStack.
+
+**6. Throws away puppet modules** — Company has existing puppet modules. OpenShift's
+   model is operators, not puppet. Complete rewrite of config management.
+
+**7. Heavyweight** — Minimum 6 nodes, 88GB RAM just for the platform. k3s uses 512MB.
+   Our entire homelab is 5 nodes, 320GB RAM.
+
+**8. ARM limited** — RHCOS on Apple Silicon doesn't exist. ARM support is limited to
+   AWS Graviton and some server ARM platforms.
+
+### The Scope Difference
+
+```
+OpenShift:  "I am your platform. Everything runs in me. I control the OS."
+            Scope: Kubernetes cluster + its nodes
+
+Lab:        "I manage your infrastructure. K8s is one thing I deploy."
+            Scope: Everything — VMs, bare metal, hypervisors, k8s,
+                   network gear, containers, across any provider
+```
+
+Lab is closer to what OpenShift + Satellite + RHCOS + ACM (Advanced Cluster Management)
+do **together** — but unified, lighter, open source, and not locked to Red Hat's ecosystem.
+
+## Why k3s
+
+- **Puppet-friendly** — it's just a binary and config files in `/etc/rancher/k3s/`
+- **Ultra-light** — runs on Mac Studio, ARM boxes, small VMs
+- **Multi-arch** — native x86 and ARM
+- **Same K8s API** as EKS/GKE — portable to cloud
+- **Single binary** — trivial to manage with puppet
+- **Proven** — CNCF certified, widely used in edge/IoT/homelab
+
+## k3s via Puppet (OpenVox)
+
+```puppet
+# Label: k8s-server → puppet class
+class kubernetes::server {
+  class { 'k3s::server':
+    token        => lab::secret('k8s/cluster-token'),
+    cluster_init => true,
+    tls_san      => [$facts['fqdn'], 'k8s.lab.internal'],
+  }
+}
+
+# Label: k8s-worker → puppet class
+class kubernetes::worker {
+  class { 'k3s::worker':
+    server_url => 'https://k8s.lab.internal:6443',
+    token      => lab::secret('k8s/cluster-token'),
+  }
+}
+```
+
+Same puppet classes work on bare metal, XCP-ng VM, EC2 instance, any architecture.
+
+## k0s as Backup Option
+
+If k3s ever becomes problematic, k0s is the closest alternative:
+- Also single binary, config-driven, multi-arch
+- `k0sctl` adds cluster management (bootstrap, upgrade, reset)
+- Mirantis backing (Lens, Docker EE)
+- Worth monitoring but no reason to switch from k3s today
diff --git a/lab-tool-spec.md b/lab-tool-spec.md
new file mode 100644
index 0000000..cd4c3db
--- /dev/null
+++ b/lab-tool-spec.md
@@ -0,0 +1,1537 @@
+# Lab — Unified Infrastructure Lifecycle Platform
+
+## What It Is
+
+A tool that abstracts infrastructure lifecycle across clouds, hypervisors, bare metal,
+and Kubernetes — using labels as the universal abstraction and existing tools under the hood.
+
+**Not reinventing the wheel.** Uses Pulumi, OpenVox, Tinkerbell, Prometheus, Naemon,
+existing Puppet modules, cloud APIs — but provides a unified interface over all of them.
+
+## Architecture
+
+```
+┌────────────────────────────────────────────────────────────┐
+│                    lab-server (control plane)               │
+│                                                             │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
+│  │ Provider  │  │ Label    │  │ Lifecycle│  │ Artifact   │ │
+│  │ Registry  │  │ Engine   │  │ Manager  │  │ Builder    │ │
+│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
+│  │ OpenVox  │  │ Health   │  │ K8s      │  │ Render     │ │
+│  │ Enrollor │  │ Aggregator│  │ Deployer │  │ Engine     │ │
+│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
+│  │ Identity │  │ DNS      │  │ Secret   │  │ Token      │ │
+│  │ Manager  │  │ Manager  │  │ Manager  │  │ Issuer     │ │
+│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
+│                                                             │
+│  API (gRPC + REST)                                          │
+└──────────────┬─────────────────────────────────────────────┘
+               │
+    ┌──────────┴──────────┐
+    │                     │
+┌───┴───┐           ┌────┴────┐
+│ lab   │           │ lab-tui │
+│ (CLI) │           │ (k9s)   │
+└───────┘           └─────────┘
+```
+
+### Control Plane (lab-server)
+
+Runs as a service (on bootstrap node, or in k8s). Hosts:
+
+- **Provider Registry** — pluggable providers (AWS, XCP-ng, bare metal, GCP, etc.)
+- **Label Engine** — resolves labels → puppet classes, sizes, ports, config
+- **Lifecycle Manager** — orchestrates provision → enroll → configure → observe
+- **Artifact Builder** — puppet classes → container images
+- **OpenVox Enrollor** — secure cert signing, node classification, environment assignment
+- **Health Aggregator** — queries Prometheus, Naemon, cloud health APIs
+- **K8s Deployer** — manages workloads on k3s/EKS clusters
+- **Render Engine** — side-by-side provider comparison, cost estimates, drift detection
+- **Identity Manager** — tracks enrollment state, certs, Vault auth, SSH keys per resource
+- **DNS Manager** — auto-registers/updates DNS for every managed resource
+- **Secret Manager** — controls which resources can access which secrets (per-label policies)
+- **Token Issuer** — generates one-time join tokens at provision time (no hardcoded secrets)
+
+### CLI (lab)
+
+kubectl-like interface for browsing and managing resources:
+
+```
+$ lab get servers
+NAME          PROVIDER   LABELS                SIZE     SYNC     PUPPET    HEALTH    IDENTITY
+api-1         aws        app,prod,eu-west      medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
+api-2         aws        app,prod,eu-west      medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
+mail-1        xcpng      mailserver,prod       medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
+db-1          baremetal   postgres,prod        large    ⚠ drift  ✓ ok      ✓ ok      ✓ enrolled
+worker-3      aws        k8s-worker,staging    large    ✓ sync   ✗ failed  ⚠ 2 alrt  ✓ enrolled
+gateway-1     baremetal   k8s-server,prod      small    ✓ sync   ✓ ok      ✓ ok      ⚠ cert exp
+
+$ lab get servers --label mailserver
+NAME          PROVIDER   SIZE     SYNC     PUPPET    HEALTH    IDENTITY
+mail-1        xcpng      medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
+mail-2        aws        medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
+
+$ lab describe server db-1
+Name:       db-1
+Provider:   baremetal
+Labels:     [postgres, prod, eu-west]
+Size:       large (8 cores, 32GB, 500GB NVMe)
+Status:     DRIFT DETECTED
+  Expected: size=large, disk=500GB
+  Actual:   size=large, disk=500GB, extra_mount=/data (unmanaged)
+Puppet:
+  Environment: production
+  Role:         postgres
+  Classes:      [postgresql::server, backup::pgbackrest, node_exporter]
+  Last run:     2026-03-15 14:22:03 (success)
+  Next run:     2026-03-15 14:52:03
+Health:
+  Prometheus:   ✓ all targets up
+  Naemon:       ✓ all checks passing
+  Alerts:       none active
+
+$ lab get labels
+LABEL          PUPPET CLASSES                        SERVERS   CONTAINERS
+mailserver     postfix, dovecot, spamassassin        2         1
+k8s-worker     kubernetes::worker, containerd        12        0
+postgres       postgresql::server, pgbackrest        3         1
+app            nginx, app::deploy                    4         2
+
+$ lab get containers
+NAME              IMAGE                              LABEL        K8S CLUSTER   STATUS
+mailserver        ghcr.io/org/mailserver:2026.03.15  mailserver   homelab       running
+postgres          ghcr.io/org/postgres:2026.03.14    postgres     homelab       running
+app               ghcr.io/org/app:2026.03.15         app          production    running
+
+$ lab diff server db-1
+  size: large
+  disk: 500GB
++ extra_mount: /data    ← unmanaged, not in spec
+
+$ lab sync server db-1          # reconcile drift
+$ lab plan server new-mail-3 --label mailserver --provider aws   # preview
+$ lab apply server new-mail-3   # create it
+
+$ lab build --label mailserver  # puppet modules → container image
+Building mailserver from puppet classes:
+  ✓ postfix
+  ✓ dovecot
+  ✓ spamassassin
+  ✓ fail2ban
+→ ghcr.io/org/mailserver:2026.03.15
+
+$ lab render --label mailserver --all-providers
+┌──────────────┬──────────────┬──────────┬────────────┐
+│              │ AWS          │ XCP-ng   │ Bare Metal │
+├──────────────┼──────────────┼──────────┼────────────┤
+│ Compute      │ t3.large     │ 4c/8GB   │ IPMI boot  │
+│ Puppet       │ postfix,...  │ postfix,.│ postfix,...│
+│ Est. Cost    │ ~$62/mo      │ —        │ —          │
+└──────────────┴──────────────┴──────────┴────────────┘
+```
+
+### TUI (lab-tui)
+
+k9s-style interactive terminal UI:
+- Real-time server list with sync/puppet/health status
+- Drill into any server for details
+- Watch puppet runs live
+- Filter by labels, providers, health status
+- Trigger actions (sync, plan, apply, build)
+
+## Core Concepts
+
+### Labels — The Universal Abstraction
+
+Everything is a thing with labels. Configuration attaches to labels, not machines.
+
+```yaml
+labels:
+  mailserver:
+    puppet_classes:
+      - postfix
+      - dovecot
+      - spamassassin
+      - fail2ban
+    ports: [25, 587, 993]
+    size: medium
+    alerts:
+      - smtp_connect           # auto-generated: is SMTP responding?
+      - imap_connect           # auto-generated: is IMAP responding?
+      - mail_queue_length      # auto-generated: is mail queue healthy?
+    secrets:
+      - mail/tls-cert
+      - mail/dkim-key
+
+  k8s-worker:
+    puppet_classes:
+      - kubernetes::worker
+      - containerd
+      - node_exporter
+    size: large
+    alerts:
+      - kubelet_healthy
+      - node_ready
+    secrets:
+      - k8s/join-token
+```
+
+### Groups — Nested Targeting with Exclusions
+
+Groups compose labels, other groups, and individual servers into reusable targets.
+Groups can nest (subgroups). Exclusions allow fine-grained control.
+
+```yaml
+groups:
+  # Simple group: all production servers
+  production:
+    match:
+      environment: prod
+
+  # Group by label combination
+  production-mail:
+    match:
+      labels: [mailserver]
+      environment: prod
+
+  # Nested group with subgroups
+  eu-infrastructure:
+    groups:
+      - eu-west-compute
+      - eu-west-storage
+      - eu-west-network
+    exclude:
+      servers: [test-box-1]           # exclude specific server
+      labels: [experimental]           # exclude servers with this label
+
+  eu-west-compute:
+    match:
+      labels: [k8s-worker, k8s-server]
+      region: eu-west
+    exclude:
+      servers: [legacy-node-3]
+
+  # Group targeting everything except a subgroup
+  all-except-staging:
+    match:
+      environment: [prod, dev]
+    exclude:
+      environment: staging
+
+  # Custom group by explicit membership
+  database-tier:
+    servers: [db-1, db-2, db-3]
+    groups: [replica-set-eu]
+```
+
+### Alerts — Auto-Generated and User-Defined
+
+Alerts attach to labels, groups, servers, or environments — same targeting as everything else.
+
+#### Auto-Generated Alerts
+
+When Lab provisions a resource, it generates baseline alerts based on:
+- **Label**: mailserver label → SMTP/IMAP checks
+- **Puppet classes**: `postgresql::server` → postgres process, replication lag
+- **Ports**: if port 443 is declared → HTTPS health check
+- **Size**: resource limits → CPU/memory threshold alerts
+- **Identity**: cert expiry alerts auto-generated for all enrolled machines
+
+#### User-Defined Alerts
+
+Users can add custom alerts targeting any scope:
+
+```yaml
+alerts:
+  # Target by label
+  - name: mail_queue_critical
+    target:
+      labels: [mailserver]
+    condition: mail_queue_length > 1000
+    severity: critical
+    for: 5m
+
+  # Target by group
+  - name: disk_space_low
+    target:
+      groups: [production]
+    condition: disk_usage_percent > 85
+    severity: warning
+
+  # Target by environment
+  - name: high_cpu
+    target:
+      environment: prod
+    condition: cpu_usage_percent > 90
+    for: 10m
+    severity: warning
+
+  # Target specific servers
+  - name: gpu_temperature
+    target:
+      servers: [dgx-spark, beelink-ser9-max]
+    condition: gpu_temp_celsius > 80
+    severity: critical
+
+  # Target by label but exclude some
+  - name: memory_pressure
+    target:
+      labels: [k8s-worker]
+      exclude:
+        servers: [batch-worker-1]   # this one is expected to run hot
+    condition: memory_usage_percent > 90
+    severity: warning
+```
+
+Alerts are rendered to the underlying monitoring system (Prometheus rules, Naemon checks,
+CloudWatch alarms) — we don't build an alerting engine, we generate configs for existing ones.
+Which monitoring backend to use for each alert type: **needs investigation**.
+
+### Targeting — Unified Query System
+
+The same targeting syntax works everywhere: alerts, puppet classes, secrets, and queries.
+Target by label, group, server name, environment, region, or any combination with exclusions.
+
+```
+# CLI targeting syntax
+$ lab get servers --label k8s-worker
+$ lab get servers --group production
+$ lab get servers --environment staging
+$ lab get servers --label k8s-worker --environment prod --exclude worker-3
+
+# What's applied WHERE (server → everything)
+$ lab show server worker-5
+```
+
+### Visibility — Show What's Applied Where
+
+Two directions of querying: "what does this server get?" and "where does this thing apply?"
+
+#### Server View: Everything applied to a server
+
+```
+$ lab show server worker-5
+
+Server: worker-5 (aws, eu-west-1)
+Labels:   [k8s-worker, production, eu-west]
+Groups:   [production, eu-west-compute, eu-infrastructure]
+Environment: prod
+
+Puppet Classes (6):
+  FROM LABEL k8s-worker:
+    ├── kubernetes::worker
+    ├── containerd
+    └── node_exporter
+  FROM LABEL production:
+    ├── base::hardening
+    └── base::monitoring
+  FROM LABEL eu-west:
+    └── base::ntp_eu
+
+Alerts (8):
+  FROM LABEL k8s-worker:
+    ├── kubelet_healthy
+    └── node_ready
+  FROM GROUP production:
+    ├── disk_space_low
+    └── high_cpu
+  AUTO-GENERATED:
+    ├── cpu_threshold (from size: large)
+    ├── memory_threshold (from size: large)
+    ├── cert_expiry (from identity)
+    └── puppet_run_failed (from enrollment)
+
+Secrets (2):
+  FROM LABEL k8s-worker:
+    ├── k8s/join-token (read)
+    └── tls/node-cert (dynamic)
+
+Excluded From:
+  └── alert "memory_pressure" (explicitly excluded)
+```
+
+#### Label/Group View: Where does this apply?
+
+```
+$ lab show label mailserver
+
+Label: mailserver
+Applied to: 2 servers
+
+Servers:
+  ├── mail-1 (xcpng, prod)    ✓ sync  ✓ puppet  ✓ health  ✓ identity
+  └── mail-2 (aws, prod)      ✓ sync  ✓ puppet  ✓ health  ✓ identity
+
+Provides:
+  Puppet Classes: postfix, dovecot, spamassassin, fail2ban
+  Alerts: smtp_connect, imap_connect, mail_queue_length
+  Secrets: mail/tls-cert, mail/dkim-key
+  Ports: 25, 587, 993
+  Size: medium
+
+$ lab show group eu-infrastructure
+
+Group: eu-infrastructure
+Contains: 3 subgroups, 47 servers (2 excluded)
+
+Subgroups:
+  ├── eu-west-compute    (28 servers)
+  ├── eu-west-storage    (12 servers)
+  └── eu-west-network    (9 servers)
+
+Excluded:
+  ├── test-box-1 (by name)
+  └── 1 server with label "experimental"
+
+Alerts targeting this group:
+  ├── disk_space_low (warning)
+  └── network_latency_high (critical)
+```
+
+#### Alert View: Where does this alert fire?
+
+```
+$ lab show alert disk_space_low
+
+Alert: disk_space_low
+Severity: warning
+Condition: disk_usage_percent > 85
+Target: group "production"
+Excludes: none
+
+Applies to 63 servers:
+  ├── api-1 (aws)        currently: 42%  ✓
+  ├── api-2 (aws)        currently: 38%  ✓
+  ├── mail-1 (xcpng)     currently: 71%  ✓
+  ├── db-1 (baremetal)   currently: 83%  ⚠ approaching
+  └── ... (59 more)
+
+Rendered to:
+  ├── Prometheus: rule "disk_space_low" in rules/production.yaml
+  └── Naemon: service check on 4 bare-metal hosts
+```
+
+#### Reverse Query: What targets this server?
+
+```
+$ lab targets server db-1
+
+Everything targeting db-1:
+  Labels:     [postgres, production, eu-west]
+  Groups:     [production, database-tier, eu-infrastructure, eu-west-storage]
+  Environment: prod
+
+  Alerts (11):
+    ├── postgres_replication_lag    (from label: postgres)
+    ├── postgres_connections        (from label: postgres)
+    ├── disk_space_low              (from group: production)
+    ├── high_cpu                    (from group: production)
+    ├── storage_iops                (from group: eu-west-storage)
+    ├── cert_expiry                 (auto-generated)
+    └── ... (5 more)
+
+  Puppet Classes (9):
+    ├── postgresql::server          (from label: postgres)
+    ├── backup::pgbackrest          (from label: postgres)
+    └── ... (7 more)
+
+  Secrets (4):
+    ├── postgres/master-password    (from label: postgres)
+    └── ... (3 more)
+```
+
+### TUI Visualization (lab-tui)
+
+The k9s-style TUI should support navigating these relationships interactively:
+
+```
+┌─ lab-tui ──────────────────────────────────────────────────────────┐
+│ View: Servers > worker-5                                    [?]Help│
+├────────────────────────────────────────────────────────────────────┤
+│                                                                    │
+│  ┌─ Server: worker-5 ──────────────────────────────────────────┐  │
+│  │ Provider: aws          Size: large         Env: prod        │  │
+│  │ Sync: ✓   Puppet: ✓   Health: ✓   Identity: ✓              │  │
+│  └─────────────────────────────────────────────────────────────┘  │
+│                                                                    │
+│  [L]abels    [A]lerts    [P]uppet    [S]ecrets    [G]roups        │
+│                                                                    │
+│  Labels ──────────────────── Alerts ──────────────────────────     │
+│  ► k8s-worker                ● kubelet_healthy        ✓ OK        │
+│  ► production                ● node_ready             ✓ OK        │
+│  ► eu-west                   ● disk_space_low         ✓ 42%       │
+│                              ● high_cpu               ✓ 12%       │
+│  Groups ──────────────────   ● cert_expiry            ✓ 347d      │
+│  ► production                                                      │
+│    ► eu-infrastructure       Puppet Classes ──────────────────     │
+│      ► eu-west-compute       ● kubernetes::worker     ✓ applied   │
+│                              ● containerd             ✓ applied   │
+│  Secrets ─────────────────   ● node_exporter          ✓ applied   │
+│  ● k8s/join-token    (read)  ● base::hardening        ✓ applied   │
+│  ● tls/node-cert     (dyn)   ● base::monitoring       ✓ applied   │
+│                                                                    │
+│ [Enter] drill down  [Esc] back  [/] search  [Tab] switch pane    │
+└────────────────────────────────────────────────────────────────────┘
+```
+
+Navigation:
+- From server → drill into label → see all other servers with that label
+- From alert → see all servers it applies to, current values
+- From group → see subgroups, expand tree, see members
+- From label → see puppet classes, alerts, secrets it provides
+- Everything is cross-linked — follow any relationship in either direction
+
+### Deployment Targets
+
+Same label → multiple targets:
+
+| Target | What happens |
+|--------|-------------|
+| VM (any cloud) | Provision VM → enroll OpenVox → apply classes live |
+| Bare metal | PXE boot → enroll OpenVox → apply classes live |
+| Container | Build image with classes baked in → push to registry |
+| ASG | Launch template with OpenVox enrollment → auto-apply |
+| K8s pod | Deploy container artifact to cluster |
+
+### Four-Pillar Status
+
+Every resource shows four things:
+
+1. **Sync** — is the actual infrastructure state matching the declared spec?
+   (instance type, security groups, disks, network — via Pulumi state)
+2. **Puppet** — did OpenVox successfully apply all classes?
+   (last run status, any failures, catalog compilation errors)
+3. **Health** — are monitoring checks passing?
+   (aggregates from Prometheus alerts, Naemon checks, cloud health APIs)
+4. **Identity** — is the resource fully enrolled?
+   (DNS registered, certs valid, Vault authenticated, SSH host key signed)
+
+### Provider Plugin System
+
+Extensible provider model — each provider implements an interface:
+
+```go
+type Provider interface {
+    Name() string
+
+    // Lifecycle
+    Plan(spec ResourceSpec) (*PlanResult, error)
+    Apply(spec ResourceSpec) (*Resource, error)
+    Destroy(id string) error
+
+    // State
+    Get(id string) (*Resource, error)
+    List(filters Filters) ([]*Resource, error)
+    Diff(spec ResourceSpec) (*DiffResult, error)
+
+    // Introspection (like DA's type-writer)
+    DiscoverResources() ([]*Resource, error)
+    AvailableSizes() ([]Size, error)
+    AvailableImages() ([]Image, error)
+}
+```
+
+Built-in providers:
+- `provider-aws` — wraps Pulumi AWS
+- `provider-xcpng` — wraps Pulumi XO / Xen Orchestra API
+- `provider-baremetal` — wraps Tinkerbell / iPXE + IPMI/Redfish
+- `provider-k8s` — wraps Pulumi Kubernetes
+
+Community can add: GCP, Azure, Hetzner, Proxmox, etc.
+
+### Health Aggregator Plugin System
+
+```go
+type HealthSource interface {
+    Name() string
+    CheckHealth(resource *Resource) (*HealthResult, error)
+}
+```
+
+Built-in sources:
+- `health-prometheus` — queries Prometheus alerting rules targeting the resource
+- `health-naemon` — queries Naemon host/service checks
+- `health-cloudwatch` — queries AWS CloudWatch alarms
+
+### Profiles — T-Shirt Sizing
+
+User-owned mappings:
+
+```yaml
+sizes:
+  medium:
+    abstract: { cores: 4, memory: 8GB }
+    providers:
+      aws: { instance_type: t3.large }
+      xcpng: { cores: 4, memory: 8192MB }
+      baremetal: { min_cores: 4, min_memory: 8GB, maas_tag: medium }
+```
+
+### Artifact Builder
+
+Puppet modules → container images:
+
+```
+label "mailserver"
+  → puppet classes [postfix, dovecot, spamassassin]
+  → Dockerfile generated:
+      FROM ubuntu:24.04
+      RUN apt-get install -y puppet-agent
+      COPY modules/ /etc/puppetlabs/code/modules/
+      RUN puppet apply --classes postfix,dovecot,spamassassin
+      # Clean up puppet, leave only configured services
+  → Image pushed to registry
+  → Available as k8s deployment or standalone container
+```
+
+## Tech Stack
+
+| Component | Technology | Why |
+|-----------|-----------|-----|
+| Server | Go | Performance, single binary, Pulumi SDK, gRPC native |
+| CLI | Go (cobra) | Same binary, kubectl-style |
+| TUI | Go (bubbletea) | Same binary, k9s-style |
+| API | gRPC + REST (grpc-gateway) | Type-safe, fast, REST fallback |
+| IaC engine | Pulumi (Go SDK) | Multi-provider, plan/preview, component packages |
+| Config mgmt | OpenVox | Puppet modules, ENC, cert management |
+| Bare metal | Tinkerbell or custom iPXE | PXE boot, IPMI/Redfish |
+| Container build | Buildah or Docker | OCI images from puppet classes |
+| State store | TBD — NOT etcd (see State Storage section) | Resource state, label definitions |
+| K8s integration | client-go | Direct k8s API for deployments |
+
+## Under The Hood — What We DON'T Build
+
+- Cloud APIs → Pulumi providers handle this
+- Puppet language/runtime → OpenVox handles this
+- Container runtime → containerd/Docker handles this
+- Monitoring → Prometheus/Naemon handle this
+- K8s orchestration → k3s/EKS handles this
+- PXE/DHCP/TFTP → Tinkerbell handles this
+- Certificate management → OpenVox CA handles this
+
+**We build the glue, the abstraction, the UX, and the lifecycle orchestration.**
+
+## Kubernetes Management
+
+Lab also controls what runs on k8s clusters:
+
+```
+$ lab get deployments
+NAME          CLUSTER     LABEL        REPLICAS   IMAGE                    STATUS
+mailserver    homelab     mailserver   2/2        org/mailserver:03.15     ✓ running
+api           production  app          4/4        org/app:03.15            ✓ running
+postgres      homelab     postgres     1/1        org/postgres:03.14       ✓ running
+
+$ lab deploy --label app --cluster production --replicas 4
+$ lab scale --label app --cluster production --replicas 6
+```
+
+Deployments reference labels — same label that defines puppet classes also defines
+the container image, ports, health checks, and k8s resources.
+
+## Bootstrap, Onboarding, and Self-Deployment
+
+### Core Idea: Your Device Is The First Coordinator
+
+You don't need a server to start. Your laptop/workstation runs the full lab engine
+locally. You onboard servers from it — including bare metal PXE boot. When ready,
+you migrate the coordinator role to one of the servers you've onboarded.
+
+```
+┌────────────┐     ┌────────────┐     ┌────────────┐     ┌────────────┐
+│  Phase 0   │     │  Phase 1   │     │  Phase 2   │     │  Phase 3   │
+│            │     │            │     │            │     │            │
+│ lab init   │────►│ Onboard    │────►│ Move lab   │────►│ Onboard    │
+│ --local    │     │ servers    │     │ to a real  │     │ remaining  │
+│            │     │ from your  │     │ server     │     │ from the   │
+│ Your device│     │ laptop     │     │            │     │ server     │
+│ = lab      │     │            │     │            │     │            │
+└────────────┘     └────────────┘     └────────────┘     └────────────┘
+```
+
+### Architecture: CLI = Embedded Server
+
+The CLI binary contains the full lab-server engine. The difference between modes
+is where state lives and whether the engine runs persistently.
+
+```
+┌──────────────────────────────────────┐
+│            lab (single binary)        │
+│                                       │
+│  ┌─────────────────────────────────┐ │
+│  │         Core Engine              │ │
+│  │  (providers, labels, render,     │ │
+│  │   lifecycle, identity, secrets,  │ │
+│  │   PXE server, everything)        │ │
+│  └─────────────────────────────────┘ │
+│                                       │
+│  Modes:                               │
+│  ├── $ lab init --local → local mode  │
+│  │     State: ~/.lab/state.db         │
+│  │     PXE/DHCP: served from laptop   │
+│  │     Full engine, no remote server  │
+│  │                                    │
+│  ├── $ lab server       → daemon mode │
+│  │     State: /var/lib/lab/state.db   │
+│  │     PXE/DHCP: served from this box │
+│  │     Persistent API on port 7443    │
+│  │                                    │
+│  └── $ lab <command>    → client mode │
+│        Talks to remote lab-server     │
+│        (or local engine if no server) │
+└──────────────────────────────────────┘
+```
+
+### Onboarding Flow
+
+`lab onboard` is the command to bring a new machine under management. It handles
+two scenarios: machines with an OS already installed, and bare metal that needs
+network boot + OS installation.
+
+#### Scenario A: Machine has OS (SSH onboard)
+
+For machines that already have an OS (like DGX Spark with Ubuntu, or Mac Studio):
+
+```
+$ lab onboard dgx-spark --provider ssh --host 192.168.1.50 --user admin
+
+Step 1: Render
+  ┌──────────────┬────────────────────────┐
+  │ Name         │ dgx-spark              │
+  │ Provider     │ ssh (existing machine) │
+  │ Host         │ 192.168.1.50           │
+  │ OS           │ Ubuntu (detected)      │
+  │ Arch         │ aarch64 (Grace)        │
+  │ RAM          │ 128GB                  │
+  │ GPU          │ CUDA (detected)        │
+  └──────────────┴────────────────────────┘
+
+  Onboarding will:
+  + Install lab agent
+  + Generate one-time enrollment token
+  + Register in DNS: dgx-spark.lab.internal
+  + Sign OpenVox certificate
+  + Assign labels (interactive or --labels flag)
+
+  Proceed? [y/N]: y
+
+Step 2: Detect & assign labels
+  Detected hardware:
+    GPU: NVIDIA GB10 Grace Blackwell → suggesting label: cuda
+    RAM: 128GB → suggesting label: ai-inference
+    Arch: aarch64 → suggesting label: arm
+
+  Assign labels [cuda,ai-inference,arm]: cuda,ai-inference,dgx-spark
+
+Step 3: Apply (same engine as lab apply)
+  → SSH into 192.168.1.50
+  → Install lab agent binary
+  → Generate one-time token
+  → Lab agent enrolls:
+    → OpenVox cert signed, classified in environment "production"
+    → DNS A record: dgx-spark.lab.internal → 192.168.1.50
+    → Identity established
+  → Apply puppet classes from labels:
+    → cuda: nvidia-drivers, cuda-toolkit
+    → ai-inference: inference-runtime
+  → Machine fully managed
+
+$ lab get servers
+NAME          PROVIDER  LABELS                     SYNC  PUPPET  HEALTH  IDENTITY
+dgx-spark     ssh       cuda,ai-inference,dgx-spark ✓     ✓ ok    ✓       ✓ enrolled
+```
+
+#### Scenario B: Bare metal (PXE network boot)
+
+For machines with no OS. Lab (on your laptop or server) becomes a PXE server
+on the local network, serves the OS installer, and onboards after installation:
+
+```
+$ lab onboard beelink-max --provider baremetal \
+    --mac AA:BB:CC:DD:EE:FF \
+    --image ubuntu-24.04 \
+    --labels k8s-worker,rocm,longhorn
+
+Step 1: Render
+  ┌──────────────┬────────────────────────┐
+  │ Name         │ beelink-max            │
+  │ Provider     │ baremetal (PXE boot)   │
+  │ MAC          │ AA:BB:CC:DD:EE:FF      │
+  │ Image        │ ubuntu-24.04           │
+  │ Labels       │ k8s-worker,rocm,longhorn│
+  │ PXE server   │ this device (laptop)   │
+  └──────────────┴────────────────────────┘
+
+  Onboarding will:
+  + Start PXE/DHCP/TFTP on local network interface
+  + Wait for machine with MAC AA:BB:CC:DD:EE:FF to boot
+  + Serve unattended Ubuntu 24.04 installer
+  + After install: auto-enroll with one-time token baked into installer
+  + Assign labels, apply puppet classes
+
+  ⚠ PXE requires: network interface on same L2 segment as target machine
+  ⚠ DHCP: will respond ONLY to MAC AA:BB:CC:DD:EE:FF (safe for existing networks)
+
+  Proceed? [y/N]: y
+
+Step 2: PXE boot phase
+  → Starting PXE server on en0 (192.168.1.x)
+  → DHCP offer scoped to MAC AA:BB:CC:DD:EE:FF only
+  → Waiting for network boot request...
+
+  ⏳ Power on the Beelink SER9 MAX and set it to boot from network (PXE)
+
+  → Boot request received from AA:BB:CC:DD:EE:FF
+  → Serving iPXE → kernel + initrd → autoinstall config
+  → OS installation in progress...
+  → Installation complete, machine rebooting
+
+Step 3: Post-install enrollment (same as SSH onboard from here)
+  → Machine boots with installed OS
+  → Lab agent runs on first boot (installed during OS setup)
+  → Uses one-time token (baked into autoinstall config) to enroll:
+    → OpenVox cert signed
+    → DNS: beelink-max.lab.internal → 192.168.1.100
+    → Identity established
+  → Apply puppet classes from labels:
+    → k8s-worker: kubernetes::worker, containerd
+    → rocm: rocm-drivers
+    → longhorn: longhorn::node
+  → Machine fully managed
+
+$ lab get servers
+NAME          PROVIDER    LABELS                      SYNC  PUPPET  HEALTH  IDENTITY
+dgx-spark     ssh         cuda,ai-inference           ✓     ✓ ok    ✓       ✓ enrolled
+beelink-max   baremetal   k8s-worker,rocm,longhorn    ✓     ✓ ok    ✓       ✓ enrolled
+```
+
+#### Scenario C: Onboard with IPMI/Redfish (remote power control)
+
+For bare metal where you have IPMI/BMC access — Lab can power on the machine
+and set PXE boot remotely, fully hands-free:
+
+```
+$ lab onboard beelink-max --provider baremetal \
+    --mac AA:BB:CC:DD:EE:FF \
+    --ipmi 192.168.1.200 --ipmi-user admin \
+    --image ubuntu-24.04 \
+    --labels k8s-worker,rocm,longhorn
+
+  → IPMI: setting next boot to PXE
+  → IPMI: powering on machine
+  → PXE server waiting for boot request...
+  → (fully automated from here)
+```
+
+### Homelab Bootstrap Walkthrough
+
+The complete flow for setting up the homelab from zero:
+
+```
+# Phase 0: Local mode on your laptop
+$ lab init --local
+  ✓ Lab engine running locally
+  ✓ State: ~/.lab/state.db
+  ✓ Ready to onboard servers
+
+# Phase 1: Onboard servers that already have an OS
+$ lab onboard dgx-spark --provider ssh --host 192.168.1.50
+  → Labels: [cuda, ai-inference, dgx-spark]
+
+$ lab onboard mac-studio --provider ssh --host 192.168.1.51
+  → Labels: [k8s-server, etcd, arm]
+
+# Phase 2: Onboard bare metal (PXE from your laptop)
+$ lab onboard beelink-ser9-pro --provider baremetal --mac XX:XX:XX:XX:XX:01 \
+    --image ubuntu-24.04 --labels bootstrap,lab-server
+  → PXE boot from laptop → install OS → enroll
+  → This will become the permanent lab-server host
+
+# Phase 3: Move lab-server to a real server
+$ lab server migrate --target ssh --host beelink-ser9-pro
+  → Lab-server deployed on Beelink SER9 Pro
+  → State migrated from ~/.lab/state.db
+  → PXE/DHCP now served from Beelink, not your laptop
+  → CLI config updated: lab talks to beelink-ser9-pro:7443
+
+# Phase 4: Onboard remaining servers (PXE from beelink-ser9-pro now)
+$ lab onboard beelink-ser9-max --provider baremetal --mac XX:XX:XX:XX:XX:02 \
+    --image ubuntu-24.04 --labels k8s-worker,rocm,longhorn
+  → PXE served by beelink-ser9-pro (not your laptop anymore)
+
+$ lab onboard minisforum-ms-r1 --provider baremetal --mac XX:XX:XX:XX:XX:03 \
+    --image ubuntu-24.04 --labels k8s-worker,arm
+
+# Phase 5: Set up k8s
+$ lab apply cluster homelab --servers mac-studio,beelink-ser9-max,minisforum-ms-r1
+  → mac-studio becomes k3s server (etcd)
+  → beelink-ser9-max joins as worker
+  → minisforum-ms-r1 joins as worker
+  → All via puppet classes from labels
+
+# Phase 6: Optionally move lab-server into k8s
+$ lab server migrate --target kubernetes --cluster homelab
+  → Lab-server now runs as k8s pod
+  → Still manages everything including the cluster it runs on
+
+# Final state:
+$ lab get servers
+NAME              PROVIDER    LABELS                       SYNC  PUPPET  HEALTH  IDENTITY
+dgx-spark         ssh         cuda,ai-inference            ✓     ✓ ok    ✓       ✓ enrolled
+mac-studio        ssh         k8s-server,etcd,arm          ✓     ✓ ok    ✓       ✓ enrolled
+beelink-ser9-pro  baremetal   bootstrap                    ✓     ✓ ok    ✓       ✓ enrolled
+beelink-ser9-max  baremetal   k8s-worker,rocm,longhorn     ✓     ✓ ok    ✓       ✓ enrolled
+minisforum-ms-r1  baremetal   k8s-worker,arm               ✓     ✓ ok    ✓       ✓ enrolled
+lab-server        kubernetes  lab,control-plane             ✓     ✓ ok    ✓       ✓ enrolled
+```
+
+### Enterprise Application: XCP-ng Bare Metal Deploy
+
+Same onboarding flow works for deploying XCP-ng to enterprise bare metal:
+
+```
+$ lab onboard xen-host-42 --provider baremetal \
+    --mac AA:BB:CC:DD:EE:FF \
+    --ipmi 10.0.0.142 --ipmi-user admin \
+    --image xcpng-8.3 \
+    --labels xen-host,production,eu-west
+
+  → IPMI: power on, PXE boot
+  → Install XCP-ng 8.3 (unattended)
+  → Enroll, apply puppet classes:
+    → xen-host: xcpng::host, xcpng::networking, xcpng::storage
+  → Host registered in Xen Orchestra pool
+  → Ready to provision VMs on it
+
+# Now create VMs on the XCP-ng host we just onboarded:
+$ lab apply server app-12 --provider xcpng --labels app,production
+  → VM created on xen-host-42 via Xen Orchestra API
+  → OS installed, enrolled, puppet applied
+  → Same flow as AWS EC2, just different provider
+```
+
+### PXE Server Capabilities
+
+When running in local or server mode, Lab includes an embedded PXE server:
+
+- **DHCP**: scoped to specific MACs only (safe for existing networks with DHCP)
+- **TFTP**: serves iPXE bootloader
+- **HTTP**: serves kernel, initrd, autoinstall configs
+- **Autoinstall generation**: creates unattended install configs per-machine with:
+  - Lab agent pre-installed
+  - One-time enrollment token baked in
+  - Network config for the target environment
+  - Disk layout per label/profile
+- **Supported images**: Ubuntu, Debian, RHEL/Rocky, XCP-ng (extensible)
+
+PXE serving moves with lab-server — if you migrate lab to a new host,
+PXE is served from there. If lab is on your laptop, PXE is on your laptop.
+Same engine, same binary.
+
+### Hardware Detection During Onboard
+
+When onboarding via SSH (existing OS), Lab detects hardware and suggests labels:
+
+```
+$ lab onboard new-server --provider ssh --host 10.0.0.50
+
+Detected hardware:
+  CPU:    AMD EPYC 7763 (x86_64, 64 cores)     → suggest: compute
+  RAM:    256 GB                                 → suggest: high-memory
+  GPU:    NVIDIA A100 80GB                       → suggest: cuda, ai-training
+  Disk:   2x NVMe 1.92TB, 4x SSD 3.84TB        → suggest: storage
+  NIC:    2x 25GbE, 1x 1GbE IPMI               → suggest: high-bandwidth
+
+  Suggested labels: [compute, high-memory, cuda, ai-training, storage, high-bandwidth]
+  Assign labels [accept/edit]: _
+```
+
+For PXE onboard, hardware detection happens after OS installation, and labels
+can be auto-confirmed or require interactive approval.
+
+### No Server? CLI Runs Locally
+
+If no remote server is configured, every `lab` command runs the engine locally.
+This means you can use Lab in permanent local mode for simple setups:
+
+```
+$ lab get servers              # no remote server configured
+  ⓘ Running locally (~/.lab/state.db)
+  Tip: run `lab server migrate --target <target>` to deploy a persistent server
+
+NAME          PROVIDER   LABELS     SYNC     PUPPET    HEALTH    IDENTITY
+...
+```
+
+### Self-Migration
+
+Migration uses the same plan/apply as everything else:
+
+```
+$ lab server migrate --target ssh --host beelink-ser9-pro
+
+Step 1: Plan
+  ~ migrate lab-server from local (~/.lab) to ssh://beelink-ser9-pro
+  + deploy lab-server container on beelink-ser9-pro
+  + copy state.db to remote host
+  + start PXE/DHCP services on remote host
+  + stop local PXE/DHCP services
+  + update CLI config to new endpoint
+
+Step 2: Apply
+  → Deploy lab-server on beelink-ser9-pro
+  → Copy state to remote
+  → Verify remote is healthy
+  → Switch CLI config
+  → Stop local engine
+
+$ lab server migrate --target kubernetes --cluster homelab
+
+Step 1: Plan
+  ~ migrate lab-server from ssh://beelink-ser9-pro to kubernetes://homelab
+  + k8s Deployment lab-server (1 replica)
+  + k8s Service lab-server (port 7443)
+  + PersistentVolumeClaim lab-server-state (10Gi)
+  + migrate state.db to PVC
+  + PXE services: move to k8s hostNetwork pod or keep on bootstrap node
+
+  ⚠ Note: PXE/DHCP requires L2 network access. If k8s node is on the same
+    L2 segment, use hostNetwork. Otherwise, keep PXE on the bootstrap node
+    and only migrate the API/state to k8s.
+
+Step 2: Apply
+  → Deploy to k8s
+  → Migrate state
+  → Verify healthy
+  → Update CLI config
+  → Tear down old deployment
+```
+
+### Key Design Principles
+
+1. **One engine everywhere** — CLI, local mode, server mode, and init all share the same code
+2. **Your device is the first coordinator** — no chicken-and-egg, start from nothing
+3. **Onboard uses the same pipeline as apply** — render, plan, apply, enroll
+4. **PXE is embedded** — no external PXE/DHCP server needed, Lab serves it
+5. **Hardware detection suggests labels** — but the user confirms
+6. **Migration is just plan/apply for lab-server** — same engine, no special case
+7. **Enterprise and homelab are the same flow** — onboard XCP-ng bare metal = onboard homelab Beelink
+
+## Identity and Trust Layer
+
+Inspired by what FreeIPA did well (auto-DNS, centralized SSH, server-scoped secrets,
+internal CA, IP mobility) without what it did badly (instability, hardcoded join secrets).
+
+Lab controls the full lifecycle — it knows when a machine is born — so it can solve
+the enrollment problem properly: generate a one-time join token at provision time,
+inject it via cloud-init or iPXE userdata. No hardcoded secrets in images.
+
+### Provision-to-Enrolled Flow
+
+```
+$ lab apply server new-worker-5 --label k8s-worker --provider aws
+
+1. PROVISION   → Pulumi creates EC2 instance
+2. IDENTITY    → Lab generates one-time join token (short-lived, single-use)
+                → Token injected via cloud-init (or iPXE userdata for bare metal)
+                → Token is NOT in the image — generated per-instance at provision time
+3. ENROLL      → Machine boots, uses token to:
+                  → Register with OpenVox (cert signed, node classified)
+                  → Register in DNS (A record + PTR)
+                  → Authenticate with Vault (get identity + policies per label)
+                  → Get SSH CA-signed host key (no more TOFU)
+4. CONFIGURE   → OpenVox applies classes
+                → Machine pulls secrets it's allowed to access from Vault
+                → e.g. k8s join token retrieved from Vault, node joins cluster
+5. ENROLLED    → Lab marks resource identity as ✓ enrolled
+```
+
+### What Each Machine Gets on Enrollment
+
+| Capability | What happens | Tool underneath (TBD — needs investigation) |
+|-----------|-------------|----------------------------------------------|
+| DNS auto-registration | A + PTR records created/updated automatically | CoreDNS API? ExternalDNS? PowerDNS? needs investigation |
+| IP mobility | Machine restarts with new IP → DNS updated automatically | Lab agent on machine reports changes? DHCP hook? needs investigation |
+| Server certificate | TLS cert issued for the machine, auto-renewed | OpenVox CA? Vault PKI secrets engine? cert-manager? needs investigation |
+| SSH host key signing | Host key signed by CA, clients trust CA not individual keys | Vault SSH secrets engine? OpenVox CA? step-ca? needs investigation |
+| SSH user access | Users get short-lived SSH certs, centrally managed | Vault SSH + OIDC? Teleport? Boundary? needs investigation |
+| Secret access (RBAC) | Machine authenticates with Vault, gets label-scoped policy | Vault AppRole? Vault cert auth? needs investigation |
+| K8s join tokens | Retrieved from Vault by entitled machines, used to join cluster | Vault KV + policy per label? needs investigation |
+| OpenVox enrollment | Cert signed, environment + role + classes assigned | OpenVox CA + ENC — this one we know |
+| One-time join tokens | Generated per-instance at provision, single-use, short-lived | Lab itself generates these — or delegate to Vault? needs investigation |
+
+**Important: We don't need to build any of these from scratch.** Each row is a capability
+that likely has an existing tool we can wrap. Just like we use Pulumi for cloud APIs and
+OpenVox for config management, we'll find the right tool for each identity concern.
+Each position requires investigation — we'll evaluate options together, one by one.
+
+### CLI: Identity Information
+
+```
+$ lab get servers
+NAME       PROVIDER  LABELS       SYNC  PUPPET  HEALTH  IDENTITY
+worker-5   aws       k8s-worker   ✓     ✓ ok    ✓       ✓ enrolled
+worker-6   xcpng     k8s-worker   ✓     ✓ ok    ✓       ✓ enrolled
+worker-7   baremetal  k8s-worker   ✓     ✗ fail  ⚠       ⚠ cert expiring
+new-box    aws       k8s-worker   ✓     …       …       ⏳ enrolling
+
+$ lab describe server worker-5
+...
+Identity:
+  DNS:          worker-5.lab.internal (A: 10.0.1.45, PTR: ✓)
+  OpenVox:      ✓ cert signed (expires 2027-03-15)
+  Vault:        ✓ authenticated (policy: k8s-worker)
+  SSH Host Key: ✓ CA-signed (fingerprint: SHA256:abc...)
+  Secrets:      k8s/join-token, tls/node-cert (2 accessible)
+  Enrolled:     2026-03-15 14:22:03 (one-time token, consumed)
+  Last Check-in: 2026-03-15 15:01:12 (38 seconds ago)
+
+$ lab get secrets --label k8s-worker
+SECRET              TYPE     ACCESSIBLE BY         LAST ROTATED
+k8s/join-token      dynamic  k8s-worker (12 srv)   2026-03-15
+tls/cluster-ca      static   k8s-worker, k8s-server  2026-01-01
+monitoring/api-key  static   k8s-worker, monitoring  2026-02-28
+
+$ lab identity renew worker-5    # force cert/key renewal
+$ lab identity revoke worker-5   # revoke all creds, remove from DNS, unenroll
+```
+
+### Secrets — Code Is The Policy
+
+**Design principle:** If your code/config declares "I use secret X", that IS the access
+grant. No one goes to a separate UI to edit policies. Default is locked — if not
+mentioned, no access. If mentioned, access is automatic.
+
+**The declaration IS the policy:**
+
+```yaml
+labels:
+  mailserver:
+    puppet_classes:
+      - postfix
+      - dovecot
+    secrets:
+      - mail/tls-cert
+      - mail/dkim-key
+      - mail/relay-credentials
+    ports: [25, 587, 993]
+```
+
+When Lab applies label `mailserver` to a server, it automatically:
+1. Grants that server access to `mail/tls-cert`, `mail/dkim-key`, `mail/relay-credentials`
+2. Denies access to everything else
+3. No separate policy file, no Vault admin, no ticket to security team
+
+When a puppet class references a secret:
+
+```puppet
+# modules/postfix/manifests/init.pp
+class postfix {
+  $relay_creds = lab::secret('mail/relay-credentials')
+
+  file { '/etc/postfix/sasl_passwd':
+    content => $relay_creds,
+    mode    => '0600',
+  }
+}
+```
+
+The `lab::secret()` call is both the usage AND the declaration that this class
+needs this secret. Lab scans puppet classes, discovers secret references,
+and auto-generates the access policy. If `postfix` class is applied to a server
+via a label, that server gets access to `mail/relay-credentials`. Remove the
+class → access revoked.
+
+**Secrets must be equally easy to access from anywhere:**
+
+| Runtime | How you get a secret | Same underneath |
+|---------|---------------------|-----------------|
+| Puppet code | `lab::secret('mail/tls-cert')` | Lab agent on machine fetches from secret backend |
+| App on VM | `LAB_SECRET_MAIL_TLS_CERT` env var, or `/run/secrets/mail/tls-cert` file | Lab agent provides via env or tmpfs mount |
+| App in Kubernetes | Same env var or volume mount | Lab k8s operator syncs to K8s Secret object |
+| App in Docker (standalone) | `--env-file` or bind mount from lab agent | Lab agent on host provides |
+| Script / cron job | `lab secret get mail/tls-cert` CLI call | Lab CLI authenticated via machine identity |
+| cloud-init / bootstrap | Injected at provision time via one-time token | Lab server provides during enrollment |
+
+**One way to consume secrets, regardless of where you run.** The lab agent (or k8s
+operator, or CLI) handles authentication and fetching transparently. The app just
+reads an env var or file.
+
+#### How Access Flows
+
+```
+                Label "mailserver"
+                declares secrets:
+                  - mail/tls-cert
+                  - mail/dkim-key
+                        │
+                        ▼
+            ┌───────────────────────┐
+            │  Lab compiles policy  │
+            │                       │
+            │  server mail-1:       │
+            │    CAN access:        │
+            │      mail/tls-cert    │
+            │      mail/dkim-key    │
+            │    CANNOT access:     │
+            │      k8s/*            │
+            │      postgres/*       │
+            │      (everything else)│
+            └───────────┬───────────┘
+                        │
+                        ▼
+            ┌───────────────────────┐
+            │  Secret backend       │
+            │  (TBD — needs         │
+            │   investigation)      │
+            │                       │
+            │  Enforces policy at   │
+            │  backend level, not   │
+            │  just in Lab          │
+            └───────────────────────┘
+```
+
+#### Secret Sources
+
+Secrets themselves can come from multiple places:
+
+```yaml
+secrets:
+  mail/tls-cert:
+    type: dynamic                 # generated/rotated automatically
+    generator: acme               # cert-manager / Let's Encrypt
+    rotate_every: 90d
+
+  mail/dkim-key:
+    type: static                  # manually set, stored encrypted
+    set_by: admin                 # who last set it
+
+  mail/relay-credentials:
+    type: static
+    set_by: admin
+
+  k8s/join-token:
+    type: dynamic
+    generator: kubernetes         # fetched from k8s API
+    rotate_every: 24h
+
+  tls/node-cert:
+    type: dynamic
+    generator: ca                 # issued per-machine from internal CA
+    per_machine: true             # each machine gets its own
+```
+
+#### CLI for Secrets
+
+```
+$ lab get secrets
+SECRET                   TYPE      USED BY              LAST ROTATED
+mail/tls-cert            dynamic   mailserver (2 srv)   2026-03-14
+mail/dkim-key            static    mailserver (2 srv)   2026-01-15
+mail/relay-credentials   static    mailserver (2 srv)   2026-02-01
+k8s/join-token           dynamic   k8s-worker (12 srv)  2026-03-15
+tls/node-cert            dynamic   * (all enrolled)     per-machine
+
+$ lab secret set mail/relay-credentials
+  Enter value: ****
+  ✓ Updated. Accessible by: mailserver (2 servers)
+  ✓ Servers will pick up new value within 60s
+
+$ lab show secret mail/relay-credentials
+Secret: mail/relay-credentials
+Type: static
+Last set: 2026-03-15 by admin
+
+Accessible by (derived from code):
+  Label "mailserver" → puppet class "postfix" → lab::secret('mail/relay-credentials')
+    ├── mail-1 (xcpng)    last fetched: 12m ago
+    └── mail-2 (aws)      last fetched: 12m ago
+
+  No other references found in any applied code.
+
+$ lab secret audit
+  ✓ All secrets are referenced by at least one applied class/label
+  ⚠ Secret "old/api-key" is defined but not referenced by any code — orphaned?
+  ⚠ Secret "db/password" referenced by class "app::database" but never set — empty!
+```
+
+#### Secret Architecture — Distributed, Offline-Capable
+
+**Critical requirement:** Nothing breaks if the central secret server (or any server)
+is unreachable. Everything continues to work — including making new pods, deployments,
+puppet runs — using local encrypted cache. This is not an edge case, it's a core design.
+
+**This means secrets are NOT a central server you query.** They're a distributed,
+synced, encrypted dataset with offline capability.
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Secret Distribution Model                 │
+│                                                              │
+│   NOT this (central server):        THIS (distributed sync): │
+│                                                              │
+│       ┌─────────┐                  ┌──────┐  ┌──────┐       │
+│       │ Vault   │                  │ Node │◄─►│ Node │       │
+│       └────┬────┘                  └──┬───┘  └──┬───┘       │
+│       ┌────┼────┐                     │    ▲    │            │
+│       │    │    │                     ▼    │    ▼            │
+│      ┌┴┐  ┌┴┐  ┌┴┐               ┌──────┐  ┌──────┐       │
+│      │N│  │N│  │N│               │ Node │◄─►│ Node │       │
+│      └─┘  └─┘  └─┘               └──┬───┘  └──────┘       │
+│   (all dead if vault               │                        │
+│    is unreachable)                  ▼                        │
+│                               ┌──────────┐                  │
+│                               │ Git repo │ (encrypted       │
+│                               │ (backup) │  backup of       │
+│                               └──────────┘  last resort)    │
+└─────────────────────────────────────────────────────────────┘
+```
+
+#### How It Works
+
+**Layer 1: Local Encrypted Cache (on every machine)**
+- Every machine that has access to secrets stores them locally, encrypted at rest
+- Encrypted with machine-specific key (derived from machine identity/TPM/secure enclave)
+- Puppet runs, app starts, pod deployments — all read from local cache
+- If cache is fresh → use it, no network call needed
+- Cache has TTL per secret, but stale cache is better than no secret
+
+**Layer 2: Secret Store (privileged nodes that hold all secrets)**
+- One or more nodes with the `secret-store` label hold the COMPLETE encrypted dataset
+- This is NOT a special server type — it's a label, applied to pods, VMs, or bare metal
+- Should have at least 2 replicas for HA
+- Machines fetch ONLY the secrets their labels entitle them to from the store
+- The store enforces policy — a machine with label `mailserver` gets `mail/*`, nothing else
+- Machines NEVER sync with each other directly — they only talk to the store
+- This prevents secret sprawl (no machine accumulates secrets it shouldn't have)
+
+**Layer 3: Git Encrypted Backup (last resort recovery)**
+- All secrets (encrypted with a master key) backed up to a Git repo
+- If a machine has empty cache AND no peers available → restore from Git backup
+- SOPS/age style encryption — secrets encrypted, metadata (paths, policies) in plaintext
+- Git gives versioning, audit trail, and disaster recovery for free
+- The Git repo alone is useless without the decryption key
+
+**Layer 4: Lab-server (coordinator, NOT single point of failure)**
+- Lab-server is the preferred interface to set/rotate secrets (via CLI/API)
+- Lab-server does NOT need to be the secret-store (but can be, via label)
+- If lab-server is down, machines keep running from local cache
+- No new secrets can be distributed while secret-store is down
+- But nothing breaks — existing workloads continue uninterrupted
+- When secret-store comes back, machines sync and catch up
+
+**Separation of concerns:**
+- `lab-server` = coordination, API, lifecycle management
+- `secret-store` label = holds all secrets, serves policy-filtered requests
+- These CAN be the same node (apply both labels) or separate nodes
+- For homelab: same node is fine. For enterprise: separate for isolation
+
+#### Recovery Scenarios
+
+```
+Scenario 1: Lab-server down, secret-store up
+  → All machines continue working from local cache
+  → Machines can still fetch/refresh secrets from secret-store
+  → No new resources can be provisioned (lab-server manages lifecycle)
+  → But existing workloads are unaffected
+
+Scenario 2: Secret-store down, lab-server up
+  → All machines continue working from local cache
+  → Lab-server can still manage lifecycle (provision, plan, apply)
+  → No new secrets can be distributed
+  → No secret rotations until store is back
+  → Lab-server shows: ⚠ secret-store unreachable
+
+Scenario 3: Both down
+  → All machines continue working from local cache
+  → Nothing new can happen, but nothing breaks
+  → Recovery priority: restore secret-store first (from Git backup)
+
+Scenario 4: Machine reboots, cache intact
+  → Reads from local encrypted cache immediately
+  → Refreshes from secret-store in background to catch up
+  → No dependency on lab-server for startup
+
+Scenario 5: Machine rebuilt, cache empty
+  → Machine has its identity (from enrollment) but no secrets
+  → Fetches entitled secrets from secret-store (policy-filtered)
+  → If secret-store unreachable → cannot start (needs secrets)
+  → Operator can restore secret-store from Git backup to unblock
+
+Scenario 6: Total disaster, only Git backup survives
+  → Deploy new node, apply `secret-store` label
+  → Restore encrypted secrets from Git backup
+  → Deploy lab-server (lab init)
+  → New machines enroll and receive their entitled secrets
+  → System fully recovered
+
+Scenario 7: New pod in k8s, secret-store unreachable
+  → K8s node has local secret cache for its entitled secrets
+  → Lab k8s operator serves pod secrets from node's local cache
+  → Pod starts with cached secrets
+  → No interruption to deployments
+```
+
+#### CLI for Secret Distribution
+
+```
+$ lab secret status
+SECRET DISTRIBUTION STATUS:
+  Local cache:     ✓ 8 secrets cached (of 8 entitled), encrypted, fresh (< 5m old)
+  Secret store:    ✓ connected (2 replicas: store-1, store-2)
+  Lab-server:      ✓ connected
+  Git backup:      ✓ last push 2026-03-15 14:30:00 (47 total secrets)
+
+$ lab secret status --store
+SECRET STORE:
+  Replicas:        2/2 healthy
+    store-1        k8s pod    ✓ synced   47 secrets (all)
+    store-2        vm/xcpng   ✓ synced   47 secrets (all)
+  Git backup:      ✓ synced   2026-03-15 14:30:00
+  Total secrets:   47
+  Entitled consumers:
+    k8s-worker (12 machines)  → 3 secrets each
+    mailserver (2 machines)   → 5 secrets each
+    postgres (3 machines)     → 4 secrets each
+    lab-server (1 machine)    → 2 secrets
+
+$ lab secret cache
+LOCAL CACHE:
+SECRET                   CACHED     TTL        STATUS
+mail/tls-cert            ✓          89d left   fresh
+mail/dkim-key            ✓          no expiry  fresh
+k8s/join-token           ✓          23h left   fresh
+tls/node-cert            ✓          346d left  fresh
+
+$ lab secret recover --from git
+  → Fetching encrypted backup from git@github.com:org/lab-secrets.git
+  → Decrypting with master key...
+  → Restored 23 secrets
+  → Syncing with available peers...
+```
+
+#### Local Cache Security
+
+The local cache must be stored securely — needs investigation:
+- Encrypted at rest with machine-specific key
+- Key derived from: TPM 2.0? Secure enclave? LUKS-bound? needs investigation
+- Memory-mapped, not swappable (mlock)
+- Accessible only by lab agent (file permissions + MAC/SELinux)
+- Wiped on machine decommission (`lab identity revoke`)
+- Possibly use kernel keyring on Linux — needs investigation
+
+#### Secret Backend — NOT Decided
+
+The underlying secret storage/sync mechanism is pluggable:
+
+```go
+type SecretBackend interface {
+    Name() string
+
+    // CRUD
+    Get(path string, identity *MachineIdentity) ([]byte, error)
+    Set(path string, value []byte) error
+    Delete(path string) error
+    List(prefix string) ([]string, error)
+
+    // Policy (auto-generated from code/labels)
+    GrantAccess(path string, identity *MachineIdentity) error
+    RevokeAccess(path string, identity *MachineIdentity) error
+
+    // Dynamic
+    Generate(path string, generator GeneratorConfig) ([]byte, error)
+    Rotate(path string) error
+
+    // Distribution
+    SyncWith(peer PeerInfo) error
+    CacheLocally(secrets []Secret) error
+    RestoreFromBackup(source BackupSource) error
+}
+```
+
+Possible approaches (each needs investigation):
+- **SOPS + age + Git** — simplest, encrypted files in Git, but no peer sync
+- **OpenBao** — Vault fork, has replication, but still central-server mindset
+- **Sealed Secrets / External Secrets Operator** — k8s-native, but not universal
+- **Infisical** — developer-friendly, but SaaS-oriented
+- **Custom: encrypted SQLite + peer sync** — simple, we control the sync protocol
+- **etcd with encryption** — distributed by nature, but might be overkill
+- **CockroachDB** — distributed SQL, encrypted, survives node failures
+- **Consul** — distributed KV with gossip, HashiCorp though
+- **Lab's own sync protocol** — gossip-based, encrypted, purpose-built
+
+The right answer might be a combination:
+- SOPS/age for encryption format (proven, auditable)
+- Custom gossip sync for distribution (lightweight)
+- Git for backup (free versioning and DR)
+- Or wrap an existing distributed KV that already handles sync
+
+**This is the most complex subsystem in Lab and needs careful investigation.**
+
+### Identity Plugin System
+
+Same extensible pattern as providers and health sources:
+
+```go
+type IdentityPlugin interface {
+    Name() string
+
+    // Enrollment
+    Enroll(resource *Resource, token string) (*Identity, error)
+    Revoke(resource *Resource) error
+
+    // Status
+    Status(resource *Resource) (*IdentityStatus, error)
+
+    // Renewal
+    Renew(resource *Resource) error
+}
+```
+
+This allows swapping identity backends without changing the rest of Lab.
+We might start with Vault + OpenVox CA and later add/replace components.
+
+## State Storage — Design Principles
+
+**NOT etcd.** etcd prioritizes consistency over availability — it would rather crash and
+stay down than serve potentially inconsistent data. For Lab, availability wins:
+
+- Losing a few events is better than total outage
+- Should auto-backup and auto-restore on corruption
+- Should degrade gracefully, never crash and refuse to start
+- Stale data is acceptable, no data is not
+
+Requirements:
+- Stores: resource state, label definitions, group membership, alert configs, audit log
+- Must survive lab-server restart
+- Must be migratable (lab-server can move between hosts)
+- Should auto-backup (to Git, S3, or local snapshots)
+- Should auto-recover from corruption without operator intervention
+- Embedded (no external dependency) preferred for simplicity
+
+Candidates (needs investigation):
+- **SQLite** — embedded, simple, proven, WAL mode for concurrent reads, easy to backup (copy file)
+- **bbolt/BoltDB** — embedded KV, used by etcd ironically, simpler than etcd itself
+- **Badger** — embedded KV in Go, LSM-tree, good performance
+- **DuckDB** — embedded analytical DB, might be overkill
+- **PostgreSQL** — if we need multi-server state, but adds external dependency
+- **Litestream** — SQLite + continuous replication to S3/GCS/Azure (interesting combo)
+
+**SQLite + Litestream** is the current leading candidate:
+- SQLite for simplicity and embeddability
+- Litestream for continuous backup to S3/GCS/local without stopping the database
+- Auto-restore: if DB is missing, Litestream restores from latest backup
+- Single file, easy to migrate when lab-server moves
+- But needs investigation to confirm it handles our scale
+
+## Open Questions
+
+1. Name: "lab" is simple but generic. Alternatives?
+2. GitOps integration — should label/profile changes go through Git, or direct API?
+3. Multi-tenancy — how to scope labels/resources per team?
+4. Auth — mTLS between CLI and server? OIDC? Vault-issued tokens?
+5. Input format — TypeScript (DA-style), YAML (Compose-style), or both?
+7. Should `lab init` deploy lab-server as a container (portable) or native binary (simpler)?
diff --git a/os-install-research.md b/os-install-research.md
new file mode 100644
index 0000000..19a4798
--- /dev/null
+++ b/os-install-research.md
@@ -0,0 +1,356 @@
+# OS Installation Research
+
+## Target Operating Systems
+
+All must support unattended network installation and automated OpenVox enrollment.
+All must work across multiple CPU architectures where the OS supports it.
+
+| OS | Install System | Answer Format | Architectures | PXE Difficulty |
+|-----|---------------|--------------|---------------|---------------|
+| Ubuntu 24.04 | autoinstall (cloud-init) | YAML | x86_64, aarch64, RISC-V | Easy |
+| Debian 12 | preseed | preseed.cfg | x86_64, aarch64, many others | Medium |
+| Fedora 41+ | Anaconda/kickstart | .ks file | x86_64, aarch64 | Easy |
+| AlmaLinux 9 | Anaconda/kickstart | .ks file | x86_64, aarch64 | Easy |
+| XCP-ng 8.3 | Custom Python TUI | XML answer file | x86_64 only | HARD |
+| VyOS 1.4 | Custom installer | config.boot | x86_64, aarch64 | Medium |
+
+## XCP-ng Network Install — Known Hard
+
+### Why it's difficult
+- iPXE UEFI is fundamentally broken (open bug, multiboot module corruption)
+- Serial/headless install hangs after detecting storage — no fix
+- No VNC installer mode (unlike RHEL/Debian)
+- TFTP agonizingly slow for large install.img
+- Custom Python TUI designed for VGA console, not automation
+- No major provisioning tool has first-class XCP-ng support
+
+### What works
+- **BIOS PXE** more reliable than UEFI
+- **IPMI virtual media** with remastered ISO is most reliable
+- Answer file XML with `<post-install-script>` and `<script stage="filesystem-populated">`
+- Post-install puppet enrollment via `/etc/firstboot.d/` scripts
+- XCP-ng enables SSH by default after install
+
+### Answer file format (XML, custom to XenServer/XCP-ng)
+```xml
+<?xml version="1.0"?>
+<installation mode="fresh" srtype="ext">
+    <primary-disk>sda</primary-disk>
+    <keymap>us</keymap>
+    <root-password type="hash">$6$...</root-password>
+    <source type="url">http://server/xcp-ng/</source>
+    <admin-interface name="eth0" proto="dhcp" />
+    <hostname>xcphost01</hostname>
+    <timezone>Europe/London</timezone>
+    <ntp-server>pool.ntp.org</ntp-server>
+    <network-backend>openvswitch</network-backend>
+    <post-install-script type="url">http://server/scripts/post-install.sh</post-install-script>
+    <script stage="filesystem-populated" type="url">http://server/scripts/fs-setup.sh</script>
+</installation>
+```
+
+### Post-install puppet enrollment
+The `filesystem-populated` stage script drops a firstboot script:
+```bash
+#!/bin/bash
+MOUNT=$1
+cat > "$MOUNT/etc/firstboot.d/99-lab-enroll" << 'SCRIPT'
+#!/bin/bash
+# Install puppet agent (XCP-ng is CentOS-based, yum works)
+yum install -y puppet-agent
+# Configure and start
+puppet config set server puppet.lab.internal
+systemctl enable --now puppet
+SCRIPT
+chmod +x "$MOUNT/etc/firstboot.d/99-lab-enroll"
+```
+
+## Lab Install Profile Abstraction
+
+Lab needs an `InstallerPlugin` interface so the same `lab onboard` command works
+for all OS types. Each plugin handles answer file generation, PXE chain setup,
+and post-install enrollment for its OS type.
+
+```go
+type InstallerPlugin interface {
+    Name() string
+    SupportedArchitectures() []string
+
+    // Generate the answer/config file for unattended install
+    GenerateAnswerFile(config InstallConfig) ([]byte, error)
+
+    // Set up PXE boot artifacts (kernel, initrd, bootloader configs)
+    PreparePXE(config PXEConfig) error
+
+    // Generate post-install enrollment script
+    GenerateEnrollmentScript(token string, labels []string) ([]byte, error)
+}
+```
+
+Built-in installer plugins:
+- `installer-autoinstall` — Ubuntu (cloud-init based autoinstall YAML)
+- `installer-kickstart` — Fedora, AlmaLinux, RHEL (kickstart .ks files)
+- `installer-preseed` — Debian (preseed.cfg)
+- `installer-xcpng` — XCP-ng (custom XML + firstboot.d scripts)
+- `installer-vyos` — VyOS (config.boot)
+
+## Auto-Onboard Rules
+
+Automatic onboarding based on detected hardware characteristics:
+
+```yaml
+auto-onboard:
+  rules:
+    - name: large-compute-to-xcpng
+      conditions:
+        cores: ">= 40"
+        memory: ">= 500GB"
+        provider: ovh
+      action:
+        image: xcpng-8.3
+        labels: [xen-host, production]
+
+    - name: arm-to-ubuntu
+      conditions:
+        arch: aarch64
+      action:
+        image: ubuntu-24.04
+        labels: [arm, k8s-worker]
+```
+
+Must support:
+- Preview: show which existing servers match/don't match rules
+- Dry-run: show what would happen for pending servers
+- Apply: actually onboard matching servers
+
+## Deployment Approach: Universal PXE Agent + Rootfs Images
+
+### Decision: NOT using native installers
+
+Instead of dealing with 6 different installer formats (autoinstall, kickstart, preseed,
+XCP-ng XML, VyOS config), Lab uses a universal approach:
+
+1. PXE boot ONE agent OS (same for all target distros)
+2. Agent contacts Lab server, gets instructions
+3. Agent partitions disk, deploys rootfs tarball, injects config, reboots
+4. Target OS boots with lab-agent, enrolls with OpenVox
+
+This avoids the nightmare of maintaining 6 installer plugins × 3 architectures.
+
+### Tool Evaluation
+
+| Tool | What It Does | For Lab? |
+|------|-------------|----------|
+| **Tinkerbell (CNCF)** | PXE → HookOS agent → workflow actions (partition, deploy, inject) | **Best candidate to wrap** |
+| **LinuxKit** | Build minimal agent OS (used by Tinkerbell's HookOS) | Build our PXE agent |
+| **mkosi** | Build rootfs tarballs for any distro (Fedora, Ubuntu, Debian, etc.) | **Image production** |
+| **iPXE** | Universal PXE bootloader with scripting | PXE foundation |
+| **Pixiecore** | Simple Go PXE server with per-MAC API mode | PXE building block |
+| **bootc** | Bootable OCI containers → install to disk (RHEL-family) | Image format option |
+| **cloud-init** | First-boot config injection | Post-deploy config |
+| **Packer** | Build VM/machine images | Golden image building |
+| **MAAS/Curtin** | Production-grade, same pattern, but Ubuntu-centric + heavy | Too opinionated |
+| **Warewulf** | Stateless/diskless boot from container images | Wrong model (RAM-only) |
+| **Kairos** | Immutable k8s-focused OS from containers | Too opinionated |
+| **FOG/Clonezilla** | Block-level disk cloning | Too rigid |
+| **FAI** | Debian-centric installer framework | Too narrow |
+| **Razor (Puppet)** | Dead (archived 2019) | Dead |
+| **netboot.xyz** | PXE boot menu into native installers | Opposite of what we want |
+
+### Tinkerbell — Closest Match
+
+Tinkerbell already implements this pattern:
+- **HookOS**: minimal agent OS built with LinuxKit, boots via PXE, multi-arch (x86 + ARM)
+- **Tink Worker**: runs inside HookOS, contacts server via gRPC, executes workflows
+- **Workflow Actions**:
+  - `rootio` — partition disks, create filesystems
+  - `archive2disk` — stream compressed rootfs tarball to mounted filesystem
+  - `image2disk` — write raw disk image (dd-style)
+  - `oci2disk` — pull OCI container image, write to disk
+  - `writefile` — write individual files (puppet certs, config, enrollment token)
+  - `cexec` — chroot and run commands (install bootloader, etc.)
+  - `kexec` — kexec into new kernel (avoids reboot)
+
+**Tinkerbell's limitation:** requires Kubernetes to run (Tink Server is k8s-native).
+Options:
+- Run on bootstrap node's k3s (works but adds k3s dependency before we have k3s)
+- Extract just HookOS + actions, replace Tink Server with Lab's own API
+- Use Tinkerbell after initial bootstrap
+
+### Option A: Wrap Tinkerbell
+Use Tinkerbell's HookOS and actions, Lab translates `lab onboard` into Tinkerbell
+workflows. Proven, multi-arch, battle-tested by Equinix Metal.
+
+### Option B: Build our own lightweight agent
+If Tinkerbell's k8s dependency is too heavy:
+- Build agent OS with LinuxKit (like HookOS but simpler)
+- Small Go binary as the agent: contacts lab-server, gets instructions, partitions,
+  deploys rootfs, injects files, installs bootloader, reboots
+- Embedded in Lab binary — no k8s dependency
+- Essentially "Tinkerbell actions without Tinkerbell's workflow engine"
+
+### Decision: TBD — needs hands-on evaluation of Tinkerbell
+
+### VyOS Inspiration
+
+VyOS proves this pattern works:
+- Image-based install (rootfs deployed to partition)
+- Also runs as Docker container (same config system)
+- Same concept as Lab: one definition → VM image, bare metal, or container
+
+### Image Production Pipeline
+
+Lab needs to produce rootfs tarballs for each OS × architecture:
+
+```
+$ lab image build ubuntu-24.04 --arch x86_64,aarch64
+  → Uses mkosi or debootstrap to build rootfs
+  → Injects lab-agent, cloud-init datasource
+  → Produces: ubuntu-24.04-x86_64.tar.gz, ubuntu-24.04-aarch64.tar.gz
+
+$ lab image build xcpng-8.3 --arch x86_64
+  → Extract/capture rootfs from XCP-ng installer/installed system
+  → Produces: xcpng-8.3-x86_64.tar.gz
+
+$ lab image list
+IMAGE              ARCH              SIZE      BUILT
+ubuntu-24.04       x86_64, aarch64   850MB     2026-03-15
+debian-12          x86_64, aarch64   620MB     2026-03-14
+fedora-41          x86_64, aarch64   920MB     2026-03-14
+almalinux-9        x86_64, aarch64   780MB     2026-03-13
+xcpng-8.3          x86_64            1.2GB     2026-03-10
+vyos-1.4           x86_64, aarch64   450MB     2026-03-12
+```
+
+Image build tools per OS:
+- Ubuntu/Debian: debootstrap or mkosi
+- Fedora/AlmaLinux: dnf --installroot or mkosi
+- XCP-ng: install in QEMU + Packer, capture rootfs (only viable method)
+- VyOS: extract squashfs from ISO (`unsquashfs /mnt/live/filesystem.squashfs`)
+- Asahi Linux: NOT BUILDABLE — SSH onboard only, OS already installed by user
+
+## XCP-ng Rootfs Production — Detailed
+
+### Why package-based build doesn't work
+- `install.img` is the installer ramdisk, NOT the target system
+- The installer (`host-installer/backend.py`) does post-install XAPI setup that
+  can't be replicated with just yum --installroot
+- Nobody has successfully built XCP-ng from packages alone
+- `create-install-image` scripts only produce ISOs
+
+### Viable approach: Packer + QEMU capture
+```
+1. Boot XCP-ng ISO in QEMU with answerfile (unattended)
+2. Installer runs normally, does all XAPI/Xen setup
+3. Mount resulting disk image
+4. Tar up root partition
+5. Generalize: remove SSH keys, XAPI state.db, hostname, UUIDs, persistent net rules
+6. Output: xcpng-8.3-x86_64.tar.gz
+```
+
+### XCP-ng partition layout (PXE agent must recreate this)
+```
+sda1: 18GB  ext3  /           (dom0 root)
+sda2: 18GB  ext3  (backup)    (upgrade slot)
+sda3: rest  LVM   (SR)        (VM storage repository)
+sda4: 512MB vfat  /boot/efi   (UEFI ESP)
+sda5: 4GB   ext3  /var/log
+sda6: 1GB   swap
+```
+
+## Asahi Linux — Special Case
+
+### Why it can't follow the standard path
+- No PXE boot — Apple Silicon only boots from internal NVMe or USB (iBoot)
+- Firmware partition — m1n1 must be in Apple's APFS container, coexists with macOS
+- Device tree — generated per-chip at install time
+- GPU drivers — Asahi's reverse-engineered drivers are kernel-specific
+- Boot chain: iBoot → m1n1 → U-Boot/GRUB → Linux (completely non-standard)
+
+### How Lab handles it
+- SSH onboard only: `lab onboard mac-studio --provider ssh --host <ip>`
+- Asahi is already installed (user did this manually or via Asahi installer)
+- Lab manages the userspace (Fedora-based) via puppet normally
+- Kernel updates from Asahi repos, managed by puppet/dnf
+- m1n1/U-Boot/firmware layer is untouched by Lab
+
+### Lesson
+Not everything is PXE-bootable. Lab needs two onboard paths:
+- **PXE onboard**: bare metal with no OS (Beelinks, OVH servers, XCP-ng hosts)
+- **SSH onboard**: OS already installed (Mac Studio, DGX Spark, cloud VMs)
+
+## Image Deployment Matrix
+
+```
+                    PXE Deploy    SSH Onboard    Container    VM Image
+Ubuntu 24.04        ✓ rootfs      ✓              ✓            ✓ qcow2
+Debian 12           ✓ rootfs      ✓              ✓            ✓ qcow2
+Fedora 41           ✓ rootfs      ✓              ✓            ✓ qcow2
+AlmaLinux 9         ✓ rootfs      ✓              ✓            ✓ qcow2
+XCP-ng 8.3          ✓ rootfs      ✓ (existing)   ✗            ✗
+VyOS 1.4            ✓ rootfs      ✓ (existing)   ✓ docker     ✓ qcow2
+Asahi Linux         ✗ impossible  ✓ (only way)   ✗            ✗
+```
+
+## Automated Image Pipeline
+
+Images must be rebuilt regularly to include security updates and new lab-agent versions.
+
+### Pipeline Configuration
+```yaml
+image-pipelines:
+  ubuntu-24.04:
+    method: debootstrap
+    schedule: weekly
+    architectures: [x86_64, aarch64]
+    outputs: [rootfs-tarball, container-base, qcow2]
+    retention: 4 builds
+
+  xcpng-8.3:
+    method: packer-qemu          # install in QEMU, capture
+    schedule: monthly
+    architectures: [x86_64]
+    outputs: [rootfs-tarball]
+    retention: 3 builds
+
+  vyos-1.4:
+    method: squashfs-extract     # extract from ISO
+    schedule: monthly
+    architectures: [x86_64, aarch64]
+    outputs: [rootfs-tarball, container-base]
+    retention: 3 builds
+```
+
+### Build runs on Lab itself (dogfooding)
+- x86 images build on x86 machines (Beelink SER9 MAX)
+- ARM images build on ARM machines (DGX Spark, Minisforum)
+- XCP-ng builds on any x86 with QEMU/KVM
+- Lab picks the right builder based on architecture
+
+### Upgrade flow
+- New image built → Lab knows which servers run old version
+- `lab image diff` shows package changes
+- `lab image promote` makes new image the default for new deploys
+- Existing servers: puppet manages package updates (not re-imaged unless requested)
+
+### Connection to Puppet → Container Artifact Builder
+
+Same pipeline, different output targets:
+
+```
+Label "mailserver" + base image "ubuntu-24.04":
+  → rootfs + puppet classes = bare metal image (tar.gz for PXE deploy)
+  → rootfs + puppet classes = container image (OCI for k8s/docker)
+  → rootfs + puppet classes = VM image (qcow2/vmdk for XCP-ng/AWS)
+
+One label, one set of puppet modules, three deployment formats.
+```
+
+## Multi-Architecture Considerations
+
+- PXE boot chain differs between x86 (BIOS/UEFI) and ARM (UEFI only)
+- Need separate kernel/initrd per architecture for the agent OS
+- Rootfs tarballs are architecture-specific
+- Some OS images don't exist for all architectures (XCP-ng = x86 only)
+- Lab must track architecture per image and refuse mismatches
+- Tinkerbell's HookOS already builds for x86_64 and aarch64