first commit

2026-03-15 23:50:43 +00:00
commit ac695f506f
9 changed files with 3003 additions and 0 deletions
--- a/architecture.md
+++ b/architecture.md
@@ -0,0 +1,246 @@
+# Architecture Decisions
+
+## Core Principles
+
+1. Build for homelab first, design for AWS/multi-cloud from the start
+2. Labels as the universal abstraction — config attaches to labels, not machines
+3. Code is the policy — declarations grant access, no separate policy management
+4. Availability over consistency — stale data is acceptable, no data is not
+5. No single point of failure — everything works offline with local cache
+6. Don't reinvent the wheel — wrap existing tools, build the glue and UX
+7. One engine everywhere — CLI, server, and init all use the same code path
+
+## The Tool: "lab"
+
+Unified infrastructure lifecycle platform. Full spec in `lab-tool-spec.md`.
+
+### Component Dependency Map
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        LAB PLATFORM                                  │
+│                                                                      │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │                    CORE (no external deps)                   │    │
+│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │    │
+│  │  │ Label    │ │ Group    │ │ Targeting│ │ Render Engine │  │    │
+│  │  │ Engine   │ │ Engine   │ │ Engine   │ │ (CLI tables,  │  │    │
+│  │  │          │ │          │ │          │ │  TUI, diff)   │  │    │
+│  │  └──────────┘ └──────────┘ └──────────┘ └───────────────┘  │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Profile      │ │ State Store  │ │ Plugin Registry  │    │    │
+│  │  │ Engine       │ │ (SQLite +    │ │                  │    │    │
+│  │  │ (t-shirt     │ │  Litestream) │ │                  │    │    │
+│  │  │  sizes)      │ │              │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│       ▲ depends on core                                              │
+│  ┌────┴────────────────────────────────────────────────────────┐    │
+│  │              LIFECYCLE (depends on: core + providers)        │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Lifecycle    │ │ Artifact     │ │ K8s Deployer     │    │    │
+│  │  │ Manager      │ │ Builder      │ │                  │    │    │
+│  │  │ (plan/apply/ │ │ (puppet →    │ │                  │    │    │
+│  │  │  destroy)    │ │  container)  │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│       ▲ depends on lifecycle                                         │
+│  ┌────┴────────────────────────────────────────────────────────┐    │
+│  │              IDENTITY & SECRETS (depends on: lifecycle)      │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Identity     │ │ Secret Store │ │ Token Issuer     │    │    │
+│  │  │ Manager      │ │ (privileged  │ │ (one-time join   │    │    │
+│  │  │ (enroll,     │ │  label, local│ │  tokens)         │    │    │
+│  │  │  DNS, certs, │ │  cache, git  │ │                  │    │    │
+│  │  │  SSH keys)   │ │  backup)     │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│       ▲ depends on identity                                          │
+│  ┌────┴────────────────────────────────────────────────────────┐    │
+│  │              OBSERVABILITY (depends on: core + identity)    │    │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
+│  │  │ Health       │ │ Alert        │ │ Audit Log        │    │    │
+│  │  │ Aggregator   │ │ Generator    │ │                  │    │    │
+│  │  │              │ │ (auto + user │ │                  │    │    │
+│  │  │              │ │  defined)    │ │                  │    │    │
+│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│                                                                      │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │              INTERFACES (depends on: everything above)      │    │
+│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐  │    │
+│  │  │ gRPC/REST│ │ CLI      │ │ TUI      │ │ Web UI       │  │    │
+│  │  │ API      │ │ (cobra)  │ │(bubbletea)│ │ (future)     │  │    │
+│  │  └──────────┘ └──────────┘ └──────────┘ └──────────────┘  │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+└─────────────────────────────────────────────────────────────────────┘
+
+PROVIDER PLUGINS (external, loaded at runtime):
+  ┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐
+  │provider-aws│ │provider-   │ │provider-     │ │provider-k8s│
+  │ (Pulumi)   │ │xcpng (XO)  │ │baremetal     │ │ (Pulumi)   │
+  └────────────┘ └────────────┘ │(Tinkerbell)  │ └────────────┘
+                                └──────────────┘
+HEALTH PLUGINS:                 IDENTITY PLUGINS:
+  ┌────────────┐ ┌──────────┐   ┌───────────┐ ┌─────────────┐
+  │health-     │ │health-   │   │id-openvox │ │id-dns       │
+  │prometheus  │ │naemon    │   │           │ │             │
+  └────────────┘ └──────────┘   └───────────┘ └─────────────┘
+  ┌────────────┐                ┌───────────┐ ┌─────────────┐
+  │health-     │                │id-ssh-ca  │ │id-secret    │
+  │cloudwatch  │                │           │ │             │
+  └────────────┘                └───────────┘ └─────────────┘
+```
+
+### Build Order (what depends on what)
+
+```
+Phase 1: CORE (can be built and tested independently)
+  ├── Label Engine
+  ├── Group Engine (depends on: labels)
+  ├── Targeting Engine (depends on: labels, groups)
+  ├── Profile Engine (t-shirt sizes)
+  ├── Render Engine
+  ├── State Store (SQLite + Litestream)
+  ├── Plugin Registry
+  ├── CLI framework (cobra)
+  └── gRPC/REST API skeleton
+
+Phase 2: PROVIDERS (can be built in parallel, each independent)
+  ├── provider-ssh (simplest, needed for onboarding existing machines)
+  ├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server)
+  ├── provider-portainer (deploy via Portainer API)
+  ├── provider-k8s (needed for k8s deployments)
+  ├── provider-aws (Pulumi AWS)
+  └── provider-xcpng (Pulumi XO / XO REST API)
+
+Phase 3: LIFECYCLE (depends on: core + at least one provider)
+  ├── Lifecycle Manager (plan/apply/destroy)
+  ├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll)
+  ├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk)
+  ├── Local mode (lab init --local, engine on user device)
+  ├── Self-deploy (lab init — deploy to remote target)
+  ├── Self-migration (lab server migrate)
+  └── Artifact Builder (puppet → container)
+
+Phase 4: IDENTITY (depends on: lifecycle)
+  ├── Token Issuer (one-time join tokens)
+  ├── OpenVox Enrollor (cert signing, node classification)
+  ├── DNS Manager (auto-registration, IP mobility)
+  ├── SSH CA integration
+  └── Secret Store (privileged label, local cache, git backup)
+
+Phase 5: OBSERVABILITY (depends on: core + identity)
+  ├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins)
+  ├── Alert Generator (auto + user-defined, targeting engine)
+  ├── Four-pillar status (sync + puppet + health + identity)
+  └── Audit log
+
+Phase 6: UX POLISH
+  ├── TUI (bubbletea, k9s-style, cross-linked navigation)
+  ├── lab show / lab targets (visibility commands)
+  ├── lab render (multi-provider comparison)
+  └── Web UI (future)
+```
+
+### Key Concepts
+
+| Concept | Description |
+|---------|-------------|
+| **Labels** | Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels |
+| **Groups** | Composable, nested, with exclusions. Target by label, group, server, environment |
+| **Targeting** | Unified query syntax used everywhere: alerts, secrets, puppet, queries |
+| **Four Pillars** | Every resource shows: Sync + Puppet + Health + Identity |
+| **Profiles** | T-shirt sizing with per-provider mappings, user-owned |
+| **Secret Store** | Privileged label holding all secrets, machines get only entitled subset |
+| **Code = Policy** | `lab::secret()` in puppet code = usage AND access declaration |
+| **Artifact Builder** | Same puppet modules → VM config OR container image |
+| **Self-deploy** | Lab deploys itself using same engine as everything else |
+| **Visibility** | Two-way: server→everything applied, label→all servers affected |
+
+## Infrastructure Stack
+
+| Layer | Homelab | AWS Equivalent | Status |
+|-------|---------|----------------|--------|
+| Orchestration | k3s | EKS | Decided |
+| IaC engine | Pulumi | Pulumi | Decided |
+| GitOps | ArgoCD | ArgoCD | Decided |
+| Monitoring (k8s) | Prometheus + Grafana | Prometheus + Grafana | Decided |
+| Monitoring (infra) | Naemon | N/A (bare metal only) | Decided |
+| Secrets backend | TBD | TBD | Needs investigation |
+| DNS | PowerDNS + ExternalDNS | Route53 + ExternalDNS | Decided — see `dns-research.md` |
+| TLS / CA | TBD | TBD | Needs investigation |
+| SSH CA | TBD | TBD | Needs investigation |
+| Storage | Longhorn | EBS CSI | Decided |
+| Config mgmt | OpenVox | OpenVox | Decided |
+| Bare metal boot | Tinkerbell / iPXE | N/A | Needs investigation |
+| State store | SQLite + Litestream | SQLite + Litestream | Leading candidate |
+| Container build | Buildah / Docker | Buildah / Docker | Needs investigation |
+
+## Decisions Made
+
+| Decision | Choice | Why | Alternatives Considered |
+|----------|--------|-----|------------------------|
+| IaC engine | Pulumi | Real languages, plan/preview, component packages, XCP-ng provider exists | Terraform (no abstraction), Crossplane (no plan) |
+| Config mgmt | OpenVox | Puppet fork, Apache 2.0, existing modules, active community | Puppet (Perforce EULA, 25-node limit) |
+| Multi-cloud abstraction | Custom (Lab) | Nothing exists that does labels + plan + bare metal + XCP-ng | Crossplane (no plan), Terraform (re-implement per cloud) |
+| Kubernetes | k3s | Puppet-friendly, multi-arch, lightweight, same K8s API as EKS | OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based) |
+| Target OS list | Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS | Multi-arch, each with different install automation | See `os-install-research.md` |
+| State store | NOT etcd | etcd crashes over serving stale data, availability > consistency | Leading: SQLite + Litestream |
+| Secret access model | Code = policy | Declarations in code/labels auto-grant access, no manual Vault policies | Manual Vault policy management |
+| Secret distribution | Privileged store + local cache | Prevents secret sprawl, machines only get entitled secrets | Peer-to-peer sync (leaks secrets sideways) |
+| Resilience model | Offline-capable | Local cache keeps everything running, git backup for DR | Central server dependency (FreeIPA burned us) |
+| Bootstrap | Self-deploying | lab init uses same engine as lab apply, no special codepath | Separate init provider interface |
+
+## Evaluated and Rejected
+
+| Tool | Why Rejected | Details |
+|------|-------------|---------|
+| **Crossplane** | No plan/preview — dealbreaker for enterprise | `crossplane-evaluation.md` |
+| **Foreman** | Obsolete, poor UX, user has used it | Memory: `feedback_foreman.md` |
+| **Terraform/OpenTofu** | No multi-platform abstraction | Re-implement per cloud at thousands of nodes |
+| **MAAS** | Bare metal only | No cloud VMs, no Puppet integration |
+| **OpenShift** | Fights external config mgmt, heavy, limited ARM | See `kubernetes-flavors.md` |
+| **Talos** | Immutable OS, no SSH, no puppet | Incompatible with our approach |
+| **MicroK8s** | Snap-based | Puppet managing snaps is awkward |
+| **HashiCorp Vault** | Not impressed, central-server mindset | Will evaluate alternatives (OpenBao, Infisical, etc.) |
+| **etcd** | Consistency over availability | Crashes rather than serving stale data |
+| **FreeIPA** | Unstable | Good features (DNS, SSH, CA, secrets) but unreliable |
+
+## Investigation Queue
+
+Things we've identified but haven't evaluated yet, in rough priority order:
+
+| # | Topic | Context | Options to Investigate |
+|---|-------|---------|----------------------|
+| 1 | Secret backend | Distributed, offline-capable, policy-filtered | OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite |
+| 2 | ~~DNS auto-registration~~ | ~~Every managed resource auto-registered~~ | **DECIDED: PowerDNS + ExternalDNS** — see `dns-research.md` |
+| 3 | SSH CA | CA-signed host keys, short-lived user certs | Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary |
+| 4 | TLS / Internal CA | Machine certs, auto-renewal | OpenVox CA, Vault PKI, step-ca, cert-manager |
+| 5 | Bare metal provisioning | Universal PXE agent + rootfs deploy (NOT native installers) | Wrap Tinkerbell vs build own agent — see `os-install-research.md` |
+| 6 | State store | Embedded, auto-backup, auto-recover | SQLite+Litestream, bbolt, Badger |
+| 7 | Container build | Puppet modules → OCI images | Buildah, Docker, Kaniko |
+| 8 | Local cache encryption | Machine-specific key for secret cache | TPM 2.0, kernel keyring, LUKS-bound, secure enclave |
+| 9 | Alert rendering | Generate monitoring configs from lab alerts | Prometheus rules, Naemon configs, CloudWatch |
+| 10 | Input format | How users define resources and labels | YAML (Compose-like), Pkl, KCL, CUE, TypeScript |
+| 11 | Auth (CLI to server) | Secure CLI-to-lab-server communication | mTLS, OIDC, Vault tokens |
+| 12 | XCP-ng Pulumi provider | May need Upjet wrapper or direct API | Existing Terraform provider via Upjet, Pulumi XO provider |
+| 13 | Multi-tenancy | Team scoping for labels/resources | Namespaces, RBAC, org hierarchy |
+| 14 | Image production pipeline | Build rootfs tarballs per OS per arch | mkosi, debootstrap, dnf --installroot, Packer |
+| 15 | Tinkerbell evaluation | Hands-on: does wrapping it work, or build our own agent? | HookOS + actions vs custom LinuxKit agent |
+| 16 | XCP-ng rootfs extraction | How to produce deployable XCP-ng rootfs (not native installer) | Extract from ISO, capture installed system |
+| 17 | VyOS rootfs extraction | How to produce deployable VyOS rootfs | VyOS build system, published images, Docker mode |
+| 18 | Multi-arch PXE | Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI | Per-arch agent OS builds, iPXE configs |
+
+## Project Files
+
+| File | Contents |
+|------|----------|
+| `lab-tool-spec.md` | Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap) |
+| `architecture.md` | This file — decisions, dependencies, investigation queue |
+| `hardware.md` | Homelab hardware inventory and node roles |
+| `crossplane-evaluation.md` | Crossplane evaluation and rejection rationale |
+| `config-format-research.md` | YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.) |
+| `os-install-research.md` | OS install automation, rootfs production, image pipeline, deployment matrix |
+| `kubernetes-flavors.md` | k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale |
+| `dns-research.md` | PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS |
--- a/bastion.sh
+++ b/bastion.sh
@@ -0,0 +1,337 @@
+#!/usr/bin/env bash
+# ─────────────────────────────────────────────────────────────────────
+# Lab PXE Bastion — ephemeral PXE server for bare-metal provisioning
+#
+# Turns this machine into a temporary PXE boot server.  Target machines
+# on the same network can PXE boot and get Fedora installed automatically.
+#
+# Usage:
+#   sudo bash bastion.sh                        # interactive, auto-detect everything
+#   sudo TARGET_HOSTNAME=puppet SSH_PUBKEY=~/.ssh/id_ed25519.pub bash bastion.sh
+#
+# Requirements: Fedora/RHEL host with dnsmasq, python3, curl
+# ─────────────────────────────────────────────────────────────────────
+set -euo pipefail
+
+# ──── Defaults (override via environment) ──────────────────────────
+FEDORA_VERSION="${FEDORA_VERSION:-41}"
+ARCH="${ARCH:-x86_64}"
+HTTP_PORT="${HTTP_PORT:-8080}"
+TARGET_HOSTNAME="${TARGET_HOSTNAME:-lab-node}"
+TARGET_DISK="${TARGET_DISK:-}"          # empty = anaconda auto-picks
+SSH_PUBKEY="${SSH_PUBKEY:-}"            # path to .pub file, auto-detected
+TIMEZONE="${TIMEZONE:-Europe/London}"
+LOCALE="${LOCALE:-en_GB.UTF-8}"
+BASTION_DIR="${BASTION_DIR:-/tmp/lab-bastion}"
+
+# ──── Colors ───────────────────────────────────────────────────────
+RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'
+CYAN='\033[0;36m'; BOLD='\033[1m'; NC='\033[0m'
+
+log()  { echo -e "${GREEN}[bastion]${NC} $*"; }
+warn() { echo -e "${YELLOW}[bastion]${NC} $*"; }
+err()  { echo -e "${RED}[bastion]${NC} $*" >&2; }
+die()  { err "$@"; exit 1; }
+
+# ──── Preflight ────────────────────────────────────────────────────
+[[ $EUID -eq 0 ]] || die "Must run as root (need DHCP/TFTP ports). Use: sudo bash bastion.sh"
+
+command -v python3 >/dev/null || die "python3 not found"
+command -v curl    >/dev/null || die "curl not found"
+
+# Install dnsmasq if missing
+if ! command -v dnsmasq >/dev/null; then
+    log "Installing dnsmasq..."
+    if command -v dnf >/dev/null; then
+        dnf install -y dnsmasq
+    elif command -v apt-get >/dev/null; then
+        apt-get install -y dnsmasq
+    else
+        die "Cannot install dnsmasq — install it manually"
+    fi
+fi
+
+# ──── Auto-detect network ─────────────────────────────────────────
+IFACE="${IFACE:-$(ip route | awk '/default/ {print $5; exit}')}"
+SERVER_IP="$(ip -4 addr show "$IFACE" | awk '/inet / {split($2,a,"/"); print a[1]; exit}')"
+NETWORK="$(echo "$SERVER_IP" | awk -F. '{print $1"."$2"."$3".0"}')"
+
+[[ -n "$SERVER_IP" ]] || die "Cannot detect IP on interface $IFACE"
+log "Interface: ${BOLD}$IFACE${NC}  IP: ${BOLD}$SERVER_IP${NC}  Network: ${BOLD}$NETWORK${NC}"
+
+# ──── Auto-detect SSH pubkey ───────────────────────────────────────
+if [[ -z "$SSH_PUBKEY" ]]; then
+    # When run via sudo, check the real user's home
+    REAL_HOME="${HOME}"
+    if [[ -n "${SUDO_USER:-}" ]]; then
+        REAL_HOME="$(getent passwd "$SUDO_USER" | cut -d: -f6)"
+    fi
+    for keyfile in "$REAL_HOME/.ssh/id_ed25519.pub" "$REAL_HOME/.ssh/id_rsa.pub" "$REAL_HOME/.ssh/id_ecdsa.pub"; do
+        if [[ -f "$keyfile" ]]; then
+            SSH_PUBKEY="$keyfile"
+            break
+        fi
+    done
+fi
+
+if [[ -n "$SSH_PUBKEY" && -f "$SSH_PUBKEY" ]]; then
+    SSH_KEY_CONTENT="$(cat "$SSH_PUBKEY")"
+    log "SSH key: ${BOLD}$SSH_PUBKEY${NC}"
+else
+    warn "No SSH public key found. Root password will be set to 'changeme'."
+    warn "Set SSH_PUBKEY=/path/to/key.pub to use key-based auth instead."
+    SSH_KEY_CONTENT=""
+fi
+
+# ──── Prepare directories ─────────────────────────────────────────
+TFTPDIR="$BASTION_DIR/tftp"
+HTTPDIR="$BASTION_DIR/http"
+mkdir -p "$TFTPDIR" "$HTTPDIR"
+
+# ──── Cleanup handler ─────────────────────────────────────────────
+DNSMASQ_PID=""
+HTTP_PID=""
+FW_OPENED=false
+
+cleanup() {
+    echo ""
+    log "Shutting down..."
+    [[ -n "$DNSMASQ_PID" ]] && kill "$DNSMASQ_PID" 2>/dev/null && log "Stopped dnsmasq"
+    [[ -n "$HTTP_PID" ]]    && kill "$HTTP_PID"    2>/dev/null && log "Stopped HTTP server"
+
+    if $FW_OPENED && command -v firewall-cmd >/dev/null; then
+        log "Removing firewall rules..."
+        firewall-cmd --quiet --remove-service=dhcp     2>/dev/null || true
+        firewall-cmd --quiet --remove-service=tftp     2>/dev/null || true
+        firewall-cmd --quiet --remove-port=${HTTP_PORT}/tcp 2>/dev/null || true
+        firewall-cmd --quiet --remove-service=proxy-dhcp 2>/dev/null || true
+    fi
+
+    log "Done. Bastion artifacts remain in $BASTION_DIR"
+    log "Re-run this script to reprovision. Remove with: rm -rf $BASTION_DIR"
+}
+trap cleanup EXIT INT TERM
+
+# ──── Download artifacts (cached) ─────────────────────────────────
+download() {
+    local url="$1" dest="$2" label="$3"
+    if [[ -f "$dest" ]]; then
+        log "  ${label} — cached"
+        return
+    fi
+    log "  ${label} — downloading..."
+    curl -# -L -o "$dest" "$url" || die "Failed to download $label from $url"
+}
+
+FEDORA_MIRROR="https://download.fedoraproject.org/pub/fedora/linux/releases/${FEDORA_VERSION}/Everything/${ARCH}/os"
+
+log "Fetching boot artifacts (Fedora ${FEDORA_VERSION} ${ARCH})..."
+download "https://boot.ipxe.org/undionly.kpxe"   "$TFTPDIR/undionly.kpxe"   "iPXE BIOS"
+download "https://boot.ipxe.org/ipxe.efi"        "$TFTPDIR/ipxe.efi"        "iPXE UEFI"
+download "${FEDORA_MIRROR}/images/pxeboot/vmlinuz"    "$HTTPDIR/vmlinuz"     "Fedora kernel"
+download "${FEDORA_MIRROR}/images/pxeboot/initrd.img" "$HTTPDIR/initrd.img"  "Fedora initrd"
+
+# ──── Generate kickstart ──────────────────────────────────────────
+log "Generating kickstart for ${BOLD}${TARGET_HOSTNAME}${NC}..."
+
+# Disk config
+if [[ -n "$TARGET_DISK" ]]; then
+    DISK_CMDS="ignoredisk --only-use=${TARGET_DISK}
+clearpart --all --initlabel --drives=${TARGET_DISK}
+autopart --type=plain"
+else
+    DISK_CMDS="clearpart --all --initlabel
+autopart --type=plain"
+fi
+
+# Auth config
+if [[ -n "$SSH_KEY_CONTENT" ]]; then
+    AUTH_CMDS="rootpw --lock
+sshkey --username=root \"${SSH_KEY_CONTENT}\""
+else
+    AUTH_CMDS='rootpw --plaintext changeme'
+fi
+
+cat > "$HTTPDIR/ks.cfg" << KICKSTART
+# Lab Bastion — Fedora ${FEDORA_VERSION} kickstart
+# Generated: $(date -Iseconds)
+# Target: ${TARGET_HOSTNAME}
+
+# Install mode
+text
+reboot
+
+# Locale
+lang ${LOCALE}
+keyboard uk
+timezone ${TIMEZONE} --utc
+
+# Network
+network --bootproto=dhcp --activate --hostname=${TARGET_HOSTNAME}
+
+# Auth
+${AUTH_CMDS}
+
+# Disk
+${DISK_CMDS}
+
+# Bootloader
+bootloader --append="console=tty0 console=ttyS0,115200n8"
+
+# Install source
+url --mirrorlist=https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-\$releasever&arch=\$basearch
+
+# Packages — minimal server + essentials
+%packages
+@core
+@server-product
+openssh-server
+vim-enhanced
+tmux
+git
+curl
+python3
+dnf-plugins-core
+%end
+
+# Post-install
+%post --log=/root/bastion-post-install.log
+#!/bin/bash
+set -x
+
+# Ensure SSH is enabled
+systemctl enable --now sshd
+
+# Allow root SSH with key (password auth disabled)
+sed -i 's/^#\?PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
+sed -i 's/^#\?PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
+
+# Set hostname
+hostnamectl set-hostname ${TARGET_HOSTNAME}
+
+# Leave a breadcrumb
+echo "Provisioned by lab-bastion on $(date -Iseconds)" > /etc/lab-provisioned
+
+# Placeholder: puppet enrollment will go here later
+# puppet is not installed yet — this IS the puppet server
+echo "# Lab bootstrap node — puppet server setup pending" > /root/README
+
+%end
+KICKSTART
+
+log "Kickstart written to ${HTTPDIR}/ks.cfg"
+
+# ──── Generate iPXE boot script ───────────────────────────────────
+cat > "$HTTPDIR/boot.ipxe" << IPXE
+#!ipxe
+
+echo
+echo =======================================
+echo   Lab PXE Bastion — Fedora ${FEDORA_VERSION}
+echo   Target: ${TARGET_HOSTNAME}
+echo =======================================
+echo
+
+kernel http://${SERVER_IP}:${HTTP_PORT}/vmlinuz inst.ks=http://${SERVER_IP}:${HTTP_PORT}/ks.cfg inst.repo=${FEDORA_MIRROR} inst.text
+initrd http://${SERVER_IP}:${HTTP_PORT}/initrd.img
+boot
+IPXE
+
+# ──── Generate dnsmasq config ─────────────────────────────────────
+cat > "$BASTION_DIR/dnsmasq.conf" << DNSMASQ
+# Lab PXE Bastion — dnsmasq config
+# ProxyDHCP mode: adds PXE options without replacing existing DHCP
+
+# Disable DNS (we only want DHCP/TFTP)
+port=0
+
+# Listen on the right interface
+interface=${IFACE}
+bind-interfaces
+
+# ProxyDHCP — works alongside existing DHCP (UniFi etc)
+dhcp-range=${NETWORK},proxy
+
+# TFTP for initial PXE boot
+enable-tftp
+tftp-root=${TFTPDIR}
+
+# Detect client architecture
+dhcp-match=set:bios,option:client-arch,0
+dhcp-match=set:efi64,option:client-arch,7
+dhcp-match=set:efi64,option:client-arch,9
+
+# Detect iPXE clients (already chainloaded)
+dhcp-userclass=set:ipxe,iPXE
+
+# First PXE boot → serve iPXE binary via TFTP
+dhcp-boot=tag:bios,tag:!ipxe,undionly.kpxe
+dhcp-boot=tag:efi64,tag:!ipxe,ipxe.efi
+
+# iPXE clients → chain to boot script via HTTP
+dhcp-boot=tag:ipxe,http://${SERVER_IP}:${HTTP_PORT}/boot.ipxe
+
+# Verbose logging (see what's happening)
+log-dhcp
+DNSMASQ
+
+# ──── Open firewall ───────────────────────────────────────────────
+if command -v firewall-cmd >/dev/null && firewall-cmd --state >/dev/null 2>&1; then
+    log "Opening firewall ports (DHCP, TFTP, HTTP:${HTTP_PORT})..."
+    firewall-cmd --quiet --add-service=dhcp
+    firewall-cmd --quiet --add-service=tftp
+    firewall-cmd --quiet --add-port=${HTTP_PORT}/tcp
+    # ProxyDHCP uses port 4011
+    firewall-cmd --quiet --add-port=4011/udp 2>/dev/null || true
+    FW_OPENED=true
+fi
+
+# ──── Stop conflicting services ───────────────────────────────────
+# dnsmasq might be running as a system service
+if systemctl is-active --quiet dnsmasq 2>/dev/null; then
+    warn "System dnsmasq is running — stopping it temporarily"
+    systemctl stop dnsmasq
+    RESTART_DNSMASQ=true
+fi
+
+# ──── Start HTTP server ───────────────────────────────────────────
+log "Starting HTTP server on :${HTTP_PORT}..."
+(cd "$HTTPDIR" && python3 -m http.server "$HTTP_PORT" --bind 0.0.0.0 >/dev/null 2>&1) &
+HTTP_PID=$!
+sleep 0.5
+
+if ! kill -0 "$HTTP_PID" 2>/dev/null; then
+    die "HTTP server failed to start — is port ${HTTP_PORT} in use?"
+fi
+
+# ──── Start dnsmasq (proxyDHCP + TFTP) ────────────────────────────
+log "Starting PXE server (proxyDHCP on ${IFACE})..."
+echo ""
+echo -e "${CYAN}${BOLD}════════════════════════════════════════════════════════${NC}"
+echo -e "${CYAN}${BOLD}  PXE Bastion ready!${NC}"
+echo -e "${CYAN}${BOLD}════════════════════════════════════════════════════════${NC}"
+echo ""
+echo -e "  Network:    ${BOLD}${NETWORK}/24${NC} via ${BOLD}${IFACE}${NC}"
+echo -e "  HTTP:       ${BOLD}http://${SERVER_IP}:${HTTP_PORT}/${NC}"
+echo -e "  OS:         ${BOLD}Fedora ${FEDORA_VERSION} (${ARCH})${NC}"
+echo -e "  Hostname:   ${BOLD}${TARGET_HOSTNAME}${NC}"
+echo -e "  Kickstart:  ${BOLD}http://${SERVER_IP}:${HTTP_PORT}/ks.cfg${NC}"
+echo ""
+echo -e "  ${YELLOW}Now PXE-boot the target machine.${NC}"
+echo -e "  ${YELLOW}Set boot order to Network/PXE in BIOS, or use one-time boot menu.${NC}"
+echo ""
+echo -e "  Press ${BOLD}Ctrl-C${NC} to stop the bastion."
+echo ""
+echo -e "${CYAN}──── dnsmasq log (watch for DHCP/PXE requests) ────${NC}"
+echo ""
+
+# Run dnsmasq in foreground so logs stream to terminal
+dnsmasq --no-daemon --conf-file="$BASTION_DIR/dnsmasq.conf" &
+DNSMASQ_PID=$!
+
+# Wait for dnsmasq — if it exits, something went wrong
+wait "$DNSMASQ_PID" || {
+    err "dnsmasq exited unexpectedly. Check if another DHCP/TFTP service is running."
+    err "Try: ss -ulnp | grep -E ':(67|69|4011) '"
+    exit 1
+}
--- a/config-format-research.md
+++ b/config-format-research.md
@@ -0,0 +1,121 @@
+# Configuration Format Research
+
+## Decision: PENDING — exploring alternatives to raw Kubernetes YAML
+
+## The Problem
+
+Kubernetes YAML is verbose, repetitive, lacks type safety, and forces users to specify
+every layer of concern (intent, team defaults, org standards, k8s boilerplate) in one file.
+Helm "solves" this with Go templating, which produces unreadable template spaghetti.
+
+Docker Compose is the gold standard for UX — 6 lines vs 35 for the same deployment.
+The problem was never YAML itself; it was being forced to write too much of it.
+
+## Core Design Principle
+
+Users should only define what they care about. Everything else should be inherited from
+expert-defined defaults. YAML (or JSON) can exist underneath as:
+- Easy, non-binary backup format
+- Live editing capability
+- Debugging / inspection output
+
+## Layered Architecture
+
+```
+Layer 1: User intent       "I want an api service running myapp"        ← USER WRITES THIS
+Layer 2: Team defaults     "Our services get health checks, limits"     ← Team lead defines
+Layer 3: Org standards     "All pods need security context, labels"     ← Platform team defines
+Layer 4: Output            Full YAML/JSON for kubectl, backup, debug    ← GENERATED
+```
+
+Docker Compose feels good because it's only Layer 1 — Docker handles the rest.
+Kubernetes forces all 4 layers into one file.
+
+## Evaluated Alternatives
+
+### Tier 1 — Strong Contenders
+
+**Pkl (Apple)**
+- Best syntax for "amend a template" via `amends` keyword
+- Strong static typing, clean readable syntax
+- Lowest ceremony for simple cases
+- Risk: Apple may abandon it, requires JVM runtime
+- K8s support: `pkl-k8s` package exists
+
+**KCL (CNCF Sandbox)**
+- Python-like syntax, lowest learning curve of typed options
+- Schema defaults, validation, constraints built in
+- CNCF backing gives legitimacy
+- Risk: primarily driven by Ant Group (Alibaba)
+
+**CUE**
+- Most principled — constraint-based unification, not inheritance
+- Used by Timoni (Helm replacement), KubeVela, Dagger
+- Defaults marked with `*`, types and values on same spectrum
+- Risk: steep learning curve, novel paradigm
+- Most mature K8s ecosystem of the three
+
+### Tier 2 — Viable But Weaker Fit
+
+**CDK8s+ (TypeScript)**
+- Full IDE support, strongest type safety
+- cdk8s+ has intent-driven APIs ("I want a web service" → generates Deployment+Service)
+- Risk: brings software engineering complexity into config, AWS-centric
+- Good if team is TypeScript-native
+
+**Jsonnet (via Tanka)**
+- Proven at scale (Grafana uses it across hundreds of services)
+- Object mixins via `+` operator for composition
+- Risk: weak type safety, no compile-time validation of field names
+
+### Tier 3 — Not Recommended
+
+**Dhall** — strongest type safety but Haskell-like syntax, small/stale community
+**Nickel** — elegant contracts system but tiny K8s ecosystem
+**Starlark** — no type safety, no schema system, just a scripting layer
+**HCL** — great for infra provisioning, wrong fit for k8s manifests
+
+### Dead Projects
+- **Winglang** — shut down April 2025
+- **Klotho** — archived, pivoted to InfraCopilot
+- **Acorn** — pivoted to AI agents (Obot)
+
+## Compose-Like Input Format (Preferred Direction)
+
+The user prefers Docker Compose brevity. The tool we build could use a Compose-inspired
+input format at Layer 1, generating full k8s manifests + provider-specific resources underneath:
+
+```yaml
+# What the user writes
+services:
+  api:
+    image: myapp:latest
+    size: medium
+    ports: [8080]
+    env:
+      DB_HOST: postgres
+
+# System generates: full k8s Deployment, Service, NetworkPolicy,
+# resource limits, security context, health checks, etc.
+```
+
+YAML is fine for Layer 1 if it's short enough. The problem was never the format —
+it was the verbosity. Compose proves short YAML works.
+
+## Open Questions
+
+1. Should Layer 1 input be YAML (Compose-like), or a typed language (Pkl/KCL/CUE)?
+2. How do team defaults (Layer 2) and org standards (Layer 3) get defined and distributed?
+3. Should the render view show the generated YAML diff when changing Layer 1 input?
+4. How does this integrate with the Pulumi multi-cloud abstraction layer?
+5. Could the input format support both k8s workloads AND infrastructure resources
+   (VMs, networks, storage) in the same spec?
+
+## GUI/TUI Space — Underserved Opportunity
+
+No tool has achieved significant adoption for visually *defining* infrastructure.
+Existing tools (K9s, Lens, Rancher) are for monitoring/management, not authoring.
+
+The ideal: platform engineers define schemas with constraints/defaults,
+developers interact with a form/wizard showing only fields they need,
+validated config generated underneath. Nobody has built this well yet.
--- a/crossplane-evaluation.md
+++ b/crossplane-evaluation.md
@@ -0,0 +1,106 @@
+# Crossplane Evaluation
+
+## Decision: NOT ADOPTING
+
+Crossplane will not be used in this stack. The lack of a plan/preview mechanism is a dealbreaker
+for enterprise adoption and safe infrastructure management.
+
+---
+
+## Why We Evaluated It
+
+The core problem: Terraform/OpenTofu requires re-implementing the same infrastructure concepts
+per platform (AWS, XCP-ng, bare metal). At thousands of nodes across multiple platforms, this is
+a massive maintenance burden. Crossplane's XRD/Composition model promised a unified API:
+
+```
+XRD: "VirtualMachine" (universal API)
+  ├── Composition: AWS      → EC2 instance
+  ├── Composition: XCP-ng   → XO VM
+  └── Composition: bare metal → MAAS / Ansible
+```
+
+One API, multiple backends — teams request a "VirtualMachine" and the right composition handles it.
+
+## Strengths
+
+- **CNCF Graduated** (Nov 2025, v2.2) — Apache 2.0 license, top-tier maturity
+- **Continuous drift detection** — automatically reverts manual changes, unlike Terraform's on-demand plan/apply
+- **No state file management** — no remote backends, locking issues, or state corruption
+- **Kubernetes-native** — works with ArgoCD, Flux, kubectl, RBAC out of the box
+- **XRDs/Compositions** — genuine multi-platform abstraction layer, solves the "re-implement per cloud" problem
+- **Eventual consistency** — resources with complex dependencies don't get stuck like Terraform's dependency graph
+- **Enterprise adoption** — Deutsche Kreditbank, Elastic, Nike, Apple, NASA, Grafana Labs, 60+ orgs
+- **Deutsche Kreditbank** replaced Terraform; deployments went from weeks to under one hour
+
+## Dealbreaker: No Plan/Preview
+
+The single biggest issue. Terraform's `terraform plan` lets operators see exactly what will change
+before applying. Crossplane applies changes immediately upon resource creation/modification.
+
+- Discussed in the community for 2+ years with no resolution
+- A Kubernetes-native solution would be a `Plan` CRD that shows proposed changes before approval
+- ArgoCD `sync --dry-run` is a partial workaround but only shows k8s resource diffs, not what the
+  cloud provider will actually do underneath
+- **For regulated environments and SRE teams at scale, change preview is non-negotiable**
+
+Possible reasons it hasn't been implemented:
+- The continuous reconciliation architecture may make point-in-time snapshots fundamentally hard
+- Upbound (commercial entity) may be reserving it for their paid platform
+- Or simply not prioritised
+
+## Other Significant Concerns
+
+### CRD Bloat
+- `provider-aws` installs 900+ CRDs — can make API server unresponsive for up to an hour (GitHub #2649)
+- Exceeds Kubernetes' recommended ~500 CRD limit
+- Mitigated by "Provider Families" (install per-service sub-providers) but requires careful planning
+
+### Debugging Difficulty
+- Errors propagate through layers: Claim → XR → Composition → Managed Resource → Provider → Cloud API
+- Multiple sources report debugging compositions is painful
+- Pipeline Inspector (alpha in v2.2) is being introduced but not production-ready
+
+### Chicken-and-Egg Problem
+- Crossplane runs inside Kubernetes — cannot provision the cluster it runs on
+- Requires a "management cluster" bootstrapped by other means (Terraform, Puppet, etc.)
+- If the management cluster dies, no drift detection or reconciliation runs
+- Recovery: applying YAMLs to a new cluster works if deterministic resource names are used,
+  otherwise risks creating duplicate cloud resources
+
+### Cluster Loss / Immutability Concerns
+- State lives in etcd, not a versionable state file
+- No independent audit trail or easy way to diff historical states
+- On new cluster: resources with explicit external names get adopted; auto-named resources get duplicated
+- Need etcd backups as insurance, and deterministic naming everywhere
+
+### Performance at Scale
+- ~2000 composites took 6+ minutes to reconcile on k3d (GitHub #2256)
+- Reconciliation interval not easily configurable globally (GitHub #5934)
+
+### YAML Limitations
+- No native loops, conditionals, or programming constructs
+- Complex compositions require changes in multiple locations
+
+## XCP-ng Provider Gap
+
+- No Crossplane provider for XCP-ng exists today
+- A mature Terraform provider (`terraform-provider-xenorchestra`) exists, maintained by Vates
+- Could be wrapped via Upjet to auto-generate a Crossplane provider — but nobody has done it
+- Would be a greenfield open-source project
+
+## Real Issues Reported
+
+- API server unresponsiveness with too many CRDs (GitHub #2649)
+- CRD scaling issues beyond ~500 CRDs (GitHub #2895)
+- GCP SQL resources randomly marked for deletion — dangerous for production databases
+- Reconciliation rate limiting at scale (GitHub #2256)
+
+## Conclusion
+
+Crossplane solves a real problem (multi-platform abstraction) that we need, but the lack of
+plan/preview makes it unsuitable for enterprise-scale production infrastructure management.
+The operational concerns (CRD bloat, debugging, cluster dependency) add further risk.
+
+We need to find an alternative approach to the multi-platform abstraction problem that Crossplane
+solves, while retaining plan/preview capabilities.
--- a/dns-research.md
+++ b/dns-research.md
@@ -0,0 +1,143 @@
+# DNS Solution Research
+
+## Decision: PowerDNS Authoritative + ExternalDNS
+
+### Why PowerDNS
+
+| Feature | PowerDNS | CoreDNS | BIND9 | Technitium |
+|---------|----------|---------|-------|------------|
+| REST API | Full | No (needs etcd) | No (nsupdate) | Yes |
+| Database backend | PostgreSQL/MySQL/SQLite | etcd | Zone files | Custom |
+| Health-aware DNS | Lua records (ifportup, ifurlup) | No | No | No |
+| ExternalDNS provider | Yes | Yes (via etcd) | Yes (RFC 2136) | No |
+| DNSSEC | Yes | Limited | Best | Yes |
+| Split DNS | dnsdist routing | Corefile blocks | Views (best) | APP records |
+| Maturity | ISP-grade | K8s-focused | Oldest | Newer |
+
+PowerDNS wins on: REST API (critical for Lab), health-check-aware Lua records,
+database backend for HA, and ExternalDNS integration.
+
+### Architecture
+
+```
+                    Lab Server
+                    (control plane)
+                        │
+                        │ PowerDNS REST API
+                        ▼
+                ┌───────────────┐
+                │  PowerDNS     │
+                │  Authoritative│──── PostgreSQL/SQLite backend
+                │  Server       │
+                └───────┬───────┘
+                        │
+            ┌───────────┼───────────┐
+            │           │           │
+            ▼           ▼           ▼
+      Internal DNS   ExternalDNS   dnsdist
+      .lab.internal  (k8s syncs    (split DNS
+                      Services/    routing)
+                      Ingress)
+```
+
+### How Lab Uses DNS
+
+#### Auto-registration on onboard
+When `lab onboard` completes, Lab calls PowerDNS API:
+- A record: `<server>.lab.internal → <ip>`
+- PTR record: `<reverse-ip>.in-addr.arpa → <server>.lab.internal`
+- Both created/updated atomically
+
+#### Domain claims via labels
+Labels can claim shared domain names:
+```yaml
+labels:
+  mailserver:
+    dns:
+      records:
+        - type: A
+          name: "{{server.name}}.lab.internal"
+      claims:
+        - name: mail.example.com
+          type: A
+          health_check: { port: 25 }
+```
+All servers with label `mailserver` contribute to `mail.example.com` round-robin.
+PowerDNS Lua records remove unhealthy servers automatically.
+
+#### IP mobility
+Lab agent on machine reports IP change → Lab server updates PowerDNS API →
+A record, PTR, and all claimed domains updated.
+
+#### K8s integration
+ExternalDNS runs in k8s, syncs Service/Ingress records to same PowerDNS instance.
+Same DNS server serves both bare metal and k8s records.
+
+#### Groups claiming domains
+Groups can claim domains for all member servers:
+```yaml
+groups:
+  production-web:
+    match:
+      labels: [web-frontend]
+      environment: prod
+    dns:
+      claims:
+        - name: www.example.com
+          type: A
+          health_check: { url: "https://{{server.ip}}/healthz" }
+```
+
+### DNS Plugin Interface
+
+```go
+type DNSPlugin interface {
+    Name() string
+
+    // Record management
+    CreateRecord(zone, name, recordType string, targets []string, ttl int) error
+    UpdateRecord(zone, name, recordType string, targets []string, ttl int) error
+    DeleteRecord(zone, name, recordType string) error
+    ListRecords(zone string) ([]Record, error)
+
+    // Health-checked records
+    CreateHealthCheckedRecord(zone, name string, targets []string, check HealthCheck) error
+
+    // Zone management
+    CreateZone(name string, kind string) error
+    DeleteZone(name string) error
+}
+```
+
+Built-in:
+- `dns-powerdns` — PowerDNS REST API (primary)
+- `dns-route53` — AWS Route53 (for cloud deployments)
+- `dns-rfc2136` — RFC 2136 dynamic updates (BIND/Knot fallback)
+
+### Split DNS Setup
+
+Internal zones (`.lab.internal`) served by PowerDNS authoritatively.
+External queries forwarded upstream (8.8.8.8, ISP DNS).
+
+Options:
+- **dnsdist** (PowerDNS ecosystem) routes by source subnet
+- **CoreDNS as resolver** — serves internal from PowerDNS, forwards external
+- **BIND views** — if we need view-based split on same zone (unlikely)
+
+### Evaluated and Not Chosen
+
+| Tool | Why Not |
+|------|---------|
+| CoreDNS | No REST API, needs etcd intermediary, k8s-focused |
+| BIND9 | No REST API, nsupdate is cumbersome for automation |
+| Technitium | No ExternalDNS provider, newer/smaller community |
+| dnsmasq | Not suitable — caching forwarder, no API, ~1000 client limit |
+| Knot DNS | No REST API, better as secondary/downstream |
+
+### DNS-as-Code (Optional Layer)
+
+For static DNS infrastructure (SOA, NS, MX, base zone config):
+- **octoDNS** (GitHub) or **DNSControl** (Stack Exchange)
+- GitOps workflow: PR → review → merge → sync to PowerDNS
+- Dynamic records (server A records, claims) managed by Lab directly via API
+- Static records managed via DNS-as-code in Git
--- a/hardware.md
+++ b/hardware.md
@@ -0,0 +1,37 @@
+# Homelab Hardware Inventory
+
+## Compute Nodes
+
+| Node | CPU Arch | RAM | Role | Cost |
+|------|----------|-----|------|------|
+| Beelink SER9 MAX | x86_64 | 64GB | k3s worker, ROCm GPU, Longhorn storage | ~£869 |
+| Beelink SER9 Pro | x86_64 | 32GB | Bootstrap: Puppet, DNS, UniFi, Vault, Naemon | ~£300 |
+| Minisforum MS-R1 | ARM (aarch64) | 64GB | k3s node | ~£500-640 |
+| Nvidia DGX Spark | ARM (Grace) | 128GB | CUDA/AI inference | ~£3,700 |
+| Mac Studio M1 Max | ARM (aarch64) | 32GB | k3s server #1 (etcd) | ~£775 |
+
+## Networking
+
+| Device | Specs | Cost |
+|--------|-------|------|
+| USW-Flex-XG x2 | 8x 10GbE ports total (4 per switch) | £458 |
+
+## Summary
+
+- **Total RAM:** 320GB
+- **Architectures:** x86_64, aarch64 (Apple Silicon + ARM + Grace)
+- **GPU compute:** ROCm (SER9 MAX), CUDA (DGX Spark)
+- **Estimated total:** ~£6,600-6,740
+
+## Node Roles
+
+### Bootstrap Node (Beelink SER9 Pro) — Outside k3s
+- Puppet (bare metal config management)
+- DNS (CoreDNS or PowerDNS)
+- UniFi controller
+- Vault (secrets management)
+- Naemon (bare metal, network, black-box endpoint monitoring)
+
+### k3s Cluster
+- **Server (control plane + etcd):** Mac Studio M1 Max
+- **Workers:** Beelink SER9 MAX, Minisforum MS-R1, DGX Spark
--- a/kubernetes-flavors.md
+++ b/kubernetes-flavors.md
@@ -0,0 +1,120 @@
+# Kubernetes Flavor Decision
+
+## Decision: k3s (confirmed)
+
+k3s is the best fit for Lab. OpenShift and most other flavors conflict with
+the puppet-managed, multi-arch, lightweight approach.
+
+## Evaluation
+
+| Flavor | Puppet-Friendly | ARM | Multi-arch | Enterprise | License | Verdict |
+|--------|:-:|:-:|:-:|:-:|---------|---------|
+| **k3s** | ✓ binary + config files | ✓ | ✓ | Rancher/SUSE | Apache 2.0 | **CHOSEN** |
+| **k0s** | ✓ single binary, config-driven | ✓ | ✓ | Mirantis | Apache 2.0 | Good alternative |
+| **kubeadm** | ✓ well-understood bootstrap | ✓ | ✓ | Upstream K8s | Apache 2.0 | Viable but heavier |
+| **RKE2** | ✓ config files | ✓ | ✓ | Rancher/SUSE | Apache 2.0 | Heavier k3s |
+| **OpenShift** | ✗ operator-driven, fights puppet | ✗ limited | ✗ limited | Red Hat | Proprietary | REJECTED |
+| **MicroK8s** | ⚠ snap-based, puppet+snaps awkward | ✓ | ✓ | Canonical | Apache 2.0 | Not great |
+| **Talos** | ✗ immutable OS, no SSH, no puppet | ✓ | ✓ | Sidero Labs | MPL 2.0 | Incompatible |
+
+## Why NOT OpenShift — Deep Analysis
+
+### OpenShift Does Overlap With Lab
+
+OpenShift is the closest existing thing to what Lab does. The overlap is real:
+
+| Capability | OpenShift | Lab |
+|-----------|-----------|-----|
+| Manages nodes end-to-end | Yes (RHCOS + MCO) | Yes (OpenVox + labels) |
+| Immutable infrastructure | Yes (rpm-ostree, operator-driven) | Yes (puppet convergence) |
+| Fights config drift | Yes (operators reconcile) | Yes (puppet + sync pillar) |
+| Built-in monitoring | Yes (Prometheus + Alertmanager bundled) | Yes (health aggregator) |
+| Built-in secrets | Yes (etcd-encrypted secrets) | Yes (secret store + local cache) |
+| Certificate management | Yes (internal CA, auto-rotation) | Yes (identity layer) |
+| Node lifecycle | Yes (MachineSet, MachinePools) | Yes (onboard, labels, providers) |
+| Self-managing | Yes (operators update themselves) | Yes (lab manages itself) |
+
+### Why OpenShift Still Doesn't Fit
+
+**1. Single OS** — OpenShift control plane = RHCOS only. Can't run on Apple Silicon,
+   Asahi Linux, or any non-RHCOS system. Lab needs Ubuntu, Debian, Fedora, AlmaLinux,
+   XCP-ng, VyOS across x86 and ARM.
+
+**2. K8s only** — OpenShift manages k8s nodes. Lab manages everything: k8s nodes,
+   standalone VMs, bare metal hypervisors, network appliances, physical servers that
+   will never run k8s. Not everything is a container.
+
+**3. Single cluster scope** — OpenShift manages one cluster. Lab manages homelab k3s +
+   enterprise AWS EKS + XCP-ng hypervisors + bare metal + OVH vRack. Cross-provider,
+   cross-cluster.
+
+**4. Fights puppet** — OpenShift has ~30+ operators that each own a piece of the system.
+   If puppet changes kubelet config, the Machine Config Operator detects "drift" and
+   reverts it. Two reconciliation loops fighting each other, possibly rebooting nodes
+   in a loop. You're supposed to change everything via CRDs, not external tools.
+
+**5. No XCP-ng/hypervisor management** — Can't provision VMs on XCP-ng, manage Xen
+   hosts, or understand hypervisors that aren't VMware/OpenStack.
+
+**6. Throws away puppet modules** — Company has existing puppet modules. OpenShift's
+   model is operators, not puppet. Complete rewrite of config management.
+
+**7. Heavyweight** — Minimum 6 nodes, 88GB RAM just for the platform. k3s uses 512MB.
+   Our entire homelab is 5 nodes, 320GB RAM.
+
+**8. ARM limited** — RHCOS on Apple Silicon doesn't exist. ARM support is limited to
+   AWS Graviton and some server ARM platforms.
+
+### The Scope Difference
+
+```
+OpenShift:  "I am your platform. Everything runs in me. I control the OS."
+            Scope: Kubernetes cluster + its nodes
+
+Lab:        "I manage your infrastructure. K8s is one thing I deploy."
+            Scope: Everything — VMs, bare metal, hypervisors, k8s,
+                   network gear, containers, across any provider
+```
+
+Lab is closer to what OpenShift + Satellite + RHCOS + ACM (Advanced Cluster Management)
+do **together** — but unified, lighter, open source, and not locked to Red Hat's ecosystem.
+
+## Why k3s
+
+- **Puppet-friendly** — it's just a binary and config files in `/etc/rancher/k3s/`
+- **Ultra-light** — runs on Mac Studio, ARM boxes, small VMs
+- **Multi-arch** — native x86 and ARM
+- **Same K8s API** as EKS/GKE — portable to cloud
+- **Single binary** — trivial to manage with puppet
+- **Proven** — CNCF certified, widely used in edge/IoT/homelab
+
+## k3s via Puppet (OpenVox)
+
+```puppet
+# Label: k8s-server → puppet class
+class kubernetes::server {
+  class { 'k3s::server':
+    token        => lab::secret('k8s/cluster-token'),
+    cluster_init => true,
+    tls_san      => [$facts['fqdn'], 'k8s.lab.internal'],
+  }
+}
+
+# Label: k8s-worker → puppet class
+class kubernetes::worker {
+  class { 'k3s::worker':
+    server_url => 'https://k8s.lab.internal:6443',
+    token      => lab::secret('k8s/cluster-token'),
+  }
+}
+```
+
+Same puppet classes work on bare metal, XCP-ng VM, EC2 instance, any architecture.
+
+## k0s as Backup Option
+
+If k3s ever becomes problematic, k0s is the closest alternative:
+- Also single binary, config-driven, multi-arch
+- `k0sctl` adds cluster management (bootstrap, upgrade, reset)
+- Mirantis backing (Lens, Docker EE)
+- Worth monitoring but no reason to switch from k3s today
--- a/lab-tool-spec.md
+++ b/lab-tool-spec.md
--- a/os-install-research.md
+++ b/os-install-research.md
@@ -0,0 +1,356 @@
+# OS Installation Research
+
+## Target Operating Systems
+
+All must support unattended network installation and automated OpenVox enrollment.
+All must work across multiple CPU architectures where the OS supports it.
+
+| OS | Install System | Answer Format | Architectures | PXE Difficulty |
+|-----|---------------|--------------|---------------|---------------|
+| Ubuntu 24.04 | autoinstall (cloud-init) | YAML | x86_64, aarch64, RISC-V | Easy |
+| Debian 12 | preseed | preseed.cfg | x86_64, aarch64, many others | Medium |
+| Fedora 41+ | Anaconda/kickstart | .ks file | x86_64, aarch64 | Easy |
+| AlmaLinux 9 | Anaconda/kickstart | .ks file | x86_64, aarch64 | Easy |
+| XCP-ng 8.3 | Custom Python TUI | XML answer file | x86_64 only | HARD |
+| VyOS 1.4 | Custom installer | config.boot | x86_64, aarch64 | Medium |
+
+## XCP-ng Network Install — Known Hard
+
+### Why it's difficult
+- iPXE UEFI is fundamentally broken (open bug, multiboot module corruption)
+- Serial/headless install hangs after detecting storage — no fix
+- No VNC installer mode (unlike RHEL/Debian)
+- TFTP agonizingly slow for large install.img
+- Custom Python TUI designed for VGA console, not automation
+- No major provisioning tool has first-class XCP-ng support
+
+### What works
+- **BIOS PXE** more reliable than UEFI
+- **IPMI virtual media** with remastered ISO is most reliable
+- Answer file XML with `<post-install-script>` and `<script stage="filesystem-populated">`
+- Post-install puppet enrollment via `/etc/firstboot.d/` scripts
+- XCP-ng enables SSH by default after install
+
+### Answer file format (XML, custom to XenServer/XCP-ng)
+```xml
+<?xml version="1.0"?>
+<installation mode="fresh" srtype="ext">
+    <primary-disk>sda</primary-disk>
+    <keymap>us</keymap>
+    <root-password type="hash">$6$...</root-password>
+    <source type="url">http://server/xcp-ng/</source>
+    <admin-interface name="eth0" proto="dhcp" />
+    <hostname>xcphost01</hostname>
+    <timezone>Europe/London</timezone>
+    <ntp-server>pool.ntp.org</ntp-server>
+    <network-backend>openvswitch</network-backend>
+    <post-install-script type="url">http://server/scripts/post-install.sh</post-install-script>
+    <script stage="filesystem-populated" type="url">http://server/scripts/fs-setup.sh</script>
+</installation>
+```
+
+### Post-install puppet enrollment
+The `filesystem-populated` stage script drops a firstboot script:
+```bash
+#!/bin/bash
+MOUNT=$1
+cat > "$MOUNT/etc/firstboot.d/99-lab-enroll" << 'SCRIPT'
+#!/bin/bash
+# Install puppet agent (XCP-ng is CentOS-based, yum works)
+yum install -y puppet-agent
+# Configure and start
+puppet config set server puppet.lab.internal
+systemctl enable --now puppet
+SCRIPT
+chmod +x "$MOUNT/etc/firstboot.d/99-lab-enroll"
+```
+
+## Lab Install Profile Abstraction
+
+Lab needs an `InstallerPlugin` interface so the same `lab onboard` command works
+for all OS types. Each plugin handles answer file generation, PXE chain setup,
+and post-install enrollment for its OS type.
+
+```go
+type InstallerPlugin interface {
+    Name() string
+    SupportedArchitectures() []string
+
+    // Generate the answer/config file for unattended install
+    GenerateAnswerFile(config InstallConfig) ([]byte, error)
+
+    // Set up PXE boot artifacts (kernel, initrd, bootloader configs)
+    PreparePXE(config PXEConfig) error
+
+    // Generate post-install enrollment script
+    GenerateEnrollmentScript(token string, labels []string) ([]byte, error)
+}
+```
+
+Built-in installer plugins:
+- `installer-autoinstall` — Ubuntu (cloud-init based autoinstall YAML)
+- `installer-kickstart` — Fedora, AlmaLinux, RHEL (kickstart .ks files)
+- `installer-preseed` — Debian (preseed.cfg)
+- `installer-xcpng` — XCP-ng (custom XML + firstboot.d scripts)
+- `installer-vyos` — VyOS (config.boot)
+
+## Auto-Onboard Rules
+
+Automatic onboarding based on detected hardware characteristics:
+
+```yaml
+auto-onboard:
+  rules:
+    - name: large-compute-to-xcpng
+      conditions:
+        cores: ">= 40"
+        memory: ">= 500GB"
+        provider: ovh
+      action:
+        image: xcpng-8.3
+        labels: [xen-host, production]
+
+    - name: arm-to-ubuntu
+      conditions:
+        arch: aarch64
+      action:
+        image: ubuntu-24.04
+        labels: [arm, k8s-worker]
+```
+
+Must support:
+- Preview: show which existing servers match/don't match rules
+- Dry-run: show what would happen for pending servers
+- Apply: actually onboard matching servers
+
+## Deployment Approach: Universal PXE Agent + Rootfs Images
+
+### Decision: NOT using native installers
+
+Instead of dealing with 6 different installer formats (autoinstall, kickstart, preseed,
+XCP-ng XML, VyOS config), Lab uses a universal approach:
+
+1. PXE boot ONE agent OS (same for all target distros)
+2. Agent contacts Lab server, gets instructions
+3. Agent partitions disk, deploys rootfs tarball, injects config, reboots
+4. Target OS boots with lab-agent, enrolls with OpenVox
+
+This avoids the nightmare of maintaining 6 installer plugins × 3 architectures.
+
+### Tool Evaluation
+
+| Tool | What It Does | For Lab? |
+|------|-------------|----------|
+| **Tinkerbell (CNCF)** | PXE → HookOS agent → workflow actions (partition, deploy, inject) | **Best candidate to wrap** |
+| **LinuxKit** | Build minimal agent OS (used by Tinkerbell's HookOS) | Build our PXE agent |
+| **mkosi** | Build rootfs tarballs for any distro (Fedora, Ubuntu, Debian, etc.) | **Image production** |
+| **iPXE** | Universal PXE bootloader with scripting | PXE foundation |
+| **Pixiecore** | Simple Go PXE server with per-MAC API mode | PXE building block |
+| **bootc** | Bootable OCI containers → install to disk (RHEL-family) | Image format option |
+| **cloud-init** | First-boot config injection | Post-deploy config |
+| **Packer** | Build VM/machine images | Golden image building |
+| **MAAS/Curtin** | Production-grade, same pattern, but Ubuntu-centric + heavy | Too opinionated |
+| **Warewulf** | Stateless/diskless boot from container images | Wrong model (RAM-only) |
+| **Kairos** | Immutable k8s-focused OS from containers | Too opinionated |
+| **FOG/Clonezilla** | Block-level disk cloning | Too rigid |
+| **FAI** | Debian-centric installer framework | Too narrow |
+| **Razor (Puppet)** | Dead (archived 2019) | Dead |
+| **netboot.xyz** | PXE boot menu into native installers | Opposite of what we want |
+
+### Tinkerbell — Closest Match
+
+Tinkerbell already implements this pattern:
+- **HookOS**: minimal agent OS built with LinuxKit, boots via PXE, multi-arch (x86 + ARM)
+- **Tink Worker**: runs inside HookOS, contacts server via gRPC, executes workflows
+- **Workflow Actions**:
+  - `rootio` — partition disks, create filesystems
+  - `archive2disk` — stream compressed rootfs tarball to mounted filesystem
+  - `image2disk` — write raw disk image (dd-style)
+  - `oci2disk` — pull OCI container image, write to disk
+  - `writefile` — write individual files (puppet certs, config, enrollment token)
+  - `cexec` — chroot and run commands (install bootloader, etc.)
+  - `kexec` — kexec into new kernel (avoids reboot)
+
+**Tinkerbell's limitation:** requires Kubernetes to run (Tink Server is k8s-native).
+Options:
+- Run on bootstrap node's k3s (works but adds k3s dependency before we have k3s)
+- Extract just HookOS + actions, replace Tink Server with Lab's own API
+- Use Tinkerbell after initial bootstrap
+
+### Option A: Wrap Tinkerbell
+Use Tinkerbell's HookOS and actions, Lab translates `lab onboard` into Tinkerbell
+workflows. Proven, multi-arch, battle-tested by Equinix Metal.
+
+### Option B: Build our own lightweight agent
+If Tinkerbell's k8s dependency is too heavy:
+- Build agent OS with LinuxKit (like HookOS but simpler)
+- Small Go binary as the agent: contacts lab-server, gets instructions, partitions,
+  deploys rootfs, injects files, installs bootloader, reboots
+- Embedded in Lab binary — no k8s dependency
+- Essentially "Tinkerbell actions without Tinkerbell's workflow engine"
+
+### Decision: TBD — needs hands-on evaluation of Tinkerbell
+
+### VyOS Inspiration
+
+VyOS proves this pattern works:
+- Image-based install (rootfs deployed to partition)
+- Also runs as Docker container (same config system)
+- Same concept as Lab: one definition → VM image, bare metal, or container
+
+### Image Production Pipeline
+
+Lab needs to produce rootfs tarballs for each OS × architecture:
+
+```
+$ lab image build ubuntu-24.04 --arch x86_64,aarch64
+  → Uses mkosi or debootstrap to build rootfs
+  → Injects lab-agent, cloud-init datasource
+  → Produces: ubuntu-24.04-x86_64.tar.gz, ubuntu-24.04-aarch64.tar.gz
+
+$ lab image build xcpng-8.3 --arch x86_64
+  → Extract/capture rootfs from XCP-ng installer/installed system
+  → Produces: xcpng-8.3-x86_64.tar.gz
+
+$ lab image list
+IMAGE              ARCH              SIZE      BUILT
+ubuntu-24.04       x86_64, aarch64   850MB     2026-03-15
+debian-12          x86_64, aarch64   620MB     2026-03-14
+fedora-41          x86_64, aarch64   920MB     2026-03-14
+almalinux-9        x86_64, aarch64   780MB     2026-03-13
+xcpng-8.3          x86_64            1.2GB     2026-03-10
+vyos-1.4           x86_64, aarch64   450MB     2026-03-12
+```
+
+Image build tools per OS:
+- Ubuntu/Debian: debootstrap or mkosi
+- Fedora/AlmaLinux: dnf --installroot or mkosi
+- XCP-ng: install in QEMU + Packer, capture rootfs (only viable method)
+- VyOS: extract squashfs from ISO (`unsquashfs /mnt/live/filesystem.squashfs`)
+- Asahi Linux: NOT BUILDABLE — SSH onboard only, OS already installed by user
+
+## XCP-ng Rootfs Production — Detailed
+
+### Why package-based build doesn't work
+- `install.img` is the installer ramdisk, NOT the target system
+- The installer (`host-installer/backend.py`) does post-install XAPI setup that
+  can't be replicated with just yum --installroot
+- Nobody has successfully built XCP-ng from packages alone
+- `create-install-image` scripts only produce ISOs
+
+### Viable approach: Packer + QEMU capture
+```
+1. Boot XCP-ng ISO in QEMU with answerfile (unattended)
+2. Installer runs normally, does all XAPI/Xen setup
+3. Mount resulting disk image
+4. Tar up root partition
+5. Generalize: remove SSH keys, XAPI state.db, hostname, UUIDs, persistent net rules
+6. Output: xcpng-8.3-x86_64.tar.gz
+```
+
+### XCP-ng partition layout (PXE agent must recreate this)
+```
+sda1: 18GB  ext3  /           (dom0 root)
+sda2: 18GB  ext3  (backup)    (upgrade slot)
+sda3: rest  LVM   (SR)        (VM storage repository)
+sda4: 512MB vfat  /boot/efi   (UEFI ESP)
+sda5: 4GB   ext3  /var/log
+sda6: 1GB   swap
+```
+
+## Asahi Linux — Special Case
+
+### Why it can't follow the standard path
+- No PXE boot — Apple Silicon only boots from internal NVMe or USB (iBoot)
+- Firmware partition — m1n1 must be in Apple's APFS container, coexists with macOS
+- Device tree — generated per-chip at install time
+- GPU drivers — Asahi's reverse-engineered drivers are kernel-specific
+- Boot chain: iBoot → m1n1 → U-Boot/GRUB → Linux (completely non-standard)
+
+### How Lab handles it
+- SSH onboard only: `lab onboard mac-studio --provider ssh --host <ip>`
+- Asahi is already installed (user did this manually or via Asahi installer)
+- Lab manages the userspace (Fedora-based) via puppet normally
+- Kernel updates from Asahi repos, managed by puppet/dnf
+- m1n1/U-Boot/firmware layer is untouched by Lab
+
+### Lesson
+Not everything is PXE-bootable. Lab needs two onboard paths:
+- **PXE onboard**: bare metal with no OS (Beelinks, OVH servers, XCP-ng hosts)
+- **SSH onboard**: OS already installed (Mac Studio, DGX Spark, cloud VMs)
+
+## Image Deployment Matrix
+
+```
+                    PXE Deploy    SSH Onboard    Container    VM Image
+Ubuntu 24.04        ✓ rootfs      ✓              ✓            ✓ qcow2
+Debian 12           ✓ rootfs      ✓              ✓            ✓ qcow2
+Fedora 41           ✓ rootfs      ✓              ✓            ✓ qcow2
+AlmaLinux 9         ✓ rootfs      ✓              ✓            ✓ qcow2
+XCP-ng 8.3          ✓ rootfs      ✓ (existing)   ✗            ✗
+VyOS 1.4            ✓ rootfs      ✓ (existing)   ✓ docker     ✓ qcow2
+Asahi Linux         ✗ impossible  ✓ (only way)   ✗            ✗
+```
+
+## Automated Image Pipeline
+
+Images must be rebuilt regularly to include security updates and new lab-agent versions.
+
+### Pipeline Configuration
+```yaml
+image-pipelines:
+  ubuntu-24.04:
+    method: debootstrap
+    schedule: weekly
+    architectures: [x86_64, aarch64]
+    outputs: [rootfs-tarball, container-base, qcow2]
+    retention: 4 builds
+
+  xcpng-8.3:
+    method: packer-qemu          # install in QEMU, capture
+    schedule: monthly
+    architectures: [x86_64]
+    outputs: [rootfs-tarball]
+    retention: 3 builds
+
+  vyos-1.4:
+    method: squashfs-extract     # extract from ISO
+    schedule: monthly
+    architectures: [x86_64, aarch64]
+    outputs: [rootfs-tarball, container-base]
+    retention: 3 builds
+```
+
+### Build runs on Lab itself (dogfooding)
+- x86 images build on x86 machines (Beelink SER9 MAX)
+- ARM images build on ARM machines (DGX Spark, Minisforum)
+- XCP-ng builds on any x86 with QEMU/KVM
+- Lab picks the right builder based on architecture
+
+### Upgrade flow
+- New image built → Lab knows which servers run old version
+- `lab image diff` shows package changes
+- `lab image promote` makes new image the default for new deploys
+- Existing servers: puppet manages package updates (not re-imaged unless requested)
+
+### Connection to Puppet → Container Artifact Builder
+
+Same pipeline, different output targets:
+
+```
+Label "mailserver" + base image "ubuntu-24.04":
+  → rootfs + puppet classes = bare metal image (tar.gz for PXE deploy)
+  → rootfs + puppet classes = container image (OCI for k8s/docker)
+  → rootfs + puppet classes = VM image (qcow2/vmdk for XCP-ng/AWS)
+
+One label, one set of puppet modules, three deployment formats.
+```
+
+## Multi-Architecture Considerations
+
+- PXE boot chain differs between x86 (BIOS/UEFI) and ARM (UEFI only)
+- Need separate kernel/initrd per architecture for the agent OS
+- Rootfs tarballs are architecture-specific
+- Some OS images don't exist for all architectures (XCP-ng = x86 only)
+- Lab must track architecture per image and refuse mismatches
+- Tinkerbell's HookOS already builds for x86_64 and aarch64