# Lab — Unified Infrastructure Lifecycle Platform ## What It Is A tool that abstracts infrastructure lifecycle across clouds, hypervisors, bare metal, and Kubernetes — using labels as the universal abstraction and existing tools under the hood. **Not reinventing the wheel.** Uses Pulumi, OpenVox, Tinkerbell, Prometheus, Naemon, existing Puppet modules, cloud APIs — but provides a unified interface over all of them. ## Architecture ``` ┌────────────────────────────────────────────────────────────┐ │ lab-server (control plane) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │ │ Provider │ │ Label │ │ Lifecycle│ │ Artifact │ │ │ │ Registry │ │ Engine │ │ Manager │ │ Builder │ │ │ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │ │ OpenVox │ │ Health │ │ K8s │ │ Render │ │ │ │ Enrollor │ │ Aggregator│ │ Deployer │ │ Engine │ │ │ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │ │ Identity │ │ DNS │ │ Secret │ │ Token │ │ │ │ Manager │ │ Manager │ │ Manager │ │ Issuer │ │ │ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │ │ │ │ API (gRPC + REST) │ └──────────────┬─────────────────────────────────────────────┘ │ ┌──────────┴──────────┐ │ │ ┌───┴───┐ ┌────┴────┐ │ lab │ │ lab-tui │ │ (CLI) │ │ (k9s) │ └───────┘ └─────────┘ ``` ### Control Plane (lab-server) Runs as a service (on bootstrap node, or in k8s). Hosts: - **Provider Registry** — pluggable providers (AWS, XCP-ng, bare metal, GCP, etc.) - **Label Engine** — resolves labels → puppet classes, sizes, ports, config - **Lifecycle Manager** — orchestrates provision → enroll → configure → observe - **Artifact Builder** — puppet classes → container images - **OpenVox Enrollor** — secure cert signing, node classification, environment assignment - **Health Aggregator** — queries Prometheus, Naemon, cloud health APIs - **K8s Deployer** — manages workloads on k3s/EKS clusters - **Render Engine** — side-by-side provider comparison, cost estimates, drift detection - **Identity Manager** — tracks enrollment state, certs, Vault auth, SSH keys per resource - **DNS Manager** — auto-registers/updates DNS for every managed resource - **Secret Manager** — controls which resources can access which secrets (per-label policies) - **Token Issuer** — generates one-time join tokens at provision time (no hardcoded secrets) ### CLI (lab) kubectl-like interface for browsing and managing resources: ``` $ lab get servers NAME PROVIDER LABELS SIZE SYNC PUPPET HEALTH IDENTITY api-1 aws app,prod,eu-west medium ✓ sync ✓ ok ✓ ok ✓ enrolled api-2 aws app,prod,eu-west medium ✓ sync ✓ ok ✓ ok ✓ enrolled mail-1 xcpng mailserver,prod medium ✓ sync ✓ ok ✓ ok ✓ enrolled db-1 baremetal postgres,prod large ⚠ drift ✓ ok ✓ ok ✓ enrolled worker-3 aws k8s-worker,staging large ✓ sync ✗ failed ⚠ 2 alrt ✓ enrolled gateway-1 baremetal k8s-server,prod small ✓ sync ✓ ok ✓ ok ⚠ cert exp $ lab get servers --label mailserver NAME PROVIDER SIZE SYNC PUPPET HEALTH IDENTITY mail-1 xcpng medium ✓ sync ✓ ok ✓ ok ✓ enrolled mail-2 aws medium ✓ sync ✓ ok ✓ ok ✓ enrolled $ lab describe server db-1 Name: db-1 Provider: baremetal Labels: [postgres, prod, eu-west] Size: large (8 cores, 32GB, 500GB NVMe) Status: DRIFT DETECTED Expected: size=large, disk=500GB Actual: size=large, disk=500GB, extra_mount=/data (unmanaged) Puppet: Environment: production Role: postgres Classes: [postgresql::server, backup::pgbackrest, node_exporter] Last run: 2026-03-15 14:22:03 (success) Next run: 2026-03-15 14:52:03 Health: Prometheus: ✓ all targets up Naemon: ✓ all checks passing Alerts: none active $ lab get labels LABEL PUPPET CLASSES SERVERS CONTAINERS mailserver postfix, dovecot, spamassassin 2 1 k8s-worker kubernetes::worker, containerd 12 0 postgres postgresql::server, pgbackrest 3 1 app nginx, app::deploy 4 2 $ lab get containers NAME IMAGE LABEL K8S CLUSTER STATUS mailserver ghcr.io/org/mailserver:2026.03.15 mailserver homelab running postgres ghcr.io/org/postgres:2026.03.14 postgres homelab running app ghcr.io/org/app:2026.03.15 app production running $ lab diff server db-1 size: large disk: 500GB + extra_mount: /data ← unmanaged, not in spec $ lab sync server db-1 # reconcile drift $ lab plan server new-mail-3 --label mailserver --provider aws # preview $ lab apply server new-mail-3 # create it $ lab build --label mailserver # puppet modules → container image Building mailserver from puppet classes: ✓ postfix ✓ dovecot ✓ spamassassin ✓ fail2ban → ghcr.io/org/mailserver:2026.03.15 $ lab render --label mailserver --all-providers ┌──────────────┬──────────────┬──────────┬────────────┐ │ │ AWS │ XCP-ng │ Bare Metal │ ├──────────────┼──────────────┼──────────┼────────────┤ │ Compute │ t3.large │ 4c/8GB │ IPMI boot │ │ Puppet │ postfix,... │ postfix,.│ postfix,...│ │ Est. Cost │ ~$62/mo │ — │ — │ └──────────────┴──────────────┴──────────┴────────────┘ ``` ### TUI (lab-tui) k9s-style interactive terminal UI: - Real-time server list with sync/puppet/health status - Drill into any server for details - Watch puppet runs live - Filter by labels, providers, health status - Trigger actions (sync, plan, apply, build) ## Core Concepts ### Labels — The Universal Abstraction Everything is a thing with labels. Configuration attaches to labels, not machines. ```yaml labels: mailserver: puppet_classes: - postfix - dovecot - spamassassin - fail2ban ports: [25, 587, 993] size: medium alerts: - smtp_connect # auto-generated: is SMTP responding? - imap_connect # auto-generated: is IMAP responding? - mail_queue_length # auto-generated: is mail queue healthy? secrets: - mail/tls-cert - mail/dkim-key k8s-worker: puppet_classes: - kubernetes::worker - containerd - node_exporter size: large alerts: - kubelet_healthy - node_ready secrets: - k8s/join-token ``` ### Groups — Nested Targeting with Exclusions Groups compose labels, other groups, and individual servers into reusable targets. Groups can nest (subgroups). Exclusions allow fine-grained control. ```yaml groups: # Simple group: all production servers production: match: environment: prod # Group by label combination production-mail: match: labels: [mailserver] environment: prod # Nested group with subgroups eu-infrastructure: groups: - eu-west-compute - eu-west-storage - eu-west-network exclude: servers: [test-box-1] # exclude specific server labels: [experimental] # exclude servers with this label eu-west-compute: match: labels: [k8s-worker, k8s-server] region: eu-west exclude: servers: [legacy-node-3] # Group targeting everything except a subgroup all-except-staging: match: environment: [prod, dev] exclude: environment: staging # Custom group by explicit membership database-tier: servers: [db-1, db-2, db-3] groups: [replica-set-eu] ``` ### Alerts — Auto-Generated and User-Defined Alerts attach to labels, groups, servers, or environments — same targeting as everything else. #### Auto-Generated Alerts When Lab provisions a resource, it generates baseline alerts based on: - **Label**: mailserver label → SMTP/IMAP checks - **Puppet classes**: `postgresql::server` → postgres process, replication lag - **Ports**: if port 443 is declared → HTTPS health check - **Size**: resource limits → CPU/memory threshold alerts - **Identity**: cert expiry alerts auto-generated for all enrolled machines #### User-Defined Alerts Users can add custom alerts targeting any scope: ```yaml alerts: # Target by label - name: mail_queue_critical target: labels: [mailserver] condition: mail_queue_length > 1000 severity: critical for: 5m # Target by group - name: disk_space_low target: groups: [production] condition: disk_usage_percent > 85 severity: warning # Target by environment - name: high_cpu target: environment: prod condition: cpu_usage_percent > 90 for: 10m severity: warning # Target specific servers - name: gpu_temperature target: servers: [dgx-spark, beelink-ser9-max] condition: gpu_temp_celsius > 80 severity: critical # Target by label but exclude some - name: memory_pressure target: labels: [k8s-worker] exclude: servers: [batch-worker-1] # this one is expected to run hot condition: memory_usage_percent > 90 severity: warning ``` Alerts are rendered to the underlying monitoring system (Prometheus rules, Naemon checks, CloudWatch alarms) — we don't build an alerting engine, we generate configs for existing ones. Which monitoring backend to use for each alert type: **needs investigation**. ### Targeting — Unified Query System The same targeting syntax works everywhere: alerts, puppet classes, secrets, and queries. Target by label, group, server name, environment, region, or any combination with exclusions. ``` # CLI targeting syntax $ lab get servers --label k8s-worker $ lab get servers --group production $ lab get servers --environment staging $ lab get servers --label k8s-worker --environment prod --exclude worker-3 # What's applied WHERE (server → everything) $ lab show server worker-5 ``` ### Visibility — Show What's Applied Where Two directions of querying: "what does this server get?" and "where does this thing apply?" #### Server View: Everything applied to a server ``` $ lab show server worker-5 Server: worker-5 (aws, eu-west-1) Labels: [k8s-worker, production, eu-west] Groups: [production, eu-west-compute, eu-infrastructure] Environment: prod Puppet Classes (6): FROM LABEL k8s-worker: ├── kubernetes::worker ├── containerd └── node_exporter FROM LABEL production: ├── base::hardening └── base::monitoring FROM LABEL eu-west: └── base::ntp_eu Alerts (8): FROM LABEL k8s-worker: ├── kubelet_healthy └── node_ready FROM GROUP production: ├── disk_space_low └── high_cpu AUTO-GENERATED: ├── cpu_threshold (from size: large) ├── memory_threshold (from size: large) ├── cert_expiry (from identity) └── puppet_run_failed (from enrollment) Secrets (2): FROM LABEL k8s-worker: ├── k8s/join-token (read) └── tls/node-cert (dynamic) Excluded From: └── alert "memory_pressure" (explicitly excluded) ``` #### Label/Group View: Where does this apply? ``` $ lab show label mailserver Label: mailserver Applied to: 2 servers Servers: ├── mail-1 (xcpng, prod) ✓ sync ✓ puppet ✓ health ✓ identity └── mail-2 (aws, prod) ✓ sync ✓ puppet ✓ health ✓ identity Provides: Puppet Classes: postfix, dovecot, spamassassin, fail2ban Alerts: smtp_connect, imap_connect, mail_queue_length Secrets: mail/tls-cert, mail/dkim-key Ports: 25, 587, 993 Size: medium $ lab show group eu-infrastructure Group: eu-infrastructure Contains: 3 subgroups, 47 servers (2 excluded) Subgroups: ├── eu-west-compute (28 servers) ├── eu-west-storage (12 servers) └── eu-west-network (9 servers) Excluded: ├── test-box-1 (by name) └── 1 server with label "experimental" Alerts targeting this group: ├── disk_space_low (warning) └── network_latency_high (critical) ``` #### Alert View: Where does this alert fire? ``` $ lab show alert disk_space_low Alert: disk_space_low Severity: warning Condition: disk_usage_percent > 85 Target: group "production" Excludes: none Applies to 63 servers: ├── api-1 (aws) currently: 42% ✓ ├── api-2 (aws) currently: 38% ✓ ├── mail-1 (xcpng) currently: 71% ✓ ├── db-1 (baremetal) currently: 83% ⚠ approaching └── ... (59 more) Rendered to: ├── Prometheus: rule "disk_space_low" in rules/production.yaml └── Naemon: service check on 4 bare-metal hosts ``` #### Reverse Query: What targets this server? ``` $ lab targets server db-1 Everything targeting db-1: Labels: [postgres, production, eu-west] Groups: [production, database-tier, eu-infrastructure, eu-west-storage] Environment: prod Alerts (11): ├── postgres_replication_lag (from label: postgres) ├── postgres_connections (from label: postgres) ├── disk_space_low (from group: production) ├── high_cpu (from group: production) ├── storage_iops (from group: eu-west-storage) ├── cert_expiry (auto-generated) └── ... (5 more) Puppet Classes (9): ├── postgresql::server (from label: postgres) ├── backup::pgbackrest (from label: postgres) └── ... (7 more) Secrets (4): ├── postgres/master-password (from label: postgres) └── ... (3 more) ``` ### TUI Visualization (lab-tui) The k9s-style TUI should support navigating these relationships interactively: ``` ┌─ lab-tui ──────────────────────────────────────────────────────────┐ │ View: Servers > worker-5 [?]Help│ ├────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─ Server: worker-5 ──────────────────────────────────────────┐ │ │ │ Provider: aws Size: large Env: prod │ │ │ │ Sync: ✓ Puppet: ✓ Health: ✓ Identity: ✓ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ [L]abels [A]lerts [P]uppet [S]ecrets [G]roups │ │ │ │ Labels ──────────────────── Alerts ────────────────────────── │ │ ► k8s-worker ● kubelet_healthy ✓ OK │ │ ► production ● node_ready ✓ OK │ │ ► eu-west ● disk_space_low ✓ 42% │ │ ● high_cpu ✓ 12% │ │ Groups ────────────────── ● cert_expiry ✓ 347d │ │ ► production │ │ ► eu-infrastructure Puppet Classes ────────────────── │ │ ► eu-west-compute ● kubernetes::worker ✓ applied │ │ ● containerd ✓ applied │ │ Secrets ───────────────── ● node_exporter ✓ applied │ │ ● k8s/join-token (read) ● base::hardening ✓ applied │ │ ● tls/node-cert (dyn) ● base::monitoring ✓ applied │ │ │ │ [Enter] drill down [Esc] back [/] search [Tab] switch pane │ └────────────────────────────────────────────────────────────────────┘ ``` Navigation: - From server → drill into label → see all other servers with that label - From alert → see all servers it applies to, current values - From group → see subgroups, expand tree, see members - From label → see puppet classes, alerts, secrets it provides - Everything is cross-linked — follow any relationship in either direction ### Deployment Targets Same label → multiple targets: | Target | What happens | |--------|-------------| | VM (any cloud) | Provision VM → enroll OpenVox → apply classes live | | Bare metal | PXE boot → enroll OpenVox → apply classes live | | Container | Build image with classes baked in → push to registry | | ASG | Launch template with OpenVox enrollment → auto-apply | | K8s pod | Deploy container artifact to cluster | ### Four-Pillar Status Every resource shows four things: 1. **Sync** — is the actual infrastructure state matching the declared spec? (instance type, security groups, disks, network — via Pulumi state) 2. **Puppet** — did OpenVox successfully apply all classes? (last run status, any failures, catalog compilation errors) 3. **Health** — are monitoring checks passing? (aggregates from Prometheus alerts, Naemon checks, cloud health APIs) 4. **Identity** — is the resource fully enrolled? (DNS registered, certs valid, Vault authenticated, SSH host key signed) ### Provider Plugin System Extensible provider model — each provider implements an interface: ```go type Provider interface { Name() string // Lifecycle Plan(spec ResourceSpec) (*PlanResult, error) Apply(spec ResourceSpec) (*Resource, error) Destroy(id string) error // State Get(id string) (*Resource, error) List(filters Filters) ([]*Resource, error) Diff(spec ResourceSpec) (*DiffResult, error) // Introspection (like DA's type-writer) DiscoverResources() ([]*Resource, error) AvailableSizes() ([]Size, error) AvailableImages() ([]Image, error) } ``` Built-in providers: - `provider-aws` — wraps Pulumi AWS - `provider-xcpng` — wraps Pulumi XO / Xen Orchestra API - `provider-baremetal` — wraps Tinkerbell / iPXE + IPMI/Redfish - `provider-k8s` — wraps Pulumi Kubernetes Community can add: GCP, Azure, Hetzner, Proxmox, etc. ### Health Aggregator Plugin System ```go type HealthSource interface { Name() string CheckHealth(resource *Resource) (*HealthResult, error) } ``` Built-in sources: - `health-prometheus` — queries Prometheus alerting rules targeting the resource - `health-naemon` — queries Naemon host/service checks - `health-cloudwatch` — queries AWS CloudWatch alarms ### Profiles — T-Shirt Sizing User-owned mappings: ```yaml sizes: medium: abstract: { cores: 4, memory: 8GB } providers: aws: { instance_type: t3.large } xcpng: { cores: 4, memory: 8192MB } baremetal: { min_cores: 4, min_memory: 8GB, maas_tag: medium } ``` ### Artifact Builder Puppet modules → container images: ``` label "mailserver" → puppet classes [postfix, dovecot, spamassassin] → Dockerfile generated: FROM ubuntu:24.04 RUN apt-get install -y puppet-agent COPY modules/ /etc/puppetlabs/code/modules/ RUN puppet apply --classes postfix,dovecot,spamassassin # Clean up puppet, leave only configured services → Image pushed to registry → Available as k8s deployment or standalone container ``` ## Tech Stack | Component | Technology | Why | |-----------|-----------|-----| | Server | Go | Performance, single binary, Pulumi SDK, gRPC native | | CLI | Go (cobra) | Same binary, kubectl-style | | TUI | Go (bubbletea) | Same binary, k9s-style | | API | gRPC + REST (grpc-gateway) | Type-safe, fast, REST fallback | | IaC engine | Pulumi (Go SDK) | Multi-provider, plan/preview, component packages | | Config mgmt | OpenVox | Puppet modules, ENC, cert management | | Bare metal | Tinkerbell or custom iPXE | PXE boot, IPMI/Redfish | | Container build | Buildah or Docker | OCI images from puppet classes | | State store | TBD — NOT etcd (see State Storage section) | Resource state, label definitions | | K8s integration | client-go | Direct k8s API for deployments | ## Under The Hood — What We DON'T Build - Cloud APIs → Pulumi providers handle this - Puppet language/runtime → OpenVox handles this - Container runtime → containerd/Docker handles this - Monitoring → Prometheus/Naemon handle this - K8s orchestration → k3s/EKS handles this - PXE/DHCP/TFTP → Tinkerbell handles this - Certificate management → OpenVox CA handles this **We build the glue, the abstraction, the UX, and the lifecycle orchestration.** ## Kubernetes Management Lab also controls what runs on k8s clusters: ``` $ lab get deployments NAME CLUSTER LABEL REPLICAS IMAGE STATUS mailserver homelab mailserver 2/2 org/mailserver:03.15 ✓ running api production app 4/4 org/app:03.15 ✓ running postgres homelab postgres 1/1 org/postgres:03.14 ✓ running $ lab deploy --label app --cluster production --replicas 4 $ lab scale --label app --cluster production --replicas 6 ``` Deployments reference labels — same label that defines puppet classes also defines the container image, ports, health checks, and k8s resources. ## Bootstrap, Onboarding, and Self-Deployment ### Core Idea: Your Device Is The First Coordinator You don't need a server to start. Your laptop/workstation runs the full lab engine locally. You onboard servers from it — including bare metal PXE boot. When ready, you migrate the coordinator role to one of the servers you've onboarded. ``` ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ Phase 0 │ │ Phase 1 │ │ Phase 2 │ │ Phase 3 │ │ │ │ │ │ │ │ │ │ lab init │────►│ Onboard │────►│ Move lab │────►│ Onboard │ │ --local │ │ servers │ │ to a real │ │ remaining │ │ │ │ from your │ │ server │ │ from the │ │ Your device│ │ laptop │ │ │ │ server │ │ = lab │ │ │ │ │ │ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ ``` ### Architecture: CLI = Embedded Server The CLI binary contains the full lab-server engine. The difference between modes is where state lives and whether the engine runs persistently. ``` ┌──────────────────────────────────────┐ │ lab (single binary) │ │ │ │ ┌─────────────────────────────────┐ │ │ │ Core Engine │ │ │ │ (providers, labels, render, │ │ │ │ lifecycle, identity, secrets, │ │ │ │ PXE server, everything) │ │ │ └─────────────────────────────────┘ │ │ │ │ Modes: │ │ ├── $ lab init --local → local mode │ │ │ State: ~/.lab/state.db │ │ │ PXE/DHCP: served from laptop │ │ │ Full engine, no remote server │ │ │ │ │ ├── $ lab server → daemon mode │ │ │ State: /var/lib/lab/state.db │ │ │ PXE/DHCP: served from this box │ │ │ Persistent API on port 7443 │ │ │ │ │ └── $ lab → client mode │ │ Talks to remote lab-server │ │ (or local engine if no server) │ └──────────────────────────────────────┘ ``` ### Onboarding Flow `lab onboard` is the command to bring a new machine under management. It handles two scenarios: machines with an OS already installed, and bare metal that needs network boot + OS installation. #### Scenario A: Machine has OS (SSH onboard) For machines that already have an OS (like DGX Spark with Ubuntu, or Mac Studio): ``` $ lab onboard dgx-spark --provider ssh --host 192.168.1.50 --user admin Step 1: Render ┌──────────────┬────────────────────────┐ │ Name │ dgx-spark │ │ Provider │ ssh (existing machine) │ │ Host │ 192.168.1.50 │ │ OS │ Ubuntu (detected) │ │ Arch │ aarch64 (Grace) │ │ RAM │ 128GB │ │ GPU │ CUDA (detected) │ └──────────────┴────────────────────────┘ Onboarding will: + Install lab agent + Generate one-time enrollment token + Register in DNS: dgx-spark.lab.internal + Sign OpenVox certificate + Assign labels (interactive or --labels flag) Proceed? [y/N]: y Step 2: Detect & assign labels Detected hardware: GPU: NVIDIA GB10 Grace Blackwell → suggesting label: cuda RAM: 128GB → suggesting label: ai-inference Arch: aarch64 → suggesting label: arm Assign labels [cuda,ai-inference,arm]: cuda,ai-inference,dgx-spark Step 3: Apply (same engine as lab apply) → SSH into 192.168.1.50 → Install lab agent binary → Generate one-time token → Lab agent enrolls: → OpenVox cert signed, classified in environment "production" → DNS A record: dgx-spark.lab.internal → 192.168.1.50 → Identity established → Apply puppet classes from labels: → cuda: nvidia-drivers, cuda-toolkit → ai-inference: inference-runtime → Machine fully managed $ lab get servers NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY dgx-spark ssh cuda,ai-inference,dgx-spark ✓ ✓ ok ✓ ✓ enrolled ``` #### Scenario B: Bare metal (PXE network boot) For machines with no OS. Lab (on your laptop or server) becomes a PXE server on the local network, serves the OS installer, and onboards after installation: ``` $ lab onboard beelink-max --provider baremetal \ --mac AA:BB:CC:DD:EE:FF \ --image ubuntu-24.04 \ --labels k8s-worker,rocm,longhorn Step 1: Render ┌──────────────┬────────────────────────┐ │ Name │ beelink-max │ │ Provider │ baremetal (PXE boot) │ │ MAC │ AA:BB:CC:DD:EE:FF │ │ Image │ ubuntu-24.04 │ │ Labels │ k8s-worker,rocm,longhorn│ │ PXE server │ this device (laptop) │ └──────────────┴────────────────────────┘ Onboarding will: + Start PXE/DHCP/TFTP on local network interface + Wait for machine with MAC AA:BB:CC:DD:EE:FF to boot + Serve unattended Ubuntu 24.04 installer + After install: auto-enroll with one-time token baked into installer + Assign labels, apply puppet classes ⚠ PXE requires: network interface on same L2 segment as target machine ⚠ DHCP: will respond ONLY to MAC AA:BB:CC:DD:EE:FF (safe for existing networks) Proceed? [y/N]: y Step 2: PXE boot phase → Starting PXE server on en0 (192.168.1.x) → DHCP offer scoped to MAC AA:BB:CC:DD:EE:FF only → Waiting for network boot request... ⏳ Power on the Beelink SER9 MAX and set it to boot from network (PXE) → Boot request received from AA:BB:CC:DD:EE:FF → Serving iPXE → kernel + initrd → autoinstall config → OS installation in progress... → Installation complete, machine rebooting Step 3: Post-install enrollment (same as SSH onboard from here) → Machine boots with installed OS → Lab agent runs on first boot (installed during OS setup) → Uses one-time token (baked into autoinstall config) to enroll: → OpenVox cert signed → DNS: beelink-max.lab.internal → 192.168.1.100 → Identity established → Apply puppet classes from labels: → k8s-worker: kubernetes::worker, containerd → rocm: rocm-drivers → longhorn: longhorn::node → Machine fully managed $ lab get servers NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY dgx-spark ssh cuda,ai-inference ✓ ✓ ok ✓ ✓ enrolled beelink-max baremetal k8s-worker,rocm,longhorn ✓ ✓ ok ✓ ✓ enrolled ``` #### Scenario C: Onboard with IPMI/Redfish (remote power control) For bare metal where you have IPMI/BMC access — Lab can power on the machine and set PXE boot remotely, fully hands-free: ``` $ lab onboard beelink-max --provider baremetal \ --mac AA:BB:CC:DD:EE:FF \ --ipmi 192.168.1.200 --ipmi-user admin \ --image ubuntu-24.04 \ --labels k8s-worker,rocm,longhorn → IPMI: setting next boot to PXE → IPMI: powering on machine → PXE server waiting for boot request... → (fully automated from here) ``` ### Homelab Bootstrap Walkthrough The complete flow for setting up the homelab from zero: ``` # Phase 0: Local mode on your laptop $ lab init --local ✓ Lab engine running locally ✓ State: ~/.lab/state.db ✓ Ready to onboard servers # Phase 1: Onboard servers that already have an OS $ lab onboard dgx-spark --provider ssh --host 192.168.1.50 → Labels: [cuda, ai-inference, dgx-spark] $ lab onboard mac-studio --provider ssh --host 192.168.1.51 → Labels: [k8s-server, etcd, arm] # Phase 2: Onboard bare metal (PXE from your laptop) $ lab onboard beelink-ser9-pro --provider baremetal --mac XX:XX:XX:XX:XX:01 \ --image ubuntu-24.04 --labels bootstrap,lab-server → PXE boot from laptop → install OS → enroll → This will become the permanent lab-server host # Phase 3: Move lab-server to a real server $ lab server migrate --target ssh --host beelink-ser9-pro → Lab-server deployed on Beelink SER9 Pro → State migrated from ~/.lab/state.db → PXE/DHCP now served from Beelink, not your laptop → CLI config updated: lab talks to beelink-ser9-pro:7443 # Phase 4: Onboard remaining servers (PXE from beelink-ser9-pro now) $ lab onboard beelink-ser9-max --provider baremetal --mac XX:XX:XX:XX:XX:02 \ --image ubuntu-24.04 --labels k8s-worker,rocm,longhorn → PXE served by beelink-ser9-pro (not your laptop anymore) $ lab onboard minisforum-ms-r1 --provider baremetal --mac XX:XX:XX:XX:XX:03 \ --image ubuntu-24.04 --labels k8s-worker,arm # Phase 5: Set up k8s $ lab apply cluster homelab --servers mac-studio,beelink-ser9-max,minisforum-ms-r1 → mac-studio becomes k3s server (etcd) → beelink-ser9-max joins as worker → minisforum-ms-r1 joins as worker → All via puppet classes from labels # Phase 6: Optionally move lab-server into k8s $ lab server migrate --target kubernetes --cluster homelab → Lab-server now runs as k8s pod → Still manages everything including the cluster it runs on # Final state: $ lab get servers NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY dgx-spark ssh cuda,ai-inference ✓ ✓ ok ✓ ✓ enrolled mac-studio ssh k8s-server,etcd,arm ✓ ✓ ok ✓ ✓ enrolled beelink-ser9-pro baremetal bootstrap ✓ ✓ ok ✓ ✓ enrolled beelink-ser9-max baremetal k8s-worker,rocm,longhorn ✓ ✓ ok ✓ ✓ enrolled minisforum-ms-r1 baremetal k8s-worker,arm ✓ ✓ ok ✓ ✓ enrolled lab-server kubernetes lab,control-plane ✓ ✓ ok ✓ ✓ enrolled ``` ### Enterprise Application: XCP-ng Bare Metal Deploy Same onboarding flow works for deploying XCP-ng to enterprise bare metal: ``` $ lab onboard xen-host-42 --provider baremetal \ --mac AA:BB:CC:DD:EE:FF \ --ipmi 10.0.0.142 --ipmi-user admin \ --image xcpng-8.3 \ --labels xen-host,production,eu-west → IPMI: power on, PXE boot → Install XCP-ng 8.3 (unattended) → Enroll, apply puppet classes: → xen-host: xcpng::host, xcpng::networking, xcpng::storage → Host registered in Xen Orchestra pool → Ready to provision VMs on it # Now create VMs on the XCP-ng host we just onboarded: $ lab apply server app-12 --provider xcpng --labels app,production → VM created on xen-host-42 via Xen Orchestra API → OS installed, enrolled, puppet applied → Same flow as AWS EC2, just different provider ``` ### PXE Server Capabilities When running in local or server mode, Lab includes an embedded PXE server: - **DHCP**: scoped to specific MACs only (safe for existing networks with DHCP) - **TFTP**: serves iPXE bootloader - **HTTP**: serves kernel, initrd, autoinstall configs - **Autoinstall generation**: creates unattended install configs per-machine with: - Lab agent pre-installed - One-time enrollment token baked in - Network config for the target environment - Disk layout per label/profile - **Supported images**: Ubuntu, Debian, RHEL/Rocky, XCP-ng (extensible) PXE serving moves with lab-server — if you migrate lab to a new host, PXE is served from there. If lab is on your laptop, PXE is on your laptop. Same engine, same binary. ### Hardware Detection During Onboard When onboarding via SSH (existing OS), Lab detects hardware and suggests labels: ``` $ lab onboard new-server --provider ssh --host 10.0.0.50 Detected hardware: CPU: AMD EPYC 7763 (x86_64, 64 cores) → suggest: compute RAM: 256 GB → suggest: high-memory GPU: NVIDIA A100 80GB → suggest: cuda, ai-training Disk: 2x NVMe 1.92TB, 4x SSD 3.84TB → suggest: storage NIC: 2x 25GbE, 1x 1GbE IPMI → suggest: high-bandwidth Suggested labels: [compute, high-memory, cuda, ai-training, storage, high-bandwidth] Assign labels [accept/edit]: _ ``` For PXE onboard, hardware detection happens after OS installation, and labels can be auto-confirmed or require interactive approval. ### No Server? CLI Runs Locally If no remote server is configured, every `lab` command runs the engine locally. This means you can use Lab in permanent local mode for simple setups: ``` $ lab get servers # no remote server configured ⓘ Running locally (~/.lab/state.db) Tip: run `lab server migrate --target ` to deploy a persistent server NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY ... ``` ### Self-Migration Migration uses the same plan/apply as everything else: ``` $ lab server migrate --target ssh --host beelink-ser9-pro Step 1: Plan ~ migrate lab-server from local (~/.lab) to ssh://beelink-ser9-pro + deploy lab-server container on beelink-ser9-pro + copy state.db to remote host + start PXE/DHCP services on remote host + stop local PXE/DHCP services + update CLI config to new endpoint Step 2: Apply → Deploy lab-server on beelink-ser9-pro → Copy state to remote → Verify remote is healthy → Switch CLI config → Stop local engine $ lab server migrate --target kubernetes --cluster homelab Step 1: Plan ~ migrate lab-server from ssh://beelink-ser9-pro to kubernetes://homelab + k8s Deployment lab-server (1 replica) + k8s Service lab-server (port 7443) + PersistentVolumeClaim lab-server-state (10Gi) + migrate state.db to PVC + PXE services: move to k8s hostNetwork pod or keep on bootstrap node ⚠ Note: PXE/DHCP requires L2 network access. If k8s node is on the same L2 segment, use hostNetwork. Otherwise, keep PXE on the bootstrap node and only migrate the API/state to k8s. Step 2: Apply → Deploy to k8s → Migrate state → Verify healthy → Update CLI config → Tear down old deployment ``` ### Key Design Principles 1. **One engine everywhere** — CLI, local mode, server mode, and init all share the same code 2. **Your device is the first coordinator** — no chicken-and-egg, start from nothing 3. **Onboard uses the same pipeline as apply** — render, plan, apply, enroll 4. **PXE is embedded** — no external PXE/DHCP server needed, Lab serves it 5. **Hardware detection suggests labels** — but the user confirms 6. **Migration is just plan/apply for lab-server** — same engine, no special case 7. **Enterprise and homelab are the same flow** — onboard XCP-ng bare metal = onboard homelab Beelink ## Identity and Trust Layer Inspired by what FreeIPA did well (auto-DNS, centralized SSH, server-scoped secrets, internal CA, IP mobility) without what it did badly (instability, hardcoded join secrets). Lab controls the full lifecycle — it knows when a machine is born — so it can solve the enrollment problem properly: generate a one-time join token at provision time, inject it via cloud-init or iPXE userdata. No hardcoded secrets in images. ### Provision-to-Enrolled Flow ``` $ lab apply server new-worker-5 --label k8s-worker --provider aws 1. PROVISION → Pulumi creates EC2 instance 2. IDENTITY → Lab generates one-time join token (short-lived, single-use) → Token injected via cloud-init (or iPXE userdata for bare metal) → Token is NOT in the image — generated per-instance at provision time 3. ENROLL → Machine boots, uses token to: → Register with OpenVox (cert signed, node classified) → Register in DNS (A record + PTR) → Authenticate with Vault (get identity + policies per label) → Get SSH CA-signed host key (no more TOFU) 4. CONFIGURE → OpenVox applies classes → Machine pulls secrets it's allowed to access from Vault → e.g. k8s join token retrieved from Vault, node joins cluster 5. ENROLLED → Lab marks resource identity as ✓ enrolled ``` ### What Each Machine Gets on Enrollment | Capability | What happens | Tool underneath (TBD — needs investigation) | |-----------|-------------|----------------------------------------------| | DNS auto-registration | A + PTR records created/updated automatically | CoreDNS API? ExternalDNS? PowerDNS? needs investigation | | IP mobility | Machine restarts with new IP → DNS updated automatically | Lab agent on machine reports changes? DHCP hook? needs investigation | | Server certificate | TLS cert issued for the machine, auto-renewed | OpenVox CA? Vault PKI secrets engine? cert-manager? needs investigation | | SSH host key signing | Host key signed by CA, clients trust CA not individual keys | Vault SSH secrets engine? OpenVox CA? step-ca? needs investigation | | SSH user access | Users get short-lived SSH certs, centrally managed | Vault SSH + OIDC? Teleport? Boundary? needs investigation | | Secret access (RBAC) | Machine authenticates with Vault, gets label-scoped policy | Vault AppRole? Vault cert auth? needs investigation | | K8s join tokens | Retrieved from Vault by entitled machines, used to join cluster | Vault KV + policy per label? needs investigation | | OpenVox enrollment | Cert signed, environment + role + classes assigned | OpenVox CA + ENC — this one we know | | One-time join tokens | Generated per-instance at provision, single-use, short-lived | Lab itself generates these — or delegate to Vault? needs investigation | **Important: We don't need to build any of these from scratch.** Each row is a capability that likely has an existing tool we can wrap. Just like we use Pulumi for cloud APIs and OpenVox for config management, we'll find the right tool for each identity concern. Each position requires investigation — we'll evaluate options together, one by one. ### CLI: Identity Information ``` $ lab get servers NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY worker-5 aws k8s-worker ✓ ✓ ok ✓ ✓ enrolled worker-6 xcpng k8s-worker ✓ ✓ ok ✓ ✓ enrolled worker-7 baremetal k8s-worker ✓ ✗ fail ⚠ ⚠ cert expiring new-box aws k8s-worker ✓ … … ⏳ enrolling $ lab describe server worker-5 ... Identity: DNS: worker-5.lab.internal (A: 10.0.1.45, PTR: ✓) OpenVox: ✓ cert signed (expires 2027-03-15) Vault: ✓ authenticated (policy: k8s-worker) SSH Host Key: ✓ CA-signed (fingerprint: SHA256:abc...) Secrets: k8s/join-token, tls/node-cert (2 accessible) Enrolled: 2026-03-15 14:22:03 (one-time token, consumed) Last Check-in: 2026-03-15 15:01:12 (38 seconds ago) $ lab get secrets --label k8s-worker SECRET TYPE ACCESSIBLE BY LAST ROTATED k8s/join-token dynamic k8s-worker (12 srv) 2026-03-15 tls/cluster-ca static k8s-worker, k8s-server 2026-01-01 monitoring/api-key static k8s-worker, monitoring 2026-02-28 $ lab identity renew worker-5 # force cert/key renewal $ lab identity revoke worker-5 # revoke all creds, remove from DNS, unenroll ``` ### Secrets — Code Is The Policy **Design principle:** If your code/config declares "I use secret X", that IS the access grant. No one goes to a separate UI to edit policies. Default is locked — if not mentioned, no access. If mentioned, access is automatic. **The declaration IS the policy:** ```yaml labels: mailserver: puppet_classes: - postfix - dovecot secrets: - mail/tls-cert - mail/dkim-key - mail/relay-credentials ports: [25, 587, 993] ``` When Lab applies label `mailserver` to a server, it automatically: 1. Grants that server access to `mail/tls-cert`, `mail/dkim-key`, `mail/relay-credentials` 2. Denies access to everything else 3. No separate policy file, no Vault admin, no ticket to security team When a puppet class references a secret: ```puppet # modules/postfix/manifests/init.pp class postfix { $relay_creds = lab::secret('mail/relay-credentials') file { '/etc/postfix/sasl_passwd': content => $relay_creds, mode => '0600', } } ``` The `lab::secret()` call is both the usage AND the declaration that this class needs this secret. Lab scans puppet classes, discovers secret references, and auto-generates the access policy. If `postfix` class is applied to a server via a label, that server gets access to `mail/relay-credentials`. Remove the class → access revoked. **Secrets must be equally easy to access from anywhere:** | Runtime | How you get a secret | Same underneath | |---------|---------------------|-----------------| | Puppet code | `lab::secret('mail/tls-cert')` | Lab agent on machine fetches from secret backend | | App on VM | `LAB_SECRET_MAIL_TLS_CERT` env var, or `/run/secrets/mail/tls-cert` file | Lab agent provides via env or tmpfs mount | | App in Kubernetes | Same env var or volume mount | Lab k8s operator syncs to K8s Secret object | | App in Docker (standalone) | `--env-file` or bind mount from lab agent | Lab agent on host provides | | Script / cron job | `lab secret get mail/tls-cert` CLI call | Lab CLI authenticated via machine identity | | cloud-init / bootstrap | Injected at provision time via one-time token | Lab server provides during enrollment | **One way to consume secrets, regardless of where you run.** The lab agent (or k8s operator, or CLI) handles authentication and fetching transparently. The app just reads an env var or file. #### How Access Flows ``` Label "mailserver" declares secrets: - mail/tls-cert - mail/dkim-key │ ▼ ┌───────────────────────┐ │ Lab compiles policy │ │ │ │ server mail-1: │ │ CAN access: │ │ mail/tls-cert │ │ mail/dkim-key │ │ CANNOT access: │ │ k8s/* │ │ postgres/* │ │ (everything else)│ └───────────┬───────────┘ │ ▼ ┌───────────────────────┐ │ Secret backend │ │ (TBD — needs │ │ investigation) │ │ │ │ Enforces policy at │ │ backend level, not │ │ just in Lab │ └───────────────────────┘ ``` #### Secret Sources Secrets themselves can come from multiple places: ```yaml secrets: mail/tls-cert: type: dynamic # generated/rotated automatically generator: acme # cert-manager / Let's Encrypt rotate_every: 90d mail/dkim-key: type: static # manually set, stored encrypted set_by: admin # who last set it mail/relay-credentials: type: static set_by: admin k8s/join-token: type: dynamic generator: kubernetes # fetched from k8s API rotate_every: 24h tls/node-cert: type: dynamic generator: ca # issued per-machine from internal CA per_machine: true # each machine gets its own ``` #### CLI for Secrets ``` $ lab get secrets SECRET TYPE USED BY LAST ROTATED mail/tls-cert dynamic mailserver (2 srv) 2026-03-14 mail/dkim-key static mailserver (2 srv) 2026-01-15 mail/relay-credentials static mailserver (2 srv) 2026-02-01 k8s/join-token dynamic k8s-worker (12 srv) 2026-03-15 tls/node-cert dynamic * (all enrolled) per-machine $ lab secret set mail/relay-credentials Enter value: **** ✓ Updated. Accessible by: mailserver (2 servers) ✓ Servers will pick up new value within 60s $ lab show secret mail/relay-credentials Secret: mail/relay-credentials Type: static Last set: 2026-03-15 by admin Accessible by (derived from code): Label "mailserver" → puppet class "postfix" → lab::secret('mail/relay-credentials') ├── mail-1 (xcpng) last fetched: 12m ago └── mail-2 (aws) last fetched: 12m ago No other references found in any applied code. $ lab secret audit ✓ All secrets are referenced by at least one applied class/label ⚠ Secret "old/api-key" is defined but not referenced by any code — orphaned? ⚠ Secret "db/password" referenced by class "app::database" but never set — empty! ``` #### Secret Architecture — Distributed, Offline-Capable **Critical requirement:** Nothing breaks if the central secret server (or any server) is unreachable. Everything continues to work — including making new pods, deployments, puppet runs — using local encrypted cache. This is not an edge case, it's a core design. **This means secrets are NOT a central server you query.** They're a distributed, synced, encrypted dataset with offline capability. ``` ┌─────────────────────────────────────────────────────────────┐ │ Secret Distribution Model │ │ │ │ NOT this (central server): THIS (distributed sync): │ │ │ │ ┌─────────┐ ┌──────┐ ┌──────┐ │ │ │ Vault │ │ Node │◄─►│ Node │ │ │ └────┬────┘ └──┬───┘ └──┬───┘ │ │ ┌────┼────┐ │ ▲ │ │ │ │ │ │ ▼ │ ▼ │ │ ┌┴┐ ┌┴┐ ┌┴┐ ┌──────┐ ┌──────┐ │ │ │N│ │N│ │N│ │ Node │◄─►│ Node │ │ │ └─┘ └─┘ └─┘ └──┬───┘ └──────┘ │ │ (all dead if vault │ │ │ is unreachable) ▼ │ │ ┌──────────┐ │ │ │ Git repo │ (encrypted │ │ │ (backup) │ backup of │ │ └──────────┘ last resort) │ └─────────────────────────────────────────────────────────────┘ ``` #### How It Works **Layer 1: Local Encrypted Cache (on every machine)** - Every machine that has access to secrets stores them locally, encrypted at rest - Encrypted with machine-specific key (derived from machine identity/TPM/secure enclave) - Puppet runs, app starts, pod deployments — all read from local cache - If cache is fresh → use it, no network call needed - Cache has TTL per secret, but stale cache is better than no secret **Layer 2: Secret Store (privileged nodes that hold all secrets)** - One or more nodes with the `secret-store` label hold the COMPLETE encrypted dataset - This is NOT a special server type — it's a label, applied to pods, VMs, or bare metal - Should have at least 2 replicas for HA - Machines fetch ONLY the secrets their labels entitle them to from the store - The store enforces policy — a machine with label `mailserver` gets `mail/*`, nothing else - Machines NEVER sync with each other directly — they only talk to the store - This prevents secret sprawl (no machine accumulates secrets it shouldn't have) **Layer 3: Git Encrypted Backup (last resort recovery)** - All secrets (encrypted with a master key) backed up to a Git repo - If a machine has empty cache AND no peers available → restore from Git backup - SOPS/age style encryption — secrets encrypted, metadata (paths, policies) in plaintext - Git gives versioning, audit trail, and disaster recovery for free - The Git repo alone is useless without the decryption key **Layer 4: Lab-server (coordinator, NOT single point of failure)** - Lab-server is the preferred interface to set/rotate secrets (via CLI/API) - Lab-server does NOT need to be the secret-store (but can be, via label) - If lab-server is down, machines keep running from local cache - No new secrets can be distributed while secret-store is down - But nothing breaks — existing workloads continue uninterrupted - When secret-store comes back, machines sync and catch up **Separation of concerns:** - `lab-server` = coordination, API, lifecycle management - `secret-store` label = holds all secrets, serves policy-filtered requests - These CAN be the same node (apply both labels) or separate nodes - For homelab: same node is fine. For enterprise: separate for isolation #### Recovery Scenarios ``` Scenario 1: Lab-server down, secret-store up → All machines continue working from local cache → Machines can still fetch/refresh secrets from secret-store → No new resources can be provisioned (lab-server manages lifecycle) → But existing workloads are unaffected Scenario 2: Secret-store down, lab-server up → All machines continue working from local cache → Lab-server can still manage lifecycle (provision, plan, apply) → No new secrets can be distributed → No secret rotations until store is back → Lab-server shows: ⚠ secret-store unreachable Scenario 3: Both down → All machines continue working from local cache → Nothing new can happen, but nothing breaks → Recovery priority: restore secret-store first (from Git backup) Scenario 4: Machine reboots, cache intact → Reads from local encrypted cache immediately → Refreshes from secret-store in background to catch up → No dependency on lab-server for startup Scenario 5: Machine rebuilt, cache empty → Machine has its identity (from enrollment) but no secrets → Fetches entitled secrets from secret-store (policy-filtered) → If secret-store unreachable → cannot start (needs secrets) → Operator can restore secret-store from Git backup to unblock Scenario 6: Total disaster, only Git backup survives → Deploy new node, apply `secret-store` label → Restore encrypted secrets from Git backup → Deploy lab-server (lab init) → New machines enroll and receive their entitled secrets → System fully recovered Scenario 7: New pod in k8s, secret-store unreachable → K8s node has local secret cache for its entitled secrets → Lab k8s operator serves pod secrets from node's local cache → Pod starts with cached secrets → No interruption to deployments ``` #### CLI for Secret Distribution ``` $ lab secret status SECRET DISTRIBUTION STATUS: Local cache: ✓ 8 secrets cached (of 8 entitled), encrypted, fresh (< 5m old) Secret store: ✓ connected (2 replicas: store-1, store-2) Lab-server: ✓ connected Git backup: ✓ last push 2026-03-15 14:30:00 (47 total secrets) $ lab secret status --store SECRET STORE: Replicas: 2/2 healthy store-1 k8s pod ✓ synced 47 secrets (all) store-2 vm/xcpng ✓ synced 47 secrets (all) Git backup: ✓ synced 2026-03-15 14:30:00 Total secrets: 47 Entitled consumers: k8s-worker (12 machines) → 3 secrets each mailserver (2 machines) → 5 secrets each postgres (3 machines) → 4 secrets each lab-server (1 machine) → 2 secrets $ lab secret cache LOCAL CACHE: SECRET CACHED TTL STATUS mail/tls-cert ✓ 89d left fresh mail/dkim-key ✓ no expiry fresh k8s/join-token ✓ 23h left fresh tls/node-cert ✓ 346d left fresh $ lab secret recover --from git → Fetching encrypted backup from git@github.com:org/lab-secrets.git → Decrypting with master key... → Restored 23 secrets → Syncing with available peers... ``` #### Local Cache Security The local cache must be stored securely — needs investigation: - Encrypted at rest with machine-specific key - Key derived from: TPM 2.0? Secure enclave? LUKS-bound? needs investigation - Memory-mapped, not swappable (mlock) - Accessible only by lab agent (file permissions + MAC/SELinux) - Wiped on machine decommission (`lab identity revoke`) - Possibly use kernel keyring on Linux — needs investigation #### Secret Backend — NOT Decided The underlying secret storage/sync mechanism is pluggable: ```go type SecretBackend interface { Name() string // CRUD Get(path string, identity *MachineIdentity) ([]byte, error) Set(path string, value []byte) error Delete(path string) error List(prefix string) ([]string, error) // Policy (auto-generated from code/labels) GrantAccess(path string, identity *MachineIdentity) error RevokeAccess(path string, identity *MachineIdentity) error // Dynamic Generate(path string, generator GeneratorConfig) ([]byte, error) Rotate(path string) error // Distribution SyncWith(peer PeerInfo) error CacheLocally(secrets []Secret) error RestoreFromBackup(source BackupSource) error } ``` Possible approaches (each needs investigation): - **SOPS + age + Git** — simplest, encrypted files in Git, but no peer sync - **OpenBao** — Vault fork, has replication, but still central-server mindset - **Sealed Secrets / External Secrets Operator** — k8s-native, but not universal - **Infisical** — developer-friendly, but SaaS-oriented - **Custom: encrypted SQLite + peer sync** — simple, we control the sync protocol - **etcd with encryption** — distributed by nature, but might be overkill - **CockroachDB** — distributed SQL, encrypted, survives node failures - **Consul** — distributed KV with gossip, HashiCorp though - **Lab's own sync protocol** — gossip-based, encrypted, purpose-built The right answer might be a combination: - SOPS/age for encryption format (proven, auditable) - Custom gossip sync for distribution (lightweight) - Git for backup (free versioning and DR) - Or wrap an existing distributed KV that already handles sync **This is the most complex subsystem in Lab and needs careful investigation.** ### Identity Plugin System Same extensible pattern as providers and health sources: ```go type IdentityPlugin interface { Name() string // Enrollment Enroll(resource *Resource, token string) (*Identity, error) Revoke(resource *Resource) error // Status Status(resource *Resource) (*IdentityStatus, error) // Renewal Renew(resource *Resource) error } ``` This allows swapping identity backends without changing the rest of Lab. We might start with Vault + OpenVox CA and later add/replace components. ## State Storage — Design Principles **NOT etcd.** etcd prioritizes consistency over availability — it would rather crash and stay down than serve potentially inconsistent data. For Lab, availability wins: - Losing a few events is better than total outage - Should auto-backup and auto-restore on corruption - Should degrade gracefully, never crash and refuse to start - Stale data is acceptable, no data is not Requirements: - Stores: resource state, label definitions, group membership, alert configs, audit log - Must survive lab-server restart - Must be migratable (lab-server can move between hosts) - Should auto-backup (to Git, S3, or local snapshots) - Should auto-recover from corruption without operator intervention - Embedded (no external dependency) preferred for simplicity Candidates (needs investigation): - **SQLite** — embedded, simple, proven, WAL mode for concurrent reads, easy to backup (copy file) - **bbolt/BoltDB** — embedded KV, used by etcd ironically, simpler than etcd itself - **Badger** — embedded KV in Go, LSM-tree, good performance - **DuckDB** — embedded analytical DB, might be overkill - **PostgreSQL** — if we need multi-server state, but adds external dependency - **Litestream** — SQLite + continuous replication to S3/GCS/Azure (interesting combo) **SQLite + Litestream** is the current leading candidate: - SQLite for simplicity and embeddability - Litestream for continuous backup to S3/GCS/local without stopping the database - Auto-restore: if DB is missing, Litestream restores from latest backup - Single file, easy to migrate when lab-server moves - But needs investigation to confirm it handles our scale ## Open Questions 1. Name: "lab" is simple but generic. Alternatives? 2. GitOps integration — should label/profile changes go through Git, or direct API? 3. Multi-tenancy — how to scope labels/resources per team? 4. Auth — mTLS between CLI and server? OIDC? Vault-issued tokens? 5. Input format — TypeScript (DA-style), YAML (Compose-style), or both? 7. Should `lab init` deploy lab-server as a container (portable) or native binary (simpler)?