Files
lab/lab-tool-spec.md
Michal Rydlikowski ac695f506f first commit
2026-03-15 23:50:43 +00:00

62 KiB

Lab — Unified Infrastructure Lifecycle Platform

What It Is

A tool that abstracts infrastructure lifecycle across clouds, hypervisors, bare metal, and Kubernetes — using labels as the universal abstraction and existing tools under the hood.

Not reinventing the wheel. Uses Pulumi, OpenVox, Tinkerbell, Prometheus, Naemon, existing Puppet modules, cloud APIs — but provides a unified interface over all of them.

Architecture

┌────────────────────────────────────────────────────────────┐
│                    lab-server (control plane)               │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
│  │ Provider  │  │ Label    │  │ Lifecycle│  │ Artifact   │ │
│  │ Registry  │  │ Engine   │  │ Manager  │  │ Builder    │ │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
│  │ OpenVox  │  │ Health   │  │ K8s      │  │ Render     │ │
│  │ Enrollor │  │ Aggregator│  │ Deployer │  │ Engine     │ │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
│  │ Identity │  │ DNS      │  │ Secret   │  │ Token      │ │
│  │ Manager  │  │ Manager  │  │ Manager  │  │ Issuer     │ │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
│                                                             │
│  API (gRPC + REST)                                          │
└──────────────┬─────────────────────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───┴───┐           ┌────┴────┐
│ lab   │           │ lab-tui │
│ (CLI) │           │ (k9s)   │
└───────┘           └─────────┘

Control Plane (lab-server)

Runs as a service (on bootstrap node, or in k8s). Hosts:

  • Provider Registry — pluggable providers (AWS, XCP-ng, bare metal, GCP, etc.)
  • Label Engine — resolves labels → puppet classes, sizes, ports, config
  • Lifecycle Manager — orchestrates provision → enroll → configure → observe
  • Artifact Builder — puppet classes → container images
  • OpenVox Enrollor — secure cert signing, node classification, environment assignment
  • Health Aggregator — queries Prometheus, Naemon, cloud health APIs
  • K8s Deployer — manages workloads on k3s/EKS clusters
  • Render Engine — side-by-side provider comparison, cost estimates, drift detection
  • Identity Manager — tracks enrollment state, certs, Vault auth, SSH keys per resource
  • DNS Manager — auto-registers/updates DNS for every managed resource
  • Secret Manager — controls which resources can access which secrets (per-label policies)
  • Token Issuer — generates one-time join tokens at provision time (no hardcoded secrets)

CLI (lab)

kubectl-like interface for browsing and managing resources:

$ lab get servers
NAME          PROVIDER   LABELS                SIZE     SYNC     PUPPET    HEALTH    IDENTITY
api-1         aws        app,prod,eu-west      medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
api-2         aws        app,prod,eu-west      medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
mail-1        xcpng      mailserver,prod       medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
db-1          baremetal   postgres,prod        large    ⚠ drift  ✓ ok      ✓ ok      ✓ enrolled
worker-3      aws        k8s-worker,staging    large    ✓ sync   ✗ failed  ⚠ 2 alrt  ✓ enrolled
gateway-1     baremetal   k8s-server,prod      small    ✓ sync   ✓ ok      ✓ ok      ⚠ cert exp

$ lab get servers --label mailserver
NAME          PROVIDER   SIZE     SYNC     PUPPET    HEALTH    IDENTITY
mail-1        xcpng      medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled
mail-2        aws        medium   ✓ sync   ✓ ok      ✓ ok      ✓ enrolled

$ lab describe server db-1
Name:       db-1
Provider:   baremetal
Labels:     [postgres, prod, eu-west]
Size:       large (8 cores, 32GB, 500GB NVMe)
Status:     DRIFT DETECTED
  Expected: size=large, disk=500GB
  Actual:   size=large, disk=500GB, extra_mount=/data (unmanaged)
Puppet:
  Environment: production
  Role:         postgres
  Classes:      [postgresql::server, backup::pgbackrest, node_exporter]
  Last run:     2026-03-15 14:22:03 (success)
  Next run:     2026-03-15 14:52:03
Health:
  Prometheus:   ✓ all targets up
  Naemon:       ✓ all checks passing
  Alerts:       none active

$ lab get labels
LABEL          PUPPET CLASSES                        SERVERS   CONTAINERS
mailserver     postfix, dovecot, spamassassin        2         1
k8s-worker     kubernetes::worker, containerd        12        0
postgres       postgresql::server, pgbackrest        3         1
app            nginx, app::deploy                    4         2

$ lab get containers
NAME              IMAGE                              LABEL        K8S CLUSTER   STATUS
mailserver        ghcr.io/org/mailserver:2026.03.15  mailserver   homelab       running
postgres          ghcr.io/org/postgres:2026.03.14    postgres     homelab       running
app               ghcr.io/org/app:2026.03.15         app          production    running

$ lab diff server db-1
  size: large
  disk: 500GB
+ extra_mount: /data    ← unmanaged, not in spec

$ lab sync server db-1          # reconcile drift
$ lab plan server new-mail-3 --label mailserver --provider aws   # preview
$ lab apply server new-mail-3   # create it

$ lab build --label mailserver  # puppet modules → container image
Building mailserver from puppet classes:
  ✓ postfix
  ✓ dovecot
  ✓ spamassassin
  ✓ fail2ban
→ ghcr.io/org/mailserver:2026.03.15

$ lab render --label mailserver --all-providers
┌──────────────┬──────────────┬──────────┬────────────┐
│              │ AWS          │ XCP-ng   │ Bare Metal │
├──────────────┼──────────────┼──────────┼────────────┤
│ Compute      │ t3.large     │ 4c/8GB   │ IPMI boot  │
│ Puppet       │ postfix,...  │ postfix,.│ postfix,...│
│ Est. Cost    │ ~$62/mo      │ —        │ —          │
└──────────────┴──────────────┴──────────┴────────────┘

TUI (lab-tui)

k9s-style interactive terminal UI:

  • Real-time server list with sync/puppet/health status
  • Drill into any server for details
  • Watch puppet runs live
  • Filter by labels, providers, health status
  • Trigger actions (sync, plan, apply, build)

Core Concepts

Labels — The Universal Abstraction

Everything is a thing with labels. Configuration attaches to labels, not machines.

labels:
  mailserver:
    puppet_classes:
      - postfix
      - dovecot
      - spamassassin
      - fail2ban
    ports: [25, 587, 993]
    size: medium
    alerts:
      - smtp_connect           # auto-generated: is SMTP responding?
      - imap_connect           # auto-generated: is IMAP responding?
      - mail_queue_length      # auto-generated: is mail queue healthy?
    secrets:
      - mail/tls-cert
      - mail/dkim-key

  k8s-worker:
    puppet_classes:
      - kubernetes::worker
      - containerd
      - node_exporter
    size: large
    alerts:
      - kubelet_healthy
      - node_ready
    secrets:
      - k8s/join-token

Groups — Nested Targeting with Exclusions

Groups compose labels, other groups, and individual servers into reusable targets. Groups can nest (subgroups). Exclusions allow fine-grained control.

groups:
  # Simple group: all production servers
  production:
    match:
      environment: prod

  # Group by label combination
  production-mail:
    match:
      labels: [mailserver]
      environment: prod

  # Nested group with subgroups
  eu-infrastructure:
    groups:
      - eu-west-compute
      - eu-west-storage
      - eu-west-network
    exclude:
      servers: [test-box-1]           # exclude specific server
      labels: [experimental]           # exclude servers with this label

  eu-west-compute:
    match:
      labels: [k8s-worker, k8s-server]
      region: eu-west
    exclude:
      servers: [legacy-node-3]

  # Group targeting everything except a subgroup
  all-except-staging:
    match:
      environment: [prod, dev]
    exclude:
      environment: staging

  # Custom group by explicit membership
  database-tier:
    servers: [db-1, db-2, db-3]
    groups: [replica-set-eu]

Alerts — Auto-Generated and User-Defined

Alerts attach to labels, groups, servers, or environments — same targeting as everything else.

Auto-Generated Alerts

When Lab provisions a resource, it generates baseline alerts based on:

  • Label: mailserver label → SMTP/IMAP checks
  • Puppet classes: postgresql::server → postgres process, replication lag
  • Ports: if port 443 is declared → HTTPS health check
  • Size: resource limits → CPU/memory threshold alerts
  • Identity: cert expiry alerts auto-generated for all enrolled machines

User-Defined Alerts

Users can add custom alerts targeting any scope:

alerts:
  # Target by label
  - name: mail_queue_critical
    target:
      labels: [mailserver]
    condition: mail_queue_length > 1000
    severity: critical
    for: 5m

  # Target by group
  - name: disk_space_low
    target:
      groups: [production]
    condition: disk_usage_percent > 85
    severity: warning

  # Target by environment
  - name: high_cpu
    target:
      environment: prod
    condition: cpu_usage_percent > 90
    for: 10m
    severity: warning

  # Target specific servers
  - name: gpu_temperature
    target:
      servers: [dgx-spark, beelink-ser9-max]
    condition: gpu_temp_celsius > 80
    severity: critical

  # Target by label but exclude some
  - name: memory_pressure
    target:
      labels: [k8s-worker]
      exclude:
        servers: [batch-worker-1]   # this one is expected to run hot
    condition: memory_usage_percent > 90
    severity: warning

Alerts are rendered to the underlying monitoring system (Prometheus rules, Naemon checks, CloudWatch alarms) — we don't build an alerting engine, we generate configs for existing ones. Which monitoring backend to use for each alert type: needs investigation.

Targeting — Unified Query System

The same targeting syntax works everywhere: alerts, puppet classes, secrets, and queries. Target by label, group, server name, environment, region, or any combination with exclusions.

# CLI targeting syntax
$ lab get servers --label k8s-worker
$ lab get servers --group production
$ lab get servers --environment staging
$ lab get servers --label k8s-worker --environment prod --exclude worker-3

# What's applied WHERE (server → everything)
$ lab show server worker-5

Visibility — Show What's Applied Where

Two directions of querying: "what does this server get?" and "where does this thing apply?"

Server View: Everything applied to a server

$ lab show server worker-5

Server: worker-5 (aws, eu-west-1)
Labels:   [k8s-worker, production, eu-west]
Groups:   [production, eu-west-compute, eu-infrastructure]
Environment: prod

Puppet Classes (6):
  FROM LABEL k8s-worker:
    ├── kubernetes::worker
    ├── containerd
    └── node_exporter
  FROM LABEL production:
    ├── base::hardening
    └── base::monitoring
  FROM LABEL eu-west:
    └── base::ntp_eu

Alerts (8):
  FROM LABEL k8s-worker:
    ├── kubelet_healthy
    └── node_ready
  FROM GROUP production:
    ├── disk_space_low
    └── high_cpu
  AUTO-GENERATED:
    ├── cpu_threshold (from size: large)
    ├── memory_threshold (from size: large)
    ├── cert_expiry (from identity)
    └── puppet_run_failed (from enrollment)

Secrets (2):
  FROM LABEL k8s-worker:
    ├── k8s/join-token (read)
    └── tls/node-cert (dynamic)

Excluded From:
  └── alert "memory_pressure" (explicitly excluded)

Label/Group View: Where does this apply?

$ lab show label mailserver

Label: mailserver
Applied to: 2 servers

Servers:
  ├── mail-1 (xcpng, prod)    ✓ sync  ✓ puppet  ✓ health  ✓ identity
  └── mail-2 (aws, prod)      ✓ sync  ✓ puppet  ✓ health  ✓ identity

Provides:
  Puppet Classes: postfix, dovecot, spamassassin, fail2ban
  Alerts: smtp_connect, imap_connect, mail_queue_length
  Secrets: mail/tls-cert, mail/dkim-key
  Ports: 25, 587, 993
  Size: medium

$ lab show group eu-infrastructure

Group: eu-infrastructure
Contains: 3 subgroups, 47 servers (2 excluded)

Subgroups:
  ├── eu-west-compute    (28 servers)
  ├── eu-west-storage    (12 servers)
  └── eu-west-network    (9 servers)

Excluded:
  ├── test-box-1 (by name)
  └── 1 server with label "experimental"

Alerts targeting this group:
  ├── disk_space_low (warning)
  └── network_latency_high (critical)

Alert View: Where does this alert fire?

$ lab show alert disk_space_low

Alert: disk_space_low
Severity: warning
Condition: disk_usage_percent > 85
Target: group "production"
Excludes: none

Applies to 63 servers:
  ├── api-1 (aws)        currently: 42%  ✓
  ├── api-2 (aws)        currently: 38%  ✓
  ├── mail-1 (xcpng)     currently: 71%  ✓
  ├── db-1 (baremetal)   currently: 83%  ⚠ approaching
  └── ... (59 more)

Rendered to:
  ├── Prometheus: rule "disk_space_low" in rules/production.yaml
  └── Naemon: service check on 4 bare-metal hosts

Reverse Query: What targets this server?

$ lab targets server db-1

Everything targeting db-1:
  Labels:     [postgres, production, eu-west]
  Groups:     [production, database-tier, eu-infrastructure, eu-west-storage]
  Environment: prod

  Alerts (11):
    ├── postgres_replication_lag    (from label: postgres)
    ├── postgres_connections        (from label: postgres)
    ├── disk_space_low              (from group: production)
    ├── high_cpu                    (from group: production)
    ├── storage_iops                (from group: eu-west-storage)
    ├── cert_expiry                 (auto-generated)
    └── ... (5 more)

  Puppet Classes (9):
    ├── postgresql::server          (from label: postgres)
    ├── backup::pgbackrest          (from label: postgres)
    └── ... (7 more)

  Secrets (4):
    ├── postgres/master-password    (from label: postgres)
    └── ... (3 more)

TUI Visualization (lab-tui)

The k9s-style TUI should support navigating these relationships interactively:

┌─ lab-tui ──────────────────────────────────────────────────────────┐
│ View: Servers > worker-5                                    [?]Help│
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  ┌─ Server: worker-5 ──────────────────────────────────────────┐  │
│  │ Provider: aws          Size: large         Env: prod        │  │
│  │ Sync: ✓   Puppet: ✓   Health: ✓   Identity: ✓              │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                                                                    │
│  [L]abels    [A]lerts    [P]uppet    [S]ecrets    [G]roups        │
│                                                                    │
│  Labels ──────────────────── Alerts ──────────────────────────     │
│  ► k8s-worker                ● kubelet_healthy        ✓ OK        │
│  ► production                ● node_ready             ✓ OK        │
│  ► eu-west                   ● disk_space_low         ✓ 42%       │
│                              ● high_cpu               ✓ 12%       │
│  Groups ──────────────────   ● cert_expiry            ✓ 347d      │
│  ► production                                                      │
│    ► eu-infrastructure       Puppet Classes ──────────────────     │
│      ► eu-west-compute       ● kubernetes::worker     ✓ applied   │
│                              ● containerd             ✓ applied   │
│  Secrets ─────────────────   ● node_exporter          ✓ applied   │
│  ● k8s/join-token    (read)  ● base::hardening        ✓ applied   │
│  ● tls/node-cert     (dyn)   ● base::monitoring       ✓ applied   │
│                                                                    │
│ [Enter] drill down  [Esc] back  [/] search  [Tab] switch pane    │
└────────────────────────────────────────────────────────────────────┘

Navigation:

  • From server → drill into label → see all other servers with that label
  • From alert → see all servers it applies to, current values
  • From group → see subgroups, expand tree, see members
  • From label → see puppet classes, alerts, secrets it provides
  • Everything is cross-linked — follow any relationship in either direction

Deployment Targets

Same label → multiple targets:

Target What happens
VM (any cloud) Provision VM → enroll OpenVox → apply classes live
Bare metal PXE boot → enroll OpenVox → apply classes live
Container Build image with classes baked in → push to registry
ASG Launch template with OpenVox enrollment → auto-apply
K8s pod Deploy container artifact to cluster

Four-Pillar Status

Every resource shows four things:

  1. Sync — is the actual infrastructure state matching the declared spec? (instance type, security groups, disks, network — via Pulumi state)
  2. Puppet — did OpenVox successfully apply all classes? (last run status, any failures, catalog compilation errors)
  3. Health — are monitoring checks passing? (aggregates from Prometheus alerts, Naemon checks, cloud health APIs)
  4. Identity — is the resource fully enrolled? (DNS registered, certs valid, Vault authenticated, SSH host key signed)

Provider Plugin System

Extensible provider model — each provider implements an interface:

type Provider interface {
    Name() string

    // Lifecycle
    Plan(spec ResourceSpec) (*PlanResult, error)
    Apply(spec ResourceSpec) (*Resource, error)
    Destroy(id string) error

    // State
    Get(id string) (*Resource, error)
    List(filters Filters) ([]*Resource, error)
    Diff(spec ResourceSpec) (*DiffResult, error)

    // Introspection (like DA's type-writer)
    DiscoverResources() ([]*Resource, error)
    AvailableSizes() ([]Size, error)
    AvailableImages() ([]Image, error)
}

Built-in providers:

  • provider-aws — wraps Pulumi AWS
  • provider-xcpng — wraps Pulumi XO / Xen Orchestra API
  • provider-baremetal — wraps Tinkerbell / iPXE + IPMI/Redfish
  • provider-k8s — wraps Pulumi Kubernetes

Community can add: GCP, Azure, Hetzner, Proxmox, etc.

Health Aggregator Plugin System

type HealthSource interface {
    Name() string
    CheckHealth(resource *Resource) (*HealthResult, error)
}

Built-in sources:

  • health-prometheus — queries Prometheus alerting rules targeting the resource
  • health-naemon — queries Naemon host/service checks
  • health-cloudwatch — queries AWS CloudWatch alarms

Profiles — T-Shirt Sizing

User-owned mappings:

sizes:
  medium:
    abstract: { cores: 4, memory: 8GB }
    providers:
      aws: { instance_type: t3.large }
      xcpng: { cores: 4, memory: 8192MB }
      baremetal: { min_cores: 4, min_memory: 8GB, maas_tag: medium }

Artifact Builder

Puppet modules → container images:

label "mailserver"
  → puppet classes [postfix, dovecot, spamassassin]
  → Dockerfile generated:
      FROM ubuntu:24.04
      RUN apt-get install -y puppet-agent
      COPY modules/ /etc/puppetlabs/code/modules/
      RUN puppet apply --classes postfix,dovecot,spamassassin
      # Clean up puppet, leave only configured services
  → Image pushed to registry
  → Available as k8s deployment or standalone container

Tech Stack

Component Technology Why
Server Go Performance, single binary, Pulumi SDK, gRPC native
CLI Go (cobra) Same binary, kubectl-style
TUI Go (bubbletea) Same binary, k9s-style
API gRPC + REST (grpc-gateway) Type-safe, fast, REST fallback
IaC engine Pulumi (Go SDK) Multi-provider, plan/preview, component packages
Config mgmt OpenVox Puppet modules, ENC, cert management
Bare metal Tinkerbell or custom iPXE PXE boot, IPMI/Redfish
Container build Buildah or Docker OCI images from puppet classes
State store TBD — NOT etcd (see State Storage section) Resource state, label definitions
K8s integration client-go Direct k8s API for deployments

Under The Hood — What We DON'T Build

  • Cloud APIs → Pulumi providers handle this
  • Puppet language/runtime → OpenVox handles this
  • Container runtime → containerd/Docker handles this
  • Monitoring → Prometheus/Naemon handle this
  • K8s orchestration → k3s/EKS handles this
  • PXE/DHCP/TFTP → Tinkerbell handles this
  • Certificate management → OpenVox CA handles this

We build the glue, the abstraction, the UX, and the lifecycle orchestration.

Kubernetes Management

Lab also controls what runs on k8s clusters:

$ lab get deployments
NAME          CLUSTER     LABEL        REPLICAS   IMAGE                    STATUS
mailserver    homelab     mailserver   2/2        org/mailserver:03.15     ✓ running
api           production  app          4/4        org/app:03.15            ✓ running
postgres      homelab     postgres     1/1        org/postgres:03.14       ✓ running

$ lab deploy --label app --cluster production --replicas 4
$ lab scale --label app --cluster production --replicas 6

Deployments reference labels — same label that defines puppet classes also defines the container image, ports, health checks, and k8s resources.

Bootstrap, Onboarding, and Self-Deployment

Core Idea: Your Device Is The First Coordinator

You don't need a server to start. Your laptop/workstation runs the full lab engine locally. You onboard servers from it — including bare metal PXE boot. When ready, you migrate the coordinator role to one of the servers you've onboarded.

┌────────────┐     ┌────────────┐     ┌────────────┐     ┌────────────┐
│  Phase 0   │     │  Phase 1   │     │  Phase 2   │     │  Phase 3   │
│            │     │            │     │            │     │            │
│ lab init   │────►│ Onboard    │────►│ Move lab   │────►│ Onboard    │
│ --local    │     │ servers    │     │ to a real  │     │ remaining  │
│            │     │ from your  │     │ server     │     │ from the   │
│ Your device│     │ laptop     │     │            │     │ server     │
│ = lab      │     │            │     │            │     │            │
└────────────┘     └────────────┘     └────────────┘     └────────────┘

Architecture: CLI = Embedded Server

The CLI binary contains the full lab-server engine. The difference between modes is where state lives and whether the engine runs persistently.

┌──────────────────────────────────────┐
│            lab (single binary)        │
│                                       │
│  ┌─────────────────────────────────┐ │
│  │         Core Engine              │ │
│  │  (providers, labels, render,     │ │
│  │   lifecycle, identity, secrets,  │ │
│  │   PXE server, everything)        │ │
│  └─────────────────────────────────┘ │
│                                       │
│  Modes:                               │
│  ├── $ lab init --local → local mode  │
│  │     State: ~/.lab/state.db         │
│  │     PXE/DHCP: served from laptop   │
│  │     Full engine, no remote server  │
│  │                                    │
│  ├── $ lab server       → daemon mode │
│  │     State: /var/lib/lab/state.db   │
│  │     PXE/DHCP: served from this box │
│  │     Persistent API on port 7443    │
│  │                                    │
│  └── $ lab <command>    → client mode │
│        Talks to remote lab-server     │
│        (or local engine if no server) │
└──────────────────────────────────────┘

Onboarding Flow

lab onboard is the command to bring a new machine under management. It handles two scenarios: machines with an OS already installed, and bare metal that needs network boot + OS installation.

Scenario A: Machine has OS (SSH onboard)

For machines that already have an OS (like DGX Spark with Ubuntu, or Mac Studio):

$ lab onboard dgx-spark --provider ssh --host 192.168.1.50 --user admin

Step 1: Render
  ┌──────────────┬────────────────────────┐
  │ Name         │ dgx-spark              │
  │ Provider     │ ssh (existing machine) │
  │ Host         │ 192.168.1.50           │
  │ OS           │ Ubuntu (detected)      │
  │ Arch         │ aarch64 (Grace)        │
  │ RAM          │ 128GB                  │
  │ GPU          │ CUDA (detected)        │
  └──────────────┴────────────────────────┘

  Onboarding will:
  + Install lab agent
  + Generate one-time enrollment token
  + Register in DNS: dgx-spark.lab.internal
  + Sign OpenVox certificate
  + Assign labels (interactive or --labels flag)

  Proceed? [y/N]: y

Step 2: Detect & assign labels
  Detected hardware:
    GPU: NVIDIA GB10 Grace Blackwell → suggesting label: cuda
    RAM: 128GB → suggesting label: ai-inference
    Arch: aarch64 → suggesting label: arm

  Assign labels [cuda,ai-inference,arm]: cuda,ai-inference,dgx-spark

Step 3: Apply (same engine as lab apply)
  → SSH into 192.168.1.50
  → Install lab agent binary
  → Generate one-time token
  → Lab agent enrolls:
    → OpenVox cert signed, classified in environment "production"
    → DNS A record: dgx-spark.lab.internal → 192.168.1.50
    → Identity established
  → Apply puppet classes from labels:
    → cuda: nvidia-drivers, cuda-toolkit
    → ai-inference: inference-runtime
  → Machine fully managed

$ lab get servers
NAME          PROVIDER  LABELS                     SYNC  PUPPET  HEALTH  IDENTITY
dgx-spark     ssh       cuda,ai-inference,dgx-spark ✓     ✓ ok    ✓       ✓ enrolled

Scenario B: Bare metal (PXE network boot)

For machines with no OS. Lab (on your laptop or server) becomes a PXE server on the local network, serves the OS installer, and onboards after installation:

$ lab onboard beelink-max --provider baremetal \
    --mac AA:BB:CC:DD:EE:FF \
    --image ubuntu-24.04 \
    --labels k8s-worker,rocm,longhorn

Step 1: Render
  ┌──────────────┬────────────────────────┐
  │ Name         │ beelink-max            │
  │ Provider     │ baremetal (PXE boot)   │
  │ MAC          │ AA:BB:CC:DD:EE:FF      │
  │ Image        │ ubuntu-24.04           │
  │ Labels       │ k8s-worker,rocm,longhorn│
  │ PXE server   │ this device (laptop)   │
  └──────────────┴────────────────────────┘

  Onboarding will:
  + Start PXE/DHCP/TFTP on local network interface
  + Wait for machine with MAC AA:BB:CC:DD:EE:FF to boot
  + Serve unattended Ubuntu 24.04 installer
  + After install: auto-enroll with one-time token baked into installer
  + Assign labels, apply puppet classes

  ⚠ PXE requires: network interface on same L2 segment as target machine
  ⚠ DHCP: will respond ONLY to MAC AA:BB:CC:DD:EE:FF (safe for existing networks)

  Proceed? [y/N]: y

Step 2: PXE boot phase
  → Starting PXE server on en0 (192.168.1.x)
  → DHCP offer scoped to MAC AA:BB:CC:DD:EE:FF only
  → Waiting for network boot request...

  ⏳ Power on the Beelink SER9 MAX and set it to boot from network (PXE)

  → Boot request received from AA:BB:CC:DD:EE:FF
  → Serving iPXE → kernel + initrd → autoinstall config
  → OS installation in progress...
  → Installation complete, machine rebooting

Step 3: Post-install enrollment (same as SSH onboard from here)
  → Machine boots with installed OS
  → Lab agent runs on first boot (installed during OS setup)
  → Uses one-time token (baked into autoinstall config) to enroll:
    → OpenVox cert signed
    → DNS: beelink-max.lab.internal → 192.168.1.100
    → Identity established
  → Apply puppet classes from labels:
    → k8s-worker: kubernetes::worker, containerd
    → rocm: rocm-drivers
    → longhorn: longhorn::node
  → Machine fully managed

$ lab get servers
NAME          PROVIDER    LABELS                      SYNC  PUPPET  HEALTH  IDENTITY
dgx-spark     ssh         cuda,ai-inference           ✓     ✓ ok    ✓       ✓ enrolled
beelink-max   baremetal   k8s-worker,rocm,longhorn    ✓     ✓ ok    ✓       ✓ enrolled

Scenario C: Onboard with IPMI/Redfish (remote power control)

For bare metal where you have IPMI/BMC access — Lab can power on the machine and set PXE boot remotely, fully hands-free:

$ lab onboard beelink-max --provider baremetal \
    --mac AA:BB:CC:DD:EE:FF \
    --ipmi 192.168.1.200 --ipmi-user admin \
    --image ubuntu-24.04 \
    --labels k8s-worker,rocm,longhorn

  → IPMI: setting next boot to PXE
  → IPMI: powering on machine
  → PXE server waiting for boot request...
  → (fully automated from here)

Homelab Bootstrap Walkthrough

The complete flow for setting up the homelab from zero:

# Phase 0: Local mode on your laptop
$ lab init --local
  ✓ Lab engine running locally
  ✓ State: ~/.lab/state.db
  ✓ Ready to onboard servers

# Phase 1: Onboard servers that already have an OS
$ lab onboard dgx-spark --provider ssh --host 192.168.1.50
  → Labels: [cuda, ai-inference, dgx-spark]

$ lab onboard mac-studio --provider ssh --host 192.168.1.51
  → Labels: [k8s-server, etcd, arm]

# Phase 2: Onboard bare metal (PXE from your laptop)
$ lab onboard beelink-ser9-pro --provider baremetal --mac XX:XX:XX:XX:XX:01 \
    --image ubuntu-24.04 --labels bootstrap,lab-server
  → PXE boot from laptop → install OS → enroll
  → This will become the permanent lab-server host

# Phase 3: Move lab-server to a real server
$ lab server migrate --target ssh --host beelink-ser9-pro
  → Lab-server deployed on Beelink SER9 Pro
  → State migrated from ~/.lab/state.db
  → PXE/DHCP now served from Beelink, not your laptop
  → CLI config updated: lab talks to beelink-ser9-pro:7443

# Phase 4: Onboard remaining servers (PXE from beelink-ser9-pro now)
$ lab onboard beelink-ser9-max --provider baremetal --mac XX:XX:XX:XX:XX:02 \
    --image ubuntu-24.04 --labels k8s-worker,rocm,longhorn
  → PXE served by beelink-ser9-pro (not your laptop anymore)

$ lab onboard minisforum-ms-r1 --provider baremetal --mac XX:XX:XX:XX:XX:03 \
    --image ubuntu-24.04 --labels k8s-worker,arm

# Phase 5: Set up k8s
$ lab apply cluster homelab --servers mac-studio,beelink-ser9-max,minisforum-ms-r1
  → mac-studio becomes k3s server (etcd)
  → beelink-ser9-max joins as worker
  → minisforum-ms-r1 joins as worker
  → All via puppet classes from labels

# Phase 6: Optionally move lab-server into k8s
$ lab server migrate --target kubernetes --cluster homelab
  → Lab-server now runs as k8s pod
  → Still manages everything including the cluster it runs on

# Final state:
$ lab get servers
NAME              PROVIDER    LABELS                       SYNC  PUPPET  HEALTH  IDENTITY
dgx-spark         ssh         cuda,ai-inference            ✓     ✓ ok    ✓       ✓ enrolled
mac-studio        ssh         k8s-server,etcd,arm          ✓     ✓ ok    ✓       ✓ enrolled
beelink-ser9-pro  baremetal   bootstrap                    ✓     ✓ ok    ✓       ✓ enrolled
beelink-ser9-max  baremetal   k8s-worker,rocm,longhorn     ✓     ✓ ok    ✓       ✓ enrolled
minisforum-ms-r1  baremetal   k8s-worker,arm               ✓     ✓ ok    ✓       ✓ enrolled
lab-server        kubernetes  lab,control-plane             ✓     ✓ ok    ✓       ✓ enrolled

Enterprise Application: XCP-ng Bare Metal Deploy

Same onboarding flow works for deploying XCP-ng to enterprise bare metal:

$ lab onboard xen-host-42 --provider baremetal \
    --mac AA:BB:CC:DD:EE:FF \
    --ipmi 10.0.0.142 --ipmi-user admin \
    --image xcpng-8.3 \
    --labels xen-host,production,eu-west

  → IPMI: power on, PXE boot
  → Install XCP-ng 8.3 (unattended)
  → Enroll, apply puppet classes:
    → xen-host: xcpng::host, xcpng::networking, xcpng::storage
  → Host registered in Xen Orchestra pool
  → Ready to provision VMs on it

# Now create VMs on the XCP-ng host we just onboarded:
$ lab apply server app-12 --provider xcpng --labels app,production
  → VM created on xen-host-42 via Xen Orchestra API
  → OS installed, enrolled, puppet applied
  → Same flow as AWS EC2, just different provider

PXE Server Capabilities

When running in local or server mode, Lab includes an embedded PXE server:

  • DHCP: scoped to specific MACs only (safe for existing networks with DHCP)
  • TFTP: serves iPXE bootloader
  • HTTP: serves kernel, initrd, autoinstall configs
  • Autoinstall generation: creates unattended install configs per-machine with:
    • Lab agent pre-installed
    • One-time enrollment token baked in
    • Network config for the target environment
    • Disk layout per label/profile
  • Supported images: Ubuntu, Debian, RHEL/Rocky, XCP-ng (extensible)

PXE serving moves with lab-server — if you migrate lab to a new host, PXE is served from there. If lab is on your laptop, PXE is on your laptop. Same engine, same binary.

Hardware Detection During Onboard

When onboarding via SSH (existing OS), Lab detects hardware and suggests labels:

$ lab onboard new-server --provider ssh --host 10.0.0.50

Detected hardware:
  CPU:    AMD EPYC 7763 (x86_64, 64 cores)     → suggest: compute
  RAM:    256 GB                                 → suggest: high-memory
  GPU:    NVIDIA A100 80GB                       → suggest: cuda, ai-training
  Disk:   2x NVMe 1.92TB, 4x SSD 3.84TB        → suggest: storage
  NIC:    2x 25GbE, 1x 1GbE IPMI               → suggest: high-bandwidth

  Suggested labels: [compute, high-memory, cuda, ai-training, storage, high-bandwidth]
  Assign labels [accept/edit]: _

For PXE onboard, hardware detection happens after OS installation, and labels can be auto-confirmed or require interactive approval.

No Server? CLI Runs Locally

If no remote server is configured, every lab command runs the engine locally. This means you can use Lab in permanent local mode for simple setups:

$ lab get servers              # no remote server configured
  ⓘ Running locally (~/.lab/state.db)
  Tip: run `lab server migrate --target <target>` to deploy a persistent server

NAME          PROVIDER   LABELS     SYNC     PUPPET    HEALTH    IDENTITY
...

Self-Migration

Migration uses the same plan/apply as everything else:

$ lab server migrate --target ssh --host beelink-ser9-pro

Step 1: Plan
  ~ migrate lab-server from local (~/.lab) to ssh://beelink-ser9-pro
  + deploy lab-server container on beelink-ser9-pro
  + copy state.db to remote host
  + start PXE/DHCP services on remote host
  + stop local PXE/DHCP services
  + update CLI config to new endpoint

Step 2: Apply
  → Deploy lab-server on beelink-ser9-pro
  → Copy state to remote
  → Verify remote is healthy
  → Switch CLI config
  → Stop local engine

$ lab server migrate --target kubernetes --cluster homelab

Step 1: Plan
  ~ migrate lab-server from ssh://beelink-ser9-pro to kubernetes://homelab
  + k8s Deployment lab-server (1 replica)
  + k8s Service lab-server (port 7443)
  + PersistentVolumeClaim lab-server-state (10Gi)
  + migrate state.db to PVC
  + PXE services: move to k8s hostNetwork pod or keep on bootstrap node

  ⚠ Note: PXE/DHCP requires L2 network access. If k8s node is on the same
    L2 segment, use hostNetwork. Otherwise, keep PXE on the bootstrap node
    and only migrate the API/state to k8s.

Step 2: Apply
  → Deploy to k8s
  → Migrate state
  → Verify healthy
  → Update CLI config
  → Tear down old deployment

Key Design Principles

  1. One engine everywhere — CLI, local mode, server mode, and init all share the same code
  2. Your device is the first coordinator — no chicken-and-egg, start from nothing
  3. Onboard uses the same pipeline as apply — render, plan, apply, enroll
  4. PXE is embedded — no external PXE/DHCP server needed, Lab serves it
  5. Hardware detection suggests labels — but the user confirms
  6. Migration is just plan/apply for lab-server — same engine, no special case
  7. Enterprise and homelab are the same flow — onboard XCP-ng bare metal = onboard homelab Beelink

Identity and Trust Layer

Inspired by what FreeIPA did well (auto-DNS, centralized SSH, server-scoped secrets, internal CA, IP mobility) without what it did badly (instability, hardcoded join secrets).

Lab controls the full lifecycle — it knows when a machine is born — so it can solve the enrollment problem properly: generate a one-time join token at provision time, inject it via cloud-init or iPXE userdata. No hardcoded secrets in images.

Provision-to-Enrolled Flow

$ lab apply server new-worker-5 --label k8s-worker --provider aws

1. PROVISION   → Pulumi creates EC2 instance
2. IDENTITY    → Lab generates one-time join token (short-lived, single-use)
                → Token injected via cloud-init (or iPXE userdata for bare metal)
                → Token is NOT in the image — generated per-instance at provision time
3. ENROLL      → Machine boots, uses token to:
                  → Register with OpenVox (cert signed, node classified)
                  → Register in DNS (A record + PTR)
                  → Authenticate with Vault (get identity + policies per label)
                  → Get SSH CA-signed host key (no more TOFU)
4. CONFIGURE   → OpenVox applies classes
                → Machine pulls secrets it's allowed to access from Vault
                → e.g. k8s join token retrieved from Vault, node joins cluster
5. ENROLLED    → Lab marks resource identity as ✓ enrolled

What Each Machine Gets on Enrollment

Capability What happens Tool underneath (TBD — needs investigation)
DNS auto-registration A + PTR records created/updated automatically CoreDNS API? ExternalDNS? PowerDNS? needs investigation
IP mobility Machine restarts with new IP → DNS updated automatically Lab agent on machine reports changes? DHCP hook? needs investigation
Server certificate TLS cert issued for the machine, auto-renewed OpenVox CA? Vault PKI secrets engine? cert-manager? needs investigation
SSH host key signing Host key signed by CA, clients trust CA not individual keys Vault SSH secrets engine? OpenVox CA? step-ca? needs investigation
SSH user access Users get short-lived SSH certs, centrally managed Vault SSH + OIDC? Teleport? Boundary? needs investigation
Secret access (RBAC) Machine authenticates with Vault, gets label-scoped policy Vault AppRole? Vault cert auth? needs investigation
K8s join tokens Retrieved from Vault by entitled machines, used to join cluster Vault KV + policy per label? needs investigation
OpenVox enrollment Cert signed, environment + role + classes assigned OpenVox CA + ENC — this one we know
One-time join tokens Generated per-instance at provision, single-use, short-lived Lab itself generates these — or delegate to Vault? needs investigation

Important: We don't need to build any of these from scratch. Each row is a capability that likely has an existing tool we can wrap. Just like we use Pulumi for cloud APIs and OpenVox for config management, we'll find the right tool for each identity concern. Each position requires investigation — we'll evaluate options together, one by one.

CLI: Identity Information

$ lab get servers
NAME       PROVIDER  LABELS       SYNC  PUPPET  HEALTH  IDENTITY
worker-5   aws       k8s-worker   ✓     ✓ ok    ✓       ✓ enrolled
worker-6   xcpng     k8s-worker   ✓     ✓ ok    ✓       ✓ enrolled
worker-7   baremetal  k8s-worker   ✓     ✗ fail  ⚠       ⚠ cert expiring
new-box    aws       k8s-worker   ✓     …       …       ⏳ enrolling

$ lab describe server worker-5
...
Identity:
  DNS:          worker-5.lab.internal (A: 10.0.1.45, PTR: ✓)
  OpenVox:      ✓ cert signed (expires 2027-03-15)
  Vault:        ✓ authenticated (policy: k8s-worker)
  SSH Host Key: ✓ CA-signed (fingerprint: SHA256:abc...)
  Secrets:      k8s/join-token, tls/node-cert (2 accessible)
  Enrolled:     2026-03-15 14:22:03 (one-time token, consumed)
  Last Check-in: 2026-03-15 15:01:12 (38 seconds ago)

$ lab get secrets --label k8s-worker
SECRET              TYPE     ACCESSIBLE BY         LAST ROTATED
k8s/join-token      dynamic  k8s-worker (12 srv)   2026-03-15
tls/cluster-ca      static   k8s-worker, k8s-server  2026-01-01
monitoring/api-key  static   k8s-worker, monitoring  2026-02-28

$ lab identity renew worker-5    # force cert/key renewal
$ lab identity revoke worker-5   # revoke all creds, remove from DNS, unenroll

Secrets — Code Is The Policy

Design principle: If your code/config declares "I use secret X", that IS the access grant. No one goes to a separate UI to edit policies. Default is locked — if not mentioned, no access. If mentioned, access is automatic.

The declaration IS the policy:

labels:
  mailserver:
    puppet_classes:
      - postfix
      - dovecot
    secrets:
      - mail/tls-cert
      - mail/dkim-key
      - mail/relay-credentials
    ports: [25, 587, 993]

When Lab applies label mailserver to a server, it automatically:

  1. Grants that server access to mail/tls-cert, mail/dkim-key, mail/relay-credentials
  2. Denies access to everything else
  3. No separate policy file, no Vault admin, no ticket to security team

When a puppet class references a secret:

# modules/postfix/manifests/init.pp
class postfix {
  $relay_creds = lab::secret('mail/relay-credentials')

  file { '/etc/postfix/sasl_passwd':
    content => $relay_creds,
    mode    => '0600',
  }
}

The lab::secret() call is both the usage AND the declaration that this class needs this secret. Lab scans puppet classes, discovers secret references, and auto-generates the access policy. If postfix class is applied to a server via a label, that server gets access to mail/relay-credentials. Remove the class → access revoked.

Secrets must be equally easy to access from anywhere:

Runtime How you get a secret Same underneath
Puppet code lab::secret('mail/tls-cert') Lab agent on machine fetches from secret backend
App on VM LAB_SECRET_MAIL_TLS_CERT env var, or /run/secrets/mail/tls-cert file Lab agent provides via env or tmpfs mount
App in Kubernetes Same env var or volume mount Lab k8s operator syncs to K8s Secret object
App in Docker (standalone) --env-file or bind mount from lab agent Lab agent on host provides
Script / cron job lab secret get mail/tls-cert CLI call Lab CLI authenticated via machine identity
cloud-init / bootstrap Injected at provision time via one-time token Lab server provides during enrollment

One way to consume secrets, regardless of where you run. The lab agent (or k8s operator, or CLI) handles authentication and fetching transparently. The app just reads an env var or file.

How Access Flows

                Label "mailserver"
                declares secrets:
                  - mail/tls-cert
                  - mail/dkim-key
                        │
                        ▼
            ┌───────────────────────┐
            │  Lab compiles policy  │
            │                       │
            │  server mail-1:       │
            │    CAN access:        │
            │      mail/tls-cert    │
            │      mail/dkim-key    │
            │    CANNOT access:     │
            │      k8s/*            │
            │      postgres/*       │
            │      (everything else)│
            └───────────┬───────────┘
                        │
                        ▼
            ┌───────────────────────┐
            │  Secret backend       │
            │  (TBD — needs         │
            │   investigation)      │
            │                       │
            │  Enforces policy at   │
            │  backend level, not   │
            │  just in Lab          │
            └───────────────────────┘

Secret Sources

Secrets themselves can come from multiple places:

secrets:
  mail/tls-cert:
    type: dynamic                 # generated/rotated automatically
    generator: acme               # cert-manager / Let's Encrypt
    rotate_every: 90d

  mail/dkim-key:
    type: static                  # manually set, stored encrypted
    set_by: admin                 # who last set it

  mail/relay-credentials:
    type: static
    set_by: admin

  k8s/join-token:
    type: dynamic
    generator: kubernetes         # fetched from k8s API
    rotate_every: 24h

  tls/node-cert:
    type: dynamic
    generator: ca                 # issued per-machine from internal CA
    per_machine: true             # each machine gets its own

CLI for Secrets

$ lab get secrets
SECRET                   TYPE      USED BY              LAST ROTATED
mail/tls-cert            dynamic   mailserver (2 srv)   2026-03-14
mail/dkim-key            static    mailserver (2 srv)   2026-01-15
mail/relay-credentials   static    mailserver (2 srv)   2026-02-01
k8s/join-token           dynamic   k8s-worker (12 srv)  2026-03-15
tls/node-cert            dynamic   * (all enrolled)     per-machine

$ lab secret set mail/relay-credentials
  Enter value: ****
  ✓ Updated. Accessible by: mailserver (2 servers)
  ✓ Servers will pick up new value within 60s

$ lab show secret mail/relay-credentials
Secret: mail/relay-credentials
Type: static
Last set: 2026-03-15 by admin

Accessible by (derived from code):
  Label "mailserver" → puppet class "postfix" → lab::secret('mail/relay-credentials')
    ├── mail-1 (xcpng)    last fetched: 12m ago
    └── mail-2 (aws)      last fetched: 12m ago

  No other references found in any applied code.

$ lab secret audit
  ✓ All secrets are referenced by at least one applied class/label
  ⚠ Secret "old/api-key" is defined but not referenced by any code — orphaned?
  ⚠ Secret "db/password" referenced by class "app::database" but never set — empty!

Secret Architecture — Distributed, Offline-Capable

Critical requirement: Nothing breaks if the central secret server (or any server) is unreachable. Everything continues to work — including making new pods, deployments, puppet runs — using local encrypted cache. This is not an edge case, it's a core design.

This means secrets are NOT a central server you query. They're a distributed, synced, encrypted dataset with offline capability.

┌─────────────────────────────────────────────────────────────┐
│                    Secret Distribution Model                 │
│                                                              │
│   NOT this (central server):        THIS (distributed sync): │
│                                                              │
│       ┌─────────┐                  ┌──────┐  ┌──────┐       │
│       │ Vault   │                  │ Node │◄─►│ Node │       │
│       └────┬────┘                  └──┬───┘  └──┬───┘       │
│       ┌────┼────┐                     │    ▲    │            │
│       │    │    │                     ▼    │    ▼            │
│      ┌┴┐  ┌┴┐  ┌┴┐               ┌──────┐  ┌──────┐       │
│      │N│  │N│  │N│               │ Node │◄─►│ Node │       │
│      └─┘  └─┘  └─┘               └──┬───┘  └──────┘       │
│   (all dead if vault               │                        │
│    is unreachable)                  ▼                        │
│                               ┌──────────┐                  │
│                               │ Git repo │ (encrypted       │
│                               │ (backup) │  backup of       │
│                               └──────────┘  last resort)    │
└─────────────────────────────────────────────────────────────┘

How It Works

Layer 1: Local Encrypted Cache (on every machine)

  • Every machine that has access to secrets stores them locally, encrypted at rest
  • Encrypted with machine-specific key (derived from machine identity/TPM/secure enclave)
  • Puppet runs, app starts, pod deployments — all read from local cache
  • If cache is fresh → use it, no network call needed
  • Cache has TTL per secret, but stale cache is better than no secret

Layer 2: Secret Store (privileged nodes that hold all secrets)

  • One or more nodes with the secret-store label hold the COMPLETE encrypted dataset
  • This is NOT a special server type — it's a label, applied to pods, VMs, or bare metal
  • Should have at least 2 replicas for HA
  • Machines fetch ONLY the secrets their labels entitle them to from the store
  • The store enforces policy — a machine with label mailserver gets mail/*, nothing else
  • Machines NEVER sync with each other directly — they only talk to the store
  • This prevents secret sprawl (no machine accumulates secrets it shouldn't have)

Layer 3: Git Encrypted Backup (last resort recovery)

  • All secrets (encrypted with a master key) backed up to a Git repo
  • If a machine has empty cache AND no peers available → restore from Git backup
  • SOPS/age style encryption — secrets encrypted, metadata (paths, policies) in plaintext
  • Git gives versioning, audit trail, and disaster recovery for free
  • The Git repo alone is useless without the decryption key

Layer 4: Lab-server (coordinator, NOT single point of failure)

  • Lab-server is the preferred interface to set/rotate secrets (via CLI/API)
  • Lab-server does NOT need to be the secret-store (but can be, via label)
  • If lab-server is down, machines keep running from local cache
  • No new secrets can be distributed while secret-store is down
  • But nothing breaks — existing workloads continue uninterrupted
  • When secret-store comes back, machines sync and catch up

Separation of concerns:

  • lab-server = coordination, API, lifecycle management
  • secret-store label = holds all secrets, serves policy-filtered requests
  • These CAN be the same node (apply both labels) or separate nodes
  • For homelab: same node is fine. For enterprise: separate for isolation

Recovery Scenarios

Scenario 1: Lab-server down, secret-store up
  → All machines continue working from local cache
  → Machines can still fetch/refresh secrets from secret-store
  → No new resources can be provisioned (lab-server manages lifecycle)
  → But existing workloads are unaffected

Scenario 2: Secret-store down, lab-server up
  → All machines continue working from local cache
  → Lab-server can still manage lifecycle (provision, plan, apply)
  → No new secrets can be distributed
  → No secret rotations until store is back
  → Lab-server shows: ⚠ secret-store unreachable

Scenario 3: Both down
  → All machines continue working from local cache
  → Nothing new can happen, but nothing breaks
  → Recovery priority: restore secret-store first (from Git backup)

Scenario 4: Machine reboots, cache intact
  → Reads from local encrypted cache immediately
  → Refreshes from secret-store in background to catch up
  → No dependency on lab-server for startup

Scenario 5: Machine rebuilt, cache empty
  → Machine has its identity (from enrollment) but no secrets
  → Fetches entitled secrets from secret-store (policy-filtered)
  → If secret-store unreachable → cannot start (needs secrets)
  → Operator can restore secret-store from Git backup to unblock

Scenario 6: Total disaster, only Git backup survives
  → Deploy new node, apply `secret-store` label
  → Restore encrypted secrets from Git backup
  → Deploy lab-server (lab init)
  → New machines enroll and receive their entitled secrets
  → System fully recovered

Scenario 7: New pod in k8s, secret-store unreachable
  → K8s node has local secret cache for its entitled secrets
  → Lab k8s operator serves pod secrets from node's local cache
  → Pod starts with cached secrets
  → No interruption to deployments

CLI for Secret Distribution

$ lab secret status
SECRET DISTRIBUTION STATUS:
  Local cache:     ✓ 8 secrets cached (of 8 entitled), encrypted, fresh (< 5m old)
  Secret store:    ✓ connected (2 replicas: store-1, store-2)
  Lab-server:      ✓ connected
  Git backup:      ✓ last push 2026-03-15 14:30:00 (47 total secrets)

$ lab secret status --store
SECRET STORE:
  Replicas:        2/2 healthy
    store-1        k8s pod    ✓ synced   47 secrets (all)
    store-2        vm/xcpng   ✓ synced   47 secrets (all)
  Git backup:      ✓ synced   2026-03-15 14:30:00
  Total secrets:   47
  Entitled consumers:
    k8s-worker (12 machines)  → 3 secrets each
    mailserver (2 machines)   → 5 secrets each
    postgres (3 machines)     → 4 secrets each
    lab-server (1 machine)    → 2 secrets

$ lab secret cache
LOCAL CACHE:
SECRET                   CACHED     TTL        STATUS
mail/tls-cert            ✓          89d left   fresh
mail/dkim-key            ✓          no expiry  fresh
k8s/join-token           ✓          23h left   fresh
tls/node-cert            ✓          346d left  fresh

$ lab secret recover --from git
  → Fetching encrypted backup from git@github.com:org/lab-secrets.git
  → Decrypting with master key...
  → Restored 23 secrets
  → Syncing with available peers...

Local Cache Security

The local cache must be stored securely — needs investigation:

  • Encrypted at rest with machine-specific key
  • Key derived from: TPM 2.0? Secure enclave? LUKS-bound? needs investigation
  • Memory-mapped, not swappable (mlock)
  • Accessible only by lab agent (file permissions + MAC/SELinux)
  • Wiped on machine decommission (lab identity revoke)
  • Possibly use kernel keyring on Linux — needs investigation

Secret Backend — NOT Decided

The underlying secret storage/sync mechanism is pluggable:

type SecretBackend interface {
    Name() string

    // CRUD
    Get(path string, identity *MachineIdentity) ([]byte, error)
    Set(path string, value []byte) error
    Delete(path string) error
    List(prefix string) ([]string, error)

    // Policy (auto-generated from code/labels)
    GrantAccess(path string, identity *MachineIdentity) error
    RevokeAccess(path string, identity *MachineIdentity) error

    // Dynamic
    Generate(path string, generator GeneratorConfig) ([]byte, error)
    Rotate(path string) error

    // Distribution
    SyncWith(peer PeerInfo) error
    CacheLocally(secrets []Secret) error
    RestoreFromBackup(source BackupSource) error
}

Possible approaches (each needs investigation):

  • SOPS + age + Git — simplest, encrypted files in Git, but no peer sync
  • OpenBao — Vault fork, has replication, but still central-server mindset
  • Sealed Secrets / External Secrets Operator — k8s-native, but not universal
  • Infisical — developer-friendly, but SaaS-oriented
  • Custom: encrypted SQLite + peer sync — simple, we control the sync protocol
  • etcd with encryption — distributed by nature, but might be overkill
  • CockroachDB — distributed SQL, encrypted, survives node failures
  • Consul — distributed KV with gossip, HashiCorp though
  • Lab's own sync protocol — gossip-based, encrypted, purpose-built

The right answer might be a combination:

  • SOPS/age for encryption format (proven, auditable)
  • Custom gossip sync for distribution (lightweight)
  • Git for backup (free versioning and DR)
  • Or wrap an existing distributed KV that already handles sync

This is the most complex subsystem in Lab and needs careful investigation.

Identity Plugin System

Same extensible pattern as providers and health sources:

type IdentityPlugin interface {
    Name() string

    // Enrollment
    Enroll(resource *Resource, token string) (*Identity, error)
    Revoke(resource *Resource) error

    // Status
    Status(resource *Resource) (*IdentityStatus, error)

    // Renewal
    Renew(resource *Resource) error
}

This allows swapping identity backends without changing the rest of Lab. We might start with Vault + OpenVox CA and later add/replace components.

State Storage — Design Principles

NOT etcd. etcd prioritizes consistency over availability — it would rather crash and stay down than serve potentially inconsistent data. For Lab, availability wins:

  • Losing a few events is better than total outage
  • Should auto-backup and auto-restore on corruption
  • Should degrade gracefully, never crash and refuse to start
  • Stale data is acceptable, no data is not

Requirements:

  • Stores: resource state, label definitions, group membership, alert configs, audit log
  • Must survive lab-server restart
  • Must be migratable (lab-server can move between hosts)
  • Should auto-backup (to Git, S3, or local snapshots)
  • Should auto-recover from corruption without operator intervention
  • Embedded (no external dependency) preferred for simplicity

Candidates (needs investigation):

  • SQLite — embedded, simple, proven, WAL mode for concurrent reads, easy to backup (copy file)
  • bbolt/BoltDB — embedded KV, used by etcd ironically, simpler than etcd itself
  • Badger — embedded KV in Go, LSM-tree, good performance
  • DuckDB — embedded analytical DB, might be overkill
  • PostgreSQL — if we need multi-server state, but adds external dependency
  • Litestream — SQLite + continuous replication to S3/GCS/Azure (interesting combo)

SQLite + Litestream is the current leading candidate:

  • SQLite for simplicity and embeddability
  • Litestream for continuous backup to S3/GCS/local without stopping the database
  • Auto-restore: if DB is missing, Litestream restores from latest backup
  • Single file, easy to migrate when lab-server moves
  • But needs investigation to confirm it handles our scale

Open Questions

  1. Name: "lab" is simple but generic. Alternatives?
  2. GitOps integration — should label/profile changes go through Git, or direct API?
  3. Multi-tenancy — how to scope labels/resources per team?
  4. Auth — mTLS between CLI and server? OIDC? Vault-issued tokens?
  5. Input format — TypeScript (DA-style), YAML (Compose-style), or both?
  6. Should lab init deploy lab-server as a container (portable) or native binary (simpler)?