Files
lab/lab-tool-spec.md

1538 lines
62 KiB
Markdown
Raw Permalink Normal View History

2026-03-15 23:50:43 +00:00
# Lab — Unified Infrastructure Lifecycle Platform
## What It Is
A tool that abstracts infrastructure lifecycle across clouds, hypervisors, bare metal,
and Kubernetes — using labels as the universal abstraction and existing tools under the hood.
**Not reinventing the wheel.** Uses Pulumi, OpenVox, Tinkerbell, Prometheus, Naemon,
existing Puppet modules, cloud APIs — but provides a unified interface over all of them.
## Architecture
```
┌────────────────────────────────────────────────────────────┐
│ lab-server (control plane) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Provider │ │ Label │ │ Lifecycle│ │ Artifact │ │
│ │ Registry │ │ Engine │ │ Manager │ │ Builder │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ OpenVox │ │ Health │ │ K8s │ │ Render │ │
│ │ Enrollor │ │ Aggregator│ │ Deployer │ │ Engine │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Identity │ │ DNS │ │ Secret │ │ Token │ │
│ │ Manager │ │ Manager │ │ Manager │ │ Issuer │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
│ │
│ API (gRPC + REST) │
└──────────────┬─────────────────────────────────────────────┘
┌──────────┴──────────┐
│ │
┌───┴───┐ ┌────┴────┐
│ lab │ │ lab-tui │
│ (CLI) │ │ (k9s) │
└───────┘ └─────────┘
```
### Control Plane (lab-server)
Runs as a service (on bootstrap node, or in k8s). Hosts:
- **Provider Registry** — pluggable providers (AWS, XCP-ng, bare metal, GCP, etc.)
- **Label Engine** — resolves labels → puppet classes, sizes, ports, config
- **Lifecycle Manager** — orchestrates provision → enroll → configure → observe
- **Artifact Builder** — puppet classes → container images
- **OpenVox Enrollor** — secure cert signing, node classification, environment assignment
- **Health Aggregator** — queries Prometheus, Naemon, cloud health APIs
- **K8s Deployer** — manages workloads on k3s/EKS clusters
- **Render Engine** — side-by-side provider comparison, cost estimates, drift detection
- **Identity Manager** — tracks enrollment state, certs, Vault auth, SSH keys per resource
- **DNS Manager** — auto-registers/updates DNS for every managed resource
- **Secret Manager** — controls which resources can access which secrets (per-label policies)
- **Token Issuer** — generates one-time join tokens at provision time (no hardcoded secrets)
### CLI (lab)
kubectl-like interface for browsing and managing resources:
```
$ lab get servers
NAME PROVIDER LABELS SIZE SYNC PUPPET HEALTH IDENTITY
api-1 aws app,prod,eu-west medium ✓ sync ✓ ok ✓ ok ✓ enrolled
api-2 aws app,prod,eu-west medium ✓ sync ✓ ok ✓ ok ✓ enrolled
mail-1 xcpng mailserver,prod medium ✓ sync ✓ ok ✓ ok ✓ enrolled
db-1 baremetal postgres,prod large ⚠ drift ✓ ok ✓ ok ✓ enrolled
worker-3 aws k8s-worker,staging large ✓ sync ✗ failed ⚠ 2 alrt ✓ enrolled
gateway-1 baremetal k8s-server,prod small ✓ sync ✓ ok ✓ ok ⚠ cert exp
$ lab get servers --label mailserver
NAME PROVIDER SIZE SYNC PUPPET HEALTH IDENTITY
mail-1 xcpng medium ✓ sync ✓ ok ✓ ok ✓ enrolled
mail-2 aws medium ✓ sync ✓ ok ✓ ok ✓ enrolled
$ lab describe server db-1
Name: db-1
Provider: baremetal
Labels: [postgres, prod, eu-west]
Size: large (8 cores, 32GB, 500GB NVMe)
Status: DRIFT DETECTED
Expected: size=large, disk=500GB
Actual: size=large, disk=500GB, extra_mount=/data (unmanaged)
Puppet:
Environment: production
Role: postgres
Classes: [postgresql::server, backup::pgbackrest, node_exporter]
Last run: 2026-03-15 14:22:03 (success)
Next run: 2026-03-15 14:52:03
Health:
Prometheus: ✓ all targets up
Naemon: ✓ all checks passing
Alerts: none active
$ lab get labels
LABEL PUPPET CLASSES SERVERS CONTAINERS
mailserver postfix, dovecot, spamassassin 2 1
k8s-worker kubernetes::worker, containerd 12 0
postgres postgresql::server, pgbackrest 3 1
app nginx, app::deploy 4 2
$ lab get containers
NAME IMAGE LABEL K8S CLUSTER STATUS
mailserver ghcr.io/org/mailserver:2026.03.15 mailserver homelab running
postgres ghcr.io/org/postgres:2026.03.14 postgres homelab running
app ghcr.io/org/app:2026.03.15 app production running
$ lab diff server db-1
size: large
disk: 500GB
+ extra_mount: /data ← unmanaged, not in spec
$ lab sync server db-1 # reconcile drift
$ lab plan server new-mail-3 --label mailserver --provider aws # preview
$ lab apply server new-mail-3 # create it
$ lab build --label mailserver # puppet modules → container image
Building mailserver from puppet classes:
✓ postfix
✓ dovecot
✓ spamassassin
✓ fail2ban
→ ghcr.io/org/mailserver:2026.03.15
$ lab render --label mailserver --all-providers
┌──────────────┬──────────────┬──────────┬────────────┐
│ │ AWS │ XCP-ng │ Bare Metal │
├──────────────┼──────────────┼──────────┼────────────┤
│ Compute │ t3.large │ 4c/8GB │ IPMI boot │
│ Puppet │ postfix,... │ postfix,.│ postfix,...│
│ Est. Cost │ ~$62/mo │ — │ — │
└──────────────┴──────────────┴──────────┴────────────┘
```
### TUI (lab-tui)
k9s-style interactive terminal UI:
- Real-time server list with sync/puppet/health status
- Drill into any server for details
- Watch puppet runs live
- Filter by labels, providers, health status
- Trigger actions (sync, plan, apply, build)
## Core Concepts
### Labels — The Universal Abstraction
Everything is a thing with labels. Configuration attaches to labels, not machines.
```yaml
labels:
mailserver:
puppet_classes:
- postfix
- dovecot
- spamassassin
- fail2ban
ports: [25, 587, 993]
size: medium
alerts:
- smtp_connect # auto-generated: is SMTP responding?
- imap_connect # auto-generated: is IMAP responding?
- mail_queue_length # auto-generated: is mail queue healthy?
secrets:
- mail/tls-cert
- mail/dkim-key
k8s-worker:
puppet_classes:
- kubernetes::worker
- containerd
- node_exporter
size: large
alerts:
- kubelet_healthy
- node_ready
secrets:
- k8s/join-token
```
### Groups — Nested Targeting with Exclusions
Groups compose labels, other groups, and individual servers into reusable targets.
Groups can nest (subgroups). Exclusions allow fine-grained control.
```yaml
groups:
# Simple group: all production servers
production:
match:
environment: prod
# Group by label combination
production-mail:
match:
labels: [mailserver]
environment: prod
# Nested group with subgroups
eu-infrastructure:
groups:
- eu-west-compute
- eu-west-storage
- eu-west-network
exclude:
servers: [test-box-1] # exclude specific server
labels: [experimental] # exclude servers with this label
eu-west-compute:
match:
labels: [k8s-worker, k8s-server]
region: eu-west
exclude:
servers: [legacy-node-3]
# Group targeting everything except a subgroup
all-except-staging:
match:
environment: [prod, dev]
exclude:
environment: staging
# Custom group by explicit membership
database-tier:
servers: [db-1, db-2, db-3]
groups: [replica-set-eu]
```
### Alerts — Auto-Generated and User-Defined
Alerts attach to labels, groups, servers, or environments — same targeting as everything else.
#### Auto-Generated Alerts
When Lab provisions a resource, it generates baseline alerts based on:
- **Label**: mailserver label → SMTP/IMAP checks
- **Puppet classes**: `postgresql::server` → postgres process, replication lag
- **Ports**: if port 443 is declared → HTTPS health check
- **Size**: resource limits → CPU/memory threshold alerts
- **Identity**: cert expiry alerts auto-generated for all enrolled machines
#### User-Defined Alerts
Users can add custom alerts targeting any scope:
```yaml
alerts:
# Target by label
- name: mail_queue_critical
target:
labels: [mailserver]
condition: mail_queue_length > 1000
severity: critical
for: 5m
# Target by group
- name: disk_space_low
target:
groups: [production]
condition: disk_usage_percent > 85
severity: warning
# Target by environment
- name: high_cpu
target:
environment: prod
condition: cpu_usage_percent > 90
for: 10m
severity: warning
# Target specific servers
- name: gpu_temperature
target:
servers: [dgx-spark, beelink-ser9-max]
condition: gpu_temp_celsius > 80
severity: critical
# Target by label but exclude some
- name: memory_pressure
target:
labels: [k8s-worker]
exclude:
servers: [batch-worker-1] # this one is expected to run hot
condition: memory_usage_percent > 90
severity: warning
```
Alerts are rendered to the underlying monitoring system (Prometheus rules, Naemon checks,
CloudWatch alarms) — we don't build an alerting engine, we generate configs for existing ones.
Which monitoring backend to use for each alert type: **needs investigation**.
### Targeting — Unified Query System
The same targeting syntax works everywhere: alerts, puppet classes, secrets, and queries.
Target by label, group, server name, environment, region, or any combination with exclusions.
```
# CLI targeting syntax
$ lab get servers --label k8s-worker
$ lab get servers --group production
$ lab get servers --environment staging
$ lab get servers --label k8s-worker --environment prod --exclude worker-3
# What's applied WHERE (server → everything)
$ lab show server worker-5
```
### Visibility — Show What's Applied Where
Two directions of querying: "what does this server get?" and "where does this thing apply?"
#### Server View: Everything applied to a server
```
$ lab show server worker-5
Server: worker-5 (aws, eu-west-1)
Labels: [k8s-worker, production, eu-west]
Groups: [production, eu-west-compute, eu-infrastructure]
Environment: prod
Puppet Classes (6):
FROM LABEL k8s-worker:
├── kubernetes::worker
├── containerd
└── node_exporter
FROM LABEL production:
├── base::hardening
└── base::monitoring
FROM LABEL eu-west:
└── base::ntp_eu
Alerts (8):
FROM LABEL k8s-worker:
├── kubelet_healthy
└── node_ready
FROM GROUP production:
├── disk_space_low
└── high_cpu
AUTO-GENERATED:
├── cpu_threshold (from size: large)
├── memory_threshold (from size: large)
├── cert_expiry (from identity)
└── puppet_run_failed (from enrollment)
Secrets (2):
FROM LABEL k8s-worker:
├── k8s/join-token (read)
└── tls/node-cert (dynamic)
Excluded From:
└── alert "memory_pressure" (explicitly excluded)
```
#### Label/Group View: Where does this apply?
```
$ lab show label mailserver
Label: mailserver
Applied to: 2 servers
Servers:
├── mail-1 (xcpng, prod) ✓ sync ✓ puppet ✓ health ✓ identity
└── mail-2 (aws, prod) ✓ sync ✓ puppet ✓ health ✓ identity
Provides:
Puppet Classes: postfix, dovecot, spamassassin, fail2ban
Alerts: smtp_connect, imap_connect, mail_queue_length
Secrets: mail/tls-cert, mail/dkim-key
Ports: 25, 587, 993
Size: medium
$ lab show group eu-infrastructure
Group: eu-infrastructure
Contains: 3 subgroups, 47 servers (2 excluded)
Subgroups:
├── eu-west-compute (28 servers)
├── eu-west-storage (12 servers)
└── eu-west-network (9 servers)
Excluded:
├── test-box-1 (by name)
└── 1 server with label "experimental"
Alerts targeting this group:
├── disk_space_low (warning)
└── network_latency_high (critical)
```
#### Alert View: Where does this alert fire?
```
$ lab show alert disk_space_low
Alert: disk_space_low
Severity: warning
Condition: disk_usage_percent > 85
Target: group "production"
Excludes: none
Applies to 63 servers:
├── api-1 (aws) currently: 42% ✓
├── api-2 (aws) currently: 38% ✓
├── mail-1 (xcpng) currently: 71% ✓
├── db-1 (baremetal) currently: 83% ⚠ approaching
└── ... (59 more)
Rendered to:
├── Prometheus: rule "disk_space_low" in rules/production.yaml
└── Naemon: service check on 4 bare-metal hosts
```
#### Reverse Query: What targets this server?
```
$ lab targets server db-1
Everything targeting db-1:
Labels: [postgres, production, eu-west]
Groups: [production, database-tier, eu-infrastructure, eu-west-storage]
Environment: prod
Alerts (11):
├── postgres_replication_lag (from label: postgres)
├── postgres_connections (from label: postgres)
├── disk_space_low (from group: production)
├── high_cpu (from group: production)
├── storage_iops (from group: eu-west-storage)
├── cert_expiry (auto-generated)
└── ... (5 more)
Puppet Classes (9):
├── postgresql::server (from label: postgres)
├── backup::pgbackrest (from label: postgres)
└── ... (7 more)
Secrets (4):
├── postgres/master-password (from label: postgres)
└── ... (3 more)
```
### TUI Visualization (lab-tui)
The k9s-style TUI should support navigating these relationships interactively:
```
┌─ lab-tui ──────────────────────────────────────────────────────────┐
│ View: Servers > worker-5 [?]Help│
├────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─ Server: worker-5 ──────────────────────────────────────────┐ │
│ │ Provider: aws Size: large Env: prod │ │
│ │ Sync: ✓ Puppet: ✓ Health: ✓ Identity: ✓ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ [L]abels [A]lerts [P]uppet [S]ecrets [G]roups │
│ │
│ Labels ──────────────────── Alerts ────────────────────────── │
│ ► k8s-worker ● kubelet_healthy ✓ OK │
│ ► production ● node_ready ✓ OK │
│ ► eu-west ● disk_space_low ✓ 42% │
│ ● high_cpu ✓ 12% │
│ Groups ────────────────── ● cert_expiry ✓ 347d │
│ ► production │
│ ► eu-infrastructure Puppet Classes ────────────────── │
│ ► eu-west-compute ● kubernetes::worker ✓ applied │
│ ● containerd ✓ applied │
│ Secrets ───────────────── ● node_exporter ✓ applied │
│ ● k8s/join-token (read) ● base::hardening ✓ applied │
│ ● tls/node-cert (dyn) ● base::monitoring ✓ applied │
│ │
│ [Enter] drill down [Esc] back [/] search [Tab] switch pane │
└────────────────────────────────────────────────────────────────────┘
```
Navigation:
- From server → drill into label → see all other servers with that label
- From alert → see all servers it applies to, current values
- From group → see subgroups, expand tree, see members
- From label → see puppet classes, alerts, secrets it provides
- Everything is cross-linked — follow any relationship in either direction
### Deployment Targets
Same label → multiple targets:
| Target | What happens |
|--------|-------------|
| VM (any cloud) | Provision VM → enroll OpenVox → apply classes live |
| Bare metal | PXE boot → enroll OpenVox → apply classes live |
| Container | Build image with classes baked in → push to registry |
| ASG | Launch template with OpenVox enrollment → auto-apply |
| K8s pod | Deploy container artifact to cluster |
### Four-Pillar Status
Every resource shows four things:
1. **Sync** — is the actual infrastructure state matching the declared spec?
(instance type, security groups, disks, network — via Pulumi state)
2. **Puppet** — did OpenVox successfully apply all classes?
(last run status, any failures, catalog compilation errors)
3. **Health** — are monitoring checks passing?
(aggregates from Prometheus alerts, Naemon checks, cloud health APIs)
4. **Identity** — is the resource fully enrolled?
(DNS registered, certs valid, Vault authenticated, SSH host key signed)
### Provider Plugin System
Extensible provider model — each provider implements an interface:
```go
type Provider interface {
Name() string
// Lifecycle
Plan(spec ResourceSpec) (*PlanResult, error)
Apply(spec ResourceSpec) (*Resource, error)
Destroy(id string) error
// State
Get(id string) (*Resource, error)
List(filters Filters) ([]*Resource, error)
Diff(spec ResourceSpec) (*DiffResult, error)
// Introspection (like DA's type-writer)
DiscoverResources() ([]*Resource, error)
AvailableSizes() ([]Size, error)
AvailableImages() ([]Image, error)
}
```
Built-in providers:
- `provider-aws` — wraps Pulumi AWS
- `provider-xcpng` — wraps Pulumi XO / Xen Orchestra API
- `provider-baremetal` — wraps Tinkerbell / iPXE + IPMI/Redfish
- `provider-k8s` — wraps Pulumi Kubernetes
Community can add: GCP, Azure, Hetzner, Proxmox, etc.
### Health Aggregator Plugin System
```go
type HealthSource interface {
Name() string
CheckHealth(resource *Resource) (*HealthResult, error)
}
```
Built-in sources:
- `health-prometheus` — queries Prometheus alerting rules targeting the resource
- `health-naemon` — queries Naemon host/service checks
- `health-cloudwatch` — queries AWS CloudWatch alarms
### Profiles — T-Shirt Sizing
User-owned mappings:
```yaml
sizes:
medium:
abstract: { cores: 4, memory: 8GB }
providers:
aws: { instance_type: t3.large }
xcpng: { cores: 4, memory: 8192MB }
baremetal: { min_cores: 4, min_memory: 8GB, maas_tag: medium }
```
### Artifact Builder
Puppet modules → container images:
```
label "mailserver"
→ puppet classes [postfix, dovecot, spamassassin]
→ Dockerfile generated:
FROM ubuntu:24.04
RUN apt-get install -y puppet-agent
COPY modules/ /etc/puppetlabs/code/modules/
RUN puppet apply --classes postfix,dovecot,spamassassin
# Clean up puppet, leave only configured services
→ Image pushed to registry
→ Available as k8s deployment or standalone container
```
## Tech Stack
| Component | Technology | Why |
|-----------|-----------|-----|
| Server | Go | Performance, single binary, Pulumi SDK, gRPC native |
| CLI | Go (cobra) | Same binary, kubectl-style |
| TUI | Go (bubbletea) | Same binary, k9s-style |
| API | gRPC + REST (grpc-gateway) | Type-safe, fast, REST fallback |
| IaC engine | Pulumi (Go SDK) | Multi-provider, plan/preview, component packages |
| Config mgmt | OpenVox | Puppet modules, ENC, cert management |
| Bare metal | Tinkerbell or custom iPXE | PXE boot, IPMI/Redfish |
| Container build | Buildah or Docker | OCI images from puppet classes |
| State store | TBD — NOT etcd (see State Storage section) | Resource state, label definitions |
| K8s integration | client-go | Direct k8s API for deployments |
## Under The Hood — What We DON'T Build
- Cloud APIs → Pulumi providers handle this
- Puppet language/runtime → OpenVox handles this
- Container runtime → containerd/Docker handles this
- Monitoring → Prometheus/Naemon handle this
- K8s orchestration → k3s/EKS handles this
- PXE/DHCP/TFTP → Tinkerbell handles this
- Certificate management → OpenVox CA handles this
**We build the glue, the abstraction, the UX, and the lifecycle orchestration.**
## Kubernetes Management
Lab also controls what runs on k8s clusters:
```
$ lab get deployments
NAME CLUSTER LABEL REPLICAS IMAGE STATUS
mailserver homelab mailserver 2/2 org/mailserver:03.15 ✓ running
api production app 4/4 org/app:03.15 ✓ running
postgres homelab postgres 1/1 org/postgres:03.14 ✓ running
$ lab deploy --label app --cluster production --replicas 4
$ lab scale --label app --cluster production --replicas 6
```
Deployments reference labels — same label that defines puppet classes also defines
the container image, ports, health checks, and k8s resources.
## Bootstrap, Onboarding, and Self-Deployment
### Core Idea: Your Device Is The First Coordinator
You don't need a server to start. Your laptop/workstation runs the full lab engine
locally. You onboard servers from it — including bare metal PXE boot. When ready,
you migrate the coordinator role to one of the servers you've onboarded.
```
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ Phase 0 │ │ Phase 1 │ │ Phase 2 │ │ Phase 3 │
│ │ │ │ │ │ │ │
│ lab init │────►│ Onboard │────►│ Move lab │────►│ Onboard │
│ --local │ │ servers │ │ to a real │ │ remaining │
│ │ │ from your │ │ server │ │ from the │
│ Your device│ │ laptop │ │ │ │ server │
│ = lab │ │ │ │ │ │ │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
```
### Architecture: CLI = Embedded Server
The CLI binary contains the full lab-server engine. The difference between modes
is where state lives and whether the engine runs persistently.
```
┌──────────────────────────────────────┐
│ lab (single binary) │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Core Engine │ │
│ │ (providers, labels, render, │ │
│ │ lifecycle, identity, secrets, │ │
│ │ PXE server, everything) │ │
│ └─────────────────────────────────┘ │
│ │
│ Modes: │
│ ├── $ lab init --local → local mode │
│ │ State: ~/.lab/state.db │
│ │ PXE/DHCP: served from laptop │
│ │ Full engine, no remote server │
│ │ │
│ ├── $ lab server → daemon mode │
│ │ State: /var/lib/lab/state.db │
│ │ PXE/DHCP: served from this box │
│ │ Persistent API on port 7443 │
│ │ │
│ └── $ lab <command> → client mode │
│ Talks to remote lab-server │
│ (or local engine if no server) │
└──────────────────────────────────────┘
```
### Onboarding Flow
`lab onboard` is the command to bring a new machine under management. It handles
two scenarios: machines with an OS already installed, and bare metal that needs
network boot + OS installation.
#### Scenario A: Machine has OS (SSH onboard)
For machines that already have an OS (like DGX Spark with Ubuntu, or Mac Studio):
```
$ lab onboard dgx-spark --provider ssh --host 192.168.1.50 --user admin
Step 1: Render
┌──────────────┬────────────────────────┐
│ Name │ dgx-spark │
│ Provider │ ssh (existing machine) │
│ Host │ 192.168.1.50 │
│ OS │ Ubuntu (detected) │
│ Arch │ aarch64 (Grace) │
│ RAM │ 128GB │
│ GPU │ CUDA (detected) │
└──────────────┴────────────────────────┘
Onboarding will:
+ Install lab agent
+ Generate one-time enrollment token
+ Register in DNS: dgx-spark.lab.internal
+ Sign OpenVox certificate
+ Assign labels (interactive or --labels flag)
Proceed? [y/N]: y
Step 2: Detect & assign labels
Detected hardware:
GPU: NVIDIA GB10 Grace Blackwell → suggesting label: cuda
RAM: 128GB → suggesting label: ai-inference
Arch: aarch64 → suggesting label: arm
Assign labels [cuda,ai-inference,arm]: cuda,ai-inference,dgx-spark
Step 3: Apply (same engine as lab apply)
→ SSH into 192.168.1.50
→ Install lab agent binary
→ Generate one-time token
→ Lab agent enrolls:
→ OpenVox cert signed, classified in environment "production"
→ DNS A record: dgx-spark.lab.internal → 192.168.1.50
→ Identity established
→ Apply puppet classes from labels:
→ cuda: nvidia-drivers, cuda-toolkit
→ ai-inference: inference-runtime
→ Machine fully managed
$ lab get servers
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
dgx-spark ssh cuda,ai-inference,dgx-spark ✓ ✓ ok ✓ ✓ enrolled
```
#### Scenario B: Bare metal (PXE network boot)
For machines with no OS. Lab (on your laptop or server) becomes a PXE server
on the local network, serves the OS installer, and onboards after installation:
```
$ lab onboard beelink-max --provider baremetal \
--mac AA:BB:CC:DD:EE:FF \
--image ubuntu-24.04 \
--labels k8s-worker,rocm,longhorn
Step 1: Render
┌──────────────┬────────────────────────┐
│ Name │ beelink-max │
│ Provider │ baremetal (PXE boot) │
│ MAC │ AA:BB:CC:DD:EE:FF │
│ Image │ ubuntu-24.04 │
│ Labels │ k8s-worker,rocm,longhorn│
│ PXE server │ this device (laptop) │
└──────────────┴────────────────────────┘
Onboarding will:
+ Start PXE/DHCP/TFTP on local network interface
+ Wait for machine with MAC AA:BB:CC:DD:EE:FF to boot
+ Serve unattended Ubuntu 24.04 installer
+ After install: auto-enroll with one-time token baked into installer
+ Assign labels, apply puppet classes
⚠ PXE requires: network interface on same L2 segment as target machine
⚠ DHCP: will respond ONLY to MAC AA:BB:CC:DD:EE:FF (safe for existing networks)
Proceed? [y/N]: y
Step 2: PXE boot phase
→ Starting PXE server on en0 (192.168.1.x)
→ DHCP offer scoped to MAC AA:BB:CC:DD:EE:FF only
→ Waiting for network boot request...
⏳ Power on the Beelink SER9 MAX and set it to boot from network (PXE)
→ Boot request received from AA:BB:CC:DD:EE:FF
→ Serving iPXE → kernel + initrd → autoinstall config
→ OS installation in progress...
→ Installation complete, machine rebooting
Step 3: Post-install enrollment (same as SSH onboard from here)
→ Machine boots with installed OS
→ Lab agent runs on first boot (installed during OS setup)
→ Uses one-time token (baked into autoinstall config) to enroll:
→ OpenVox cert signed
→ DNS: beelink-max.lab.internal → 192.168.1.100
→ Identity established
→ Apply puppet classes from labels:
→ k8s-worker: kubernetes::worker, containerd
→ rocm: rocm-drivers
→ longhorn: longhorn::node
→ Machine fully managed
$ lab get servers
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
dgx-spark ssh cuda,ai-inference ✓ ✓ ok ✓ ✓ enrolled
beelink-max baremetal k8s-worker,rocm,longhorn ✓ ✓ ok ✓ ✓ enrolled
```
#### Scenario C: Onboard with IPMI/Redfish (remote power control)
For bare metal where you have IPMI/BMC access — Lab can power on the machine
and set PXE boot remotely, fully hands-free:
```
$ lab onboard beelink-max --provider baremetal \
--mac AA:BB:CC:DD:EE:FF \
--ipmi 192.168.1.200 --ipmi-user admin \
--image ubuntu-24.04 \
--labels k8s-worker,rocm,longhorn
→ IPMI: setting next boot to PXE
→ IPMI: powering on machine
→ PXE server waiting for boot request...
→ (fully automated from here)
```
### Homelab Bootstrap Walkthrough
The complete flow for setting up the homelab from zero:
```
# Phase 0: Local mode on your laptop
$ lab init --local
✓ Lab engine running locally
✓ State: ~/.lab/state.db
✓ Ready to onboard servers
# Phase 1: Onboard servers that already have an OS
$ lab onboard dgx-spark --provider ssh --host 192.168.1.50
→ Labels: [cuda, ai-inference, dgx-spark]
$ lab onboard mac-studio --provider ssh --host 192.168.1.51
→ Labels: [k8s-server, etcd, arm]
# Phase 2: Onboard bare metal (PXE from your laptop)
$ lab onboard beelink-ser9-pro --provider baremetal --mac XX:XX:XX:XX:XX:01 \
--image ubuntu-24.04 --labels bootstrap,lab-server
→ PXE boot from laptop → install OS → enroll
→ This will become the permanent lab-server host
# Phase 3: Move lab-server to a real server
$ lab server migrate --target ssh --host beelink-ser9-pro
→ Lab-server deployed on Beelink SER9 Pro
→ State migrated from ~/.lab/state.db
→ PXE/DHCP now served from Beelink, not your laptop
→ CLI config updated: lab talks to beelink-ser9-pro:7443
# Phase 4: Onboard remaining servers (PXE from beelink-ser9-pro now)
$ lab onboard beelink-ser9-max --provider baremetal --mac XX:XX:XX:XX:XX:02 \
--image ubuntu-24.04 --labels k8s-worker,rocm,longhorn
→ PXE served by beelink-ser9-pro (not your laptop anymore)
$ lab onboard minisforum-ms-r1 --provider baremetal --mac XX:XX:XX:XX:XX:03 \
--image ubuntu-24.04 --labels k8s-worker,arm
# Phase 5: Set up k8s
$ lab apply cluster homelab --servers mac-studio,beelink-ser9-max,minisforum-ms-r1
→ mac-studio becomes k3s server (etcd)
→ beelink-ser9-max joins as worker
→ minisforum-ms-r1 joins as worker
→ All via puppet classes from labels
# Phase 6: Optionally move lab-server into k8s
$ lab server migrate --target kubernetes --cluster homelab
→ Lab-server now runs as k8s pod
→ Still manages everything including the cluster it runs on
# Final state:
$ lab get servers
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
dgx-spark ssh cuda,ai-inference ✓ ✓ ok ✓ ✓ enrolled
mac-studio ssh k8s-server,etcd,arm ✓ ✓ ok ✓ ✓ enrolled
beelink-ser9-pro baremetal bootstrap ✓ ✓ ok ✓ ✓ enrolled
beelink-ser9-max baremetal k8s-worker,rocm,longhorn ✓ ✓ ok ✓ ✓ enrolled
minisforum-ms-r1 baremetal k8s-worker,arm ✓ ✓ ok ✓ ✓ enrolled
lab-server kubernetes lab,control-plane ✓ ✓ ok ✓ ✓ enrolled
```
### Enterprise Application: XCP-ng Bare Metal Deploy
Same onboarding flow works for deploying XCP-ng to enterprise bare metal:
```
$ lab onboard xen-host-42 --provider baremetal \
--mac AA:BB:CC:DD:EE:FF \
--ipmi 10.0.0.142 --ipmi-user admin \
--image xcpng-8.3 \
--labels xen-host,production,eu-west
→ IPMI: power on, PXE boot
→ Install XCP-ng 8.3 (unattended)
→ Enroll, apply puppet classes:
→ xen-host: xcpng::host, xcpng::networking, xcpng::storage
→ Host registered in Xen Orchestra pool
→ Ready to provision VMs on it
# Now create VMs on the XCP-ng host we just onboarded:
$ lab apply server app-12 --provider xcpng --labels app,production
→ VM created on xen-host-42 via Xen Orchestra API
→ OS installed, enrolled, puppet applied
→ Same flow as AWS EC2, just different provider
```
### PXE Server Capabilities
When running in local or server mode, Lab includes an embedded PXE server:
- **DHCP**: scoped to specific MACs only (safe for existing networks with DHCP)
- **TFTP**: serves iPXE bootloader
- **HTTP**: serves kernel, initrd, autoinstall configs
- **Autoinstall generation**: creates unattended install configs per-machine with:
- Lab agent pre-installed
- One-time enrollment token baked in
- Network config for the target environment
- Disk layout per label/profile
- **Supported images**: Ubuntu, Debian, RHEL/Rocky, XCP-ng (extensible)
PXE serving moves with lab-server — if you migrate lab to a new host,
PXE is served from there. If lab is on your laptop, PXE is on your laptop.
Same engine, same binary.
### Hardware Detection During Onboard
When onboarding via SSH (existing OS), Lab detects hardware and suggests labels:
```
$ lab onboard new-server --provider ssh --host 10.0.0.50
Detected hardware:
CPU: AMD EPYC 7763 (x86_64, 64 cores) → suggest: compute
RAM: 256 GB → suggest: high-memory
GPU: NVIDIA A100 80GB → suggest: cuda, ai-training
Disk: 2x NVMe 1.92TB, 4x SSD 3.84TB → suggest: storage
NIC: 2x 25GbE, 1x 1GbE IPMI → suggest: high-bandwidth
Suggested labels: [compute, high-memory, cuda, ai-training, storage, high-bandwidth]
Assign labels [accept/edit]: _
```
For PXE onboard, hardware detection happens after OS installation, and labels
can be auto-confirmed or require interactive approval.
### No Server? CLI Runs Locally
If no remote server is configured, every `lab` command runs the engine locally.
This means you can use Lab in permanent local mode for simple setups:
```
$ lab get servers # no remote server configured
ⓘ Running locally (~/.lab/state.db)
Tip: run `lab server migrate --target <target>` to deploy a persistent server
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
...
```
### Self-Migration
Migration uses the same plan/apply as everything else:
```
$ lab server migrate --target ssh --host beelink-ser9-pro
Step 1: Plan
~ migrate lab-server from local (~/.lab) to ssh://beelink-ser9-pro
+ deploy lab-server container on beelink-ser9-pro
+ copy state.db to remote host
+ start PXE/DHCP services on remote host
+ stop local PXE/DHCP services
+ update CLI config to new endpoint
Step 2: Apply
→ Deploy lab-server on beelink-ser9-pro
→ Copy state to remote
→ Verify remote is healthy
→ Switch CLI config
→ Stop local engine
$ lab server migrate --target kubernetes --cluster homelab
Step 1: Plan
~ migrate lab-server from ssh://beelink-ser9-pro to kubernetes://homelab
+ k8s Deployment lab-server (1 replica)
+ k8s Service lab-server (port 7443)
+ PersistentVolumeClaim lab-server-state (10Gi)
+ migrate state.db to PVC
+ PXE services: move to k8s hostNetwork pod or keep on bootstrap node
⚠ Note: PXE/DHCP requires L2 network access. If k8s node is on the same
L2 segment, use hostNetwork. Otherwise, keep PXE on the bootstrap node
and only migrate the API/state to k8s.
Step 2: Apply
→ Deploy to k8s
→ Migrate state
→ Verify healthy
→ Update CLI config
→ Tear down old deployment
```
### Key Design Principles
1. **One engine everywhere** — CLI, local mode, server mode, and init all share the same code
2. **Your device is the first coordinator** — no chicken-and-egg, start from nothing
3. **Onboard uses the same pipeline as apply** — render, plan, apply, enroll
4. **PXE is embedded** — no external PXE/DHCP server needed, Lab serves it
5. **Hardware detection suggests labels** — but the user confirms
6. **Migration is just plan/apply for lab-server** — same engine, no special case
7. **Enterprise and homelab are the same flow** — onboard XCP-ng bare metal = onboard homelab Beelink
## Identity and Trust Layer
Inspired by what FreeIPA did well (auto-DNS, centralized SSH, server-scoped secrets,
internal CA, IP mobility) without what it did badly (instability, hardcoded join secrets).
Lab controls the full lifecycle — it knows when a machine is born — so it can solve
the enrollment problem properly: generate a one-time join token at provision time,
inject it via cloud-init or iPXE userdata. No hardcoded secrets in images.
### Provision-to-Enrolled Flow
```
$ lab apply server new-worker-5 --label k8s-worker --provider aws
1. PROVISION → Pulumi creates EC2 instance
2. IDENTITY → Lab generates one-time join token (short-lived, single-use)
→ Token injected via cloud-init (or iPXE userdata for bare metal)
→ Token is NOT in the image — generated per-instance at provision time
3. ENROLL → Machine boots, uses token to:
→ Register with OpenVox (cert signed, node classified)
→ Register in DNS (A record + PTR)
→ Authenticate with Vault (get identity + policies per label)
→ Get SSH CA-signed host key (no more TOFU)
4. CONFIGURE → OpenVox applies classes
→ Machine pulls secrets it's allowed to access from Vault
→ e.g. k8s join token retrieved from Vault, node joins cluster
5. ENROLLED → Lab marks resource identity as ✓ enrolled
```
### What Each Machine Gets on Enrollment
| Capability | What happens | Tool underneath (TBD — needs investigation) |
|-----------|-------------|----------------------------------------------|
| DNS auto-registration | A + PTR records created/updated automatically | CoreDNS API? ExternalDNS? PowerDNS? needs investigation |
| IP mobility | Machine restarts with new IP → DNS updated automatically | Lab agent on machine reports changes? DHCP hook? needs investigation |
| Server certificate | TLS cert issued for the machine, auto-renewed | OpenVox CA? Vault PKI secrets engine? cert-manager? needs investigation |
| SSH host key signing | Host key signed by CA, clients trust CA not individual keys | Vault SSH secrets engine? OpenVox CA? step-ca? needs investigation |
| SSH user access | Users get short-lived SSH certs, centrally managed | Vault SSH + OIDC? Teleport? Boundary? needs investigation |
| Secret access (RBAC) | Machine authenticates with Vault, gets label-scoped policy | Vault AppRole? Vault cert auth? needs investigation |
| K8s join tokens | Retrieved from Vault by entitled machines, used to join cluster | Vault KV + policy per label? needs investigation |
| OpenVox enrollment | Cert signed, environment + role + classes assigned | OpenVox CA + ENC — this one we know |
| One-time join tokens | Generated per-instance at provision, single-use, short-lived | Lab itself generates these — or delegate to Vault? needs investigation |
**Important: We don't need to build any of these from scratch.** Each row is a capability
that likely has an existing tool we can wrap. Just like we use Pulumi for cloud APIs and
OpenVox for config management, we'll find the right tool for each identity concern.
Each position requires investigation — we'll evaluate options together, one by one.
### CLI: Identity Information
```
$ lab get servers
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
worker-5 aws k8s-worker ✓ ✓ ok ✓ ✓ enrolled
worker-6 xcpng k8s-worker ✓ ✓ ok ✓ ✓ enrolled
worker-7 baremetal k8s-worker ✓ ✗ fail ⚠ ⚠ cert expiring
new-box aws k8s-worker ✓ … … ⏳ enrolling
$ lab describe server worker-5
...
Identity:
DNS: worker-5.lab.internal (A: 10.0.1.45, PTR: ✓)
OpenVox: ✓ cert signed (expires 2027-03-15)
Vault: ✓ authenticated (policy: k8s-worker)
SSH Host Key: ✓ CA-signed (fingerprint: SHA256:abc...)
Secrets: k8s/join-token, tls/node-cert (2 accessible)
Enrolled: 2026-03-15 14:22:03 (one-time token, consumed)
Last Check-in: 2026-03-15 15:01:12 (38 seconds ago)
$ lab get secrets --label k8s-worker
SECRET TYPE ACCESSIBLE BY LAST ROTATED
k8s/join-token dynamic k8s-worker (12 srv) 2026-03-15
tls/cluster-ca static k8s-worker, k8s-server 2026-01-01
monitoring/api-key static k8s-worker, monitoring 2026-02-28
$ lab identity renew worker-5 # force cert/key renewal
$ lab identity revoke worker-5 # revoke all creds, remove from DNS, unenroll
```
### Secrets — Code Is The Policy
**Design principle:** If your code/config declares "I use secret X", that IS the access
grant. No one goes to a separate UI to edit policies. Default is locked — if not
mentioned, no access. If mentioned, access is automatic.
**The declaration IS the policy:**
```yaml
labels:
mailserver:
puppet_classes:
- postfix
- dovecot
secrets:
- mail/tls-cert
- mail/dkim-key
- mail/relay-credentials
ports: [25, 587, 993]
```
When Lab applies label `mailserver` to a server, it automatically:
1. Grants that server access to `mail/tls-cert`, `mail/dkim-key`, `mail/relay-credentials`
2. Denies access to everything else
3. No separate policy file, no Vault admin, no ticket to security team
When a puppet class references a secret:
```puppet
# modules/postfix/manifests/init.pp
class postfix {
$relay_creds = lab::secret('mail/relay-credentials')
file { '/etc/postfix/sasl_passwd':
content => $relay_creds,
mode => '0600',
}
}
```
The `lab::secret()` call is both the usage AND the declaration that this class
needs this secret. Lab scans puppet classes, discovers secret references,
and auto-generates the access policy. If `postfix` class is applied to a server
via a label, that server gets access to `mail/relay-credentials`. Remove the
class → access revoked.
**Secrets must be equally easy to access from anywhere:**
| Runtime | How you get a secret | Same underneath |
|---------|---------------------|-----------------|
| Puppet code | `lab::secret('mail/tls-cert')` | Lab agent on machine fetches from secret backend |
| App on VM | `LAB_SECRET_MAIL_TLS_CERT` env var, or `/run/secrets/mail/tls-cert` file | Lab agent provides via env or tmpfs mount |
| App in Kubernetes | Same env var or volume mount | Lab k8s operator syncs to K8s Secret object |
| App in Docker (standalone) | `--env-file` or bind mount from lab agent | Lab agent on host provides |
| Script / cron job | `lab secret get mail/tls-cert` CLI call | Lab CLI authenticated via machine identity |
| cloud-init / bootstrap | Injected at provision time via one-time token | Lab server provides during enrollment |
**One way to consume secrets, regardless of where you run.** The lab agent (or k8s
operator, or CLI) handles authentication and fetching transparently. The app just
reads an env var or file.
#### How Access Flows
```
Label "mailserver"
declares secrets:
- mail/tls-cert
- mail/dkim-key
┌───────────────────────┐
│ Lab compiles policy │
│ │
│ server mail-1: │
│ CAN access: │
│ mail/tls-cert │
│ mail/dkim-key │
│ CANNOT access: │
│ k8s/* │
│ postgres/* │
│ (everything else)│
└───────────┬───────────┘
┌───────────────────────┐
│ Secret backend │
│ (TBD — needs │
│ investigation) │
│ │
│ Enforces policy at │
│ backend level, not │
│ just in Lab │
└───────────────────────┘
```
#### Secret Sources
Secrets themselves can come from multiple places:
```yaml
secrets:
mail/tls-cert:
type: dynamic # generated/rotated automatically
generator: acme # cert-manager / Let's Encrypt
rotate_every: 90d
mail/dkim-key:
type: static # manually set, stored encrypted
set_by: admin # who last set it
mail/relay-credentials:
type: static
set_by: admin
k8s/join-token:
type: dynamic
generator: kubernetes # fetched from k8s API
rotate_every: 24h
tls/node-cert:
type: dynamic
generator: ca # issued per-machine from internal CA
per_machine: true # each machine gets its own
```
#### CLI for Secrets
```
$ lab get secrets
SECRET TYPE USED BY LAST ROTATED
mail/tls-cert dynamic mailserver (2 srv) 2026-03-14
mail/dkim-key static mailserver (2 srv) 2026-01-15
mail/relay-credentials static mailserver (2 srv) 2026-02-01
k8s/join-token dynamic k8s-worker (12 srv) 2026-03-15
tls/node-cert dynamic * (all enrolled) per-machine
$ lab secret set mail/relay-credentials
Enter value: ****
✓ Updated. Accessible by: mailserver (2 servers)
✓ Servers will pick up new value within 60s
$ lab show secret mail/relay-credentials
Secret: mail/relay-credentials
Type: static
Last set: 2026-03-15 by admin
Accessible by (derived from code):
Label "mailserver" → puppet class "postfix" → lab::secret('mail/relay-credentials')
├── mail-1 (xcpng) last fetched: 12m ago
└── mail-2 (aws) last fetched: 12m ago
No other references found in any applied code.
$ lab secret audit
✓ All secrets are referenced by at least one applied class/label
⚠ Secret "old/api-key" is defined but not referenced by any code — orphaned?
⚠ Secret "db/password" referenced by class "app::database" but never set — empty!
```
#### Secret Architecture — Distributed, Offline-Capable
**Critical requirement:** Nothing breaks if the central secret server (or any server)
is unreachable. Everything continues to work — including making new pods, deployments,
puppet runs — using local encrypted cache. This is not an edge case, it's a core design.
**This means secrets are NOT a central server you query.** They're a distributed,
synced, encrypted dataset with offline capability.
```
┌─────────────────────────────────────────────────────────────┐
│ Secret Distribution Model │
│ │
│ NOT this (central server): THIS (distributed sync): │
│ │
│ ┌─────────┐ ┌──────┐ ┌──────┐ │
│ │ Vault │ │ Node │◄─►│ Node │ │
│ └────┬────┘ └──┬───┘ └──┬───┘ │
│ ┌────┼────┐ │ ▲ │ │
│ │ │ │ ▼ │ ▼ │
│ ┌┴┐ ┌┴┐ ┌┴┐ ┌──────┐ ┌──────┐ │
│ │N│ │N│ │N│ │ Node │◄─►│ Node │ │
│ └─┘ └─┘ └─┘ └──┬───┘ └──────┘ │
│ (all dead if vault │ │
│ is unreachable) ▼ │
│ ┌──────────┐ │
│ │ Git repo │ (encrypted │
│ │ (backup) │ backup of │
│ └──────────┘ last resort) │
└─────────────────────────────────────────────────────────────┘
```
#### How It Works
**Layer 1: Local Encrypted Cache (on every machine)**
- Every machine that has access to secrets stores them locally, encrypted at rest
- Encrypted with machine-specific key (derived from machine identity/TPM/secure enclave)
- Puppet runs, app starts, pod deployments — all read from local cache
- If cache is fresh → use it, no network call needed
- Cache has TTL per secret, but stale cache is better than no secret
**Layer 2: Secret Store (privileged nodes that hold all secrets)**
- One or more nodes with the `secret-store` label hold the COMPLETE encrypted dataset
- This is NOT a special server type — it's a label, applied to pods, VMs, or bare metal
- Should have at least 2 replicas for HA
- Machines fetch ONLY the secrets their labels entitle them to from the store
- The store enforces policy — a machine with label `mailserver` gets `mail/*`, nothing else
- Machines NEVER sync with each other directly — they only talk to the store
- This prevents secret sprawl (no machine accumulates secrets it shouldn't have)
**Layer 3: Git Encrypted Backup (last resort recovery)**
- All secrets (encrypted with a master key) backed up to a Git repo
- If a machine has empty cache AND no peers available → restore from Git backup
- SOPS/age style encryption — secrets encrypted, metadata (paths, policies) in plaintext
- Git gives versioning, audit trail, and disaster recovery for free
- The Git repo alone is useless without the decryption key
**Layer 4: Lab-server (coordinator, NOT single point of failure)**
- Lab-server is the preferred interface to set/rotate secrets (via CLI/API)
- Lab-server does NOT need to be the secret-store (but can be, via label)
- If lab-server is down, machines keep running from local cache
- No new secrets can be distributed while secret-store is down
- But nothing breaks — existing workloads continue uninterrupted
- When secret-store comes back, machines sync and catch up
**Separation of concerns:**
- `lab-server` = coordination, API, lifecycle management
- `secret-store` label = holds all secrets, serves policy-filtered requests
- These CAN be the same node (apply both labels) or separate nodes
- For homelab: same node is fine. For enterprise: separate for isolation
#### Recovery Scenarios
```
Scenario 1: Lab-server down, secret-store up
→ All machines continue working from local cache
→ Machines can still fetch/refresh secrets from secret-store
→ No new resources can be provisioned (lab-server manages lifecycle)
→ But existing workloads are unaffected
Scenario 2: Secret-store down, lab-server up
→ All machines continue working from local cache
→ Lab-server can still manage lifecycle (provision, plan, apply)
→ No new secrets can be distributed
→ No secret rotations until store is back
→ Lab-server shows: ⚠ secret-store unreachable
Scenario 3: Both down
→ All machines continue working from local cache
→ Nothing new can happen, but nothing breaks
→ Recovery priority: restore secret-store first (from Git backup)
Scenario 4: Machine reboots, cache intact
→ Reads from local encrypted cache immediately
→ Refreshes from secret-store in background to catch up
→ No dependency on lab-server for startup
Scenario 5: Machine rebuilt, cache empty
→ Machine has its identity (from enrollment) but no secrets
→ Fetches entitled secrets from secret-store (policy-filtered)
→ If secret-store unreachable → cannot start (needs secrets)
→ Operator can restore secret-store from Git backup to unblock
Scenario 6: Total disaster, only Git backup survives
→ Deploy new node, apply `secret-store` label
→ Restore encrypted secrets from Git backup
→ Deploy lab-server (lab init)
→ New machines enroll and receive their entitled secrets
→ System fully recovered
Scenario 7: New pod in k8s, secret-store unreachable
→ K8s node has local secret cache for its entitled secrets
→ Lab k8s operator serves pod secrets from node's local cache
→ Pod starts with cached secrets
→ No interruption to deployments
```
#### CLI for Secret Distribution
```
$ lab secret status
SECRET DISTRIBUTION STATUS:
Local cache: ✓ 8 secrets cached (of 8 entitled), encrypted, fresh (< 5m old)
Secret store: ✓ connected (2 replicas: store-1, store-2)
Lab-server: ✓ connected
Git backup: ✓ last push 2026-03-15 14:30:00 (47 total secrets)
$ lab secret status --store
SECRET STORE:
Replicas: 2/2 healthy
store-1 k8s pod ✓ synced 47 secrets (all)
store-2 vm/xcpng ✓ synced 47 secrets (all)
Git backup: ✓ synced 2026-03-15 14:30:00
Total secrets: 47
Entitled consumers:
k8s-worker (12 machines) → 3 secrets each
mailserver (2 machines) → 5 secrets each
postgres (3 machines) → 4 secrets each
lab-server (1 machine) → 2 secrets
$ lab secret cache
LOCAL CACHE:
SECRET CACHED TTL STATUS
mail/tls-cert ✓ 89d left fresh
mail/dkim-key ✓ no expiry fresh
k8s/join-token ✓ 23h left fresh
tls/node-cert ✓ 346d left fresh
$ lab secret recover --from git
→ Fetching encrypted backup from git@github.com:org/lab-secrets.git
→ Decrypting with master key...
→ Restored 23 secrets
→ Syncing with available peers...
```
#### Local Cache Security
The local cache must be stored securely — needs investigation:
- Encrypted at rest with machine-specific key
- Key derived from: TPM 2.0? Secure enclave? LUKS-bound? needs investigation
- Memory-mapped, not swappable (mlock)
- Accessible only by lab agent (file permissions + MAC/SELinux)
- Wiped on machine decommission (`lab identity revoke`)
- Possibly use kernel keyring on Linux — needs investigation
#### Secret Backend — NOT Decided
The underlying secret storage/sync mechanism is pluggable:
```go
type SecretBackend interface {
Name() string
// CRUD
Get(path string, identity *MachineIdentity) ([]byte, error)
Set(path string, value []byte) error
Delete(path string) error
List(prefix string) ([]string, error)
// Policy (auto-generated from code/labels)
GrantAccess(path string, identity *MachineIdentity) error
RevokeAccess(path string, identity *MachineIdentity) error
// Dynamic
Generate(path string, generator GeneratorConfig) ([]byte, error)
Rotate(path string) error
// Distribution
SyncWith(peer PeerInfo) error
CacheLocally(secrets []Secret) error
RestoreFromBackup(source BackupSource) error
}
```
Possible approaches (each needs investigation):
- **SOPS + age + Git** — simplest, encrypted files in Git, but no peer sync
- **OpenBao** — Vault fork, has replication, but still central-server mindset
- **Sealed Secrets / External Secrets Operator** — k8s-native, but not universal
- **Infisical** — developer-friendly, but SaaS-oriented
- **Custom: encrypted SQLite + peer sync** — simple, we control the sync protocol
- **etcd with encryption** — distributed by nature, but might be overkill
- **CockroachDB** — distributed SQL, encrypted, survives node failures
- **Consul** — distributed KV with gossip, HashiCorp though
- **Lab's own sync protocol** — gossip-based, encrypted, purpose-built
The right answer might be a combination:
- SOPS/age for encryption format (proven, auditable)
- Custom gossip sync for distribution (lightweight)
- Git for backup (free versioning and DR)
- Or wrap an existing distributed KV that already handles sync
**This is the most complex subsystem in Lab and needs careful investigation.**
### Identity Plugin System
Same extensible pattern as providers and health sources:
```go
type IdentityPlugin interface {
Name() string
// Enrollment
Enroll(resource *Resource, token string) (*Identity, error)
Revoke(resource *Resource) error
// Status
Status(resource *Resource) (*IdentityStatus, error)
// Renewal
Renew(resource *Resource) error
}
```
This allows swapping identity backends without changing the rest of Lab.
We might start with Vault + OpenVox CA and later add/replace components.
## State Storage — Design Principles
**NOT etcd.** etcd prioritizes consistency over availability — it would rather crash and
stay down than serve potentially inconsistent data. For Lab, availability wins:
- Losing a few events is better than total outage
- Should auto-backup and auto-restore on corruption
- Should degrade gracefully, never crash and refuse to start
- Stale data is acceptable, no data is not
Requirements:
- Stores: resource state, label definitions, group membership, alert configs, audit log
- Must survive lab-server restart
- Must be migratable (lab-server can move between hosts)
- Should auto-backup (to Git, S3, or local snapshots)
- Should auto-recover from corruption without operator intervention
- Embedded (no external dependency) preferred for simplicity
Candidates (needs investigation):
- **SQLite** — embedded, simple, proven, WAL mode for concurrent reads, easy to backup (copy file)
- **bbolt/BoltDB** — embedded KV, used by etcd ironically, simpler than etcd itself
- **Badger** — embedded KV in Go, LSM-tree, good performance
- **DuckDB** — embedded analytical DB, might be overkill
- **PostgreSQL** — if we need multi-server state, but adds external dependency
- **Litestream** — SQLite + continuous replication to S3/GCS/Azure (interesting combo)
**SQLite + Litestream** is the current leading candidate:
- SQLite for simplicity and embeddability
- Litestream for continuous backup to S3/GCS/local without stopping the database
- Auto-restore: if DB is missing, Litestream restores from latest backup
- Single file, easy to migrate when lab-server moves
- But needs investigation to confirm it handles our scale
## Open Questions
1. Name: "lab" is simple but generic. Alternatives?
2. GitOps integration — should label/profile changes go through Git, or direct API?
3. Multi-tenancy — how to scope labels/resources per team?
4. Auth — mTLS between CLI and server? OIDC? Vault-issued tokens?
5. Input format — TypeScript (DA-style), YAML (Compose-style), or both?
7. Should `lab init` deploy lab-server as a container (portable) or native binary (simpler)?