1538 lines
62 KiB
Markdown
1538 lines
62 KiB
Markdown
|
|
# Lab — Unified Infrastructure Lifecycle Platform
|
||
|
|
|
||
|
|
## What It Is
|
||
|
|
|
||
|
|
A tool that abstracts infrastructure lifecycle across clouds, hypervisors, bare metal,
|
||
|
|
and Kubernetes — using labels as the universal abstraction and existing tools under the hood.
|
||
|
|
|
||
|
|
**Not reinventing the wheel.** Uses Pulumi, OpenVox, Tinkerbell, Prometheus, Naemon,
|
||
|
|
existing Puppet modules, cloud APIs — but provides a unified interface over all of them.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌────────────────────────────────────────────────────────────┐
|
||
|
|
│ lab-server (control plane) │
|
||
|
|
│ │
|
||
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
|
||
|
|
│ │ Provider │ │ Label │ │ Lifecycle│ │ Artifact │ │
|
||
|
|
│ │ Registry │ │ Engine │ │ Manager │ │ Builder │ │
|
||
|
|
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
|
||
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
|
||
|
|
│ │ OpenVox │ │ Health │ │ K8s │ │ Render │ │
|
||
|
|
│ │ Enrollor │ │ Aggregator│ │ Deployer │ │ Engine │ │
|
||
|
|
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
|
||
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
|
||
|
|
│ │ Identity │ │ DNS │ │ Secret │ │ Token │ │
|
||
|
|
│ │ Manager │ │ Manager │ │ Manager │ │ Issuer │ │
|
||
|
|
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
|
||
|
|
│ │
|
||
|
|
│ API (gRPC + REST) │
|
||
|
|
└──────────────┬─────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
┌──────────┴──────────┐
|
||
|
|
│ │
|
||
|
|
┌───┴───┐ ┌────┴────┐
|
||
|
|
│ lab │ │ lab-tui │
|
||
|
|
│ (CLI) │ │ (k9s) │
|
||
|
|
└───────┘ └─────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Control Plane (lab-server)
|
||
|
|
|
||
|
|
Runs as a service (on bootstrap node, or in k8s). Hosts:
|
||
|
|
|
||
|
|
- **Provider Registry** — pluggable providers (AWS, XCP-ng, bare metal, GCP, etc.)
|
||
|
|
- **Label Engine** — resolves labels → puppet classes, sizes, ports, config
|
||
|
|
- **Lifecycle Manager** — orchestrates provision → enroll → configure → observe
|
||
|
|
- **Artifact Builder** — puppet classes → container images
|
||
|
|
- **OpenVox Enrollor** — secure cert signing, node classification, environment assignment
|
||
|
|
- **Health Aggregator** — queries Prometheus, Naemon, cloud health APIs
|
||
|
|
- **K8s Deployer** — manages workloads on k3s/EKS clusters
|
||
|
|
- **Render Engine** — side-by-side provider comparison, cost estimates, drift detection
|
||
|
|
- **Identity Manager** — tracks enrollment state, certs, Vault auth, SSH keys per resource
|
||
|
|
- **DNS Manager** — auto-registers/updates DNS for every managed resource
|
||
|
|
- **Secret Manager** — controls which resources can access which secrets (per-label policies)
|
||
|
|
- **Token Issuer** — generates one-time join tokens at provision time (no hardcoded secrets)
|
||
|
|
|
||
|
|
### CLI (lab)
|
||
|
|
|
||
|
|
kubectl-like interface for browsing and managing resources:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab get servers
|
||
|
|
NAME PROVIDER LABELS SIZE SYNC PUPPET HEALTH IDENTITY
|
||
|
|
api-1 aws app,prod,eu-west medium ✓ sync ✓ ok ✓ ok ✓ enrolled
|
||
|
|
api-2 aws app,prod,eu-west medium ✓ sync ✓ ok ✓ ok ✓ enrolled
|
||
|
|
mail-1 xcpng mailserver,prod medium ✓ sync ✓ ok ✓ ok ✓ enrolled
|
||
|
|
db-1 baremetal postgres,prod large ⚠ drift ✓ ok ✓ ok ✓ enrolled
|
||
|
|
worker-3 aws k8s-worker,staging large ✓ sync ✗ failed ⚠ 2 alrt ✓ enrolled
|
||
|
|
gateway-1 baremetal k8s-server,prod small ✓ sync ✓ ok ✓ ok ⚠ cert exp
|
||
|
|
|
||
|
|
$ lab get servers --label mailserver
|
||
|
|
NAME PROVIDER SIZE SYNC PUPPET HEALTH IDENTITY
|
||
|
|
mail-1 xcpng medium ✓ sync ✓ ok ✓ ok ✓ enrolled
|
||
|
|
mail-2 aws medium ✓ sync ✓ ok ✓ ok ✓ enrolled
|
||
|
|
|
||
|
|
$ lab describe server db-1
|
||
|
|
Name: db-1
|
||
|
|
Provider: baremetal
|
||
|
|
Labels: [postgres, prod, eu-west]
|
||
|
|
Size: large (8 cores, 32GB, 500GB NVMe)
|
||
|
|
Status: DRIFT DETECTED
|
||
|
|
Expected: size=large, disk=500GB
|
||
|
|
Actual: size=large, disk=500GB, extra_mount=/data (unmanaged)
|
||
|
|
Puppet:
|
||
|
|
Environment: production
|
||
|
|
Role: postgres
|
||
|
|
Classes: [postgresql::server, backup::pgbackrest, node_exporter]
|
||
|
|
Last run: 2026-03-15 14:22:03 (success)
|
||
|
|
Next run: 2026-03-15 14:52:03
|
||
|
|
Health:
|
||
|
|
Prometheus: ✓ all targets up
|
||
|
|
Naemon: ✓ all checks passing
|
||
|
|
Alerts: none active
|
||
|
|
|
||
|
|
$ lab get labels
|
||
|
|
LABEL PUPPET CLASSES SERVERS CONTAINERS
|
||
|
|
mailserver postfix, dovecot, spamassassin 2 1
|
||
|
|
k8s-worker kubernetes::worker, containerd 12 0
|
||
|
|
postgres postgresql::server, pgbackrest 3 1
|
||
|
|
app nginx, app::deploy 4 2
|
||
|
|
|
||
|
|
$ lab get containers
|
||
|
|
NAME IMAGE LABEL K8S CLUSTER STATUS
|
||
|
|
mailserver ghcr.io/org/mailserver:2026.03.15 mailserver homelab running
|
||
|
|
postgres ghcr.io/org/postgres:2026.03.14 postgres homelab running
|
||
|
|
app ghcr.io/org/app:2026.03.15 app production running
|
||
|
|
|
||
|
|
$ lab diff server db-1
|
||
|
|
size: large
|
||
|
|
disk: 500GB
|
||
|
|
+ extra_mount: /data ← unmanaged, not in spec
|
||
|
|
|
||
|
|
$ lab sync server db-1 # reconcile drift
|
||
|
|
$ lab plan server new-mail-3 --label mailserver --provider aws # preview
|
||
|
|
$ lab apply server new-mail-3 # create it
|
||
|
|
|
||
|
|
$ lab build --label mailserver # puppet modules → container image
|
||
|
|
Building mailserver from puppet classes:
|
||
|
|
✓ postfix
|
||
|
|
✓ dovecot
|
||
|
|
✓ spamassassin
|
||
|
|
✓ fail2ban
|
||
|
|
→ ghcr.io/org/mailserver:2026.03.15
|
||
|
|
|
||
|
|
$ lab render --label mailserver --all-providers
|
||
|
|
┌──────────────┬──────────────┬──────────┬────────────┐
|
||
|
|
│ │ AWS │ XCP-ng │ Bare Metal │
|
||
|
|
├──────────────┼──────────────┼──────────┼────────────┤
|
||
|
|
│ Compute │ t3.large │ 4c/8GB │ IPMI boot │
|
||
|
|
│ Puppet │ postfix,... │ postfix,.│ postfix,...│
|
||
|
|
│ Est. Cost │ ~$62/mo │ — │ — │
|
||
|
|
└──────────────┴──────────────┴──────────┴────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### TUI (lab-tui)
|
||
|
|
|
||
|
|
k9s-style interactive terminal UI:
|
||
|
|
- Real-time server list with sync/puppet/health status
|
||
|
|
- Drill into any server for details
|
||
|
|
- Watch puppet runs live
|
||
|
|
- Filter by labels, providers, health status
|
||
|
|
- Trigger actions (sync, plan, apply, build)
|
||
|
|
|
||
|
|
## Core Concepts
|
||
|
|
|
||
|
|
### Labels — The Universal Abstraction
|
||
|
|
|
||
|
|
Everything is a thing with labels. Configuration attaches to labels, not machines.
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
labels:
|
||
|
|
mailserver:
|
||
|
|
puppet_classes:
|
||
|
|
- postfix
|
||
|
|
- dovecot
|
||
|
|
- spamassassin
|
||
|
|
- fail2ban
|
||
|
|
ports: [25, 587, 993]
|
||
|
|
size: medium
|
||
|
|
alerts:
|
||
|
|
- smtp_connect # auto-generated: is SMTP responding?
|
||
|
|
- imap_connect # auto-generated: is IMAP responding?
|
||
|
|
- mail_queue_length # auto-generated: is mail queue healthy?
|
||
|
|
secrets:
|
||
|
|
- mail/tls-cert
|
||
|
|
- mail/dkim-key
|
||
|
|
|
||
|
|
k8s-worker:
|
||
|
|
puppet_classes:
|
||
|
|
- kubernetes::worker
|
||
|
|
- containerd
|
||
|
|
- node_exporter
|
||
|
|
size: large
|
||
|
|
alerts:
|
||
|
|
- kubelet_healthy
|
||
|
|
- node_ready
|
||
|
|
secrets:
|
||
|
|
- k8s/join-token
|
||
|
|
```
|
||
|
|
|
||
|
|
### Groups — Nested Targeting with Exclusions
|
||
|
|
|
||
|
|
Groups compose labels, other groups, and individual servers into reusable targets.
|
||
|
|
Groups can nest (subgroups). Exclusions allow fine-grained control.
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
groups:
|
||
|
|
# Simple group: all production servers
|
||
|
|
production:
|
||
|
|
match:
|
||
|
|
environment: prod
|
||
|
|
|
||
|
|
# Group by label combination
|
||
|
|
production-mail:
|
||
|
|
match:
|
||
|
|
labels: [mailserver]
|
||
|
|
environment: prod
|
||
|
|
|
||
|
|
# Nested group with subgroups
|
||
|
|
eu-infrastructure:
|
||
|
|
groups:
|
||
|
|
- eu-west-compute
|
||
|
|
- eu-west-storage
|
||
|
|
- eu-west-network
|
||
|
|
exclude:
|
||
|
|
servers: [test-box-1] # exclude specific server
|
||
|
|
labels: [experimental] # exclude servers with this label
|
||
|
|
|
||
|
|
eu-west-compute:
|
||
|
|
match:
|
||
|
|
labels: [k8s-worker, k8s-server]
|
||
|
|
region: eu-west
|
||
|
|
exclude:
|
||
|
|
servers: [legacy-node-3]
|
||
|
|
|
||
|
|
# Group targeting everything except a subgroup
|
||
|
|
all-except-staging:
|
||
|
|
match:
|
||
|
|
environment: [prod, dev]
|
||
|
|
exclude:
|
||
|
|
environment: staging
|
||
|
|
|
||
|
|
# Custom group by explicit membership
|
||
|
|
database-tier:
|
||
|
|
servers: [db-1, db-2, db-3]
|
||
|
|
groups: [replica-set-eu]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Alerts — Auto-Generated and User-Defined
|
||
|
|
|
||
|
|
Alerts attach to labels, groups, servers, or environments — same targeting as everything else.
|
||
|
|
|
||
|
|
#### Auto-Generated Alerts
|
||
|
|
|
||
|
|
When Lab provisions a resource, it generates baseline alerts based on:
|
||
|
|
- **Label**: mailserver label → SMTP/IMAP checks
|
||
|
|
- **Puppet classes**: `postgresql::server` → postgres process, replication lag
|
||
|
|
- **Ports**: if port 443 is declared → HTTPS health check
|
||
|
|
- **Size**: resource limits → CPU/memory threshold alerts
|
||
|
|
- **Identity**: cert expiry alerts auto-generated for all enrolled machines
|
||
|
|
|
||
|
|
#### User-Defined Alerts
|
||
|
|
|
||
|
|
Users can add custom alerts targeting any scope:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
alerts:
|
||
|
|
# Target by label
|
||
|
|
- name: mail_queue_critical
|
||
|
|
target:
|
||
|
|
labels: [mailserver]
|
||
|
|
condition: mail_queue_length > 1000
|
||
|
|
severity: critical
|
||
|
|
for: 5m
|
||
|
|
|
||
|
|
# Target by group
|
||
|
|
- name: disk_space_low
|
||
|
|
target:
|
||
|
|
groups: [production]
|
||
|
|
condition: disk_usage_percent > 85
|
||
|
|
severity: warning
|
||
|
|
|
||
|
|
# Target by environment
|
||
|
|
- name: high_cpu
|
||
|
|
target:
|
||
|
|
environment: prod
|
||
|
|
condition: cpu_usage_percent > 90
|
||
|
|
for: 10m
|
||
|
|
severity: warning
|
||
|
|
|
||
|
|
# Target specific servers
|
||
|
|
- name: gpu_temperature
|
||
|
|
target:
|
||
|
|
servers: [dgx-spark, beelink-ser9-max]
|
||
|
|
condition: gpu_temp_celsius > 80
|
||
|
|
severity: critical
|
||
|
|
|
||
|
|
# Target by label but exclude some
|
||
|
|
- name: memory_pressure
|
||
|
|
target:
|
||
|
|
labels: [k8s-worker]
|
||
|
|
exclude:
|
||
|
|
servers: [batch-worker-1] # this one is expected to run hot
|
||
|
|
condition: memory_usage_percent > 90
|
||
|
|
severity: warning
|
||
|
|
```
|
||
|
|
|
||
|
|
Alerts are rendered to the underlying monitoring system (Prometheus rules, Naemon checks,
|
||
|
|
CloudWatch alarms) — we don't build an alerting engine, we generate configs for existing ones.
|
||
|
|
Which monitoring backend to use for each alert type: **needs investigation**.
|
||
|
|
|
||
|
|
### Targeting — Unified Query System
|
||
|
|
|
||
|
|
The same targeting syntax works everywhere: alerts, puppet classes, secrets, and queries.
|
||
|
|
Target by label, group, server name, environment, region, or any combination with exclusions.
|
||
|
|
|
||
|
|
```
|
||
|
|
# CLI targeting syntax
|
||
|
|
$ lab get servers --label k8s-worker
|
||
|
|
$ lab get servers --group production
|
||
|
|
$ lab get servers --environment staging
|
||
|
|
$ lab get servers --label k8s-worker --environment prod --exclude worker-3
|
||
|
|
|
||
|
|
# What's applied WHERE (server → everything)
|
||
|
|
$ lab show server worker-5
|
||
|
|
```
|
||
|
|
|
||
|
|
### Visibility — Show What's Applied Where
|
||
|
|
|
||
|
|
Two directions of querying: "what does this server get?" and "where does this thing apply?"
|
||
|
|
|
||
|
|
#### Server View: Everything applied to a server
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab show server worker-5
|
||
|
|
|
||
|
|
Server: worker-5 (aws, eu-west-1)
|
||
|
|
Labels: [k8s-worker, production, eu-west]
|
||
|
|
Groups: [production, eu-west-compute, eu-infrastructure]
|
||
|
|
Environment: prod
|
||
|
|
|
||
|
|
Puppet Classes (6):
|
||
|
|
FROM LABEL k8s-worker:
|
||
|
|
├── kubernetes::worker
|
||
|
|
├── containerd
|
||
|
|
└── node_exporter
|
||
|
|
FROM LABEL production:
|
||
|
|
├── base::hardening
|
||
|
|
└── base::monitoring
|
||
|
|
FROM LABEL eu-west:
|
||
|
|
└── base::ntp_eu
|
||
|
|
|
||
|
|
Alerts (8):
|
||
|
|
FROM LABEL k8s-worker:
|
||
|
|
├── kubelet_healthy
|
||
|
|
└── node_ready
|
||
|
|
FROM GROUP production:
|
||
|
|
├── disk_space_low
|
||
|
|
└── high_cpu
|
||
|
|
AUTO-GENERATED:
|
||
|
|
├── cpu_threshold (from size: large)
|
||
|
|
├── memory_threshold (from size: large)
|
||
|
|
├── cert_expiry (from identity)
|
||
|
|
└── puppet_run_failed (from enrollment)
|
||
|
|
|
||
|
|
Secrets (2):
|
||
|
|
FROM LABEL k8s-worker:
|
||
|
|
├── k8s/join-token (read)
|
||
|
|
└── tls/node-cert (dynamic)
|
||
|
|
|
||
|
|
Excluded From:
|
||
|
|
└── alert "memory_pressure" (explicitly excluded)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Label/Group View: Where does this apply?
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab show label mailserver
|
||
|
|
|
||
|
|
Label: mailserver
|
||
|
|
Applied to: 2 servers
|
||
|
|
|
||
|
|
Servers:
|
||
|
|
├── mail-1 (xcpng, prod) ✓ sync ✓ puppet ✓ health ✓ identity
|
||
|
|
└── mail-2 (aws, prod) ✓ sync ✓ puppet ✓ health ✓ identity
|
||
|
|
|
||
|
|
Provides:
|
||
|
|
Puppet Classes: postfix, dovecot, spamassassin, fail2ban
|
||
|
|
Alerts: smtp_connect, imap_connect, mail_queue_length
|
||
|
|
Secrets: mail/tls-cert, mail/dkim-key
|
||
|
|
Ports: 25, 587, 993
|
||
|
|
Size: medium
|
||
|
|
|
||
|
|
$ lab show group eu-infrastructure
|
||
|
|
|
||
|
|
Group: eu-infrastructure
|
||
|
|
Contains: 3 subgroups, 47 servers (2 excluded)
|
||
|
|
|
||
|
|
Subgroups:
|
||
|
|
├── eu-west-compute (28 servers)
|
||
|
|
├── eu-west-storage (12 servers)
|
||
|
|
└── eu-west-network (9 servers)
|
||
|
|
|
||
|
|
Excluded:
|
||
|
|
├── test-box-1 (by name)
|
||
|
|
└── 1 server with label "experimental"
|
||
|
|
|
||
|
|
Alerts targeting this group:
|
||
|
|
├── disk_space_low (warning)
|
||
|
|
└── network_latency_high (critical)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Alert View: Where does this alert fire?
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab show alert disk_space_low
|
||
|
|
|
||
|
|
Alert: disk_space_low
|
||
|
|
Severity: warning
|
||
|
|
Condition: disk_usage_percent > 85
|
||
|
|
Target: group "production"
|
||
|
|
Excludes: none
|
||
|
|
|
||
|
|
Applies to 63 servers:
|
||
|
|
├── api-1 (aws) currently: 42% ✓
|
||
|
|
├── api-2 (aws) currently: 38% ✓
|
||
|
|
├── mail-1 (xcpng) currently: 71% ✓
|
||
|
|
├── db-1 (baremetal) currently: 83% ⚠ approaching
|
||
|
|
└── ... (59 more)
|
||
|
|
|
||
|
|
Rendered to:
|
||
|
|
├── Prometheus: rule "disk_space_low" in rules/production.yaml
|
||
|
|
└── Naemon: service check on 4 bare-metal hosts
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Reverse Query: What targets this server?
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab targets server db-1
|
||
|
|
|
||
|
|
Everything targeting db-1:
|
||
|
|
Labels: [postgres, production, eu-west]
|
||
|
|
Groups: [production, database-tier, eu-infrastructure, eu-west-storage]
|
||
|
|
Environment: prod
|
||
|
|
|
||
|
|
Alerts (11):
|
||
|
|
├── postgres_replication_lag (from label: postgres)
|
||
|
|
├── postgres_connections (from label: postgres)
|
||
|
|
├── disk_space_low (from group: production)
|
||
|
|
├── high_cpu (from group: production)
|
||
|
|
├── storage_iops (from group: eu-west-storage)
|
||
|
|
├── cert_expiry (auto-generated)
|
||
|
|
└── ... (5 more)
|
||
|
|
|
||
|
|
Puppet Classes (9):
|
||
|
|
├── postgresql::server (from label: postgres)
|
||
|
|
├── backup::pgbackrest (from label: postgres)
|
||
|
|
└── ... (7 more)
|
||
|
|
|
||
|
|
Secrets (4):
|
||
|
|
├── postgres/master-password (from label: postgres)
|
||
|
|
└── ... (3 more)
|
||
|
|
```
|
||
|
|
|
||
|
|
### TUI Visualization (lab-tui)
|
||
|
|
|
||
|
|
The k9s-style TUI should support navigating these relationships interactively:
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─ lab-tui ──────────────────────────────────────────────────────────┐
|
||
|
|
│ View: Servers > worker-5 [?]Help│
|
||
|
|
├────────────────────────────────────────────────────────────────────┤
|
||
|
|
│ │
|
||
|
|
│ ┌─ Server: worker-5 ──────────────────────────────────────────┐ │
|
||
|
|
│ │ Provider: aws Size: large Env: prod │ │
|
||
|
|
│ │ Sync: ✓ Puppet: ✓ Health: ✓ Identity: ✓ │ │
|
||
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
||
|
|
│ │
|
||
|
|
│ [L]abels [A]lerts [P]uppet [S]ecrets [G]roups │
|
||
|
|
│ │
|
||
|
|
│ Labels ──────────────────── Alerts ────────────────────────── │
|
||
|
|
│ ► k8s-worker ● kubelet_healthy ✓ OK │
|
||
|
|
│ ► production ● node_ready ✓ OK │
|
||
|
|
│ ► eu-west ● disk_space_low ✓ 42% │
|
||
|
|
│ ● high_cpu ✓ 12% │
|
||
|
|
│ Groups ────────────────── ● cert_expiry ✓ 347d │
|
||
|
|
│ ► production │
|
||
|
|
│ ► eu-infrastructure Puppet Classes ────────────────── │
|
||
|
|
│ ► eu-west-compute ● kubernetes::worker ✓ applied │
|
||
|
|
│ ● containerd ✓ applied │
|
||
|
|
│ Secrets ───────────────── ● node_exporter ✓ applied │
|
||
|
|
│ ● k8s/join-token (read) ● base::hardening ✓ applied │
|
||
|
|
│ ● tls/node-cert (dyn) ● base::monitoring ✓ applied │
|
||
|
|
│ │
|
||
|
|
│ [Enter] drill down [Esc] back [/] search [Tab] switch pane │
|
||
|
|
└────────────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
Navigation:
|
||
|
|
- From server → drill into label → see all other servers with that label
|
||
|
|
- From alert → see all servers it applies to, current values
|
||
|
|
- From group → see subgroups, expand tree, see members
|
||
|
|
- From label → see puppet classes, alerts, secrets it provides
|
||
|
|
- Everything is cross-linked — follow any relationship in either direction
|
||
|
|
|
||
|
|
### Deployment Targets
|
||
|
|
|
||
|
|
Same label → multiple targets:
|
||
|
|
|
||
|
|
| Target | What happens |
|
||
|
|
|--------|-------------|
|
||
|
|
| VM (any cloud) | Provision VM → enroll OpenVox → apply classes live |
|
||
|
|
| Bare metal | PXE boot → enroll OpenVox → apply classes live |
|
||
|
|
| Container | Build image with classes baked in → push to registry |
|
||
|
|
| ASG | Launch template with OpenVox enrollment → auto-apply |
|
||
|
|
| K8s pod | Deploy container artifact to cluster |
|
||
|
|
|
||
|
|
### Four-Pillar Status
|
||
|
|
|
||
|
|
Every resource shows four things:
|
||
|
|
|
||
|
|
1. **Sync** — is the actual infrastructure state matching the declared spec?
|
||
|
|
(instance type, security groups, disks, network — via Pulumi state)
|
||
|
|
2. **Puppet** — did OpenVox successfully apply all classes?
|
||
|
|
(last run status, any failures, catalog compilation errors)
|
||
|
|
3. **Health** — are monitoring checks passing?
|
||
|
|
(aggregates from Prometheus alerts, Naemon checks, cloud health APIs)
|
||
|
|
4. **Identity** — is the resource fully enrolled?
|
||
|
|
(DNS registered, certs valid, Vault authenticated, SSH host key signed)
|
||
|
|
|
||
|
|
### Provider Plugin System
|
||
|
|
|
||
|
|
Extensible provider model — each provider implements an interface:
|
||
|
|
|
||
|
|
```go
|
||
|
|
type Provider interface {
|
||
|
|
Name() string
|
||
|
|
|
||
|
|
// Lifecycle
|
||
|
|
Plan(spec ResourceSpec) (*PlanResult, error)
|
||
|
|
Apply(spec ResourceSpec) (*Resource, error)
|
||
|
|
Destroy(id string) error
|
||
|
|
|
||
|
|
// State
|
||
|
|
Get(id string) (*Resource, error)
|
||
|
|
List(filters Filters) ([]*Resource, error)
|
||
|
|
Diff(spec ResourceSpec) (*DiffResult, error)
|
||
|
|
|
||
|
|
// Introspection (like DA's type-writer)
|
||
|
|
DiscoverResources() ([]*Resource, error)
|
||
|
|
AvailableSizes() ([]Size, error)
|
||
|
|
AvailableImages() ([]Image, error)
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Built-in providers:
|
||
|
|
- `provider-aws` — wraps Pulumi AWS
|
||
|
|
- `provider-xcpng` — wraps Pulumi XO / Xen Orchestra API
|
||
|
|
- `provider-baremetal` — wraps Tinkerbell / iPXE + IPMI/Redfish
|
||
|
|
- `provider-k8s` — wraps Pulumi Kubernetes
|
||
|
|
|
||
|
|
Community can add: GCP, Azure, Hetzner, Proxmox, etc.
|
||
|
|
|
||
|
|
### Health Aggregator Plugin System
|
||
|
|
|
||
|
|
```go
|
||
|
|
type HealthSource interface {
|
||
|
|
Name() string
|
||
|
|
CheckHealth(resource *Resource) (*HealthResult, error)
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Built-in sources:
|
||
|
|
- `health-prometheus` — queries Prometheus alerting rules targeting the resource
|
||
|
|
- `health-naemon` — queries Naemon host/service checks
|
||
|
|
- `health-cloudwatch` — queries AWS CloudWatch alarms
|
||
|
|
|
||
|
|
### Profiles — T-Shirt Sizing
|
||
|
|
|
||
|
|
User-owned mappings:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
sizes:
|
||
|
|
medium:
|
||
|
|
abstract: { cores: 4, memory: 8GB }
|
||
|
|
providers:
|
||
|
|
aws: { instance_type: t3.large }
|
||
|
|
xcpng: { cores: 4, memory: 8192MB }
|
||
|
|
baremetal: { min_cores: 4, min_memory: 8GB, maas_tag: medium }
|
||
|
|
```
|
||
|
|
|
||
|
|
### Artifact Builder
|
||
|
|
|
||
|
|
Puppet modules → container images:
|
||
|
|
|
||
|
|
```
|
||
|
|
label "mailserver"
|
||
|
|
→ puppet classes [postfix, dovecot, spamassassin]
|
||
|
|
→ Dockerfile generated:
|
||
|
|
FROM ubuntu:24.04
|
||
|
|
RUN apt-get install -y puppet-agent
|
||
|
|
COPY modules/ /etc/puppetlabs/code/modules/
|
||
|
|
RUN puppet apply --classes postfix,dovecot,spamassassin
|
||
|
|
# Clean up puppet, leave only configured services
|
||
|
|
→ Image pushed to registry
|
||
|
|
→ Available as k8s deployment or standalone container
|
||
|
|
```
|
||
|
|
|
||
|
|
## Tech Stack
|
||
|
|
|
||
|
|
| Component | Technology | Why |
|
||
|
|
|-----------|-----------|-----|
|
||
|
|
| Server | Go | Performance, single binary, Pulumi SDK, gRPC native |
|
||
|
|
| CLI | Go (cobra) | Same binary, kubectl-style |
|
||
|
|
| TUI | Go (bubbletea) | Same binary, k9s-style |
|
||
|
|
| API | gRPC + REST (grpc-gateway) | Type-safe, fast, REST fallback |
|
||
|
|
| IaC engine | Pulumi (Go SDK) | Multi-provider, plan/preview, component packages |
|
||
|
|
| Config mgmt | OpenVox | Puppet modules, ENC, cert management |
|
||
|
|
| Bare metal | Tinkerbell or custom iPXE | PXE boot, IPMI/Redfish |
|
||
|
|
| Container build | Buildah or Docker | OCI images from puppet classes |
|
||
|
|
| State store | TBD — NOT etcd (see State Storage section) | Resource state, label definitions |
|
||
|
|
| K8s integration | client-go | Direct k8s API for deployments |
|
||
|
|
|
||
|
|
## Under The Hood — What We DON'T Build
|
||
|
|
|
||
|
|
- Cloud APIs → Pulumi providers handle this
|
||
|
|
- Puppet language/runtime → OpenVox handles this
|
||
|
|
- Container runtime → containerd/Docker handles this
|
||
|
|
- Monitoring → Prometheus/Naemon handle this
|
||
|
|
- K8s orchestration → k3s/EKS handles this
|
||
|
|
- PXE/DHCP/TFTP → Tinkerbell handles this
|
||
|
|
- Certificate management → OpenVox CA handles this
|
||
|
|
|
||
|
|
**We build the glue, the abstraction, the UX, and the lifecycle orchestration.**
|
||
|
|
|
||
|
|
## Kubernetes Management
|
||
|
|
|
||
|
|
Lab also controls what runs on k8s clusters:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab get deployments
|
||
|
|
NAME CLUSTER LABEL REPLICAS IMAGE STATUS
|
||
|
|
mailserver homelab mailserver 2/2 org/mailserver:03.15 ✓ running
|
||
|
|
api production app 4/4 org/app:03.15 ✓ running
|
||
|
|
postgres homelab postgres 1/1 org/postgres:03.14 ✓ running
|
||
|
|
|
||
|
|
$ lab deploy --label app --cluster production --replicas 4
|
||
|
|
$ lab scale --label app --cluster production --replicas 6
|
||
|
|
```
|
||
|
|
|
||
|
|
Deployments reference labels — same label that defines puppet classes also defines
|
||
|
|
the container image, ports, health checks, and k8s resources.
|
||
|
|
|
||
|
|
## Bootstrap, Onboarding, and Self-Deployment
|
||
|
|
|
||
|
|
### Core Idea: Your Device Is The First Coordinator
|
||
|
|
|
||
|
|
You don't need a server to start. Your laptop/workstation runs the full lab engine
|
||
|
|
locally. You onboard servers from it — including bare metal PXE boot. When ready,
|
||
|
|
you migrate the coordinator role to one of the servers you've onboarded.
|
||
|
|
|
||
|
|
```
|
||
|
|
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
|
||
|
|
│ Phase 0 │ │ Phase 1 │ │ Phase 2 │ │ Phase 3 │
|
||
|
|
│ │ │ │ │ │ │ │
|
||
|
|
│ lab init │────►│ Onboard │────►│ Move lab │────►│ Onboard │
|
||
|
|
│ --local │ │ servers │ │ to a real │ │ remaining │
|
||
|
|
│ │ │ from your │ │ server │ │ from the │
|
||
|
|
│ Your device│ │ laptop │ │ │ │ server │
|
||
|
|
│ = lab │ │ │ │ │ │ │
|
||
|
|
└────────────┘ └────────────┘ └────────────┘ └────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Architecture: CLI = Embedded Server
|
||
|
|
|
||
|
|
The CLI binary contains the full lab-server engine. The difference between modes
|
||
|
|
is where state lives and whether the engine runs persistently.
|
||
|
|
|
||
|
|
```
|
||
|
|
┌──────────────────────────────────────┐
|
||
|
|
│ lab (single binary) │
|
||
|
|
│ │
|
||
|
|
│ ┌─────────────────────────────────┐ │
|
||
|
|
│ │ Core Engine │ │
|
||
|
|
│ │ (providers, labels, render, │ │
|
||
|
|
│ │ lifecycle, identity, secrets, │ │
|
||
|
|
│ │ PXE server, everything) │ │
|
||
|
|
│ └─────────────────────────────────┘ │
|
||
|
|
│ │
|
||
|
|
│ Modes: │
|
||
|
|
│ ├── $ lab init --local → local mode │
|
||
|
|
│ │ State: ~/.lab/state.db │
|
||
|
|
│ │ PXE/DHCP: served from laptop │
|
||
|
|
│ │ Full engine, no remote server │
|
||
|
|
│ │ │
|
||
|
|
│ ├── $ lab server → daemon mode │
|
||
|
|
│ │ State: /var/lib/lab/state.db │
|
||
|
|
│ │ PXE/DHCP: served from this box │
|
||
|
|
│ │ Persistent API on port 7443 │
|
||
|
|
│ │ │
|
||
|
|
│ └── $ lab <command> → client mode │
|
||
|
|
│ Talks to remote lab-server │
|
||
|
|
│ (or local engine if no server) │
|
||
|
|
└──────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Onboarding Flow
|
||
|
|
|
||
|
|
`lab onboard` is the command to bring a new machine under management. It handles
|
||
|
|
two scenarios: machines with an OS already installed, and bare metal that needs
|
||
|
|
network boot + OS installation.
|
||
|
|
|
||
|
|
#### Scenario A: Machine has OS (SSH onboard)
|
||
|
|
|
||
|
|
For machines that already have an OS (like DGX Spark with Ubuntu, or Mac Studio):
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab onboard dgx-spark --provider ssh --host 192.168.1.50 --user admin
|
||
|
|
|
||
|
|
Step 1: Render
|
||
|
|
┌──────────────┬────────────────────────┐
|
||
|
|
│ Name │ dgx-spark │
|
||
|
|
│ Provider │ ssh (existing machine) │
|
||
|
|
│ Host │ 192.168.1.50 │
|
||
|
|
│ OS │ Ubuntu (detected) │
|
||
|
|
│ Arch │ aarch64 (Grace) │
|
||
|
|
│ RAM │ 128GB │
|
||
|
|
│ GPU │ CUDA (detected) │
|
||
|
|
└──────────────┴────────────────────────┘
|
||
|
|
|
||
|
|
Onboarding will:
|
||
|
|
+ Install lab agent
|
||
|
|
+ Generate one-time enrollment token
|
||
|
|
+ Register in DNS: dgx-spark.lab.internal
|
||
|
|
+ Sign OpenVox certificate
|
||
|
|
+ Assign labels (interactive or --labels flag)
|
||
|
|
|
||
|
|
Proceed? [y/N]: y
|
||
|
|
|
||
|
|
Step 2: Detect & assign labels
|
||
|
|
Detected hardware:
|
||
|
|
GPU: NVIDIA GB10 Grace Blackwell → suggesting label: cuda
|
||
|
|
RAM: 128GB → suggesting label: ai-inference
|
||
|
|
Arch: aarch64 → suggesting label: arm
|
||
|
|
|
||
|
|
Assign labels [cuda,ai-inference,arm]: cuda,ai-inference,dgx-spark
|
||
|
|
|
||
|
|
Step 3: Apply (same engine as lab apply)
|
||
|
|
→ SSH into 192.168.1.50
|
||
|
|
→ Install lab agent binary
|
||
|
|
→ Generate one-time token
|
||
|
|
→ Lab agent enrolls:
|
||
|
|
→ OpenVox cert signed, classified in environment "production"
|
||
|
|
→ DNS A record: dgx-spark.lab.internal → 192.168.1.50
|
||
|
|
→ Identity established
|
||
|
|
→ Apply puppet classes from labels:
|
||
|
|
→ cuda: nvidia-drivers, cuda-toolkit
|
||
|
|
→ ai-inference: inference-runtime
|
||
|
|
→ Machine fully managed
|
||
|
|
|
||
|
|
$ lab get servers
|
||
|
|
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
|
||
|
|
dgx-spark ssh cuda,ai-inference,dgx-spark ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Scenario B: Bare metal (PXE network boot)
|
||
|
|
|
||
|
|
For machines with no OS. Lab (on your laptop or server) becomes a PXE server
|
||
|
|
on the local network, serves the OS installer, and onboards after installation:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab onboard beelink-max --provider baremetal \
|
||
|
|
--mac AA:BB:CC:DD:EE:FF \
|
||
|
|
--image ubuntu-24.04 \
|
||
|
|
--labels k8s-worker,rocm,longhorn
|
||
|
|
|
||
|
|
Step 1: Render
|
||
|
|
┌──────────────┬────────────────────────┐
|
||
|
|
│ Name │ beelink-max │
|
||
|
|
│ Provider │ baremetal (PXE boot) │
|
||
|
|
│ MAC │ AA:BB:CC:DD:EE:FF │
|
||
|
|
│ Image │ ubuntu-24.04 │
|
||
|
|
│ Labels │ k8s-worker,rocm,longhorn│
|
||
|
|
│ PXE server │ this device (laptop) │
|
||
|
|
└──────────────┴────────────────────────┘
|
||
|
|
|
||
|
|
Onboarding will:
|
||
|
|
+ Start PXE/DHCP/TFTP on local network interface
|
||
|
|
+ Wait for machine with MAC AA:BB:CC:DD:EE:FF to boot
|
||
|
|
+ Serve unattended Ubuntu 24.04 installer
|
||
|
|
+ After install: auto-enroll with one-time token baked into installer
|
||
|
|
+ Assign labels, apply puppet classes
|
||
|
|
|
||
|
|
⚠ PXE requires: network interface on same L2 segment as target machine
|
||
|
|
⚠ DHCP: will respond ONLY to MAC AA:BB:CC:DD:EE:FF (safe for existing networks)
|
||
|
|
|
||
|
|
Proceed? [y/N]: y
|
||
|
|
|
||
|
|
Step 2: PXE boot phase
|
||
|
|
→ Starting PXE server on en0 (192.168.1.x)
|
||
|
|
→ DHCP offer scoped to MAC AA:BB:CC:DD:EE:FF only
|
||
|
|
→ Waiting for network boot request...
|
||
|
|
|
||
|
|
⏳ Power on the Beelink SER9 MAX and set it to boot from network (PXE)
|
||
|
|
|
||
|
|
→ Boot request received from AA:BB:CC:DD:EE:FF
|
||
|
|
→ Serving iPXE → kernel + initrd → autoinstall config
|
||
|
|
→ OS installation in progress...
|
||
|
|
→ Installation complete, machine rebooting
|
||
|
|
|
||
|
|
Step 3: Post-install enrollment (same as SSH onboard from here)
|
||
|
|
→ Machine boots with installed OS
|
||
|
|
→ Lab agent runs on first boot (installed during OS setup)
|
||
|
|
→ Uses one-time token (baked into autoinstall config) to enroll:
|
||
|
|
→ OpenVox cert signed
|
||
|
|
→ DNS: beelink-max.lab.internal → 192.168.1.100
|
||
|
|
→ Identity established
|
||
|
|
→ Apply puppet classes from labels:
|
||
|
|
→ k8s-worker: kubernetes::worker, containerd
|
||
|
|
→ rocm: rocm-drivers
|
||
|
|
→ longhorn: longhorn::node
|
||
|
|
→ Machine fully managed
|
||
|
|
|
||
|
|
$ lab get servers
|
||
|
|
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
|
||
|
|
dgx-spark ssh cuda,ai-inference ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
beelink-max baremetal k8s-worker,rocm,longhorn ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Scenario C: Onboard with IPMI/Redfish (remote power control)
|
||
|
|
|
||
|
|
For bare metal where you have IPMI/BMC access — Lab can power on the machine
|
||
|
|
and set PXE boot remotely, fully hands-free:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab onboard beelink-max --provider baremetal \
|
||
|
|
--mac AA:BB:CC:DD:EE:FF \
|
||
|
|
--ipmi 192.168.1.200 --ipmi-user admin \
|
||
|
|
--image ubuntu-24.04 \
|
||
|
|
--labels k8s-worker,rocm,longhorn
|
||
|
|
|
||
|
|
→ IPMI: setting next boot to PXE
|
||
|
|
→ IPMI: powering on machine
|
||
|
|
→ PXE server waiting for boot request...
|
||
|
|
→ (fully automated from here)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Homelab Bootstrap Walkthrough
|
||
|
|
|
||
|
|
The complete flow for setting up the homelab from zero:
|
||
|
|
|
||
|
|
```
|
||
|
|
# Phase 0: Local mode on your laptop
|
||
|
|
$ lab init --local
|
||
|
|
✓ Lab engine running locally
|
||
|
|
✓ State: ~/.lab/state.db
|
||
|
|
✓ Ready to onboard servers
|
||
|
|
|
||
|
|
# Phase 1: Onboard servers that already have an OS
|
||
|
|
$ lab onboard dgx-spark --provider ssh --host 192.168.1.50
|
||
|
|
→ Labels: [cuda, ai-inference, dgx-spark]
|
||
|
|
|
||
|
|
$ lab onboard mac-studio --provider ssh --host 192.168.1.51
|
||
|
|
→ Labels: [k8s-server, etcd, arm]
|
||
|
|
|
||
|
|
# Phase 2: Onboard bare metal (PXE from your laptop)
|
||
|
|
$ lab onboard beelink-ser9-pro --provider baremetal --mac XX:XX:XX:XX:XX:01 \
|
||
|
|
--image ubuntu-24.04 --labels bootstrap,lab-server
|
||
|
|
→ PXE boot from laptop → install OS → enroll
|
||
|
|
→ This will become the permanent lab-server host
|
||
|
|
|
||
|
|
# Phase 3: Move lab-server to a real server
|
||
|
|
$ lab server migrate --target ssh --host beelink-ser9-pro
|
||
|
|
→ Lab-server deployed on Beelink SER9 Pro
|
||
|
|
→ State migrated from ~/.lab/state.db
|
||
|
|
→ PXE/DHCP now served from Beelink, not your laptop
|
||
|
|
→ CLI config updated: lab talks to beelink-ser9-pro:7443
|
||
|
|
|
||
|
|
# Phase 4: Onboard remaining servers (PXE from beelink-ser9-pro now)
|
||
|
|
$ lab onboard beelink-ser9-max --provider baremetal --mac XX:XX:XX:XX:XX:02 \
|
||
|
|
--image ubuntu-24.04 --labels k8s-worker,rocm,longhorn
|
||
|
|
→ PXE served by beelink-ser9-pro (not your laptop anymore)
|
||
|
|
|
||
|
|
$ lab onboard minisforum-ms-r1 --provider baremetal --mac XX:XX:XX:XX:XX:03 \
|
||
|
|
--image ubuntu-24.04 --labels k8s-worker,arm
|
||
|
|
|
||
|
|
# Phase 5: Set up k8s
|
||
|
|
$ lab apply cluster homelab --servers mac-studio,beelink-ser9-max,minisforum-ms-r1
|
||
|
|
→ mac-studio becomes k3s server (etcd)
|
||
|
|
→ beelink-ser9-max joins as worker
|
||
|
|
→ minisforum-ms-r1 joins as worker
|
||
|
|
→ All via puppet classes from labels
|
||
|
|
|
||
|
|
# Phase 6: Optionally move lab-server into k8s
|
||
|
|
$ lab server migrate --target kubernetes --cluster homelab
|
||
|
|
→ Lab-server now runs as k8s pod
|
||
|
|
→ Still manages everything including the cluster it runs on
|
||
|
|
|
||
|
|
# Final state:
|
||
|
|
$ lab get servers
|
||
|
|
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
|
||
|
|
dgx-spark ssh cuda,ai-inference ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
mac-studio ssh k8s-server,etcd,arm ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
beelink-ser9-pro baremetal bootstrap ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
beelink-ser9-max baremetal k8s-worker,rocm,longhorn ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
minisforum-ms-r1 baremetal k8s-worker,arm ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
lab-server kubernetes lab,control-plane ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
```
|
||
|
|
|
||
|
|
### Enterprise Application: XCP-ng Bare Metal Deploy
|
||
|
|
|
||
|
|
Same onboarding flow works for deploying XCP-ng to enterprise bare metal:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab onboard xen-host-42 --provider baremetal \
|
||
|
|
--mac AA:BB:CC:DD:EE:FF \
|
||
|
|
--ipmi 10.0.0.142 --ipmi-user admin \
|
||
|
|
--image xcpng-8.3 \
|
||
|
|
--labels xen-host,production,eu-west
|
||
|
|
|
||
|
|
→ IPMI: power on, PXE boot
|
||
|
|
→ Install XCP-ng 8.3 (unattended)
|
||
|
|
→ Enroll, apply puppet classes:
|
||
|
|
→ xen-host: xcpng::host, xcpng::networking, xcpng::storage
|
||
|
|
→ Host registered in Xen Orchestra pool
|
||
|
|
→ Ready to provision VMs on it
|
||
|
|
|
||
|
|
# Now create VMs on the XCP-ng host we just onboarded:
|
||
|
|
$ lab apply server app-12 --provider xcpng --labels app,production
|
||
|
|
→ VM created on xen-host-42 via Xen Orchestra API
|
||
|
|
→ OS installed, enrolled, puppet applied
|
||
|
|
→ Same flow as AWS EC2, just different provider
|
||
|
|
```
|
||
|
|
|
||
|
|
### PXE Server Capabilities
|
||
|
|
|
||
|
|
When running in local or server mode, Lab includes an embedded PXE server:
|
||
|
|
|
||
|
|
- **DHCP**: scoped to specific MACs only (safe for existing networks with DHCP)
|
||
|
|
- **TFTP**: serves iPXE bootloader
|
||
|
|
- **HTTP**: serves kernel, initrd, autoinstall configs
|
||
|
|
- **Autoinstall generation**: creates unattended install configs per-machine with:
|
||
|
|
- Lab agent pre-installed
|
||
|
|
- One-time enrollment token baked in
|
||
|
|
- Network config for the target environment
|
||
|
|
- Disk layout per label/profile
|
||
|
|
- **Supported images**: Ubuntu, Debian, RHEL/Rocky, XCP-ng (extensible)
|
||
|
|
|
||
|
|
PXE serving moves with lab-server — if you migrate lab to a new host,
|
||
|
|
PXE is served from there. If lab is on your laptop, PXE is on your laptop.
|
||
|
|
Same engine, same binary.
|
||
|
|
|
||
|
|
### Hardware Detection During Onboard
|
||
|
|
|
||
|
|
When onboarding via SSH (existing OS), Lab detects hardware and suggests labels:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab onboard new-server --provider ssh --host 10.0.0.50
|
||
|
|
|
||
|
|
Detected hardware:
|
||
|
|
CPU: AMD EPYC 7763 (x86_64, 64 cores) → suggest: compute
|
||
|
|
RAM: 256 GB → suggest: high-memory
|
||
|
|
GPU: NVIDIA A100 80GB → suggest: cuda, ai-training
|
||
|
|
Disk: 2x NVMe 1.92TB, 4x SSD 3.84TB → suggest: storage
|
||
|
|
NIC: 2x 25GbE, 1x 1GbE IPMI → suggest: high-bandwidth
|
||
|
|
|
||
|
|
Suggested labels: [compute, high-memory, cuda, ai-training, storage, high-bandwidth]
|
||
|
|
Assign labels [accept/edit]: _
|
||
|
|
```
|
||
|
|
|
||
|
|
For PXE onboard, hardware detection happens after OS installation, and labels
|
||
|
|
can be auto-confirmed or require interactive approval.
|
||
|
|
|
||
|
|
### No Server? CLI Runs Locally
|
||
|
|
|
||
|
|
If no remote server is configured, every `lab` command runs the engine locally.
|
||
|
|
This means you can use Lab in permanent local mode for simple setups:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab get servers # no remote server configured
|
||
|
|
ⓘ Running locally (~/.lab/state.db)
|
||
|
|
Tip: run `lab server migrate --target <target>` to deploy a persistent server
|
||
|
|
|
||
|
|
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
|
||
|
|
...
|
||
|
|
```
|
||
|
|
|
||
|
|
### Self-Migration
|
||
|
|
|
||
|
|
Migration uses the same plan/apply as everything else:
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab server migrate --target ssh --host beelink-ser9-pro
|
||
|
|
|
||
|
|
Step 1: Plan
|
||
|
|
~ migrate lab-server from local (~/.lab) to ssh://beelink-ser9-pro
|
||
|
|
+ deploy lab-server container on beelink-ser9-pro
|
||
|
|
+ copy state.db to remote host
|
||
|
|
+ start PXE/DHCP services on remote host
|
||
|
|
+ stop local PXE/DHCP services
|
||
|
|
+ update CLI config to new endpoint
|
||
|
|
|
||
|
|
Step 2: Apply
|
||
|
|
→ Deploy lab-server on beelink-ser9-pro
|
||
|
|
→ Copy state to remote
|
||
|
|
→ Verify remote is healthy
|
||
|
|
→ Switch CLI config
|
||
|
|
→ Stop local engine
|
||
|
|
|
||
|
|
$ lab server migrate --target kubernetes --cluster homelab
|
||
|
|
|
||
|
|
Step 1: Plan
|
||
|
|
~ migrate lab-server from ssh://beelink-ser9-pro to kubernetes://homelab
|
||
|
|
+ k8s Deployment lab-server (1 replica)
|
||
|
|
+ k8s Service lab-server (port 7443)
|
||
|
|
+ PersistentVolumeClaim lab-server-state (10Gi)
|
||
|
|
+ migrate state.db to PVC
|
||
|
|
+ PXE services: move to k8s hostNetwork pod or keep on bootstrap node
|
||
|
|
|
||
|
|
⚠ Note: PXE/DHCP requires L2 network access. If k8s node is on the same
|
||
|
|
L2 segment, use hostNetwork. Otherwise, keep PXE on the bootstrap node
|
||
|
|
and only migrate the API/state to k8s.
|
||
|
|
|
||
|
|
Step 2: Apply
|
||
|
|
→ Deploy to k8s
|
||
|
|
→ Migrate state
|
||
|
|
→ Verify healthy
|
||
|
|
→ Update CLI config
|
||
|
|
→ Tear down old deployment
|
||
|
|
```
|
||
|
|
|
||
|
|
### Key Design Principles
|
||
|
|
|
||
|
|
1. **One engine everywhere** — CLI, local mode, server mode, and init all share the same code
|
||
|
|
2. **Your device is the first coordinator** — no chicken-and-egg, start from nothing
|
||
|
|
3. **Onboard uses the same pipeline as apply** — render, plan, apply, enroll
|
||
|
|
4. **PXE is embedded** — no external PXE/DHCP server needed, Lab serves it
|
||
|
|
5. **Hardware detection suggests labels** — but the user confirms
|
||
|
|
6. **Migration is just plan/apply for lab-server** — same engine, no special case
|
||
|
|
7. **Enterprise and homelab are the same flow** — onboard XCP-ng bare metal = onboard homelab Beelink
|
||
|
|
|
||
|
|
## Identity and Trust Layer
|
||
|
|
|
||
|
|
Inspired by what FreeIPA did well (auto-DNS, centralized SSH, server-scoped secrets,
|
||
|
|
internal CA, IP mobility) without what it did badly (instability, hardcoded join secrets).
|
||
|
|
|
||
|
|
Lab controls the full lifecycle — it knows when a machine is born — so it can solve
|
||
|
|
the enrollment problem properly: generate a one-time join token at provision time,
|
||
|
|
inject it via cloud-init or iPXE userdata. No hardcoded secrets in images.
|
||
|
|
|
||
|
|
### Provision-to-Enrolled Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab apply server new-worker-5 --label k8s-worker --provider aws
|
||
|
|
|
||
|
|
1. PROVISION → Pulumi creates EC2 instance
|
||
|
|
2. IDENTITY → Lab generates one-time join token (short-lived, single-use)
|
||
|
|
→ Token injected via cloud-init (or iPXE userdata for bare metal)
|
||
|
|
→ Token is NOT in the image — generated per-instance at provision time
|
||
|
|
3. ENROLL → Machine boots, uses token to:
|
||
|
|
→ Register with OpenVox (cert signed, node classified)
|
||
|
|
→ Register in DNS (A record + PTR)
|
||
|
|
→ Authenticate with Vault (get identity + policies per label)
|
||
|
|
→ Get SSH CA-signed host key (no more TOFU)
|
||
|
|
4. CONFIGURE → OpenVox applies classes
|
||
|
|
→ Machine pulls secrets it's allowed to access from Vault
|
||
|
|
→ e.g. k8s join token retrieved from Vault, node joins cluster
|
||
|
|
5. ENROLLED → Lab marks resource identity as ✓ enrolled
|
||
|
|
```
|
||
|
|
|
||
|
|
### What Each Machine Gets on Enrollment
|
||
|
|
|
||
|
|
| Capability | What happens | Tool underneath (TBD — needs investigation) |
|
||
|
|
|-----------|-------------|----------------------------------------------|
|
||
|
|
| DNS auto-registration | A + PTR records created/updated automatically | CoreDNS API? ExternalDNS? PowerDNS? needs investigation |
|
||
|
|
| IP mobility | Machine restarts with new IP → DNS updated automatically | Lab agent on machine reports changes? DHCP hook? needs investigation |
|
||
|
|
| Server certificate | TLS cert issued for the machine, auto-renewed | OpenVox CA? Vault PKI secrets engine? cert-manager? needs investigation |
|
||
|
|
| SSH host key signing | Host key signed by CA, clients trust CA not individual keys | Vault SSH secrets engine? OpenVox CA? step-ca? needs investigation |
|
||
|
|
| SSH user access | Users get short-lived SSH certs, centrally managed | Vault SSH + OIDC? Teleport? Boundary? needs investigation |
|
||
|
|
| Secret access (RBAC) | Machine authenticates with Vault, gets label-scoped policy | Vault AppRole? Vault cert auth? needs investigation |
|
||
|
|
| K8s join tokens | Retrieved from Vault by entitled machines, used to join cluster | Vault KV + policy per label? needs investigation |
|
||
|
|
| OpenVox enrollment | Cert signed, environment + role + classes assigned | OpenVox CA + ENC — this one we know |
|
||
|
|
| One-time join tokens | Generated per-instance at provision, single-use, short-lived | Lab itself generates these — or delegate to Vault? needs investigation |
|
||
|
|
|
||
|
|
**Important: We don't need to build any of these from scratch.** Each row is a capability
|
||
|
|
that likely has an existing tool we can wrap. Just like we use Pulumi for cloud APIs and
|
||
|
|
OpenVox for config management, we'll find the right tool for each identity concern.
|
||
|
|
Each position requires investigation — we'll evaluate options together, one by one.
|
||
|
|
|
||
|
|
### CLI: Identity Information
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab get servers
|
||
|
|
NAME PROVIDER LABELS SYNC PUPPET HEALTH IDENTITY
|
||
|
|
worker-5 aws k8s-worker ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
worker-6 xcpng k8s-worker ✓ ✓ ok ✓ ✓ enrolled
|
||
|
|
worker-7 baremetal k8s-worker ✓ ✗ fail ⚠ ⚠ cert expiring
|
||
|
|
new-box aws k8s-worker ✓ … … ⏳ enrolling
|
||
|
|
|
||
|
|
$ lab describe server worker-5
|
||
|
|
...
|
||
|
|
Identity:
|
||
|
|
DNS: worker-5.lab.internal (A: 10.0.1.45, PTR: ✓)
|
||
|
|
OpenVox: ✓ cert signed (expires 2027-03-15)
|
||
|
|
Vault: ✓ authenticated (policy: k8s-worker)
|
||
|
|
SSH Host Key: ✓ CA-signed (fingerprint: SHA256:abc...)
|
||
|
|
Secrets: k8s/join-token, tls/node-cert (2 accessible)
|
||
|
|
Enrolled: 2026-03-15 14:22:03 (one-time token, consumed)
|
||
|
|
Last Check-in: 2026-03-15 15:01:12 (38 seconds ago)
|
||
|
|
|
||
|
|
$ lab get secrets --label k8s-worker
|
||
|
|
SECRET TYPE ACCESSIBLE BY LAST ROTATED
|
||
|
|
k8s/join-token dynamic k8s-worker (12 srv) 2026-03-15
|
||
|
|
tls/cluster-ca static k8s-worker, k8s-server 2026-01-01
|
||
|
|
monitoring/api-key static k8s-worker, monitoring 2026-02-28
|
||
|
|
|
||
|
|
$ lab identity renew worker-5 # force cert/key renewal
|
||
|
|
$ lab identity revoke worker-5 # revoke all creds, remove from DNS, unenroll
|
||
|
|
```
|
||
|
|
|
||
|
|
### Secrets — Code Is The Policy
|
||
|
|
|
||
|
|
**Design principle:** If your code/config declares "I use secret X", that IS the access
|
||
|
|
grant. No one goes to a separate UI to edit policies. Default is locked — if not
|
||
|
|
mentioned, no access. If mentioned, access is automatic.
|
||
|
|
|
||
|
|
**The declaration IS the policy:**
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
labels:
|
||
|
|
mailserver:
|
||
|
|
puppet_classes:
|
||
|
|
- postfix
|
||
|
|
- dovecot
|
||
|
|
secrets:
|
||
|
|
- mail/tls-cert
|
||
|
|
- mail/dkim-key
|
||
|
|
- mail/relay-credentials
|
||
|
|
ports: [25, 587, 993]
|
||
|
|
```
|
||
|
|
|
||
|
|
When Lab applies label `mailserver` to a server, it automatically:
|
||
|
|
1. Grants that server access to `mail/tls-cert`, `mail/dkim-key`, `mail/relay-credentials`
|
||
|
|
2. Denies access to everything else
|
||
|
|
3. No separate policy file, no Vault admin, no ticket to security team
|
||
|
|
|
||
|
|
When a puppet class references a secret:
|
||
|
|
|
||
|
|
```puppet
|
||
|
|
# modules/postfix/manifests/init.pp
|
||
|
|
class postfix {
|
||
|
|
$relay_creds = lab::secret('mail/relay-credentials')
|
||
|
|
|
||
|
|
file { '/etc/postfix/sasl_passwd':
|
||
|
|
content => $relay_creds,
|
||
|
|
mode => '0600',
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
The `lab::secret()` call is both the usage AND the declaration that this class
|
||
|
|
needs this secret. Lab scans puppet classes, discovers secret references,
|
||
|
|
and auto-generates the access policy. If `postfix` class is applied to a server
|
||
|
|
via a label, that server gets access to `mail/relay-credentials`. Remove the
|
||
|
|
class → access revoked.
|
||
|
|
|
||
|
|
**Secrets must be equally easy to access from anywhere:**
|
||
|
|
|
||
|
|
| Runtime | How you get a secret | Same underneath |
|
||
|
|
|---------|---------------------|-----------------|
|
||
|
|
| Puppet code | `lab::secret('mail/tls-cert')` | Lab agent on machine fetches from secret backend |
|
||
|
|
| App on VM | `LAB_SECRET_MAIL_TLS_CERT` env var, or `/run/secrets/mail/tls-cert` file | Lab agent provides via env or tmpfs mount |
|
||
|
|
| App in Kubernetes | Same env var or volume mount | Lab k8s operator syncs to K8s Secret object |
|
||
|
|
| App in Docker (standalone) | `--env-file` or bind mount from lab agent | Lab agent on host provides |
|
||
|
|
| Script / cron job | `lab secret get mail/tls-cert` CLI call | Lab CLI authenticated via machine identity |
|
||
|
|
| cloud-init / bootstrap | Injected at provision time via one-time token | Lab server provides during enrollment |
|
||
|
|
|
||
|
|
**One way to consume secrets, regardless of where you run.** The lab agent (or k8s
|
||
|
|
operator, or CLI) handles authentication and fetching transparently. The app just
|
||
|
|
reads an env var or file.
|
||
|
|
|
||
|
|
#### How Access Flows
|
||
|
|
|
||
|
|
```
|
||
|
|
Label "mailserver"
|
||
|
|
declares secrets:
|
||
|
|
- mail/tls-cert
|
||
|
|
- mail/dkim-key
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌───────────────────────┐
|
||
|
|
│ Lab compiles policy │
|
||
|
|
│ │
|
||
|
|
│ server mail-1: │
|
||
|
|
│ CAN access: │
|
||
|
|
│ mail/tls-cert │
|
||
|
|
│ mail/dkim-key │
|
||
|
|
│ CANNOT access: │
|
||
|
|
│ k8s/* │
|
||
|
|
│ postgres/* │
|
||
|
|
│ (everything else)│
|
||
|
|
└───────────┬───────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌───────────────────────┐
|
||
|
|
│ Secret backend │
|
||
|
|
│ (TBD — needs │
|
||
|
|
│ investigation) │
|
||
|
|
│ │
|
||
|
|
│ Enforces policy at │
|
||
|
|
│ backend level, not │
|
||
|
|
│ just in Lab │
|
||
|
|
└───────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Secret Sources
|
||
|
|
|
||
|
|
Secrets themselves can come from multiple places:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
secrets:
|
||
|
|
mail/tls-cert:
|
||
|
|
type: dynamic # generated/rotated automatically
|
||
|
|
generator: acme # cert-manager / Let's Encrypt
|
||
|
|
rotate_every: 90d
|
||
|
|
|
||
|
|
mail/dkim-key:
|
||
|
|
type: static # manually set, stored encrypted
|
||
|
|
set_by: admin # who last set it
|
||
|
|
|
||
|
|
mail/relay-credentials:
|
||
|
|
type: static
|
||
|
|
set_by: admin
|
||
|
|
|
||
|
|
k8s/join-token:
|
||
|
|
type: dynamic
|
||
|
|
generator: kubernetes # fetched from k8s API
|
||
|
|
rotate_every: 24h
|
||
|
|
|
||
|
|
tls/node-cert:
|
||
|
|
type: dynamic
|
||
|
|
generator: ca # issued per-machine from internal CA
|
||
|
|
per_machine: true # each machine gets its own
|
||
|
|
```
|
||
|
|
|
||
|
|
#### CLI for Secrets
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab get secrets
|
||
|
|
SECRET TYPE USED BY LAST ROTATED
|
||
|
|
mail/tls-cert dynamic mailserver (2 srv) 2026-03-14
|
||
|
|
mail/dkim-key static mailserver (2 srv) 2026-01-15
|
||
|
|
mail/relay-credentials static mailserver (2 srv) 2026-02-01
|
||
|
|
k8s/join-token dynamic k8s-worker (12 srv) 2026-03-15
|
||
|
|
tls/node-cert dynamic * (all enrolled) per-machine
|
||
|
|
|
||
|
|
$ lab secret set mail/relay-credentials
|
||
|
|
Enter value: ****
|
||
|
|
✓ Updated. Accessible by: mailserver (2 servers)
|
||
|
|
✓ Servers will pick up new value within 60s
|
||
|
|
|
||
|
|
$ lab show secret mail/relay-credentials
|
||
|
|
Secret: mail/relay-credentials
|
||
|
|
Type: static
|
||
|
|
Last set: 2026-03-15 by admin
|
||
|
|
|
||
|
|
Accessible by (derived from code):
|
||
|
|
Label "mailserver" → puppet class "postfix" → lab::secret('mail/relay-credentials')
|
||
|
|
├── mail-1 (xcpng) last fetched: 12m ago
|
||
|
|
└── mail-2 (aws) last fetched: 12m ago
|
||
|
|
|
||
|
|
No other references found in any applied code.
|
||
|
|
|
||
|
|
$ lab secret audit
|
||
|
|
✓ All secrets are referenced by at least one applied class/label
|
||
|
|
⚠ Secret "old/api-key" is defined but not referenced by any code — orphaned?
|
||
|
|
⚠ Secret "db/password" referenced by class "app::database" but never set — empty!
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Secret Architecture — Distributed, Offline-Capable
|
||
|
|
|
||
|
|
**Critical requirement:** Nothing breaks if the central secret server (or any server)
|
||
|
|
is unreachable. Everything continues to work — including making new pods, deployments,
|
||
|
|
puppet runs — using local encrypted cache. This is not an edge case, it's a core design.
|
||
|
|
|
||
|
|
**This means secrets are NOT a central server you query.** They're a distributed,
|
||
|
|
synced, encrypted dataset with offline capability.
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ Secret Distribution Model │
|
||
|
|
│ │
|
||
|
|
│ NOT this (central server): THIS (distributed sync): │
|
||
|
|
│ │
|
||
|
|
│ ┌─────────┐ ┌──────┐ ┌──────┐ │
|
||
|
|
│ │ Vault │ │ Node │◄─►│ Node │ │
|
||
|
|
│ └────┬────┘ └──┬───┘ └──┬───┘ │
|
||
|
|
│ ┌────┼────┐ │ ▲ │ │
|
||
|
|
│ │ │ │ ▼ │ ▼ │
|
||
|
|
│ ┌┴┐ ┌┴┐ ┌┴┐ ┌──────┐ ┌──────┐ │
|
||
|
|
│ │N│ │N│ │N│ │ Node │◄─►│ Node │ │
|
||
|
|
│ └─┘ └─┘ └─┘ └──┬───┘ └──────┘ │
|
||
|
|
│ (all dead if vault │ │
|
||
|
|
│ is unreachable) ▼ │
|
||
|
|
│ ┌──────────┐ │
|
||
|
|
│ │ Git repo │ (encrypted │
|
||
|
|
│ │ (backup) │ backup of │
|
||
|
|
│ └──────────┘ last resort) │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
#### How It Works
|
||
|
|
|
||
|
|
**Layer 1: Local Encrypted Cache (on every machine)**
|
||
|
|
- Every machine that has access to secrets stores them locally, encrypted at rest
|
||
|
|
- Encrypted with machine-specific key (derived from machine identity/TPM/secure enclave)
|
||
|
|
- Puppet runs, app starts, pod deployments — all read from local cache
|
||
|
|
- If cache is fresh → use it, no network call needed
|
||
|
|
- Cache has TTL per secret, but stale cache is better than no secret
|
||
|
|
|
||
|
|
**Layer 2: Secret Store (privileged nodes that hold all secrets)**
|
||
|
|
- One or more nodes with the `secret-store` label hold the COMPLETE encrypted dataset
|
||
|
|
- This is NOT a special server type — it's a label, applied to pods, VMs, or bare metal
|
||
|
|
- Should have at least 2 replicas for HA
|
||
|
|
- Machines fetch ONLY the secrets their labels entitle them to from the store
|
||
|
|
- The store enforces policy — a machine with label `mailserver` gets `mail/*`, nothing else
|
||
|
|
- Machines NEVER sync with each other directly — they only talk to the store
|
||
|
|
- This prevents secret sprawl (no machine accumulates secrets it shouldn't have)
|
||
|
|
|
||
|
|
**Layer 3: Git Encrypted Backup (last resort recovery)**
|
||
|
|
- All secrets (encrypted with a master key) backed up to a Git repo
|
||
|
|
- If a machine has empty cache AND no peers available → restore from Git backup
|
||
|
|
- SOPS/age style encryption — secrets encrypted, metadata (paths, policies) in plaintext
|
||
|
|
- Git gives versioning, audit trail, and disaster recovery for free
|
||
|
|
- The Git repo alone is useless without the decryption key
|
||
|
|
|
||
|
|
**Layer 4: Lab-server (coordinator, NOT single point of failure)**
|
||
|
|
- Lab-server is the preferred interface to set/rotate secrets (via CLI/API)
|
||
|
|
- Lab-server does NOT need to be the secret-store (but can be, via label)
|
||
|
|
- If lab-server is down, machines keep running from local cache
|
||
|
|
- No new secrets can be distributed while secret-store is down
|
||
|
|
- But nothing breaks — existing workloads continue uninterrupted
|
||
|
|
- When secret-store comes back, machines sync and catch up
|
||
|
|
|
||
|
|
**Separation of concerns:**
|
||
|
|
- `lab-server` = coordination, API, lifecycle management
|
||
|
|
- `secret-store` label = holds all secrets, serves policy-filtered requests
|
||
|
|
- These CAN be the same node (apply both labels) or separate nodes
|
||
|
|
- For homelab: same node is fine. For enterprise: separate for isolation
|
||
|
|
|
||
|
|
#### Recovery Scenarios
|
||
|
|
|
||
|
|
```
|
||
|
|
Scenario 1: Lab-server down, secret-store up
|
||
|
|
→ All machines continue working from local cache
|
||
|
|
→ Machines can still fetch/refresh secrets from secret-store
|
||
|
|
→ No new resources can be provisioned (lab-server manages lifecycle)
|
||
|
|
→ But existing workloads are unaffected
|
||
|
|
|
||
|
|
Scenario 2: Secret-store down, lab-server up
|
||
|
|
→ All machines continue working from local cache
|
||
|
|
→ Lab-server can still manage lifecycle (provision, plan, apply)
|
||
|
|
→ No new secrets can be distributed
|
||
|
|
→ No secret rotations until store is back
|
||
|
|
→ Lab-server shows: ⚠ secret-store unreachable
|
||
|
|
|
||
|
|
Scenario 3: Both down
|
||
|
|
→ All machines continue working from local cache
|
||
|
|
→ Nothing new can happen, but nothing breaks
|
||
|
|
→ Recovery priority: restore secret-store first (from Git backup)
|
||
|
|
|
||
|
|
Scenario 4: Machine reboots, cache intact
|
||
|
|
→ Reads from local encrypted cache immediately
|
||
|
|
→ Refreshes from secret-store in background to catch up
|
||
|
|
→ No dependency on lab-server for startup
|
||
|
|
|
||
|
|
Scenario 5: Machine rebuilt, cache empty
|
||
|
|
→ Machine has its identity (from enrollment) but no secrets
|
||
|
|
→ Fetches entitled secrets from secret-store (policy-filtered)
|
||
|
|
→ If secret-store unreachable → cannot start (needs secrets)
|
||
|
|
→ Operator can restore secret-store from Git backup to unblock
|
||
|
|
|
||
|
|
Scenario 6: Total disaster, only Git backup survives
|
||
|
|
→ Deploy new node, apply `secret-store` label
|
||
|
|
→ Restore encrypted secrets from Git backup
|
||
|
|
→ Deploy lab-server (lab init)
|
||
|
|
→ New machines enroll and receive their entitled secrets
|
||
|
|
→ System fully recovered
|
||
|
|
|
||
|
|
Scenario 7: New pod in k8s, secret-store unreachable
|
||
|
|
→ K8s node has local secret cache for its entitled secrets
|
||
|
|
→ Lab k8s operator serves pod secrets from node's local cache
|
||
|
|
→ Pod starts with cached secrets
|
||
|
|
→ No interruption to deployments
|
||
|
|
```
|
||
|
|
|
||
|
|
#### CLI for Secret Distribution
|
||
|
|
|
||
|
|
```
|
||
|
|
$ lab secret status
|
||
|
|
SECRET DISTRIBUTION STATUS:
|
||
|
|
Local cache: ✓ 8 secrets cached (of 8 entitled), encrypted, fresh (< 5m old)
|
||
|
|
Secret store: ✓ connected (2 replicas: store-1, store-2)
|
||
|
|
Lab-server: ✓ connected
|
||
|
|
Git backup: ✓ last push 2026-03-15 14:30:00 (47 total secrets)
|
||
|
|
|
||
|
|
$ lab secret status --store
|
||
|
|
SECRET STORE:
|
||
|
|
Replicas: 2/2 healthy
|
||
|
|
store-1 k8s pod ✓ synced 47 secrets (all)
|
||
|
|
store-2 vm/xcpng ✓ synced 47 secrets (all)
|
||
|
|
Git backup: ✓ synced 2026-03-15 14:30:00
|
||
|
|
Total secrets: 47
|
||
|
|
Entitled consumers:
|
||
|
|
k8s-worker (12 machines) → 3 secrets each
|
||
|
|
mailserver (2 machines) → 5 secrets each
|
||
|
|
postgres (3 machines) → 4 secrets each
|
||
|
|
lab-server (1 machine) → 2 secrets
|
||
|
|
|
||
|
|
$ lab secret cache
|
||
|
|
LOCAL CACHE:
|
||
|
|
SECRET CACHED TTL STATUS
|
||
|
|
mail/tls-cert ✓ 89d left fresh
|
||
|
|
mail/dkim-key ✓ no expiry fresh
|
||
|
|
k8s/join-token ✓ 23h left fresh
|
||
|
|
tls/node-cert ✓ 346d left fresh
|
||
|
|
|
||
|
|
$ lab secret recover --from git
|
||
|
|
→ Fetching encrypted backup from git@github.com:org/lab-secrets.git
|
||
|
|
→ Decrypting with master key...
|
||
|
|
→ Restored 23 secrets
|
||
|
|
→ Syncing with available peers...
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Local Cache Security
|
||
|
|
|
||
|
|
The local cache must be stored securely — needs investigation:
|
||
|
|
- Encrypted at rest with machine-specific key
|
||
|
|
- Key derived from: TPM 2.0? Secure enclave? LUKS-bound? needs investigation
|
||
|
|
- Memory-mapped, not swappable (mlock)
|
||
|
|
- Accessible only by lab agent (file permissions + MAC/SELinux)
|
||
|
|
- Wiped on machine decommission (`lab identity revoke`)
|
||
|
|
- Possibly use kernel keyring on Linux — needs investigation
|
||
|
|
|
||
|
|
#### Secret Backend — NOT Decided
|
||
|
|
|
||
|
|
The underlying secret storage/sync mechanism is pluggable:
|
||
|
|
|
||
|
|
```go
|
||
|
|
type SecretBackend interface {
|
||
|
|
Name() string
|
||
|
|
|
||
|
|
// CRUD
|
||
|
|
Get(path string, identity *MachineIdentity) ([]byte, error)
|
||
|
|
Set(path string, value []byte) error
|
||
|
|
Delete(path string) error
|
||
|
|
List(prefix string) ([]string, error)
|
||
|
|
|
||
|
|
// Policy (auto-generated from code/labels)
|
||
|
|
GrantAccess(path string, identity *MachineIdentity) error
|
||
|
|
RevokeAccess(path string, identity *MachineIdentity) error
|
||
|
|
|
||
|
|
// Dynamic
|
||
|
|
Generate(path string, generator GeneratorConfig) ([]byte, error)
|
||
|
|
Rotate(path string) error
|
||
|
|
|
||
|
|
// Distribution
|
||
|
|
SyncWith(peer PeerInfo) error
|
||
|
|
CacheLocally(secrets []Secret) error
|
||
|
|
RestoreFromBackup(source BackupSource) error
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Possible approaches (each needs investigation):
|
||
|
|
- **SOPS + age + Git** — simplest, encrypted files in Git, but no peer sync
|
||
|
|
- **OpenBao** — Vault fork, has replication, but still central-server mindset
|
||
|
|
- **Sealed Secrets / External Secrets Operator** — k8s-native, but not universal
|
||
|
|
- **Infisical** — developer-friendly, but SaaS-oriented
|
||
|
|
- **Custom: encrypted SQLite + peer sync** — simple, we control the sync protocol
|
||
|
|
- **etcd with encryption** — distributed by nature, but might be overkill
|
||
|
|
- **CockroachDB** — distributed SQL, encrypted, survives node failures
|
||
|
|
- **Consul** — distributed KV with gossip, HashiCorp though
|
||
|
|
- **Lab's own sync protocol** — gossip-based, encrypted, purpose-built
|
||
|
|
|
||
|
|
The right answer might be a combination:
|
||
|
|
- SOPS/age for encryption format (proven, auditable)
|
||
|
|
- Custom gossip sync for distribution (lightweight)
|
||
|
|
- Git for backup (free versioning and DR)
|
||
|
|
- Or wrap an existing distributed KV that already handles sync
|
||
|
|
|
||
|
|
**This is the most complex subsystem in Lab and needs careful investigation.**
|
||
|
|
|
||
|
|
### Identity Plugin System
|
||
|
|
|
||
|
|
Same extensible pattern as providers and health sources:
|
||
|
|
|
||
|
|
```go
|
||
|
|
type IdentityPlugin interface {
|
||
|
|
Name() string
|
||
|
|
|
||
|
|
// Enrollment
|
||
|
|
Enroll(resource *Resource, token string) (*Identity, error)
|
||
|
|
Revoke(resource *Resource) error
|
||
|
|
|
||
|
|
// Status
|
||
|
|
Status(resource *Resource) (*IdentityStatus, error)
|
||
|
|
|
||
|
|
// Renewal
|
||
|
|
Renew(resource *Resource) error
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
This allows swapping identity backends without changing the rest of Lab.
|
||
|
|
We might start with Vault + OpenVox CA and later add/replace components.
|
||
|
|
|
||
|
|
## State Storage — Design Principles
|
||
|
|
|
||
|
|
**NOT etcd.** etcd prioritizes consistency over availability — it would rather crash and
|
||
|
|
stay down than serve potentially inconsistent data. For Lab, availability wins:
|
||
|
|
|
||
|
|
- Losing a few events is better than total outage
|
||
|
|
- Should auto-backup and auto-restore on corruption
|
||
|
|
- Should degrade gracefully, never crash and refuse to start
|
||
|
|
- Stale data is acceptable, no data is not
|
||
|
|
|
||
|
|
Requirements:
|
||
|
|
- Stores: resource state, label definitions, group membership, alert configs, audit log
|
||
|
|
- Must survive lab-server restart
|
||
|
|
- Must be migratable (lab-server can move between hosts)
|
||
|
|
- Should auto-backup (to Git, S3, or local snapshots)
|
||
|
|
- Should auto-recover from corruption without operator intervention
|
||
|
|
- Embedded (no external dependency) preferred for simplicity
|
||
|
|
|
||
|
|
Candidates (needs investigation):
|
||
|
|
- **SQLite** — embedded, simple, proven, WAL mode for concurrent reads, easy to backup (copy file)
|
||
|
|
- **bbolt/BoltDB** — embedded KV, used by etcd ironically, simpler than etcd itself
|
||
|
|
- **Badger** — embedded KV in Go, LSM-tree, good performance
|
||
|
|
- **DuckDB** — embedded analytical DB, might be overkill
|
||
|
|
- **PostgreSQL** — if we need multi-server state, but adds external dependency
|
||
|
|
- **Litestream** — SQLite + continuous replication to S3/GCS/Azure (interesting combo)
|
||
|
|
|
||
|
|
**SQLite + Litestream** is the current leading candidate:
|
||
|
|
- SQLite for simplicity and embeddability
|
||
|
|
- Litestream for continuous backup to S3/GCS/local without stopping the database
|
||
|
|
- Auto-restore: if DB is missing, Litestream restores from latest backup
|
||
|
|
- Single file, easy to migrate when lab-server moves
|
||
|
|
- But needs investigation to confirm it handles our scale
|
||
|
|
|
||
|
|
## Open Questions
|
||
|
|
|
||
|
|
1. Name: "lab" is simple but generic. Alternatives?
|
||
|
|
2. GitOps integration — should label/profile changes go through Git, or direct API?
|
||
|
|
3. Multi-tenancy — how to scope labels/resources per team?
|
||
|
|
4. Auth — mTLS between CLI and server? OIDC? Vault-issued tokens?
|
||
|
|
5. Input format — TypeScript (DA-style), YAML (Compose-style), or both?
|
||
|
|
7. Should `lab init` deploy lab-server as a container (portable) or native binary (simpler)?
|