Architecture Decisions

Core Principles

Build for homelab first, design for AWS/multi-cloud from the start
Labels as the universal abstraction — config attaches to labels, not machines
Code is the policy — declarations grant access, no separate policy management
Availability over consistency — stale data is acceptable, no data is not
No single point of failure — everything works offline with local cache
Don't reinvent the wheel — wrap existing tools, build the glue and UX
One engine everywhere — CLI, server, and init all use the same code path

The Tool: "lab"

Unified infrastructure lifecycle platform. Full spec in lab-tool-spec.md.

Component Dependency Map

┌─────────────────────────────────────────────────────────────────────┐
│                        LAB PLATFORM                                  │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    CORE (no external deps)                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │    │
│  │  │ Label    │ │ Group    │ │ Targeting│ │ Render Engine │  │    │
│  │  │ Engine   │ │ Engine   │ │ Engine   │ │ (CLI tables,  │  │    │
│  │  │          │ │          │ │          │ │  TUI, diff)   │  │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └───────────────┘  │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Profile      │ │ State Store  │ │ Plugin Registry  │    │    │
│  │  │ Engine       │ │ (SQLite +    │ │                  │    │    │
│  │  │ (t-shirt     │ │  Litestream) │ │                  │    │    │
│  │  │  sizes)      │ │              │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on core                                              │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              LIFECYCLE (depends on: core + providers)        │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Lifecycle    │ │ Artifact     │ │ K8s Deployer     │    │    │
│  │  │ Manager      │ │ Builder      │ │                  │    │    │
│  │  │ (plan/apply/ │ │ (puppet →    │ │                  │    │    │
│  │  │  destroy)    │ │  container)  │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on lifecycle                                         │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              IDENTITY & SECRETS (depends on: lifecycle)      │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Identity     │ │ Secret Store │ │ Token Issuer     │    │    │
│  │  │ Manager      │ │ (privileged  │ │ (one-time join   │    │    │
│  │  │ (enroll,     │ │  label, local│ │  tokens)         │    │    │
│  │  │  DNS, certs, │ │  cache, git  │ │                  │    │    │
│  │  │  SSH keys)   │ │  backup)     │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on identity                                          │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              OBSERVABILITY (depends on: core + identity)    │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Health       │ │ Alert        │ │ Audit Log        │    │    │
│  │  │ Aggregator   │ │ Generator    │ │                  │    │    │
│  │  │              │ │ (auto + user │ │                  │    │    │
│  │  │              │ │  defined)    │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              INTERFACES (depends on: everything above)      │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐  │    │
│  │  │ gRPC/REST│ │ CLI      │ │ TUI      │ │ Web UI       │  │    │
│  │  │ API      │ │ (cobra)  │ │(bubbletea)│ │ (future)     │  │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────────┘  │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

PROVIDER PLUGINS (external, loaded at runtime):
  ┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐
  │provider-aws│ │provider-   │ │provider-     │ │provider-k8s│
  │ (Pulumi)   │ │xcpng (XO)  │ │baremetal     │ │ (Pulumi)   │
  └────────────┘ └────────────┘ │(Tinkerbell)  │ └────────────┘
                                └──────────────┘
HEALTH PLUGINS:                 IDENTITY PLUGINS:
  ┌────────────┐ ┌──────────┐   ┌───────────┐ ┌─────────────┐
  │health-     │ │health-   │   │id-openvox │ │id-dns       │
  │prometheus  │ │naemon    │   │           │ │             │
  └────────────┘ └──────────┘   └───────────┘ └─────────────┘
  ┌────────────┐                ┌───────────┐ ┌─────────────┐
  │health-     │                │id-ssh-ca  │ │id-secret    │
  │cloudwatch  │                │           │ │             │
  └────────────┘                └───────────┘ └─────────────┘

Build Order (what depends on what)

Phase 1: CORE (can be built and tested independently)
  ├── Label Engine
  ├── Group Engine (depends on: labels)
  ├── Targeting Engine (depends on: labels, groups)
  ├── Profile Engine (t-shirt sizes)
  ├── Render Engine
  ├── State Store (SQLite + Litestream)
  ├── Plugin Registry
  ├── CLI framework (cobra)
  └── gRPC/REST API skeleton

Phase 2: PROVIDERS (can be built in parallel, each independent)
  ├── provider-ssh (simplest, needed for onboarding existing machines)
  ├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server)
  ├── provider-portainer (deploy via Portainer API)
  ├── provider-k8s (needed for k8s deployments)
  ├── provider-aws (Pulumi AWS)
  └── provider-xcpng (Pulumi XO / XO REST API)

Phase 3: LIFECYCLE (depends on: core + at least one provider)
  ├── Lifecycle Manager (plan/apply/destroy)
  ├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll)
  ├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk)
  ├── Local mode (lab init --local, engine on user device)
  ├── Self-deploy (lab init — deploy to remote target)
  ├── Self-migration (lab server migrate)
  └── Artifact Builder (puppet → container)

Phase 4: IDENTITY (depends on: lifecycle)
  ├── Token Issuer (one-time join tokens)
  ├── OpenVox Enrollor (cert signing, node classification)
  ├── DNS Manager (auto-registration, IP mobility)
  ├── SSH CA integration
  └── Secret Store (privileged label, local cache, git backup)

Phase 5: OBSERVABILITY (depends on: core + identity)
  ├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins)
  ├── Alert Generator (auto + user-defined, targeting engine)
  ├── Four-pillar status (sync + puppet + health + identity)
  └── Audit log

Phase 6: UX POLISH
  ├── TUI (bubbletea, k9s-style, cross-linked navigation)
  ├── lab show / lab targets (visibility commands)
  ├── lab render (multi-provider comparison)
  └── Web UI (future)

Key Concepts

Concept	Description
Labels	Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels
Groups	Composable, nested, with exclusions. Target by label, group, server, environment
Targeting	Unified query syntax used everywhere: alerts, secrets, puppet, queries
Four Pillars	Every resource shows: Sync + Puppet + Health + Identity
Profiles	T-shirt sizing with per-provider mappings, user-owned
Secret Store	Privileged label holding all secrets, machines get only entitled subset
Code = Policy	`lab::secret()` in puppet code = usage AND access declaration
Artifact Builder	Same puppet modules → VM config OR container image
Self-deploy	Lab deploys itself using same engine as everything else
Visibility	Two-way: server→everything applied, label→all servers affected

Infrastructure Stack

Layer	Homelab	AWS Equivalent	Status
Orchestration	k3s	EKS	Decided
IaC engine	Pulumi	Pulumi	Decided
GitOps	ArgoCD	ArgoCD	Decided
Monitoring (k8s)	Prometheus + Grafana	Prometheus + Grafana	Decided
Monitoring (infra)	Naemon	N/A (bare metal only)	Decided
Secrets backend	TBD	TBD	Needs investigation
DNS	PowerDNS + ExternalDNS	Route53 + ExternalDNS	Decided — see `dns-research.md`
TLS / CA	TBD	TBD	Needs investigation
SSH CA	TBD	TBD	Needs investigation
Storage	Longhorn	EBS CSI	Decided
Config mgmt	OpenVox	OpenVox	Decided
Bare metal boot	Tinkerbell / iPXE	N/A	Needs investigation
State store	SQLite + Litestream	SQLite + Litestream	Leading candidate
Container build	Buildah / Docker	Buildah / Docker	Needs investigation

Decisions Made

Decision	Choice	Why	Alternatives Considered
IaC engine	Pulumi	Real languages, plan/preview, component packages, XCP-ng provider exists	Terraform (no abstraction), Crossplane (no plan)
Config mgmt	OpenVox	Puppet fork, Apache 2.0, existing modules, active community	Puppet (Perforce EULA, 25-node limit)
Multi-cloud abstraction	Custom (Lab)	Nothing exists that does labels + plan + bare metal + XCP-ng	Crossplane (no plan), Terraform (re-implement per cloud)
Kubernetes	k3s	Puppet-friendly, multi-arch, lightweight, same K8s API as EKS	OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based)
Target OS list	Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS	Multi-arch, each with different install automation	See `os-install-research.md`
State store	NOT etcd	etcd crashes over serving stale data, availability > consistency	Leading: SQLite + Litestream
Secret access model	Code = policy	Declarations in code/labels auto-grant access, no manual Vault policies	Manual Vault policy management
Secret distribution	Privileged store + local cache	Prevents secret sprawl, machines only get entitled secrets	Peer-to-peer sync (leaks secrets sideways)
Resilience model	Offline-capable	Local cache keeps everything running, git backup for DR	Central server dependency (FreeIPA burned us)
Bootstrap	Self-deploying	lab init uses same engine as lab apply, no special codepath	Separate init provider interface

Evaluated and Rejected

Tool	Why Rejected	Details
Crossplane	No plan/preview — dealbreaker for enterprise	`crossplane-evaluation.md`
Foreman	Obsolete, poor UX, user has used it	Memory: `feedback_foreman.md`
Terraform/OpenTofu	No multi-platform abstraction	Re-implement per cloud at thousands of nodes
MAAS	Bare metal only	No cloud VMs, no Puppet integration
OpenShift	Fights external config mgmt, heavy, limited ARM	See `kubernetes-flavors.md`
Talos	Immutable OS, no SSH, no puppet	Incompatible with our approach
MicroK8s	Snap-based	Puppet managing snaps is awkward
HashiCorp Vault	Not impressed, central-server mindset	Will evaluate alternatives (OpenBao, Infisical, etc.)
etcd	Consistency over availability	Crashes rather than serving stale data
FreeIPA	Unstable	Good features (DNS, SSH, CA, secrets) but unreliable

Investigation Queue

Things we've identified but haven't evaluated yet, in rough priority order:

#	Topic	Context	Options to Investigate
1	Secret backend	Distributed, offline-capable, policy-filtered	OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite
2	~~DNS auto-registration~~	~~Every managed resource auto-registered~~	DECIDED: PowerDNS + ExternalDNS — see `dns-research.md`
3	SSH CA	CA-signed host keys, short-lived user certs	Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary
4	TLS / Internal CA	Machine certs, auto-renewal	OpenVox CA, Vault PKI, step-ca, cert-manager
5	Bare metal provisioning	Universal PXE agent + rootfs deploy (NOT native installers)	Wrap Tinkerbell vs build own agent — see `os-install-research.md`
6	State store	Embedded, auto-backup, auto-recover	SQLite+Litestream, bbolt, Badger
7	Container build	Puppet modules → OCI images	Buildah, Docker, Kaniko
8	Local cache encryption	Machine-specific key for secret cache	TPM 2.0, kernel keyring, LUKS-bound, secure enclave
9	Alert rendering	Generate monitoring configs from lab alerts	Prometheus rules, Naemon configs, CloudWatch
10	Input format	How users define resources and labels	YAML (Compose-like), Pkl, KCL, CUE, TypeScript
11	Auth (CLI to server)	Secure CLI-to-lab-server communication	mTLS, OIDC, Vault tokens
12	XCP-ng Pulumi provider	May need Upjet wrapper or direct API	Existing Terraform provider via Upjet, Pulumi XO provider
13	Multi-tenancy	Team scoping for labels/resources	Namespaces, RBAC, org hierarchy
14	Image production pipeline	Build rootfs tarballs per OS per arch	mkosi, debootstrap, dnf --installroot, Packer
15	Tinkerbell evaluation	Hands-on: does wrapping it work, or build our own agent?	HookOS + actions vs custom LinuxKit agent
16	XCP-ng rootfs extraction	How to produce deployable XCP-ng rootfs (not native installer)	Extract from ISO, capture installed system
17	VyOS rootfs extraction	How to produce deployable VyOS rootfs	VyOS build system, published images, Docker mode
18	Multi-arch PXE	Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI	Per-arch agent OS builds, iPXE configs

Project Files

File	Contents
`lab-tool-spec.md`	Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap)
`architecture.md`	This file — decisions, dependencies, investigation queue
`hardware.md`	Homelab hardware inventory and node roles
`crossplane-evaluation.md`	Crossplane evaluation and rejection rationale
`config-format-research.md`	YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.)
`os-install-research.md`	OS install automation, rootfs production, image pipeline, deployment matrix
`kubernetes-flavors.md`	k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale
`dns-research.md`	PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS

19 KiB Raw Blame History

Architecture Decisions

Core Principles

The Tool: "lab"

Component Dependency Map

Build Order (what depends on what)

Key Concepts

Infrastructure Stack

Decisions Made

Evaluated and Rejected

Investigation Queue

Project Files

19 KiB

Raw Blame History