# Architecture Decisions ## Core Principles 1. Build for homelab first, design for AWS/multi-cloud from the start 2. Labels as the universal abstraction — config attaches to labels, not machines 3. Code is the policy — declarations grant access, no separate policy management 4. Availability over consistency — stale data is acceptable, no data is not 5. No single point of failure — everything works offline with local cache 6. Don't reinvent the wheel — wrap existing tools, build the glue and UX 7. One engine everywhere — CLI, server, and init all use the same code path ## The Tool: "lab" Unified infrastructure lifecycle platform. Full spec in `lab-tool-spec.md`. ### Component Dependency Map ``` ┌─────────────────────────────────────────────────────────────────────┐ │ LAB PLATFORM │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ CORE (no external deps) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │ │ │ │ │ Label │ │ Group │ │ Targeting│ │ Render Engine │ │ │ │ │ │ Engine │ │ Engine │ │ Engine │ │ (CLI tables, │ │ │ │ │ │ │ │ │ │ │ │ TUI, diff) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └───────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ │ │ Profile │ │ State Store │ │ Plugin Registry │ │ │ │ │ │ Engine │ │ (SQLite + │ │ │ │ │ │ │ │ (t-shirt │ │ Litestream) │ │ │ │ │ │ │ │ sizes) │ │ │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ ▲ depends on core │ │ ┌────┴────────────────────────────────────────────────────────┐ │ │ │ LIFECYCLE (depends on: core + providers) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ │ │ Lifecycle │ │ Artifact │ │ K8s Deployer │ │ │ │ │ │ Manager │ │ Builder │ │ │ │ │ │ │ │ (plan/apply/ │ │ (puppet → │ │ │ │ │ │ │ │ destroy) │ │ container) │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ ▲ depends on lifecycle │ │ ┌────┴────────────────────────────────────────────────────────┐ │ │ │ IDENTITY & SECRETS (depends on: lifecycle) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ │ │ Identity │ │ Secret Store │ │ Token Issuer │ │ │ │ │ │ Manager │ │ (privileged │ │ (one-time join │ │ │ │ │ │ (enroll, │ │ label, local│ │ tokens) │ │ │ │ │ │ DNS, certs, │ │ cache, git │ │ │ │ │ │ │ │ SSH keys) │ │ backup) │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ ▲ depends on identity │ │ ┌────┴────────────────────────────────────────────────────────┐ │ │ │ OBSERVABILITY (depends on: core + identity) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ │ │ Health │ │ Alert │ │ Audit Log │ │ │ │ │ │ Aggregator │ │ Generator │ │ │ │ │ │ │ │ │ │ (auto + user │ │ │ │ │ │ │ │ │ │ defined) │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ INTERFACES (depends on: everything above) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ │ │ gRPC/REST│ │ CLI │ │ TUI │ │ Web UI │ │ │ │ │ │ API │ │ (cobra) │ │(bubbletea)│ │ (future) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ PROVIDER PLUGINS (external, loaded at runtime): ┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │provider-aws│ │provider- │ │provider- │ │provider-k8s│ │ (Pulumi) │ │xcpng (XO) │ │baremetal │ │ (Pulumi) │ └────────────┘ └────────────┘ │(Tinkerbell) │ └────────────┘ └──────────────┘ HEALTH PLUGINS: IDENTITY PLUGINS: ┌────────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────┐ │health- │ │health- │ │id-openvox │ │id-dns │ │prometheus │ │naemon │ │ │ │ │ └────────────┘ └──────────┘ └───────────┘ └─────────────┘ ┌────────────┐ ┌───────────┐ ┌─────────────┐ │health- │ │id-ssh-ca │ │id-secret │ │cloudwatch │ │ │ │ │ └────────────┘ └───────────┘ └─────────────┘ ``` ### Build Order (what depends on what) ``` Phase 1: CORE (can be built and tested independently) ├── Label Engine ├── Group Engine (depends on: labels) ├── Targeting Engine (depends on: labels, groups) ├── Profile Engine (t-shirt sizes) ├── Render Engine ├── State Store (SQLite + Litestream) ├── Plugin Registry ├── CLI framework (cobra) └── gRPC/REST API skeleton Phase 2: PROVIDERS (can be built in parallel, each independent) ├── provider-ssh (simplest, needed for onboarding existing machines) ├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server) ├── provider-portainer (deploy via Portainer API) ├── provider-k8s (needed for k8s deployments) ├── provider-aws (Pulumi AWS) └── provider-xcpng (Pulumi XO / XO REST API) Phase 3: LIFECYCLE (depends on: core + at least one provider) ├── Lifecycle Manager (plan/apply/destroy) ├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll) ├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk) ├── Local mode (lab init --local, engine on user device) ├── Self-deploy (lab init — deploy to remote target) ├── Self-migration (lab server migrate) └── Artifact Builder (puppet → container) Phase 4: IDENTITY (depends on: lifecycle) ├── Token Issuer (one-time join tokens) ├── OpenVox Enrollor (cert signing, node classification) ├── DNS Manager (auto-registration, IP mobility) ├── SSH CA integration └── Secret Store (privileged label, local cache, git backup) Phase 5: OBSERVABILITY (depends on: core + identity) ├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins) ├── Alert Generator (auto + user-defined, targeting engine) ├── Four-pillar status (sync + puppet + health + identity) └── Audit log Phase 6: UX POLISH ├── TUI (bubbletea, k9s-style, cross-linked navigation) ├── lab show / lab targets (visibility commands) ├── lab render (multi-provider comparison) └── Web UI (future) ``` ### Key Concepts | Concept | Description | |---------|-------------| | **Labels** | Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels | | **Groups** | Composable, nested, with exclusions. Target by label, group, server, environment | | **Targeting** | Unified query syntax used everywhere: alerts, secrets, puppet, queries | | **Four Pillars** | Every resource shows: Sync + Puppet + Health + Identity | | **Profiles** | T-shirt sizing with per-provider mappings, user-owned | | **Secret Store** | Privileged label holding all secrets, machines get only entitled subset | | **Code = Policy** | `lab::secret()` in puppet code = usage AND access declaration | | **Artifact Builder** | Same puppet modules → VM config OR container image | | **Self-deploy** | Lab deploys itself using same engine as everything else | | **Visibility** | Two-way: server→everything applied, label→all servers affected | ## Infrastructure Stack | Layer | Homelab | AWS Equivalent | Status | |-------|---------|----------------|--------| | Orchestration | k3s | EKS | Decided | | IaC engine | Pulumi | Pulumi | Decided | | GitOps | ArgoCD | ArgoCD | Decided | | Monitoring (k8s) | Prometheus + Grafana | Prometheus + Grafana | Decided | | Monitoring (infra) | Naemon | N/A (bare metal only) | Decided | | Secrets backend | TBD | TBD | Needs investigation | | DNS | PowerDNS + ExternalDNS | Route53 + ExternalDNS | Decided — see `dns-research.md` | | TLS / CA | TBD | TBD | Needs investigation | | SSH CA | TBD | TBD | Needs investigation | | Storage | Longhorn | EBS CSI | Decided | | Config mgmt | OpenVox | OpenVox | Decided | | Bare metal boot | Tinkerbell / iPXE | N/A | Needs investigation | | State store | SQLite + Litestream | SQLite + Litestream | Leading candidate | | Container build | Buildah / Docker | Buildah / Docker | Needs investigation | ## Decisions Made | Decision | Choice | Why | Alternatives Considered | |----------|--------|-----|------------------------| | IaC engine | Pulumi | Real languages, plan/preview, component packages, XCP-ng provider exists | Terraform (no abstraction), Crossplane (no plan) | | Config mgmt | OpenVox | Puppet fork, Apache 2.0, existing modules, active community | Puppet (Perforce EULA, 25-node limit) | | Multi-cloud abstraction | Custom (Lab) | Nothing exists that does labels + plan + bare metal + XCP-ng | Crossplane (no plan), Terraform (re-implement per cloud) | | Kubernetes | k3s | Puppet-friendly, multi-arch, lightweight, same K8s API as EKS | OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based) | | Target OS list | Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS | Multi-arch, each with different install automation | See `os-install-research.md` | | State store | NOT etcd | etcd crashes over serving stale data, availability > consistency | Leading: SQLite + Litestream | | Secret access model | Code = policy | Declarations in code/labels auto-grant access, no manual Vault policies | Manual Vault policy management | | Secret distribution | Privileged store + local cache | Prevents secret sprawl, machines only get entitled secrets | Peer-to-peer sync (leaks secrets sideways) | | Resilience model | Offline-capable | Local cache keeps everything running, git backup for DR | Central server dependency (FreeIPA burned us) | | Bootstrap | Self-deploying | lab init uses same engine as lab apply, no special codepath | Separate init provider interface | ## Evaluated and Rejected | Tool | Why Rejected | Details | |------|-------------|---------| | **Crossplane** | No plan/preview — dealbreaker for enterprise | `crossplane-evaluation.md` | | **Foreman** | Obsolete, poor UX, user has used it | Memory: `feedback_foreman.md` | | **Terraform/OpenTofu** | No multi-platform abstraction | Re-implement per cloud at thousands of nodes | | **MAAS** | Bare metal only | No cloud VMs, no Puppet integration | | **OpenShift** | Fights external config mgmt, heavy, limited ARM | See `kubernetes-flavors.md` | | **Talos** | Immutable OS, no SSH, no puppet | Incompatible with our approach | | **MicroK8s** | Snap-based | Puppet managing snaps is awkward | | **HashiCorp Vault** | Not impressed, central-server mindset | Will evaluate alternatives (OpenBao, Infisical, etc.) | | **etcd** | Consistency over availability | Crashes rather than serving stale data | | **FreeIPA** | Unstable | Good features (DNS, SSH, CA, secrets) but unreliable | ## Investigation Queue Things we've identified but haven't evaluated yet, in rough priority order: | # | Topic | Context | Options to Investigate | |---|-------|---------|----------------------| | 1 | Secret backend | Distributed, offline-capable, policy-filtered | OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite | | 2 | ~~DNS auto-registration~~ | ~~Every managed resource auto-registered~~ | **DECIDED: PowerDNS + ExternalDNS** — see `dns-research.md` | | 3 | SSH CA | CA-signed host keys, short-lived user certs | Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary | | 4 | TLS / Internal CA | Machine certs, auto-renewal | OpenVox CA, Vault PKI, step-ca, cert-manager | | 5 | Bare metal provisioning | Universal PXE agent + rootfs deploy (NOT native installers) | Wrap Tinkerbell vs build own agent — see `os-install-research.md` | | 6 | State store | Embedded, auto-backup, auto-recover | SQLite+Litestream, bbolt, Badger | | 7 | Container build | Puppet modules → OCI images | Buildah, Docker, Kaniko | | 8 | Local cache encryption | Machine-specific key for secret cache | TPM 2.0, kernel keyring, LUKS-bound, secure enclave | | 9 | Alert rendering | Generate monitoring configs from lab alerts | Prometheus rules, Naemon configs, CloudWatch | | 10 | Input format | How users define resources and labels | YAML (Compose-like), Pkl, KCL, CUE, TypeScript | | 11 | Auth (CLI to server) | Secure CLI-to-lab-server communication | mTLS, OIDC, Vault tokens | | 12 | XCP-ng Pulumi provider | May need Upjet wrapper or direct API | Existing Terraform provider via Upjet, Pulumi XO provider | | 13 | Multi-tenancy | Team scoping for labels/resources | Namespaces, RBAC, org hierarchy | | 14 | Image production pipeline | Build rootfs tarballs per OS per arch | mkosi, debootstrap, dnf --installroot, Packer | | 15 | Tinkerbell evaluation | Hands-on: does wrapping it work, or build our own agent? | HookOS + actions vs custom LinuxKit agent | | 16 | XCP-ng rootfs extraction | How to produce deployable XCP-ng rootfs (not native installer) | Extract from ISO, capture installed system | | 17 | VyOS rootfs extraction | How to produce deployable VyOS rootfs | VyOS build system, published images, Docker mode | | 18 | Multi-arch PXE | Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI | Per-arch agent OS builds, iPXE configs | ## Project Files | File | Contents | |------|----------| | `lab-tool-spec.md` | Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap) | | `architecture.md` | This file — decisions, dependencies, investigation queue | | `hardware.md` | Homelab hardware inventory and node roles | | `crossplane-evaluation.md` | Crossplane evaluation and rejection rationale | | `config-format-research.md` | YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.) | | `os-install-research.md` | OS install automation, rootfs production, image pipeline, deployment matrix | | `kubernetes-flavors.md` | k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale | | `dns-research.md` | PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS |