Files
lab/architecture.md
Michal Rydlikowski ac695f506f first commit
2026-03-15 23:50:43 +00:00

19 KiB

Architecture Decisions

Core Principles

  1. Build for homelab first, design for AWS/multi-cloud from the start
  2. Labels as the universal abstraction — config attaches to labels, not machines
  3. Code is the policy — declarations grant access, no separate policy management
  4. Availability over consistency — stale data is acceptable, no data is not
  5. No single point of failure — everything works offline with local cache
  6. Don't reinvent the wheel — wrap existing tools, build the glue and UX
  7. One engine everywhere — CLI, server, and init all use the same code path

The Tool: "lab"

Unified infrastructure lifecycle platform. Full spec in lab-tool-spec.md.

Component Dependency Map

┌─────────────────────────────────────────────────────────────────────┐
│                        LAB PLATFORM                                  │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    CORE (no external deps)                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │    │
│  │  │ Label    │ │ Group    │ │ Targeting│ │ Render Engine │  │    │
│  │  │ Engine   │ │ Engine   │ │ Engine   │ │ (CLI tables,  │  │    │
│  │  │          │ │          │ │          │ │  TUI, diff)   │  │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └───────────────┘  │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Profile      │ │ State Store  │ │ Plugin Registry  │    │    │
│  │  │ Engine       │ │ (SQLite +    │ │                  │    │    │
│  │  │ (t-shirt     │ │  Litestream) │ │                  │    │    │
│  │  │  sizes)      │ │              │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on core                                              │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              LIFECYCLE (depends on: core + providers)        │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Lifecycle    │ │ Artifact     │ │ K8s Deployer     │    │    │
│  │  │ Manager      │ │ Builder      │ │                  │    │    │
│  │  │ (plan/apply/ │ │ (puppet →    │ │                  │    │    │
│  │  │  destroy)    │ │  container)  │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on lifecycle                                         │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              IDENTITY & SECRETS (depends on: lifecycle)      │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Identity     │ │ Secret Store │ │ Token Issuer     │    │    │
│  │  │ Manager      │ │ (privileged  │ │ (one-time join   │    │    │
│  │  │ (enroll,     │ │  label, local│ │  tokens)         │    │    │
│  │  │  DNS, certs, │ │  cache, git  │ │                  │    │    │
│  │  │  SSH keys)   │ │  backup)     │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on identity                                          │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              OBSERVABILITY (depends on: core + identity)    │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Health       │ │ Alert        │ │ Audit Log        │    │    │
│  │  │ Aggregator   │ │ Generator    │ │                  │    │    │
│  │  │              │ │ (auto + user │ │                  │    │    │
│  │  │              │ │  defined)    │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              INTERFACES (depends on: everything above)      │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐  │    │
│  │  │ gRPC/REST│ │ CLI      │ │ TUI      │ │ Web UI       │  │    │
│  │  │ API      │ │ (cobra)  │ │(bubbletea)│ │ (future)     │  │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────────┘  │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

PROVIDER PLUGINS (external, loaded at runtime):
  ┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐
  │provider-aws│ │provider-   │ │provider-     │ │provider-k8s│
  │ (Pulumi)   │ │xcpng (XO)  │ │baremetal     │ │ (Pulumi)   │
  └────────────┘ └────────────┘ │(Tinkerbell)  │ └────────────┘
                                └──────────────┘
HEALTH PLUGINS:                 IDENTITY PLUGINS:
  ┌────────────┐ ┌──────────┐   ┌───────────┐ ┌─────────────┐
  │health-     │ │health-   │   │id-openvox │ │id-dns       │
  │prometheus  │ │naemon    │   │           │ │             │
  └────────────┘ └──────────┘   └───────────┘ └─────────────┘
  ┌────────────┐                ┌───────────┐ ┌─────────────┐
  │health-     │                │id-ssh-ca  │ │id-secret    │
  │cloudwatch  │                │           │ │             │
  └────────────┘                └───────────┘ └─────────────┘

Build Order (what depends on what)

Phase 1: CORE (can be built and tested independently)
  ├── Label Engine
  ├── Group Engine (depends on: labels)
  ├── Targeting Engine (depends on: labels, groups)
  ├── Profile Engine (t-shirt sizes)
  ├── Render Engine
  ├── State Store (SQLite + Litestream)
  ├── Plugin Registry
  ├── CLI framework (cobra)
  └── gRPC/REST API skeleton

Phase 2: PROVIDERS (can be built in parallel, each independent)
  ├── provider-ssh (simplest, needed for onboarding existing machines)
  ├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server)
  ├── provider-portainer (deploy via Portainer API)
  ├── provider-k8s (needed for k8s deployments)
  ├── provider-aws (Pulumi AWS)
  └── provider-xcpng (Pulumi XO / XO REST API)

Phase 3: LIFECYCLE (depends on: core + at least one provider)
  ├── Lifecycle Manager (plan/apply/destroy)
  ├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll)
  ├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk)
  ├── Local mode (lab init --local, engine on user device)
  ├── Self-deploy (lab init — deploy to remote target)
  ├── Self-migration (lab server migrate)
  └── Artifact Builder (puppet → container)

Phase 4: IDENTITY (depends on: lifecycle)
  ├── Token Issuer (one-time join tokens)
  ├── OpenVox Enrollor (cert signing, node classification)
  ├── DNS Manager (auto-registration, IP mobility)
  ├── SSH CA integration
  └── Secret Store (privileged label, local cache, git backup)

Phase 5: OBSERVABILITY (depends on: core + identity)
  ├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins)
  ├── Alert Generator (auto + user-defined, targeting engine)
  ├── Four-pillar status (sync + puppet + health + identity)
  └── Audit log

Phase 6: UX POLISH
  ├── TUI (bubbletea, k9s-style, cross-linked navigation)
  ├── lab show / lab targets (visibility commands)
  ├── lab render (multi-provider comparison)
  └── Web UI (future)

Key Concepts

Concept Description
Labels Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels
Groups Composable, nested, with exclusions. Target by label, group, server, environment
Targeting Unified query syntax used everywhere: alerts, secrets, puppet, queries
Four Pillars Every resource shows: Sync + Puppet + Health + Identity
Profiles T-shirt sizing with per-provider mappings, user-owned
Secret Store Privileged label holding all secrets, machines get only entitled subset
Code = Policy lab::secret() in puppet code = usage AND access declaration
Artifact Builder Same puppet modules → VM config OR container image
Self-deploy Lab deploys itself using same engine as everything else
Visibility Two-way: server→everything applied, label→all servers affected

Infrastructure Stack

Layer Homelab AWS Equivalent Status
Orchestration k3s EKS Decided
IaC engine Pulumi Pulumi Decided
GitOps ArgoCD ArgoCD Decided
Monitoring (k8s) Prometheus + Grafana Prometheus + Grafana Decided
Monitoring (infra) Naemon N/A (bare metal only) Decided
Secrets backend TBD TBD Needs investigation
DNS PowerDNS + ExternalDNS Route53 + ExternalDNS Decided — see dns-research.md
TLS / CA TBD TBD Needs investigation
SSH CA TBD TBD Needs investigation
Storage Longhorn EBS CSI Decided
Config mgmt OpenVox OpenVox Decided
Bare metal boot Tinkerbell / iPXE N/A Needs investigation
State store SQLite + Litestream SQLite + Litestream Leading candidate
Container build Buildah / Docker Buildah / Docker Needs investigation

Decisions Made

Decision Choice Why Alternatives Considered
IaC engine Pulumi Real languages, plan/preview, component packages, XCP-ng provider exists Terraform (no abstraction), Crossplane (no plan)
Config mgmt OpenVox Puppet fork, Apache 2.0, existing modules, active community Puppet (Perforce EULA, 25-node limit)
Multi-cloud abstraction Custom (Lab) Nothing exists that does labels + plan + bare metal + XCP-ng Crossplane (no plan), Terraform (re-implement per cloud)
Kubernetes k3s Puppet-friendly, multi-arch, lightweight, same K8s API as EKS OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based)
Target OS list Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS Multi-arch, each with different install automation See os-install-research.md
State store NOT etcd etcd crashes over serving stale data, availability > consistency Leading: SQLite + Litestream
Secret access model Code = policy Declarations in code/labels auto-grant access, no manual Vault policies Manual Vault policy management
Secret distribution Privileged store + local cache Prevents secret sprawl, machines only get entitled secrets Peer-to-peer sync (leaks secrets sideways)
Resilience model Offline-capable Local cache keeps everything running, git backup for DR Central server dependency (FreeIPA burned us)
Bootstrap Self-deploying lab init uses same engine as lab apply, no special codepath Separate init provider interface

Evaluated and Rejected

Tool Why Rejected Details
Crossplane No plan/preview — dealbreaker for enterprise crossplane-evaluation.md
Foreman Obsolete, poor UX, user has used it Memory: feedback_foreman.md
Terraform/OpenTofu No multi-platform abstraction Re-implement per cloud at thousands of nodes
MAAS Bare metal only No cloud VMs, no Puppet integration
OpenShift Fights external config mgmt, heavy, limited ARM See kubernetes-flavors.md
Talos Immutable OS, no SSH, no puppet Incompatible with our approach
MicroK8s Snap-based Puppet managing snaps is awkward
HashiCorp Vault Not impressed, central-server mindset Will evaluate alternatives (OpenBao, Infisical, etc.)
etcd Consistency over availability Crashes rather than serving stale data
FreeIPA Unstable Good features (DNS, SSH, CA, secrets) but unreliable

Investigation Queue

Things we've identified but haven't evaluated yet, in rough priority order:

# Topic Context Options to Investigate
1 Secret backend Distributed, offline-capable, policy-filtered OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite
2 DNS auto-registration Every managed resource auto-registered DECIDED: PowerDNS + ExternalDNS — see dns-research.md
3 SSH CA CA-signed host keys, short-lived user certs Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary
4 TLS / Internal CA Machine certs, auto-renewal OpenVox CA, Vault PKI, step-ca, cert-manager
5 Bare metal provisioning Universal PXE agent + rootfs deploy (NOT native installers) Wrap Tinkerbell vs build own agent — see os-install-research.md
6 State store Embedded, auto-backup, auto-recover SQLite+Litestream, bbolt, Badger
7 Container build Puppet modules → OCI images Buildah, Docker, Kaniko
8 Local cache encryption Machine-specific key for secret cache TPM 2.0, kernel keyring, LUKS-bound, secure enclave
9 Alert rendering Generate monitoring configs from lab alerts Prometheus rules, Naemon configs, CloudWatch
10 Input format How users define resources and labels YAML (Compose-like), Pkl, KCL, CUE, TypeScript
11 Auth (CLI to server) Secure CLI-to-lab-server communication mTLS, OIDC, Vault tokens
12 XCP-ng Pulumi provider May need Upjet wrapper or direct API Existing Terraform provider via Upjet, Pulumi XO provider
13 Multi-tenancy Team scoping for labels/resources Namespaces, RBAC, org hierarchy
14 Image production pipeline Build rootfs tarballs per OS per arch mkosi, debootstrap, dnf --installroot, Packer
15 Tinkerbell evaluation Hands-on: does wrapping it work, or build our own agent? HookOS + actions vs custom LinuxKit agent
16 XCP-ng rootfs extraction How to produce deployable XCP-ng rootfs (not native installer) Extract from ISO, capture installed system
17 VyOS rootfs extraction How to produce deployable VyOS rootfs VyOS build system, published images, Docker mode
18 Multi-arch PXE Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI Per-arch agent OS builds, iPXE configs

Project Files

File Contents
lab-tool-spec.md Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap)
architecture.md This file — decisions, dependencies, investigation queue
hardware.md Homelab hardware inventory and node roles
crossplane-evaluation.md Crossplane evaluation and rejection rationale
config-format-research.md YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.)
os-install-research.md OS install automation, rootfs production, image pipeline, deployment matrix
kubernetes-flavors.md k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale
dns-research.md PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS