247 lines
19 KiB
Markdown
247 lines
19 KiB
Markdown
# Architecture Decisions
|
|
|
|
## Core Principles
|
|
|
|
1. Build for homelab first, design for AWS/multi-cloud from the start
|
|
2. Labels as the universal abstraction — config attaches to labels, not machines
|
|
3. Code is the policy — declarations grant access, no separate policy management
|
|
4. Availability over consistency — stale data is acceptable, no data is not
|
|
5. No single point of failure — everything works offline with local cache
|
|
6. Don't reinvent the wheel — wrap existing tools, build the glue and UX
|
|
7. One engine everywhere — CLI, server, and init all use the same code path
|
|
|
|
## The Tool: "lab"
|
|
|
|
Unified infrastructure lifecycle platform. Full spec in `lab-tool-spec.md`.
|
|
|
|
### Component Dependency Map
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ LAB PLATFORM │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ CORE (no external deps) │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │ │
|
|
│ │ │ Label │ │ Group │ │ Targeting│ │ Render Engine │ │ │
|
|
│ │ │ Engine │ │ Engine │ │ Engine │ │ (CLI tables, │ │ │
|
|
│ │ │ │ │ │ │ │ │ TUI, diff) │ │ │
|
|
│ │ └──────────┘ └──────────┘ └──────────┘ └───────────────┘ │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ Profile │ │ State Store │ │ Plugin Registry │ │ │
|
|
│ │ │ Engine │ │ (SQLite + │ │ │ │ │
|
|
│ │ │ (t-shirt │ │ Litestream) │ │ │ │ │
|
|
│ │ │ sizes) │ │ │ │ │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ ▲ depends on core │
|
|
│ ┌────┴────────────────────────────────────────────────────────┐ │
|
|
│ │ LIFECYCLE (depends on: core + providers) │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ Lifecycle │ │ Artifact │ │ K8s Deployer │ │ │
|
|
│ │ │ Manager │ │ Builder │ │ │ │ │
|
|
│ │ │ (plan/apply/ │ │ (puppet → │ │ │ │ │
|
|
│ │ │ destroy) │ │ container) │ │ │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ ▲ depends on lifecycle │
|
|
│ ┌────┴────────────────────────────────────────────────────────┐ │
|
|
│ │ IDENTITY & SECRETS (depends on: lifecycle) │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ Identity │ │ Secret Store │ │ Token Issuer │ │ │
|
|
│ │ │ Manager │ │ (privileged │ │ (one-time join │ │ │
|
|
│ │ │ (enroll, │ │ label, local│ │ tokens) │ │ │
|
|
│ │ │ DNS, certs, │ │ cache, git │ │ │ │ │
|
|
│ │ │ SSH keys) │ │ backup) │ │ │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ ▲ depends on identity │
|
|
│ ┌────┴────────────────────────────────────────────────────────┐ │
|
|
│ │ OBSERVABILITY (depends on: core + identity) │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ Health │ │ Alert │ │ Audit Log │ │ │
|
|
│ │ │ Aggregator │ │ Generator │ │ │ │ │
|
|
│ │ │ │ │ (auto + user │ │ │ │ │
|
|
│ │ │ │ │ defined) │ │ │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ INTERFACES (depends on: everything above) │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │
|
|
│ │ │ gRPC/REST│ │ CLI │ │ TUI │ │ Web UI │ │ │
|
|
│ │ │ API │ │ (cobra) │ │(bubbletea)│ │ (future) │ │ │
|
|
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
PROVIDER PLUGINS (external, loaded at runtime):
|
|
┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐
|
|
│provider-aws│ │provider- │ │provider- │ │provider-k8s│
|
|
│ (Pulumi) │ │xcpng (XO) │ │baremetal │ │ (Pulumi) │
|
|
└────────────┘ └────────────┘ │(Tinkerbell) │ └────────────┘
|
|
└──────────────┘
|
|
HEALTH PLUGINS: IDENTITY PLUGINS:
|
|
┌────────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────┐
|
|
│health- │ │health- │ │id-openvox │ │id-dns │
|
|
│prometheus │ │naemon │ │ │ │ │
|
|
└────────────┘ └──────────┘ └───────────┘ └─────────────┘
|
|
┌────────────┐ ┌───────────┐ ┌─────────────┐
|
|
│health- │ │id-ssh-ca │ │id-secret │
|
|
│cloudwatch │ │ │ │ │
|
|
└────────────┘ └───────────┘ └─────────────┘
|
|
```
|
|
|
|
### Build Order (what depends on what)
|
|
|
|
```
|
|
Phase 1: CORE (can be built and tested independently)
|
|
├── Label Engine
|
|
├── Group Engine (depends on: labels)
|
|
├── Targeting Engine (depends on: labels, groups)
|
|
├── Profile Engine (t-shirt sizes)
|
|
├── Render Engine
|
|
├── State Store (SQLite + Litestream)
|
|
├── Plugin Registry
|
|
├── CLI framework (cobra)
|
|
└── gRPC/REST API skeleton
|
|
|
|
Phase 2: PROVIDERS (can be built in parallel, each independent)
|
|
├── provider-ssh (simplest, needed for onboarding existing machines)
|
|
├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server)
|
|
├── provider-portainer (deploy via Portainer API)
|
|
├── provider-k8s (needed for k8s deployments)
|
|
├── provider-aws (Pulumi AWS)
|
|
└── provider-xcpng (Pulumi XO / XO REST API)
|
|
|
|
Phase 3: LIFECYCLE (depends on: core + at least one provider)
|
|
├── Lifecycle Manager (plan/apply/destroy)
|
|
├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll)
|
|
├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk)
|
|
├── Local mode (lab init --local, engine on user device)
|
|
├── Self-deploy (lab init — deploy to remote target)
|
|
├── Self-migration (lab server migrate)
|
|
└── Artifact Builder (puppet → container)
|
|
|
|
Phase 4: IDENTITY (depends on: lifecycle)
|
|
├── Token Issuer (one-time join tokens)
|
|
├── OpenVox Enrollor (cert signing, node classification)
|
|
├── DNS Manager (auto-registration, IP mobility)
|
|
├── SSH CA integration
|
|
└── Secret Store (privileged label, local cache, git backup)
|
|
|
|
Phase 5: OBSERVABILITY (depends on: core + identity)
|
|
├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins)
|
|
├── Alert Generator (auto + user-defined, targeting engine)
|
|
├── Four-pillar status (sync + puppet + health + identity)
|
|
└── Audit log
|
|
|
|
Phase 6: UX POLISH
|
|
├── TUI (bubbletea, k9s-style, cross-linked navigation)
|
|
├── lab show / lab targets (visibility commands)
|
|
├── lab render (multi-provider comparison)
|
|
└── Web UI (future)
|
|
```
|
|
|
|
### Key Concepts
|
|
|
|
| Concept | Description |
|
|
|---------|-------------|
|
|
| **Labels** | Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels |
|
|
| **Groups** | Composable, nested, with exclusions. Target by label, group, server, environment |
|
|
| **Targeting** | Unified query syntax used everywhere: alerts, secrets, puppet, queries |
|
|
| **Four Pillars** | Every resource shows: Sync + Puppet + Health + Identity |
|
|
| **Profiles** | T-shirt sizing with per-provider mappings, user-owned |
|
|
| **Secret Store** | Privileged label holding all secrets, machines get only entitled subset |
|
|
| **Code = Policy** | `lab::secret()` in puppet code = usage AND access declaration |
|
|
| **Artifact Builder** | Same puppet modules → VM config OR container image |
|
|
| **Self-deploy** | Lab deploys itself using same engine as everything else |
|
|
| **Visibility** | Two-way: server→everything applied, label→all servers affected |
|
|
|
|
## Infrastructure Stack
|
|
|
|
| Layer | Homelab | AWS Equivalent | Status |
|
|
|-------|---------|----------------|--------|
|
|
| Orchestration | k3s | EKS | Decided |
|
|
| IaC engine | Pulumi | Pulumi | Decided |
|
|
| GitOps | ArgoCD | ArgoCD | Decided |
|
|
| Monitoring (k8s) | Prometheus + Grafana | Prometheus + Grafana | Decided |
|
|
| Monitoring (infra) | Naemon | N/A (bare metal only) | Decided |
|
|
| Secrets backend | TBD | TBD | Needs investigation |
|
|
| DNS | PowerDNS + ExternalDNS | Route53 + ExternalDNS | Decided — see `dns-research.md` |
|
|
| TLS / CA | TBD | TBD | Needs investigation |
|
|
| SSH CA | TBD | TBD | Needs investigation |
|
|
| Storage | Longhorn | EBS CSI | Decided |
|
|
| Config mgmt | OpenVox | OpenVox | Decided |
|
|
| Bare metal boot | Tinkerbell / iPXE | N/A | Needs investigation |
|
|
| State store | SQLite + Litestream | SQLite + Litestream | Leading candidate |
|
|
| Container build | Buildah / Docker | Buildah / Docker | Needs investigation |
|
|
|
|
## Decisions Made
|
|
|
|
| Decision | Choice | Why | Alternatives Considered |
|
|
|----------|--------|-----|------------------------|
|
|
| IaC engine | Pulumi | Real languages, plan/preview, component packages, XCP-ng provider exists | Terraform (no abstraction), Crossplane (no plan) |
|
|
| Config mgmt | OpenVox | Puppet fork, Apache 2.0, existing modules, active community | Puppet (Perforce EULA, 25-node limit) |
|
|
| Multi-cloud abstraction | Custom (Lab) | Nothing exists that does labels + plan + bare metal + XCP-ng | Crossplane (no plan), Terraform (re-implement per cloud) |
|
|
| Kubernetes | k3s | Puppet-friendly, multi-arch, lightweight, same K8s API as EKS | OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based) |
|
|
| Target OS list | Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS | Multi-arch, each with different install automation | See `os-install-research.md` |
|
|
| State store | NOT etcd | etcd crashes over serving stale data, availability > consistency | Leading: SQLite + Litestream |
|
|
| Secret access model | Code = policy | Declarations in code/labels auto-grant access, no manual Vault policies | Manual Vault policy management |
|
|
| Secret distribution | Privileged store + local cache | Prevents secret sprawl, machines only get entitled secrets | Peer-to-peer sync (leaks secrets sideways) |
|
|
| Resilience model | Offline-capable | Local cache keeps everything running, git backup for DR | Central server dependency (FreeIPA burned us) |
|
|
| Bootstrap | Self-deploying | lab init uses same engine as lab apply, no special codepath | Separate init provider interface |
|
|
|
|
## Evaluated and Rejected
|
|
|
|
| Tool | Why Rejected | Details |
|
|
|------|-------------|---------|
|
|
| **Crossplane** | No plan/preview — dealbreaker for enterprise | `crossplane-evaluation.md` |
|
|
| **Foreman** | Obsolete, poor UX, user has used it | Memory: `feedback_foreman.md` |
|
|
| **Terraform/OpenTofu** | No multi-platform abstraction | Re-implement per cloud at thousands of nodes |
|
|
| **MAAS** | Bare metal only | No cloud VMs, no Puppet integration |
|
|
| **OpenShift** | Fights external config mgmt, heavy, limited ARM | See `kubernetes-flavors.md` |
|
|
| **Talos** | Immutable OS, no SSH, no puppet | Incompatible with our approach |
|
|
| **MicroK8s** | Snap-based | Puppet managing snaps is awkward |
|
|
| **HashiCorp Vault** | Not impressed, central-server mindset | Will evaluate alternatives (OpenBao, Infisical, etc.) |
|
|
| **etcd** | Consistency over availability | Crashes rather than serving stale data |
|
|
| **FreeIPA** | Unstable | Good features (DNS, SSH, CA, secrets) but unreliable |
|
|
|
|
## Investigation Queue
|
|
|
|
Things we've identified but haven't evaluated yet, in rough priority order:
|
|
|
|
| # | Topic | Context | Options to Investigate |
|
|
|---|-------|---------|----------------------|
|
|
| 1 | Secret backend | Distributed, offline-capable, policy-filtered | OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite |
|
|
| 2 | ~~DNS auto-registration~~ | ~~Every managed resource auto-registered~~ | **DECIDED: PowerDNS + ExternalDNS** — see `dns-research.md` |
|
|
| 3 | SSH CA | CA-signed host keys, short-lived user certs | Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary |
|
|
| 4 | TLS / Internal CA | Machine certs, auto-renewal | OpenVox CA, Vault PKI, step-ca, cert-manager |
|
|
| 5 | Bare metal provisioning | Universal PXE agent + rootfs deploy (NOT native installers) | Wrap Tinkerbell vs build own agent — see `os-install-research.md` |
|
|
| 6 | State store | Embedded, auto-backup, auto-recover | SQLite+Litestream, bbolt, Badger |
|
|
| 7 | Container build | Puppet modules → OCI images | Buildah, Docker, Kaniko |
|
|
| 8 | Local cache encryption | Machine-specific key for secret cache | TPM 2.0, kernel keyring, LUKS-bound, secure enclave |
|
|
| 9 | Alert rendering | Generate monitoring configs from lab alerts | Prometheus rules, Naemon configs, CloudWatch |
|
|
| 10 | Input format | How users define resources and labels | YAML (Compose-like), Pkl, KCL, CUE, TypeScript |
|
|
| 11 | Auth (CLI to server) | Secure CLI-to-lab-server communication | mTLS, OIDC, Vault tokens |
|
|
| 12 | XCP-ng Pulumi provider | May need Upjet wrapper or direct API | Existing Terraform provider via Upjet, Pulumi XO provider |
|
|
| 13 | Multi-tenancy | Team scoping for labels/resources | Namespaces, RBAC, org hierarchy |
|
|
| 14 | Image production pipeline | Build rootfs tarballs per OS per arch | mkosi, debootstrap, dnf --installroot, Packer |
|
|
| 15 | Tinkerbell evaluation | Hands-on: does wrapping it work, or build our own agent? | HookOS + actions vs custom LinuxKit agent |
|
|
| 16 | XCP-ng rootfs extraction | How to produce deployable XCP-ng rootfs (not native installer) | Extract from ISO, capture installed system |
|
|
| 17 | VyOS rootfs extraction | How to produce deployable VyOS rootfs | VyOS build system, published images, Docker mode |
|
|
| 18 | Multi-arch PXE | Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI | Per-arch agent OS builds, iPXE configs |
|
|
|
|
## Project Files
|
|
|
|
| File | Contents |
|
|
|------|----------|
|
|
| `lab-tool-spec.md` | Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap) |
|
|
| `architecture.md` | This file — decisions, dependencies, investigation queue |
|
|
| `hardware.md` | Homelab hardware inventory and node roles |
|
|
| `crossplane-evaluation.md` | Crossplane evaluation and rejection rationale |
|
|
| `config-format-research.md` | YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.) |
|
|
| `os-install-research.md` | OS install automation, rootfs production, image pipeline, deployment matrix |
|
|
| `kubernetes-flavors.md` | k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale |
|
|
| `dns-research.md` | PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS |
|