lab/architecture.md

# Architecture Decisions

## Core Principles

1. Build for homelab first, design for AWS/multi-cloud from the start
2. Labels as the universal abstraction — config attaches to labels, not machines
3. Code is the policy — declarations grant access, no separate policy management
4. Availability over consistency — stale data is acceptable, no data is not
5. No single point of failure — everything works offline with local cache
6. Don't reinvent the wheel — wrap existing tools, build the glue and UX
7. One engine everywhere — CLI, server, and init all use the same code path

## The Tool: "lab"

Unified infrastructure lifecycle platform. Full spec in `lab-tool-spec.md`.

### Component Dependency Map

```
┌─────────────────────────────────────────────────────────────────────┐
│                        LAB PLATFORM                                  │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    CORE (no external deps)                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │    │
│  │  │ Label    │ │ Group    │ │ Targeting│ │ Render Engine │  │    │
│  │  │ Engine   │ │ Engine   │ │ Engine   │ │ (CLI tables,  │  │    │
│  │  │          │ │          │ │          │ │  TUI, diff)   │  │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └───────────────┘  │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Profile      │ │ State Store  │ │ Plugin Registry  │    │    │
│  │  │ Engine       │ │ (SQLite +    │ │                  │    │    │
│  │  │ (t-shirt     │ │  Litestream) │ │                  │    │    │
│  │  │  sizes)      │ │              │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on core                                              │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              LIFECYCLE (depends on: core + providers)        │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Lifecycle    │ │ Artifact     │ │ K8s Deployer     │    │    │
│  │  │ Manager      │ │ Builder      │ │                  │    │    │
│  │  │ (plan/apply/ │ │ (puppet →    │ │                  │    │    │
│  │  │  destroy)    │ │  container)  │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on lifecycle                                         │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              IDENTITY & SECRETS (depends on: lifecycle)      │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Identity     │ │ Secret Store │ │ Token Issuer     │    │    │
│  │  │ Manager      │ │ (privileged  │ │ (one-time join   │    │    │
│  │  │ (enroll,     │ │  label, local│ │  tokens)         │    │    │
│  │  │  DNS, certs, │ │  cache, git  │ │                  │    │    │
│  │  │  SSH keys)   │ │  backup)     │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│       ▲ depends on identity                                          │
│  ┌────┴────────────────────────────────────────────────────────┐    │
│  │              OBSERVABILITY (depends on: core + identity)    │    │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐    │    │
│  │  │ Health       │ │ Alert        │ │ Audit Log        │    │    │
│  │  │ Aggregator   │ │ Generator    │ │                  │    │    │
│  │  │              │ │ (auto + user │ │                  │    │    │
│  │  │              │ │  defined)    │ │                  │    │    │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              INTERFACES (depends on: everything above)      │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐  │    │
│  │  │ gRPC/REST│ │ CLI      │ │ TUI      │ │ Web UI       │  │    │
│  │  │ API      │ │ (cobra)  │ │(bubbletea)│ │ (future)     │  │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────────┘  │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

PROVIDER PLUGINS (external, loaded at runtime):
  ┌────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐
  │provider-aws│ │provider-   │ │provider-     │ │provider-k8s│
  │ (Pulumi)   │ │xcpng (XO)  │ │baremetal     │ │ (Pulumi)   │
  └────────────┘ └────────────┘ │(Tinkerbell)  │ └────────────┘
                                └──────────────┘
HEALTH PLUGINS:                 IDENTITY PLUGINS:
  ┌────────────┐ ┌──────────┐   ┌───────────┐ ┌─────────────┐
  │health-     │ │health-   │   │id-openvox │ │id-dns       │
  │prometheus  │ │naemon    │   │           │ │             │
  └────────────┘ └──────────┘   └───────────┘ └─────────────┘
  ┌────────────┐                ┌───────────┐ ┌─────────────┐
  │health-     │                │id-ssh-ca  │ │id-secret    │
  │cloudwatch  │                │           │ │             │
  └────────────┘                └───────────┘ └─────────────┘
```

### Build Order (what depends on what)

```
Phase 1: CORE (can be built and tested independently)
  ├── Label Engine
  ├── Group Engine (depends on: labels)
  ├── Targeting Engine (depends on: labels, groups)
  ├── Profile Engine (t-shirt sizes)
  ├── Render Engine
  ├── State Store (SQLite + Litestream)
  ├── Plugin Registry
  ├── CLI framework (cobra)
  └── gRPC/REST API skeleton

Phase 2: PROVIDERS (can be built in parallel, each independent)
  ├── provider-ssh (simplest, needed for onboarding existing machines)
  ├── provider-baremetal (PXE boot — embedded DHCP/TFTP/HTTP server)
  ├── provider-portainer (deploy via Portainer API)
  ├── provider-k8s (needed for k8s deployments)
  ├── provider-aws (Pulumi AWS)
  └── provider-xcpng (Pulumi XO / XO REST API)

Phase 3: LIFECYCLE (depends on: core + at least one provider)
  ├── Lifecycle Manager (plan/apply/destroy)
  ├── Onboarding (lab onboard — SSH detect + PXE boot + auto-enroll)
  ├── Hardware detection (suggest labels from detected CPU/GPU/RAM/disk)
  ├── Local mode (lab init --local, engine on user device)
  ├── Self-deploy (lab init — deploy to remote target)
  ├── Self-migration (lab server migrate)
  └── Artifact Builder (puppet → container)

Phase 4: IDENTITY (depends on: lifecycle)
  ├── Token Issuer (one-time join tokens)
  ├── OpenVox Enrollor (cert signing, node classification)
  ├── DNS Manager (auto-registration, IP mobility)
  ├── SSH CA integration
  └── Secret Store (privileged label, local cache, git backup)

Phase 5: OBSERVABILITY (depends on: core + identity)
  ├── Health Aggregator (Prometheus, Naemon, CloudWatch plugins)
  ├── Alert Generator (auto + user-defined, targeting engine)
  ├── Four-pillar status (sync + puppet + health + identity)
  └── Audit log

Phase 6: UX POLISH
  ├── TUI (bubbletea, k9s-style, cross-linked navigation)
  ├── lab show / lab targets (visibility commands)
  ├── lab render (multi-provider comparison)
  └── Web UI (future)
```

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Labels** | Universal abstraction. Config (puppet classes, alerts, secrets, sizes) attaches to labels |
| **Groups** | Composable, nested, with exclusions. Target by label, group, server, environment |
| **Targeting** | Unified query syntax used everywhere: alerts, secrets, puppet, queries |
| **Four Pillars** | Every resource shows: Sync + Puppet + Health + Identity |
| **Profiles** | T-shirt sizing with per-provider mappings, user-owned |
| **Secret Store** | Privileged label holding all secrets, machines get only entitled subset |
| **Code = Policy** | `lab::secret()` in puppet code = usage AND access declaration |
| **Artifact Builder** | Same puppet modules → VM config OR container image |
| **Self-deploy** | Lab deploys itself using same engine as everything else |
| **Visibility** | Two-way: server→everything applied, label→all servers affected |

## Infrastructure Stack

| Layer | Homelab | AWS Equivalent | Status |
|-------|---------|----------------|--------|
| Orchestration | k3s | EKS | Decided |
| IaC engine | Pulumi | Pulumi | Decided |
| GitOps | ArgoCD | ArgoCD | Decided |
| Monitoring (k8s) | Prometheus + Grafana | Prometheus + Grafana | Decided |
| Monitoring (infra) | Naemon | N/A (bare metal only) | Decided |
| Secrets backend | TBD | TBD | Needs investigation |
| DNS | PowerDNS + ExternalDNS | Route53 + ExternalDNS | Decided — see `dns-research.md` |
| TLS / CA | TBD | TBD | Needs investigation |
| SSH CA | TBD | TBD | Needs investigation |
| Storage | Longhorn | EBS CSI | Decided |
| Config mgmt | OpenVox | OpenVox | Decided |
| Bare metal boot | Tinkerbell / iPXE | N/A | Needs investigation |
| State store | SQLite + Litestream | SQLite + Litestream | Leading candidate |
| Container build | Buildah / Docker | Buildah / Docker | Needs investigation |

## Decisions Made

| Decision | Choice | Why | Alternatives Considered |
|----------|--------|-----|------------------------|
| IaC engine | Pulumi | Real languages, plan/preview, component packages, XCP-ng provider exists | Terraform (no abstraction), Crossplane (no plan) |
| Config mgmt | OpenVox | Puppet fork, Apache 2.0, existing modules, active community | Puppet (Perforce EULA, 25-node limit) |
| Multi-cloud abstraction | Custom (Lab) | Nothing exists that does labels + plan + bare metal + XCP-ng | Crossplane (no plan), Terraform (re-implement per cloud) |
| Kubernetes | k3s | Puppet-friendly, multi-arch, lightweight, same K8s API as EKS | OpenShift (fights puppet), Talos (no SSH/puppet), MicroK8s (snap-based) |
| Target OS list | Ubuntu, Debian, Fedora, AlmaLinux, XCP-ng, VyOS | Multi-arch, each with different install automation | See `os-install-research.md` |
| State store | NOT etcd | etcd crashes over serving stale data, availability > consistency | Leading: SQLite + Litestream |
| Secret access model | Code = policy | Declarations in code/labels auto-grant access, no manual Vault policies | Manual Vault policy management |
| Secret distribution | Privileged store + local cache | Prevents secret sprawl, machines only get entitled secrets | Peer-to-peer sync (leaks secrets sideways) |
| Resilience model | Offline-capable | Local cache keeps everything running, git backup for DR | Central server dependency (FreeIPA burned us) |
| Bootstrap | Self-deploying | lab init uses same engine as lab apply, no special codepath | Separate init provider interface |

## Evaluated and Rejected

| Tool | Why Rejected | Details |
|------|-------------|---------|
| **Crossplane** | No plan/preview — dealbreaker for enterprise | `crossplane-evaluation.md` |
| **Foreman** | Obsolete, poor UX, user has used it | Memory: `feedback_foreman.md` |
| **Terraform/OpenTofu** | No multi-platform abstraction | Re-implement per cloud at thousands of nodes |
| **MAAS** | Bare metal only | No cloud VMs, no Puppet integration |
| **OpenShift** | Fights external config mgmt, heavy, limited ARM | See `kubernetes-flavors.md` |
| **Talos** | Immutable OS, no SSH, no puppet | Incompatible with our approach |
| **MicroK8s** | Snap-based | Puppet managing snaps is awkward |
| **HashiCorp Vault** | Not impressed, central-server mindset | Will evaluate alternatives (OpenBao, Infisical, etc.) |
| **etcd** | Consistency over availability | Crashes rather than serving stale data |
| **FreeIPA** | Unstable | Good features (DNS, SSH, CA, secrets) but unreliable |

## Investigation Queue

Things we've identified but haven't evaluated yet, in rough priority order:

| # | Topic | Context | Options to Investigate |
|---|-------|---------|----------------------|
| 1 | Secret backend | Distributed, offline-capable, policy-filtered | OpenBao, Infisical, Conjur, SOPS+age, custom encrypted SQLite |
| 2 | ~~DNS auto-registration~~ | ~~Every managed resource auto-registered~~ | **DECIDED: PowerDNS + ExternalDNS** — see `dns-research.md` |
| 3 | SSH CA | CA-signed host keys, short-lived user certs | Vault SSH engine, OpenVox CA, step-ca, Teleport, Boundary |
| 4 | TLS / Internal CA | Machine certs, auto-renewal | OpenVox CA, Vault PKI, step-ca, cert-manager |
| 5 | Bare metal provisioning | Universal PXE agent + rootfs deploy (NOT native installers) | Wrap Tinkerbell vs build own agent — see `os-install-research.md` |
| 6 | State store | Embedded, auto-backup, auto-recover | SQLite+Litestream, bbolt, Badger |
| 7 | Container build | Puppet modules → OCI images | Buildah, Docker, Kaniko |
| 8 | Local cache encryption | Machine-specific key for secret cache | TPM 2.0, kernel keyring, LUKS-bound, secure enclave |
| 9 | Alert rendering | Generate monitoring configs from lab alerts | Prometheus rules, Naemon configs, CloudWatch |
| 10 | Input format | How users define resources and labels | YAML (Compose-like), Pkl, KCL, CUE, TypeScript |
| 11 | Auth (CLI to server) | Secure CLI-to-lab-server communication | mTLS, OIDC, Vault tokens |
| 12 | XCP-ng Pulumi provider | May need Upjet wrapper or direct API | Existing Terraform provider via Upjet, Pulumi XO provider |
| 13 | Multi-tenancy | Team scoping for labels/resources | Namespaces, RBAC, org hierarchy |
| 14 | Image production pipeline | Build rootfs tarballs per OS per arch | mkosi, debootstrap, dnf --installroot, Packer |
| 15 | Tinkerbell evaluation | Hands-on: does wrapping it work, or build our own agent? | HookOS + actions vs custom LinuxKit agent |
| 16 | XCP-ng rootfs extraction | How to produce deployable XCP-ng rootfs (not native installer) | Extract from ISO, capture installed system |
| 17 | VyOS rootfs extraction | How to produce deployable VyOS rootfs | VyOS build system, published images, Docker mode |
| 18 | Multi-arch PXE | Different boot chains for x86 BIOS, x86 UEFI, ARM UEFI | Per-arch agent OS builds, iPXE configs |

## Project Files

| File | Contents |
|------|----------|
| `lab-tool-spec.md` | Full platform specification (CLI examples, plugin interfaces, secrets, identity, bootstrap) |
| `architecture.md` | This file — decisions, dependencies, investigation queue |
| `hardware.md` | Homelab hardware inventory and node roles |
| `crossplane-evaluation.md` | Crossplane evaluation and rejection rationale |
| `config-format-research.md` | YAML alternatives research (Pkl, KCL, CUE, CDK8s, etc.) |
| `os-install-research.md` | OS install automation, rootfs production, image pipeline, deployment matrix |
| `kubernetes-flavors.md` | k3s chosen, OpenShift/Talos/MicroK8s rejected with rationale |
| `dns-research.md` | PowerDNS + ExternalDNS chosen, domain claims, health-checked DNS |