Files
lab/bastion/.taskmaster/docs/pulumi-k3s-refactor.md
Michal 46b017d77e
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
feat: install logging, error trapping, PXE/ISO integration tests
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00

5.8 KiB

PRD: Refactor K3s Module from Bash Heredocs to Pulumi TypeScript

Problem

The k3s install/configure/health module currently generates ~300 lines of bash heredoc strings embedded in TypeScript files (install.ts, configure.ts, health.ts). These are unmaintainable, untestable, and impossible to compose. This is the same bash-in-code problem that drove the bastion TypeScript rewrite.

Vision

The lab platform uses Pulumi as its IaC engine:

  • Central execution: labd runs Pulumi programs in labcontroller k8s for cloud/remote resources with RBAC, global state, and audit trail (PulumiRun table already exists in CockroachDB)
  • Local execution: lab-agents run Pulumi programs directly on bare-metal nodes
  • Multi-environment: supports multiple datacenters, clouds (baremetal, AWS, GCP), production/dev/ephemeral environments

Current State

Files to replace

  • src/modules/modules/k3s/src/install.ts — 275 lines, generates bash for 10 install phases
  • src/modules/modules/k3s/src/configure.ts — 118 lines, generates bash for 5 configure phases
  • src/modules/modules/k3s/src/health.ts — 57 lines, generates bash for 6 health checks

Existing infrastructure

  • sshExec(ip, user, command, opts) and sshExecStreaming() — SSH execution primitives in src/modules/src/ssh.ts
  • Module system: ModuleRunner, ModuleRegistry, Module interface with install/configure/health phases
  • @lab/shared types: BastionConfig, K3sInstallContext, roles, OS types
  • PulumiRun model in Prisma schema (labd) — tracks Pulumi execution state
  • labcontroller module generates k8s manifests (cockroachdb.ts, labd.ts, bastion.ts) — these also need Pulumi migration eventually

32 distinct operations currently in bash

Install phase (10 steps):

  1. Load kernel modules (br_netfilter, overlay, ip_conntrack)
  2. Apply CIS sysctl hardening (9 params)
  3. Disable swap
  4. Disable firewall (firewalld/ufw — mask to survive reboot)
  5. Set SELinux permissive
  6. Write k3s server config (flannel=none, secrets-encryption, audit, CIS hardened)
  7. Write audit policy YAML
  8. Clean up stale CNI (flannel.1 vxlan, cilium interfaces, port 8472 conflicts)
  9. Install k3s binary (curl | sh)
  10. Install Cilium CNI (detect arch, detect interface, kubeProxyReplacement)

Configure phase (5 steps):

  1. Fix CoreDNS upstream DNS (systemd-resolved 127.0.0.53 unreachable from pod netns)
  2. Configure log rotation
  3. Check certificate expiry
  4. Apply default network policies (deny-ingress, allow-dns-egress)
  5. Apply Pod Security Standards (restricted)

Health checks (6 checks):

  1. k3s service active
  2. Node Ready condition
  3. API server /healthz
  4. Secrets encryption enabled
  5. Cilium status
  6. kube-system pod status

Requirements

Architecture decisions needed (discuss with user via task-master)

  1. Pulumi structure: micro-stacks vs monorepo-by-env vs component-library vs GitOps operator
  2. Multi-cloud support: how stacks are organized across baremetal/AWS/GCP
  3. Environment model: how prod/dev/ephemeral environments are represented
  4. State backend: Pulumi Cloud vs self-hosted (S3/CockroachDB)
  5. Execution model: who runs pulumi up — labd central, lab-agent local, or both?

Operation design

  • Each operation is a typed TypeScript async function using sshExec()
  • Standard interface: OperationContext in, OperationResult out
  • Idempotent: check before act, report changed: boolean
  • Composable: operations grouped into logical units (host-prep, networking, hardening)
  • Testable: mock sshExec for unit tests
  • Future Pulumi-ready: each function maps 1:1 to a remote.Command resource

Groups (logical composition)

  • host-prep: kernel-modules + sysctl + swap + firewall + selinux
  • k3s-server: k3s-config + audit-policy + cni-cleanup + k3s-install
  • k3s-agent: k3s-config (agent) + k3s-install (agent mode)
  • networking: cilium + dns-fix + network-policy
  • hardening: pod-security + cert-check + log-rotation

Pulumi integration (when added)

  • Add @pulumi/pulumi and @pulumi/command as dependencies
  • Each operation becomes a command.remote.Command resource
  • Groups become pulumi.ComponentResource classes
  • K3sCluster becomes a top-level ComponentResource that composes groups
  • Stacks per environment: lab-baremetal, aws-prod, dev, ephemeral-pr-123

File structure

src/modules/modules/k3s/src/
├── types.ts              # K3sConfig, OperationContext, OperationResult
├── utils.ts              # sshOpts(), runSequential(), file helpers
├── operations/           # ~15 atomic operations
│   ├── kernel-modules.ts
│   ├── sysctl.ts
│   ├── swap.ts
│   ├── firewall.ts
│   ├── selinux.ts
│   ├── k3s-config.ts
│   ├── audit-policy.ts
│   ├── cni-cleanup.ts
│   ├── k3s-install.ts
│   ├── cilium.ts
│   ├── dns-fix.ts
│   ├── log-rotation.ts
│   ├── network-policy.ts
│   ├── pod-security.ts
│   └── cert-check.ts
├── groups/               # Logical groupings
│   ├── host-prep.ts
│   ├── k3s-server.ts
│   ├── k3s-agent.ts
│   ├── networking.ts
│   └── hardening.ts
├── health/               # Health checks
│   ├── k3s-service.ts
│   ├── node-ready.ts
│   ├── api-health.ts
│   ├── secrets-encryption.ts
│   ├── cilium-status.ts
│   └── pod-status.ts
├── k3s-module.ts         # Module implementation
└── index.ts              # Public exports

Success criteria

  • Zero bash heredoc strings in the k3s module
  • Every operation independently testable with mocked sshExec
  • labctl app k3s install <target> works end-to-end
  • labctl app k3s health works end-to-end
  • Existing test suite passes (updated for new API)
  • Clear path to wrapping operations as Pulumi resources