# PRD: Refactor K3s Module from Bash Heredocs to Pulumi TypeScript ## Problem The k3s install/configure/health module currently generates ~300 lines of bash heredoc strings embedded in TypeScript files (`install.ts`, `configure.ts`, `health.ts`). These are unmaintainable, untestable, and impossible to compose. This is the same bash-in-code problem that drove the bastion TypeScript rewrite. ## Vision The lab platform uses Pulumi as its IaC engine: - **Central execution**: labd runs Pulumi programs in labcontroller k8s for cloud/remote resources with RBAC, global state, and audit trail (PulumiRun table already exists in CockroachDB) - **Local execution**: lab-agents run Pulumi programs directly on bare-metal nodes - **Multi-environment**: supports multiple datacenters, clouds (baremetal, AWS, GCP), production/dev/ephemeral environments ## Current State ### Files to replace - `src/modules/modules/k3s/src/install.ts` — 275 lines, generates bash for 10 install phases - `src/modules/modules/k3s/src/configure.ts` — 118 lines, generates bash for 5 configure phases - `src/modules/modules/k3s/src/health.ts` — 57 lines, generates bash for 6 health checks ### Existing infrastructure - `sshExec(ip, user, command, opts)` and `sshExecStreaming()` — SSH execution primitives in `src/modules/src/ssh.ts` - Module system: `ModuleRunner`, `ModuleRegistry`, `Module` interface with install/configure/health phases - `@lab/shared` types: `BastionConfig`, `K3sInstallContext`, roles, OS types - PulumiRun model in Prisma schema (labd) — tracks Pulumi execution state - labcontroller module generates k8s manifests (cockroachdb.ts, labd.ts, bastion.ts) — these also need Pulumi migration eventually ### 32 distinct operations currently in bash **Install phase (10 steps):** 1. Load kernel modules (br_netfilter, overlay, ip_conntrack) 2. Apply CIS sysctl hardening (9 params) 3. Disable swap 4. Disable firewall (firewalld/ufw — mask to survive reboot) 5. Set SELinux permissive 6. Write k3s server config (flannel=none, secrets-encryption, audit, CIS hardened) 7. Write audit policy YAML 8. Clean up stale CNI (flannel.1 vxlan, cilium interfaces, port 8472 conflicts) 9. Install k3s binary (curl | sh) 10. Install Cilium CNI (detect arch, detect interface, kubeProxyReplacement) **Configure phase (5 steps):** 1. Fix CoreDNS upstream DNS (systemd-resolved 127.0.0.53 unreachable from pod netns) 2. Configure log rotation 3. Check certificate expiry 4. Apply default network policies (deny-ingress, allow-dns-egress) 5. Apply Pod Security Standards (restricted) **Health checks (6 checks):** 1. k3s service active 2. Node Ready condition 3. API server /healthz 4. Secrets encryption enabled 5. Cilium status 6. kube-system pod status ## Requirements ### Architecture decisions needed (discuss with user via task-master) 1. **Pulumi structure**: micro-stacks vs monorepo-by-env vs component-library vs GitOps operator 2. **Multi-cloud support**: how stacks are organized across baremetal/AWS/GCP 3. **Environment model**: how prod/dev/ephemeral environments are represented 4. **State backend**: Pulumi Cloud vs self-hosted (S3/CockroachDB) 5. **Execution model**: who runs `pulumi up` — labd central, lab-agent local, or both? ### Operation design - Each operation is a typed TypeScript async function using `sshExec()` - Standard interface: `OperationContext` in, `OperationResult` out - **Idempotent**: check before act, report `changed: boolean` - **Composable**: operations grouped into logical units (host-prep, networking, hardening) - **Testable**: mock sshExec for unit tests - **Future Pulumi-ready**: each function maps 1:1 to a `remote.Command` resource ### Groups (logical composition) - `host-prep`: kernel-modules + sysctl + swap + firewall + selinux - `k3s-server`: k3s-config + audit-policy + cni-cleanup + k3s-install - `k3s-agent`: k3s-config (agent) + k3s-install (agent mode) - `networking`: cilium + dns-fix + network-policy - `hardening`: pod-security + cert-check + log-rotation ### Pulumi integration (when added) - Add `@pulumi/pulumi` and `@pulumi/command` as dependencies - Each operation becomes a `command.remote.Command` resource - Groups become `pulumi.ComponentResource` classes - K3sCluster becomes a top-level ComponentResource that composes groups - Stacks per environment: `lab-baremetal`, `aws-prod`, `dev`, `ephemeral-pr-123` ## File structure ``` src/modules/modules/k3s/src/ ├── types.ts # K3sConfig, OperationContext, OperationResult ├── utils.ts # sshOpts(), runSequential(), file helpers ├── operations/ # ~15 atomic operations │ ├── kernel-modules.ts │ ├── sysctl.ts │ ├── swap.ts │ ├── firewall.ts │ ├── selinux.ts │ ├── k3s-config.ts │ ├── audit-policy.ts │ ├── cni-cleanup.ts │ ├── k3s-install.ts │ ├── cilium.ts │ ├── dns-fix.ts │ ├── log-rotation.ts │ ├── network-policy.ts │ ├── pod-security.ts │ └── cert-check.ts ├── groups/ # Logical groupings │ ├── host-prep.ts │ ├── k3s-server.ts │ ├── k3s-agent.ts │ ├── networking.ts │ └── hardening.ts ├── health/ # Health checks │ ├── k3s-service.ts │ ├── node-ready.ts │ ├── api-health.ts │ ├── secrets-encryption.ts │ ├── cilium-status.ts │ └── pod-status.ts ├── k3s-module.ts # Module implementation └── index.ts # Public exports ``` ## Success criteria - Zero bash heredoc strings in the k3s module - Every operation independently testable with mocked sshExec - `labctl app k3s install ` works end-to-end - `labctl app k3s health` works end-to-end - Existing test suite passes (updated for new API) - Clear path to wrapping operations as Pulumi resources