# PRD: Refactor K3s Module from Bash Heredocs to Pulumi TypeScript

## Problem

The k3s install/configure/health module currently generates ~300 lines of bash heredoc strings embedded in TypeScript files (`install.ts`, `configure.ts`, `health.ts`). These are unmaintainable, untestable, and impossible to compose. This is the same bash-in-code problem that drove the bastion TypeScript rewrite.

## Vision

The lab platform uses Pulumi as its IaC engine:
- **Central execution**: labd runs Pulumi programs in labcontroller k8s for cloud/remote resources with RBAC, global state, and audit trail (PulumiRun table already exists in CockroachDB)
- **Local execution**: lab-agents run Pulumi programs directly on bare-metal nodes
- **Multi-environment**: supports multiple datacenters, clouds (baremetal, AWS, GCP), production/dev/ephemeral environments

## Current State

### Files to replace
- `src/modules/modules/k3s/src/install.ts` — 275 lines, generates bash for 10 install phases
- `src/modules/modules/k3s/src/configure.ts` — 118 lines, generates bash for 5 configure phases
- `src/modules/modules/k3s/src/health.ts` — 57 lines, generates bash for 6 health checks

### Existing infrastructure
- `sshExec(ip, user, command, opts)` and `sshExecStreaming()` — SSH execution primitives in `src/modules/src/ssh.ts`
- Module system: `ModuleRunner`, `ModuleRegistry`, `Module` interface with install/configure/health phases
- `@lab/shared` types: `BastionConfig`, `K3sInstallContext`, roles, OS types
- PulumiRun model in Prisma schema (labd) — tracks Pulumi execution state
- labcontroller module generates k8s manifests (cockroachdb.ts, labd.ts, bastion.ts) — these also need Pulumi migration eventually

### 32 distinct operations currently in bash
**Install phase (10 steps):**
1. Load kernel modules (br_netfilter, overlay, ip_conntrack)
2. Apply CIS sysctl hardening (9 params)
3. Disable swap
4. Disable firewall (firewalld/ufw — mask to survive reboot)
5. Set SELinux permissive
6. Write k3s server config (flannel=none, secrets-encryption, audit, CIS hardened)
7. Write audit policy YAML
8. Clean up stale CNI (flannel.1 vxlan, cilium interfaces, port 8472 conflicts)
9. Install k3s binary (curl | sh)
10. Install Cilium CNI (detect arch, detect interface, kubeProxyReplacement)

**Configure phase (5 steps):**
1. Fix CoreDNS upstream DNS (systemd-resolved 127.0.0.53 unreachable from pod netns)
2. Configure log rotation
3. Check certificate expiry
4. Apply default network policies (deny-ingress, allow-dns-egress)
5. Apply Pod Security Standards (restricted)

**Health checks (6 checks):**
1. k3s service active
2. Node Ready condition
3. API server /healthz
4. Secrets encryption enabled
5. Cilium status
6. kube-system pod status

## Requirements

### Architecture decisions needed (discuss with user via task-master)
1. **Pulumi structure**: micro-stacks vs monorepo-by-env vs component-library vs GitOps operator
2. **Multi-cloud support**: how stacks are organized across baremetal/AWS/GCP
3. **Environment model**: how prod/dev/ephemeral environments are represented
4. **State backend**: Pulumi Cloud vs self-hosted (S3/CockroachDB)
5. **Execution model**: who runs `pulumi up` — labd central, lab-agent local, or both?

### Operation design
- Each operation is a typed TypeScript async function using `sshExec()`
- Standard interface: `OperationContext` in, `OperationResult` out
- **Idempotent**: check before act, report `changed: boolean`
- **Composable**: operations grouped into logical units (host-prep, networking, hardening)
- **Testable**: mock sshExec for unit tests
- **Future Pulumi-ready**: each function maps 1:1 to a `remote.Command` resource

### Groups (logical composition)
- `host-prep`: kernel-modules + sysctl + swap + firewall + selinux
- `k3s-server`: k3s-config + audit-policy + cni-cleanup + k3s-install
- `k3s-agent`: k3s-config (agent) + k3s-install (agent mode)
- `networking`: cilium + dns-fix + network-policy
- `hardening`: pod-security + cert-check + log-rotation

### Pulumi integration (when added)
- Add `@pulumi/pulumi` and `@pulumi/command` as dependencies
- Each operation becomes a `command.remote.Command` resource
- Groups become `pulumi.ComponentResource` classes
- K3sCluster becomes a top-level ComponentResource that composes groups
- Stacks per environment: `lab-baremetal`, `aws-prod`, `dev`, `ephemeral-pr-123`

## File structure

```
src/modules/modules/k3s/src/
├── types.ts              # K3sConfig, OperationContext, OperationResult
├── utils.ts              # sshOpts(), runSequential(), file helpers
├── operations/           # ~15 atomic operations
│   ├── kernel-modules.ts
│   ├── sysctl.ts
│   ├── swap.ts
│   ├── firewall.ts
│   ├── selinux.ts
│   ├── k3s-config.ts
│   ├── audit-policy.ts
│   ├── cni-cleanup.ts
│   ├── k3s-install.ts
│   ├── cilium.ts
│   ├── dns-fix.ts
│   ├── log-rotation.ts
│   ├── network-policy.ts
│   ├── pod-security.ts
│   └── cert-check.ts
├── groups/               # Logical groupings
│   ├── host-prep.ts
│   ├── k3s-server.ts
│   ├── k3s-agent.ts
│   ├── networking.ts
│   └── hardening.ts
├── health/               # Health checks
│   ├── k3s-service.ts
│   ├── node-ready.ts
│   ├── api-health.ts
│   ├── secrets-encryption.ts
│   ├── cilium-status.ts
│   └── pod-status.ts
├── k3s-module.ts         # Module implementation
└── index.ts              # Public exports
```

## Success criteria
- Zero bash heredoc strings in the k3s module
- Every operation independently testable with mocked sshExec
- `labctl app k3s install <target>` works end-to-end
- `labctl app k3s health` works end-to-end
- Existing test suite passes (updated for new API)
- Clear path to wrapping operations as Pulumi resources