feat: install logging, error trapping, PXE/ISO integration tests
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting, only 3 progress callbacks, zero log streaming. This overhaul makes every install fully observable. Kickstart improvements: - Error trapping in %pre and %post (trap ERR sends failure details to bastion) - 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata - Background log streamer: tails %post output and batch-sends to /api/log - bastion_log() function for explicit log lines from kickstart scripts Bastion API: - POST /api/log — receives raw log lines from kickstart (single or batch) - InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence - GET /api/logs/:mac — now returns log_lines + log_total alongside stages - SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log) - Progress events forwarded to labd via bastion-progress WebSocket message - Post-provision k3s logs routed through progressBus (was console-only) dnsmasq fixes found during VM testing: - HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach) - pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode) - PXEClient vendor class echo for UEFI firmware compatibility Integration tests: - PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install - ISO boot test: blank VM boots from bastion-generated ISO → same flow - Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot) - test-provision.sh: runs both PXE + ISO tests with prerequisite checks - 250GB sparse QCOW2 disk (LVM layout needs ~204GB) 201 unit tests passing (11 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
132
bastion/.taskmaster/docs/pulumi-k3s-refactor.md
Normal file
132
bastion/.taskmaster/docs/pulumi-k3s-refactor.md
Normal file
@@ -0,0 +1,132 @@
|
||||
# PRD: Refactor K3s Module from Bash Heredocs to Pulumi TypeScript
|
||||
|
||||
## Problem
|
||||
|
||||
The k3s install/configure/health module currently generates ~300 lines of bash heredoc strings embedded in TypeScript files (`install.ts`, `configure.ts`, `health.ts`). These are unmaintainable, untestable, and impossible to compose. This is the same bash-in-code problem that drove the bastion TypeScript rewrite.
|
||||
|
||||
## Vision
|
||||
|
||||
The lab platform uses Pulumi as its IaC engine:
|
||||
- **Central execution**: labd runs Pulumi programs in labcontroller k8s for cloud/remote resources with RBAC, global state, and audit trail (PulumiRun table already exists in CockroachDB)
|
||||
- **Local execution**: lab-agents run Pulumi programs directly on bare-metal nodes
|
||||
- **Multi-environment**: supports multiple datacenters, clouds (baremetal, AWS, GCP), production/dev/ephemeral environments
|
||||
|
||||
## Current State
|
||||
|
||||
### Files to replace
|
||||
- `src/modules/modules/k3s/src/install.ts` — 275 lines, generates bash for 10 install phases
|
||||
- `src/modules/modules/k3s/src/configure.ts` — 118 lines, generates bash for 5 configure phases
|
||||
- `src/modules/modules/k3s/src/health.ts` — 57 lines, generates bash for 6 health checks
|
||||
|
||||
### Existing infrastructure
|
||||
- `sshExec(ip, user, command, opts)` and `sshExecStreaming()` — SSH execution primitives in `src/modules/src/ssh.ts`
|
||||
- Module system: `ModuleRunner`, `ModuleRegistry`, `Module` interface with install/configure/health phases
|
||||
- `@lab/shared` types: `BastionConfig`, `K3sInstallContext`, roles, OS types
|
||||
- PulumiRun model in Prisma schema (labd) — tracks Pulumi execution state
|
||||
- labcontroller module generates k8s manifests (cockroachdb.ts, labd.ts, bastion.ts) — these also need Pulumi migration eventually
|
||||
|
||||
### 32 distinct operations currently in bash
|
||||
**Install phase (10 steps):**
|
||||
1. Load kernel modules (br_netfilter, overlay, ip_conntrack)
|
||||
2. Apply CIS sysctl hardening (9 params)
|
||||
3. Disable swap
|
||||
4. Disable firewall (firewalld/ufw — mask to survive reboot)
|
||||
5. Set SELinux permissive
|
||||
6. Write k3s server config (flannel=none, secrets-encryption, audit, CIS hardened)
|
||||
7. Write audit policy YAML
|
||||
8. Clean up stale CNI (flannel.1 vxlan, cilium interfaces, port 8472 conflicts)
|
||||
9. Install k3s binary (curl | sh)
|
||||
10. Install Cilium CNI (detect arch, detect interface, kubeProxyReplacement)
|
||||
|
||||
**Configure phase (5 steps):**
|
||||
1. Fix CoreDNS upstream DNS (systemd-resolved 127.0.0.53 unreachable from pod netns)
|
||||
2. Configure log rotation
|
||||
3. Check certificate expiry
|
||||
4. Apply default network policies (deny-ingress, allow-dns-egress)
|
||||
5. Apply Pod Security Standards (restricted)
|
||||
|
||||
**Health checks (6 checks):**
|
||||
1. k3s service active
|
||||
2. Node Ready condition
|
||||
3. API server /healthz
|
||||
4. Secrets encryption enabled
|
||||
5. Cilium status
|
||||
6. kube-system pod status
|
||||
|
||||
## Requirements
|
||||
|
||||
### Architecture decisions needed (discuss with user via task-master)
|
||||
1. **Pulumi structure**: micro-stacks vs monorepo-by-env vs component-library vs GitOps operator
|
||||
2. **Multi-cloud support**: how stacks are organized across baremetal/AWS/GCP
|
||||
3. **Environment model**: how prod/dev/ephemeral environments are represented
|
||||
4. **State backend**: Pulumi Cloud vs self-hosted (S3/CockroachDB)
|
||||
5. **Execution model**: who runs `pulumi up` — labd central, lab-agent local, or both?
|
||||
|
||||
### Operation design
|
||||
- Each operation is a typed TypeScript async function using `sshExec()`
|
||||
- Standard interface: `OperationContext` in, `OperationResult` out
|
||||
- **Idempotent**: check before act, report `changed: boolean`
|
||||
- **Composable**: operations grouped into logical units (host-prep, networking, hardening)
|
||||
- **Testable**: mock sshExec for unit tests
|
||||
- **Future Pulumi-ready**: each function maps 1:1 to a `remote.Command` resource
|
||||
|
||||
### Groups (logical composition)
|
||||
- `host-prep`: kernel-modules + sysctl + swap + firewall + selinux
|
||||
- `k3s-server`: k3s-config + audit-policy + cni-cleanup + k3s-install
|
||||
- `k3s-agent`: k3s-config (agent) + k3s-install (agent mode)
|
||||
- `networking`: cilium + dns-fix + network-policy
|
||||
- `hardening`: pod-security + cert-check + log-rotation
|
||||
|
||||
### Pulumi integration (when added)
|
||||
- Add `@pulumi/pulumi` and `@pulumi/command` as dependencies
|
||||
- Each operation becomes a `command.remote.Command` resource
|
||||
- Groups become `pulumi.ComponentResource` classes
|
||||
- K3sCluster becomes a top-level ComponentResource that composes groups
|
||||
- Stacks per environment: `lab-baremetal`, `aws-prod`, `dev`, `ephemeral-pr-123`
|
||||
|
||||
## File structure
|
||||
|
||||
```
|
||||
src/modules/modules/k3s/src/
|
||||
├── types.ts # K3sConfig, OperationContext, OperationResult
|
||||
├── utils.ts # sshOpts(), runSequential(), file helpers
|
||||
├── operations/ # ~15 atomic operations
|
||||
│ ├── kernel-modules.ts
|
||||
│ ├── sysctl.ts
|
||||
│ ├── swap.ts
|
||||
│ ├── firewall.ts
|
||||
│ ├── selinux.ts
|
||||
│ ├── k3s-config.ts
|
||||
│ ├── audit-policy.ts
|
||||
│ ├── cni-cleanup.ts
|
||||
│ ├── k3s-install.ts
|
||||
│ ├── cilium.ts
|
||||
│ ├── dns-fix.ts
|
||||
│ ├── log-rotation.ts
|
||||
│ ├── network-policy.ts
|
||||
│ ├── pod-security.ts
|
||||
│ └── cert-check.ts
|
||||
├── groups/ # Logical groupings
|
||||
│ ├── host-prep.ts
|
||||
│ ├── k3s-server.ts
|
||||
│ ├── k3s-agent.ts
|
||||
│ ├── networking.ts
|
||||
│ └── hardening.ts
|
||||
├── health/ # Health checks
|
||||
│ ├── k3s-service.ts
|
||||
│ ├── node-ready.ts
|
||||
│ ├── api-health.ts
|
||||
│ ├── secrets-encryption.ts
|
||||
│ ├── cilium-status.ts
|
||||
│ └── pod-status.ts
|
||||
├── k3s-module.ts # Module implementation
|
||||
└── index.ts # Public exports
|
||||
```
|
||||
|
||||
## Success criteria
|
||||
- Zero bash heredoc strings in the k3s module
|
||||
- Every operation independently testable with mocked sshExec
|
||||
- `labctl app k3s install <target>` works end-to-end
|
||||
- `labctl app k3s health` works end-to-end
|
||||
- Existing test suite passes (updated for new API)
|
||||
- Clear path to wrapping operations as Pulumi resources
|
||||
Reference in New Issue
Block a user