Commit Graph

34 Commits

Author SHA1 Message Date
Michal
c0fb1310cb fix: re-enable logging --host (removed invalid --level flag)
ksvalidator caught the issue: --level=info is not valid for F43.
Correct syntax is just: logging --host=<ip> --port=<port>

Also added ksvalidator syntax check to unit tests — validates
rendered kickstart for all roles (vanilla, worker, infra) against
F43 pykickstart. This catches kickstart syntax errors at test time
instead of during a 12-minute VM install.

Integration test passes: 21/22 (1 skipped: log lines capture).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:45:11 +00:00
Michal
48b2230665 fix: disable logging --host (breaks Anaconda), add integration config
The kickstart `logging --host` directive stalls Anaconda install —
likely firewall blocks UDP syslog or Fedora 43 Anaconda has issues
with it. Commented out for now. Syslog listener infrastructure is
in place and ready once we resolve the Anaconda/firewall issue.

Added vitest.integration.config.ts for running integration tests:
  pnpm exec vitest run --config vitest.integration.config.ts

All 21 integration tests pass, serial console rsyslog forwarding works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:19:48 +00:00
Michal
3dc1317301 feat: Anaconda syslog logging, serial console forwarding, protocol types
- Add UDP syslog listener (port 5514) for receiving Anaconda install logs
  via native `logging --host` kickstart directive — no background processes
- Add rsyslog serial console forwarding in %post (AWS EC2 compatible ttyS0@115200n8)
- Add ProvisionStackType ("dhcpproxy" | "iso" | "cloud-init") to shared types
- Add bastion-install-log WebSocket protocol message for bastion→labd log sync
- Add syslogPort to BastionConfig (default 5514)
- Wire syslog listener into bastion startup/shutdown lifecycle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:14:10 +00:00
Michal
cac7514014 feat: admin user 'lab' with SSH key auth (Step 7 — PASS)
Changed admin user from 'michal' to generic 'lab' user.
SSH key auth works for both root and lab user.
21/22 tests pass (1 skipped: log lines, needs log streamer redesign).

Bisection complete — all features work except background log streamer
which prevents Anaconda from syncing filesystem writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:30:59 +00:00
Michal
25a2beccff fix: add error trap, bastion helpers, serial console (Steps 2-5 pass)
Bisection results:
- Step 2: bastion_log/bastion_error helpers — PASS
- Step 3: ERR trap in %post — PASS
- Step 4: background log streamer — FAIL (breaks boot, NOT included)
- Step 5: serial console on ttyS0 — PASS

The background log streamer (tail -f subprocess in %post) prevents
Anaconda from properly syncing the installed filesystem. This was
the root cause of all boot failures. Will need a different approach
for real-time log streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:17:47 +00:00
Michal
2a1a29c03b fix: revert kickstart to near-original baseline (Step 0 — boots clean)
Reverted install.ks.ts to near-original state from commit 64533b2.
This is the bisection baseline — 21/22 integration tests pass,
0 failed systemd services, SSH works, /boot/efi mounts.

Removed all accumulated fixes that collectively broke boot:
- ERR trap, background log streamer, bastion_log/bastion_error
- depmod rebuild, nofail on /boot/efi, SELinux autorelabel
- chcon/restorecon for /etc /var /root
- kernel-modules and dosfstools packages

Kept from current branch:
- rootpw --plaintext lab-root-pw (console debug access)
- Network-first boot order (bastion controls boot)
- Vanilla role support, rancher partition support
- Boot screenshots during SSH wait (1/sec rolling buffer)
- Test runner script (run-pxe-test.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 20:47:34 +00:00
Michal
a664074fa3 wip: save current ks debugging state before bisect revert
All accumulated changes to kickstart template, test infrastructure,
and dnsmasq config. None of these produce a clean boot yet — saving
state before reverting to baseline for bisection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 20:24:14 +00:00
Michal
cc289c0f94 feat: serial console on test VMs for debugging without SSH
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 21s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- VMs get serial console on TCP (PXE: port 4555, ISO: port 4556)
- serialExec() helper: runs commands via telnet when SSH/network is down
- PXE test: on SSH failure, dumps hostname, IP, NetworkManager, sshd,
  failed units, and fstab via serial console before failing
- Kickstart enables serial-getty@ttyS0 for auto-login on serial

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 15:22:43 +00:00
Michal
ea7e437241 fix: network-first boot order, OVMF dispatch chain working
Some checks failed
CI/CD / typecheck (pull_request) Failing after 13s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 7m0s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Kickstart %post now restores network-first EFI boot order (undoes
  Anaconda's disk-first default). Grep pattern includes HTTP boot entries.
- Test force-restarts VM after install so OVMF rereads NVRAM.
- VM successfully network-boots after install, hits /dispatch, bastion
  returns exit (local boot). Confirmed in test logs.
- nofail on /boot/efi fstab entry prevents emergency mode.
- Remaining: Fedora disk boot after iPXE exit may still fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 14:35:33 +00:00
Michal
7446d669c1 feat: ARM ISO boot integration test, OVMF boot fixes
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
ARM integration test:
- arm-iso-provision.test.ts: aarch64 VM boots from bastion-generated ISO
- Uses QEMU aarch64 emulation (slow but validates the R1 scenario)
- Generous timeouts for emulated boot (15min discovery, 60min install)
- test-provision.sh updated: `sudo ./scripts/test-provision.sh arm`

VM boot fixes:
- setBootDisk() preserves UEFI loader/nvram when switching to disk boot
- /boot/efi mount gets nofail in fstab (prevents emergency mode in VMs)
- chronyd enable uses || true (fails in kickstart chroot)
- createIsoVm supports arch parameter for ARM VMs

Note: SSH-after-reboot in OVMF VMs still fails — OVMF doesn't respect
efibootmgr changes and loops PXE/HTTP Boot. Real hardware works fine.
The install flow itself (discovery → kickstart → complete) is validated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 00:26:12 +00:00
Michal
46b017d77e feat: install logging, error trapping, PXE/ISO integration tests
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00
Michal
ffc4a782d2 docs: comprehensive PRD for taskmaster — labctl platform
Full product requirements covering: architecture, CLI commands,
partition layout, modules, testing strategy, cloud model, app model,
implementation phases, tech stack, and lessons from mcpctl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:23:24 +00:00
Michal
44f1ebb843 feat: scaffold labd — master daemon with CockroachDB + Prisma
New @lab/labd workspace package:
- Fastify HTTP server + WebSocket for agent connections
- Prisma schema (CockroachDB): Server, Agent, User, Role, Permission,
  UserRole, JoinToken, AuditLog, PulumiRun, Cluster models
- Health endpoint with DB connectivity check
- Server listing with cloud/env/status filters
- Auth routes: agent enrollment, join token management
- Placeholder mTLS auth middleware
- Dev stack: CockroachDB single-node in docker-compose
- 32 tests passing (2 new for labd health)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:13:16 +00:00
Michal
897844fae0 refactor: rename CLI binary from lab to labctl
Updated everywhere: constants, package.json bin, completions,
nfpm packaging, build scripts, CI, banner text. Binary is now
/usr/bin/labctl. Internal package names (@lab/*) unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:07:17 +00:00
Michal
e3a1460593 docs: README with full command reference + platform design
Complete CLI reference for labctl covering:
- Bastion (PXE provisioning) — implemented
- Provisioning (install/reprovision/forget) — implemented
- Server management (get/describe) — planned
- Remote execution (exec) — planned
- Kubernetes proxy (kubectl) — planned
- Resource-scoped logs (server/app/pulumi/audit) — planned
- Apps (Pulumi charts replacing Helm) — planned
- RBAC (roles/permissions/deny rules) — planned
- Infrastructure as Code (apply/plan/destroy) — planned

Plus: partition layout, architecture diagram, tech stack, dev setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:02:19 +00:00
Michal
dbbdf5f971 docs: lab platform design — labd, agent, RBAC, multi-cloud, testing strategy
Comprehensive design document covering:
- labd master daemon with CA, RBAC, Pulumi executor
- lab-agent with mTLS enrollment, heartbeat, log shipping
- Module system (built-in + external repos)
- Cloud/environment model (baremetal + AWS)
- Ephemeral test environments (containers, VMs, cloud)
- Security test patterns for RBAC
- Health gates for deployment promotion
- Database strategy: PostgreSQL now, CockroachDB later
- Networking: Tailscale mesh + Cilium CNI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 23:46:29 +00:00
Michal
7cfd8fe1b8 feat: daemonize bastion start, fix status for root-owned processes
- `lab init bastion standalone start` now runs in background by default
- `--foreground` flag for running in foreground (debugging/containers)
- Shows startup output then detaches with PID + log path
- Status command uses /proc check instead of kill -0 (works cross-user)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:38:46 +00:00
Michal
4d2e8677d4 fix: PID file permission handling + root check
- Require root when dnsmasq is needed (clear error message)
- Handle stale PID files owned by different user (remove + recreate)
- Create bastion dir with 755 permissions
- 3 new PID file tests (30 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:31:17 +00:00
Michal
52e1932bde feat: multi-architecture builds (x86_64 + arm64)
- build-rpm.sh: --arch flag for targeting x86_64 or arm64, --all for both
  Uses bun cross-compile with --target=bun-linux-x64/arm64
- build-bastion.sh: --arch flag for Docker platform targeting
- release.sh: builds both architectures by default
- CI: builds + publishes RPM/DEB for both architectures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:02:52 +00:00
Michal
86cd961ee4 feat: release pipeline, k3s manifests, infra k3s bootstrap
- scripts/release.sh: full release orchestration (build, publish, install)
- deploy/k3s/: Deployment, ConfigMap, PVC, Namespace with kustomize
  hostNetwork for dnsmasq, NET_ADMIN caps, local-path PVC
- Infra role gets /var/lib/rancher partition (20GB, preserved on reprovision)
  for k3s etcd data persistence across reinstalls
- Infra %post installs k3s server (INSTALL_K3S_SKIP_START=true)
- 5 new kickstart tests (27 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 21:56:39 +00:00
Michal
ed1df8a77c feat: ESLint, shell completions, Docker, nfpm packaging, CI/CD
- ESLint with typescript-eslint + prettier (eslint.config.js)
- Shell completions for bash and fish (scripts/generate-completions.ts)
- Multi-stage Dockerfile for bastion (fedora:43 + dnsmasq + node)
- nfpm.yaml for RPM/DEB packaging with bun-compiled binary
- Build scripts: build-rpm.sh, build-bastion.sh, publish-rpm/deb.sh
- Gitea Actions CI/CD: lint, typecheck, test, build, publish

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 21:51:01 +00:00
Michal
520af41a52 feat: colorful progress output with icons and SSH command
Progress callbacks from kickstart now show:
  ◆ 78:55:36:08:35:14  partitioning -- preparing disk layout
  ◆◆◆ 78:55:36:08:35:14  post-install -- configuring system
  ✔ 78:55:36:08:35:14  complete -- ready at 10.0.1.88

    ssh michal@10.0.1.88

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 12:04:52 +00:00
Michal
9803817004 fix: reprovision SSH reboot is expected to close connection
The SSH connection closing during reboot is normal, not an error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:45:16 +00:00
Michal
db26c5ecb1 docs: architecture design document
ARCHITECTURE.md covering: CLI structure (lab init/provision), project layout,
boot flow, partition scheme, container architecture, CI/CD pipeline,
state management, and technology stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:42:16 +00:00
Michal
d01b675cca feat: firewall management + reprovision SSH key fix
- Open firewall ports (dhcp, tftp, http, 4011) on bastion start
- Close firewall ports on bastion shutdown
- Auto-detect firewall zone for interface
- Fix reprovision SSH to use execFileSync with explicit key path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:39:57 +00:00
Michal
62f896593d feat: CLI subcommands, PID self-restart, unit tests (22 passing)
CLI restructured:
  lab init bastion standalone start/stop/status
  lab provision list/install/reprovision/forget

- Nested commander subcommand groups (init > bastion > standalone, provision)
- PID file management: auto-kills old bastion on start, cleans up on stop
- stop command reads PID file and sends SIGTERM
- status command shows running state, port, machine counts
- forget command (DELETE /api/machines/:mac) removes from all state

Unit tests (22 tests, 3 files):
- kickstart.test.ts: worker/infra roles, SSH keys, partitions, admin user
- state.test.ts: load/save, atomic writes
- dispatch.test.ts: install/discover/local-boot routing, progress, forget

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:12:17 +00:00
Michal
64533b2dcf refactor: restructure bastion as pnpm monorepo (@lab/shared, @lab/bastion, @lab/cli)
- Split into 3 workspace packages: shared (types/constants), bastion (server), cli
- CLI binary renamed from "bastion" to "lab"
- Cross-package imports via @lab/shared and @lab/bastion workspace references
- Extracted BastionConfig, BastionState, HardwareInfo types into @lab/shared
- Added APP_NAME/APP_VERSION constants
- tsconfig.base.json with project references for build ordering
- Root workspace scripts: build, test, typecheck, clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:05:41 +00:00
Michal
937c01f5d9 fix: add --skip-dnsmasq/--skip-artifacts flags, fix config propagation
Enables running the TS bastion without dnsmasq for testing.
VM-tested: SSH works, partitions correct, k3s prereqs configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 03:11:29 +00:00
Michal
177e993736 feat: TypeScript bastion rewrite (initial scaffold)
Full rewrite of the bash bastion.sh into a TypeScript application:
- Fastify HTTP server with typed routes (dispatch, kickstart, API)
- Commander CLI (serve, install, list, reprovision)
- Kickstart templates as TypeScript template literals (no more heredoc hell)
- dnsmasq management via execa subprocess
- Merged machine list view (hardware + install info in one table)
- Containerized via podman-compose (Dockerfile + docker-compose.yml)
- All partition logic preserved (LVM, reprovision detection, role-based)

Not yet tested end-to-end — needs VM validation before replacing bash version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 02:55:52 +00:00
Michal
fac14b6d4a feat: server kickstart with LVM, user creation, progress callbacks, reprovision
- LVM partition layout: /, /var, /var/log, /home, /srv, swap, tmpfs /tmp
  plus /var/lib/longhorn for worker role (grows to fill disk)
- Reprovision preserves /home, /srv, /var/lib/longhorn via %pre detection
- Admin user created matching the user running the bastion script
  with SSH keys from authorized_keys + local pubkeys, passwordless sudo
- Progress callbacks from %pre and %post to /api/progress endpoint
  with IP reported on completion (ssh command printed)
- Installed machines boot from local disk (iPXE exit) instead of
  re-entering discovery mode
- --role worker|infra flag (infra skips longhorn partition)
- reprovision subcommand: queues install + SSH reboot into PXE
- Self-cleanup: kills old bastion instances on start
- Domain config (DOMAIN env, default ad.itaz.eu)
- efibootmgr in %post to set local disk first in boot order
- k3s prereqs: kernel modules, sysctl, firewalld disabled, chrony
- VM reprovision test script (test-reprovision.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 02:40:40 +00:00
Michal
75d17eb87c fix: HTTP Content-Length, firewall zones, UEFI boot improvements
- Fix Content-Length using byte count instead of character count
  (em dash in iPXE scripts caused mismatch, breaking iPXE chain)
- Use firewall zone-aware commands matching interface zone
- Add UEFI HTTP Boot support (arch 16/20) alongside PXE TFTP
- Add pxe-service directives for proper proxy DHCP responses
- Use bind-dynamic instead of bind-interfaces for bridge compat
- Add tftp-no-blocksize for UEFI firmware compatibility
- Use local ipxe packages instead of downloading from internet
- Add custom UEFI PXE loader stub (pxeloader.c) for chainloading
- Enable HTTP request logging for debugging boot issues

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 00:59:27 +00:00
Michal Rydlikowski
2a429088c5 bastion: discover-first PXE provisioning with multi-arch support
Rewrote bastion from install-only to discover-first flow:
- Default mode discovers hardware (PXE boot → inventory → poweroff)
- Discovered machines promoted to install via subcommand
- Per-MAC iPXE dispatch (/dispatch?mac=) routes discover vs install
- Python HTTP server with discovery API, state management, kickstart gen
- Added full DHCP mode (DHCP_MODE=full) for isolated/test networks
- Added arm64 UEFI support (client-arch 11, iPXE arm64 binary)
- Added QEMU test script (aarch64+KVM on Asahi Linux)
- All API endpoints unit tested and working

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:06:04 +00:00
Michal Rydlikowski
5ba22b94ea first commit 2026-03-16 00:00:13 +00:00
Michal Rydlikowski
ac695f506f first commit 2026-03-15 23:50:43 +00:00