Commit Graph

56 Commits

Author SHA1 Message Date
a0f6161533 Merge pull request 'docs: PXE boot debugging post-mortem' (#5) from docs/pxe-boot-debugging into main
Some checks failed
CI/CD / lint (push) Failing after 21s
CI/CD / typecheck (push) Failing after 22s
CI/CD / test (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-30 03:01:12 +00:00
Michal
87c1a34232 docs: PXE boot debugging post-mortem — serial console root cause
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 7m4s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Documents the 2026-03-30 debugging session: root cause (console=ttyS0
on UART-less hardware), what was tried, what was fixed, and remaining
work items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 04:00:51 +01:00
84afe7d5e4 Merge pull request 'feat: PXE debug boot mode for rescue/diagnostics' (#4) from wip/ks-debugging into main
Some checks failed
CI/CD / lint (push) Failing after 9s
CI/CD / test (push) Failing after 10s
CI/CD / typecheck (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-30 02:59:34 +00:00
Michal
0a4916d3c9 fix: remove serial console (root cause of 30s boot delay), enable syslog logging, disk auto-detect
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every
systemd boot phase on hardware without a physical serial UART. Each phase
transition blocks waiting for the non-existent UART.

Changes:
- Remove console=ttyS0 from kickstart bootloader args and %post setup
- Enable Anaconda syslog forwarding (logging --host --port) for install visibility
- Improve syslog IP→MAC resolution (register from kickstart fetch + progress)
- Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support
- Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM
- Simplify debug command: remove --sshd flag (inst.sshd always available),
  add /debug-setup.sh HTTP endpoint for nc listener setup
- Add labctl provision logs -f (follow mode with polling)
- Add syslog listener unit tests
- Enable syslog log capture test in integration suite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 03:58:51 +01:00
Michal
a4a4840930 feat: debug --pxe-boot flag, boot installed system via PXE
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Loads kernel+initrd from bastion HTTP server, mounts root from local
NVMe. Workaround for UEFI firmware bugs that make local disk boot
100x slower. One-time use, auto-clears after boot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:49:44 +01:00
Michal
8da947a1c3 fix: use %pre instead of %post for debug --sshd (rescue mode skips %post)
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:19 +01:00
Michal
92c65b4672 fix: generic rescue instructions in debug command output
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:59:38 +01:00
Michal
3835fefba1 feat: debug --sshd flag, auto SSH + nc listener + IP callback
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
When using `labctl provision debug <target> --sshd`, the rescue
kickstart generates host keys, starts sshd (pw: debug) and nc
listener (port 2323), and reports the IP back to bastion via
/api/progress callback. Fully self-contained, no mounted FS needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:54:22 +01:00
Michal
d7a59665ad fix: route command-debug through bastion WebSocket handler
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 6m53s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:01:16 +01:00
Michal
82ca93f4d7 fix: add debug field to inline BastionState in labd server
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 8s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:54:02 +01:00
Michal
52150fd955 fix: add command-debug to LabdBastionMessage protocol types
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:42:52 +01:00
Michal
e87edfcfbd feat: PXE debug boot mode for rescue/diagnostics
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
New `labctl provision debug <target>` command that PXE boots a machine
into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears
after one boot so next reboot returns to normal.

Adds debug state to BastionState, dispatch routing, API endpoints,
labd command routing, and CLI with rescue workflow guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:25:44 +01:00
Michal
6c6d5763c4 fix: skip USB-attached disks in %pre (JetKVM virtual media is SCSI-over-USB)
Check sysfs device path for 'usb' to skip JetKVM virtual media which
appears as /dev/sda but is not a real install target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:51:44 +01:00
Michal
a7a6ad8098 fix: skip removable/USB disks in %pre, wait for NVMe init
JetKVM virtual media appears as /dev/sda before NVMe initializes.
Now: wait up to 10s for disks, skip removable disks and anything
under 20GB. Fixes "ignoredisk: sda does not exist" on SER9MAX.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:38:41 +01:00
Michal
e3523d642c fix: remove serial console from iPXE kernel args (may hang on SER9MAX)
ttyS0 console output on iPXE kernel line may cause kernel hang on
hardware without physical serial port. Removed from both discover
and install iPXE scripts. Serial console stays in bootloader config
for the installed system only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:32:02 +01:00
Michal
5b04d3162b fix: disable logging --host (UDP not exposed), add nomodeset + JetKVM helper
- logging --host blocks Anaconda when syslog UDP port not reachable
- nomodeset prevents amdgpu hang on SER9MAX (Radeon 780M)
- JetKVM helper script for device control (status, reboot, power)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:07:48 +01:00
Michal
a14fd04947 fix: add nomodeset to iPXE kernel args (amdgpu hangs on SER9MAX)
Radeon 780M GPU driver initialization hangs during Anaconda boot
on SER9MAX. nomodeset disables kernel modesetting so the installer
doesn't try to initialize the GPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 03:01:21 +01:00
Michal
0c1e18cee1 feat: persist machine state to CockroachDB on bastion-state-sync
When bastion syncs state, labd now upserts discovered and installed
machines into the Server table. /api/machines merges live bastion
state with DB records, so machines survive pod restarts.

Discovered machines get status=discovered with hardware labels.
Installed machines get status=online with hostname, role, IP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 02:34:26 +01:00
Michal
aae03d9877 fix: syslog parser TS strict null check, deploy script
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:58:00 +00:00
d4e9101bb6 Merge pull request 'fix: PXE boot debugging — bisect root cause, syslog logging, serial console' (#3) from wip/ks-debugging into main
Some checks failed
CI/CD / lint (push) Failing after 9s
CI/CD / test (push) Failing after 9s
CI/CD / typecheck (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
Reviewed-on: #3
2026-03-29 00:50:04 +00:00
Michal
84f1a7b133 feat: serial console on iPXE kernel boot args
Some checks failed
CI/CD / lint (pull_request) Failing after 12s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Add console=ttyS0,115200n8 to both discover and install iPXE kernel
lines so Anaconda output is visible on serial during install phase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:46:25 +00:00
Michal
c0fb1310cb fix: re-enable logging --host (removed invalid --level flag)
ksvalidator caught the issue: --level=info is not valid for F43.
Correct syntax is just: logging --host=<ip> --port=<port>

Also added ksvalidator syntax check to unit tests — validates
rendered kickstart for all roles (vanilla, worker, infra) against
F43 pykickstart. This catches kickstart syntax errors at test time
instead of during a 12-minute VM install.

Integration test passes: 21/22 (1 skipped: log lines capture).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:45:11 +00:00
Michal
48b2230665 fix: disable logging --host (breaks Anaconda), add integration config
The kickstart `logging --host` directive stalls Anaconda install —
likely firewall blocks UDP syslog or Fedora 43 Anaconda has issues
with it. Commented out for now. Syslog listener infrastructure is
in place and ready once we resolve the Anaconda/firewall issue.

Added vitest.integration.config.ts for running integration tests:
  pnpm exec vitest run --config vitest.integration.config.ts

All 21 integration tests pass, serial console rsyslog forwarding works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:19:48 +00:00
Michal
3dc1317301 feat: Anaconda syslog logging, serial console forwarding, protocol types
- Add UDP syslog listener (port 5514) for receiving Anaconda install logs
  via native `logging --host` kickstart directive — no background processes
- Add rsyslog serial console forwarding in %post (AWS EC2 compatible ttyS0@115200n8)
- Add ProvisionStackType ("dhcpproxy" | "iso" | "cloud-init") to shared types
- Add bastion-install-log WebSocket protocol message for bastion→labd log sync
- Add syslogPort to BastionConfig (default 5514)
- Wire syslog listener into bastion startup/shutdown lifecycle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:14:10 +00:00
Michal
cac7514014 feat: admin user 'lab' with SSH key auth (Step 7 — PASS)
Changed admin user from 'michal' to generic 'lab' user.
SSH key auth works for both root and lab user.
21/22 tests pass (1 skipped: log lines, needs log streamer redesign).

Bisection complete — all features work except background log streamer
which prevents Anaconda from syncing filesystem writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:30:59 +00:00
Michal
25a2beccff fix: add error trap, bastion helpers, serial console (Steps 2-5 pass)
Bisection results:
- Step 2: bastion_log/bastion_error helpers — PASS
- Step 3: ERR trap in %post — PASS
- Step 4: background log streamer — FAIL (breaks boot, NOT included)
- Step 5: serial console on ttyS0 — PASS

The background log streamer (tail -f subprocess in %post) prevents
Anaconda from properly syncing the installed filesystem. This was
the root cause of all boot failures. Will need a different approach
for real-time log streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:17:47 +00:00
Michal
2a1a29c03b fix: revert kickstart to near-original baseline (Step 0 — boots clean)
Reverted install.ks.ts to near-original state from commit 64533b2.
This is the bisection baseline — 21/22 integration tests pass,
0 failed systemd services, SSH works, /boot/efi mounts.

Removed all accumulated fixes that collectively broke boot:
- ERR trap, background log streamer, bastion_log/bastion_error
- depmod rebuild, nofail on /boot/efi, SELinux autorelabel
- chcon/restorecon for /etc /var /root
- kernel-modules and dosfstools packages

Kept from current branch:
- rootpw --plaintext lab-root-pw (console debug access)
- Network-first boot order (bastion controls boot)
- Vanilla role support, rancher partition support
- Boot screenshots during SSH wait (1/sec rolling buffer)
- Test runner script (run-pxe-test.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 20:47:34 +00:00
Michal
a664074fa3 wip: save current ks debugging state before bisect revert
All accumulated changes to kickstart template, test infrastructure,
and dnsmasq config. None of these produce a clean boot yet — saving
state before reverting to baseline for bisection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 20:24:14 +00:00
Michal
cc289c0f94 feat: serial console on test VMs for debugging without SSH
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 21s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- VMs get serial console on TCP (PXE: port 4555, ISO: port 4556)
- serialExec() helper: runs commands via telnet when SSH/network is down
- PXE test: on SSH failure, dumps hostname, IP, NetworkManager, sshd,
  failed units, and fstab via serial console before failing
- Kickstart enables serial-getty@ttyS0 for auto-login on serial

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 15:22:43 +00:00
Michal
ea7e437241 fix: network-first boot order, OVMF dispatch chain working
Some checks failed
CI/CD / typecheck (pull_request) Failing after 13s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 7m0s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Kickstart %post now restores network-first EFI boot order (undoes
  Anaconda's disk-first default). Grep pattern includes HTTP boot entries.
- Test force-restarts VM after install so OVMF rereads NVRAM.
- VM successfully network-boots after install, hits /dispatch, bastion
  returns exit (local boot). Confirmed in test logs.
- nofail on /boot/efi fstab entry prevents emergency mode.
- Remaining: Fedora disk boot after iPXE exit may still fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 14:35:33 +00:00
Michal
7446d669c1 feat: ARM ISO boot integration test, OVMF boot fixes
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
ARM integration test:
- arm-iso-provision.test.ts: aarch64 VM boots from bastion-generated ISO
- Uses QEMU aarch64 emulation (slow but validates the R1 scenario)
- Generous timeouts for emulated boot (15min discovery, 60min install)
- test-provision.sh updated: `sudo ./scripts/test-provision.sh arm`

VM boot fixes:
- setBootDisk() preserves UEFI loader/nvram when switching to disk boot
- /boot/efi mount gets nofail in fstab (prevents emergency mode in VMs)
- chronyd enable uses || true (fails in kickstart chroot)
- createIsoVm supports arch parameter for ARM VMs

Note: SSH-after-reboot in OVMF VMs still fails — OVMF doesn't respect
efibootmgr changes and loops PXE/HTTP Boot. Real hardware works fine.
The install flow itself (discovery → kickstart → complete) is validated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 00:26:12 +00:00
Michal
46b017d77e feat: install logging, error trapping, PXE/ISO integration tests
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00
Michal
ffc4a782d2 docs: comprehensive PRD for taskmaster — labctl platform
Full product requirements covering: architecture, CLI commands,
partition layout, modules, testing strategy, cloud model, app model,
implementation phases, tech stack, and lessons from mcpctl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:23:24 +00:00
Michal
44f1ebb843 feat: scaffold labd — master daemon with CockroachDB + Prisma
New @lab/labd workspace package:
- Fastify HTTP server + WebSocket for agent connections
- Prisma schema (CockroachDB): Server, Agent, User, Role, Permission,
  UserRole, JoinToken, AuditLog, PulumiRun, Cluster models
- Health endpoint with DB connectivity check
- Server listing with cloud/env/status filters
- Auth routes: agent enrollment, join token management
- Placeholder mTLS auth middleware
- Dev stack: CockroachDB single-node in docker-compose
- 32 tests passing (2 new for labd health)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:13:16 +00:00
Michal
897844fae0 refactor: rename CLI binary from lab to labctl
Updated everywhere: constants, package.json bin, completions,
nfpm packaging, build scripts, CI, banner text. Binary is now
/usr/bin/labctl. Internal package names (@lab/*) unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:07:17 +00:00
Michal
e3a1460593 docs: README with full command reference + platform design
Complete CLI reference for labctl covering:
- Bastion (PXE provisioning) — implemented
- Provisioning (install/reprovision/forget) — implemented
- Server management (get/describe) — planned
- Remote execution (exec) — planned
- Kubernetes proxy (kubectl) — planned
- Resource-scoped logs (server/app/pulumi/audit) — planned
- Apps (Pulumi charts replacing Helm) — planned
- RBAC (roles/permissions/deny rules) — planned
- Infrastructure as Code (apply/plan/destroy) — planned

Plus: partition layout, architecture diagram, tech stack, dev setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:02:19 +00:00
Michal
dbbdf5f971 docs: lab platform design — labd, agent, RBAC, multi-cloud, testing strategy
Comprehensive design document covering:
- labd master daemon with CA, RBAC, Pulumi executor
- lab-agent with mTLS enrollment, heartbeat, log shipping
- Module system (built-in + external repos)
- Cloud/environment model (baremetal + AWS)
- Ephemeral test environments (containers, VMs, cloud)
- Security test patterns for RBAC
- Health gates for deployment promotion
- Database strategy: PostgreSQL now, CockroachDB later
- Networking: Tailscale mesh + Cilium CNI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 23:46:29 +00:00
Michal
7cfd8fe1b8 feat: daemonize bastion start, fix status for root-owned processes
- `lab init bastion standalone start` now runs in background by default
- `--foreground` flag for running in foreground (debugging/containers)
- Shows startup output then detaches with PID + log path
- Status command uses /proc check instead of kill -0 (works cross-user)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:38:46 +00:00
Michal
4d2e8677d4 fix: PID file permission handling + root check
- Require root when dnsmasq is needed (clear error message)
- Handle stale PID files owned by different user (remove + recreate)
- Create bastion dir with 755 permissions
- 3 new PID file tests (30 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:31:17 +00:00
Michal
52e1932bde feat: multi-architecture builds (x86_64 + arm64)
- build-rpm.sh: --arch flag for targeting x86_64 or arm64, --all for both
  Uses bun cross-compile with --target=bun-linux-x64/arm64
- build-bastion.sh: --arch flag for Docker platform targeting
- release.sh: builds both architectures by default
- CI: builds + publishes RPM/DEB for both architectures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:02:52 +00:00
Michal
86cd961ee4 feat: release pipeline, k3s manifests, infra k3s bootstrap
- scripts/release.sh: full release orchestration (build, publish, install)
- deploy/k3s/: Deployment, ConfigMap, PVC, Namespace with kustomize
  hostNetwork for dnsmasq, NET_ADMIN caps, local-path PVC
- Infra role gets /var/lib/rancher partition (20GB, preserved on reprovision)
  for k3s etcd data persistence across reinstalls
- Infra %post installs k3s server (INSTALL_K3S_SKIP_START=true)
- 5 new kickstart tests (27 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 21:56:39 +00:00
Michal
ed1df8a77c feat: ESLint, shell completions, Docker, nfpm packaging, CI/CD
- ESLint with typescript-eslint + prettier (eslint.config.js)
- Shell completions for bash and fish (scripts/generate-completions.ts)
- Multi-stage Dockerfile for bastion (fedora:43 + dnsmasq + node)
- nfpm.yaml for RPM/DEB packaging with bun-compiled binary
- Build scripts: build-rpm.sh, build-bastion.sh, publish-rpm/deb.sh
- Gitea Actions CI/CD: lint, typecheck, test, build, publish

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 21:51:01 +00:00
Michal
520af41a52 feat: colorful progress output with icons and SSH command
Progress callbacks from kickstart now show:
  ◆ 78:55:36:08:35:14  partitioning -- preparing disk layout
  ◆◆◆ 78:55:36:08:35:14  post-install -- configuring system
  ✔ 78:55:36:08:35:14  complete -- ready at 10.0.1.88

    ssh michal@10.0.1.88

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 12:04:52 +00:00
Michal
9803817004 fix: reprovision SSH reboot is expected to close connection
The SSH connection closing during reboot is normal, not an error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:45:16 +00:00
Michal
db26c5ecb1 docs: architecture design document
ARCHITECTURE.md covering: CLI structure (lab init/provision), project layout,
boot flow, partition scheme, container architecture, CI/CD pipeline,
state management, and technology stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:42:16 +00:00
Michal
d01b675cca feat: firewall management + reprovision SSH key fix
- Open firewall ports (dhcp, tftp, http, 4011) on bastion start
- Close firewall ports on bastion shutdown
- Auto-detect firewall zone for interface
- Fix reprovision SSH to use execFileSync with explicit key path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:39:57 +00:00
Michal
62f896593d feat: CLI subcommands, PID self-restart, unit tests (22 passing)
CLI restructured:
  lab init bastion standalone start/stop/status
  lab provision list/install/reprovision/forget

- Nested commander subcommand groups (init > bastion > standalone, provision)
- PID file management: auto-kills old bastion on start, cleans up on stop
- stop command reads PID file and sends SIGTERM
- status command shows running state, port, machine counts
- forget command (DELETE /api/machines/:mac) removes from all state

Unit tests (22 tests, 3 files):
- kickstart.test.ts: worker/infra roles, SSH keys, partitions, admin user
- state.test.ts: load/save, atomic writes
- dispatch.test.ts: install/discover/local-boot routing, progress, forget

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:12:17 +00:00
Michal
64533b2dcf refactor: restructure bastion as pnpm monorepo (@lab/shared, @lab/bastion, @lab/cli)
- Split into 3 workspace packages: shared (types/constants), bastion (server), cli
- CLI binary renamed from "bastion" to "lab"
- Cross-package imports via @lab/shared and @lab/bastion workspace references
- Extracted BastionConfig, BastionState, HardwareInfo types into @lab/shared
- Added APP_NAME/APP_VERSION constants
- tsconfig.base.json with project references for build ordering
- Root workspace scripts: build, test, typecheck, clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:05:41 +00:00
Michal
937c01f5d9 fix: add --skip-dnsmasq/--skip-artifacts flags, fix config propagation
Enables running the TS bastion without dnsmasq for testing.
VM-tested: SSH works, partitions correct, k3s prereqs configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 03:11:29 +00:00
Michal
177e993736 feat: TypeScript bastion rewrite (initial scaffold)
Full rewrite of the bash bastion.sh into a TypeScript application:
- Fastify HTTP server with typed routes (dispatch, kickstart, API)
- Commander CLI (serve, install, list, reprovision)
- Kickstart templates as TypeScript template literals (no more heredoc hell)
- dnsmasq management via execa subprocess
- Merged machine list view (hardware + install info in one table)
- Containerized via podman-compose (Dockerfile + docker-compose.yml)
- All partition logic preserved (LVM, reprovision detection, role-based)

Not yet tested end-to-end — needs VM validation before replacing bash version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 02:55:52 +00:00