Commit Graph

79 Commits

Author SHA1 Message Date
Michal
ae91f2895e feat: dynamic /ks-auto kickstart for ISO boot (R1 ARM support)
Some checks failed
CI/CD / lint (push) Failing after 11s
CI/CD / typecheck (push) Failing after 22s
CI/CD / test (push) Failing after 7m5s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
Add state-aware kickstart dispatch for machines that boot from ISO
(no PXE/network at UEFI level). Replaces hardcoded discover.ks.

- /ks-auto: %pre detects MAC, queries /api/machine-state/<mac>,
  writes discover or install kickstart to /tmp/dynamic.ks,
  main body %include's it
- /api/machine-state/<mac>: simple state endpoint returning
  unknown|discovered|queued|installing|installed|debug
- ISO kernel cmdline updated: discover.ks → ks-auto
- Handles: discovery (first boot), install (queued), debug modes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 16:17:08 +01:00
Michal
06fc40a857 fix: k3s install automation — skip Cilium on join, Longhorn via server, default root user
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 9s
CI/CD / lint (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Skip Cilium install for joining servers (already in cluster via daemonset)
- Longhorn annotation for workers: SSH to server node from CLI to apply
  kubectl annotation (workers don't have kubectl access)
- Default SSH user for k3s/app commands changed to 'root' (operations
  need root privileges, using 'lab' user broke installs)
- k3s server config: cluster-init for initial server, server+token for joins

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 16:02:19 +01:00
Michal
a68d6d617e feat: k3s cluster-init for etcd HA, fix Cilium duplicate install
Some checks failed
CI/CD / lint (push) Failing after 11s
CI/CD / test (push) Failing after 10s
CI/CD / typecheck (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Server config now uses cluster-init: true for initial server (enables
  embedded etcd). Joining servers get server: + token: in config.
- Cilium install already checks for existing installation, so joining
  servers skip it gracefully (the "release name in use" error is non-fatal)

Cluster rebuilt as etcd HA:
  worker0-k8s0  control-plane,etcd  (initial server, cluster-init)
  worker1-k8s0  control-plane,etcd  (joined server, Mac Studio aarch64)
  spark-2935    worker              (DGX Spark, aarch64)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:53:18 +01:00
Michal
c49a650888 fix: firstboot fstab handling — no duplicates, compatible with Asahi sed
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 11s
CI/CD / lint (push) Failing after 23s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's
  sed doesn't support \| delimiter or \? quantifier)
- Use idempotent write_lab_fstab function: removes all old entries first,
  comments out conflicting btrfs subvol entries, adds fresh LVM entries
- Fix sed for SSH hardening: use #* instead of \? (POSIX compatible)
- Tested on Mac Studio: no duplicate fstab entries after multiple runs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:40:29 +01:00
Michal
87e09af941 fix: default admin user to 'lab', case-insensitive OS detection for iSCSI
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 10s
CI/CD / lint (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Firstboot script defaults admin user to 'lab' instead of bastion's
  config.adminUser (which was 'michal' from host system)
- iSCSI OS detection uses case-insensitive match for 'fedora'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:13:53 +01:00
Michal
6f13e284fd fix: firstboot script auto-detects hostname and MAC, no query params needed
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 10s
CI/CD / lint (push) Failing after 23s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
The firstboot script now auto-detects hostname (from hostnamectl) and
MAC address (from first UP interface) at runtime. No URL query parameters
required — just `curl bastion/asahi/firstboot.sh | sudo bash`.

Fixes the shell escaping issue where `&` in query params broke curl piping.
Updated labctl provision asahi instructions accordingly.

Tested on Mac Studio (worker1-k8s0): hostname, MAC, and bastion
registration all auto-detected correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:05:25 +01:00
Michal
6c963a15bd fix: firstboot reprovision path now runs hostname, user, and registration
Some checks failed
CI/CD / lint (push) Failing after 12s
CI/CD / test (push) Failing after 10s
CI/CD / typecheck (push) Failing after 29s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
Previously the reprovision path exited early after re-mounting LVs,
skipping hostname setup, admin user creation, metadata, and bastion
registration. Now both paths fall through to the common post-setup code.

Tested on Mac Studio (worker1-k8s0) — reprovision + self-registration
confirmed working via curl | bash pipe.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 09:59:02 +01:00
8c737d163d Merge pull request 'feat: Asahi Linux provisioning for Apple Silicon' (#10) from feat/asahi-provisioning into main
Some checks failed
CI/CD / lint (push) Failing after 11s
CI/CD / typecheck (push) Failing after 22s
CI/CD / test (push) Failing after 7m7s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-31 23:30:41 +00:00
Michal
17bae7ddbf fix: pre-download rootfs ZIP to avoid macOS Python HTTP streaming issues
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
The Asahi installer's urlcache.py fails with AssertionError on macOS
when streaming ZIP via HTTP Range requests from Fastify. Fix: download
the ZIP with curl first (reliable on macOS), then set REPO_BASE to the
local directory so the installer opens it as a local file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 00:30:29 +01:00
Michal
bb8f37ef7d feat: iSCSI, Longhorn disk labels, labctl asahi command, ZIP32 fix
Some checks failed
CI/CD / typecheck (pull_request) Failing after 12s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / test (pull_request) Failing after 10s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
k3s host prep:
- Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils,
  Ubuntu: open-iscsi) — required by Longhorn
- Add Longhorn disk label to k3s server+agent configs
- Add Longhorn disk annotation operation in post-install hardening

CLI:
- Add `labctl provision asahi` command with interactive install guide
- Change default SSH user from "michal" to "lab" in all commands
- Change admin user in bastion progress callback to "lab"

Asahi provisioning fixes:
- Download installer_data.json locally (installer reads it as file)
- Use REPO_BASE to serve upstream ZIP from bastion (LAN speed)
- Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified
  (our repackaged ZIP used ZIP64 which breaks Asahi urlcache)
- Add /data/asahi-repo fallback path for k3s container PVC mount
- Deploy script syncs asahi-repo to bastion pod after deployment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 23:32:38 +01:00
Michal
a8dc79bc5a feat: Asahi validation tests, rootfs build fixes, shellcheck-clean scripts
Some checks failed
CI/CD / lint (pull_request) Failing after 12s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Add 16 validation tests: shellcheck (3 roles), installer_data.json
  schema (8), Python parser validation, ZIP structure (3), rootfs mount
- Fix empty SSH keys generating invalid bash (SC1073)
- Fix __dirname crash in ESM modules (use import.meta.url)
- Fix rootfs build: mkdir -p before writing, correct binary paths
- Add .gitignore for large build artifacts (.asahi-cache, *.zip)
- Bump smoke test timeout for additional static plugin registration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:22:24 +01:00
Michal
ad76c74020 fix: rootfs build script — mkdir before write, fix package path checks
Some checks failed
CI/CD / typecheck (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 21s
CI/CD / test (pull_request) Failing after 11s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 03:26:26 +01:00
Michal
6807632d46 feat: Asahi rootfs build pipeline + serve from bastion
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Add scripts/build-asahi-rootfs.sh: downloads upstream Fedora Asahi
  Remix Server, injects lab firstboot script + systemd service + SSH
  keys, repackages with installer_data.json that adds LVM Data partition
- Bastion serves built artifacts at /asahi/repo/* via fastify-static
- installer_data.json prefers built config, falls back to minimal
- Fix __dirname crash in ESM module (use import.meta.url)
- Fix smoke test timeout (was crashing due to __dirname)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 03:20:12 +01:00
Michal
53265bb18c test: integration test for Asahi firstboot LVM setup
Some checks failed
CI/CD / lint (pull_request) Failing after 21s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / test (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
VM-based end-to-end test using Fedora cloud image with two disks:
root (20GB) + data (200GB). Verifies the firstboot script creates
labvg with correct LV sizes, mounts volumes, migrates /home content,
sets hostname, creates admin user, and handles reprovision.

Fixes to firstboot script:
- Detect whole disks (not just partitions) for LVM PV
- Handle btrfs subvolume paths in root device detection
- Copy /home content before mounting LV (preserves SSH keys)
- Don't restart sshd (config takes effect on reboot)
- Make swapon and mount operations resilient to failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 03:07:38 +01:00
Michal
863c7f2b83 feat: Asahi Linux provisioning for Apple Silicon (Mac Studio)
Some checks failed
CI/CD / typecheck (pull_request) Failing after 11s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / test (pull_request) Failing after 11s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Add bastion endpoints for provisioning Apple Silicon machines via the
Asahi Linux installer with custom LVM partitioning:

- GET /asahi — wrapper script (curl bastion:8080/asahi | sh)
- GET /asahi/installer_data.json — custom partition layout (60GB root + LVM data)
- GET /asahi/firstboot.sh — first-boot LVM setup matching kickstart layout
- GET /asahi/firstboot.service — systemd oneshot unit

The firstboot script creates labvg with role-specific LVs (var, varlog,
home, srv, rancher, longhorn) and handles reprovision by detecting
existing VGs. Includes 19 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 02:46:27 +01:00
906f93f6f2 Merge pull request 'fix: Cilium multi-node support' (#9) from fix/cilium-multi-node into main
Some checks failed
CI/CD / lint (push) Failing after 22s
CI/CD / typecheck (push) Failing after 21s
CI/CD / test (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-31 00:36:17 +00:00
Michal
aea28b5a0f fix: Cilium multi-node support — auto-detect NIC, k3s agent API port, worker label
Some checks failed
CI/CD / typecheck (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / test (pull_request) Failing after 7m8s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Remove hardcoded devices/directRoutingDevice from Cilium install (let
  Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7)
- Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init
  containers can reach the API via k3s agent's local LB proxy
- Add node-role.kubernetes.io/worker label to agent config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:35:51 +01:00
f3f0ea48e7 Merge pull request 'feat: provision register + k3s kubeconfig' (#8) from feat/register-and-kubeconfig into main
Some checks failed
CI/CD / lint (push) Failing after 10s
CI/CD / test (push) Failing after 10s
CI/CD / typecheck (push) Failing after 21s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-31 00:16:06 +00:00
Michal
49d747db98 feat: provision register command and k3s kubeconfig merge
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 11s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Add `labctl provision register` to re-add machines to installed state
without reprovisioning (e.g. after bastion state loss). Full stack:
protocol type, bastion API + WS handler, labd route, CLI command.

Add `labctl app k3s kubeconfig <target>` to fetch kubeconfig from a
k3s node via SSH, rewrite server URL, and merge into ~/.kube/config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:15:31 +01:00
8635da08a6 Merge pull request 'fix: reprovision workflow bugs' (#7) from fix/reprovision-bugs into main
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 10s
CI/CD / lint (push) Failing after 23s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
Reviewed-on: #7
2026-03-30 22:44:44 +00:00
Michal
6a5f23c0f5 fix: reprovision workflow bugs — SSH host key warnings, log following, status priority
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Add UserKnownHostsFile=/dev/null to SSH in debug and reprovision commands
- Track install state in log follower so it doesn't exit prematurely on "installed"
- Reorder bastion status check to prioritize active queue over stale installed state
- Update .gitignore with task file entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:59:45 +01:00
63cc033e3e Merge pull request 'docs: comprehensive architecture document' (#6) from docs/architecture into main
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 11s
CI/CD / lint (push) Failing after 24s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-30 16:31:41 +00:00
Michal
d7a25066bd docs: comprehensive architecture document
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 14s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Covers all components (bastion, labd, labctl, agent, modules),
data flow, machine lifecycle, disk layout, kickstart features,
deployment, testing, security, known issues, and planned work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 17:31:29 +01:00
a0f6161533 Merge pull request 'docs: PXE boot debugging post-mortem' (#5) from docs/pxe-boot-debugging into main
Some checks failed
CI/CD / lint (push) Failing after 21s
CI/CD / typecheck (push) Failing after 22s
CI/CD / test (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-30 03:01:12 +00:00
Michal
87c1a34232 docs: PXE boot debugging post-mortem — serial console root cause
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 7m4s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Documents the 2026-03-30 debugging session: root cause (console=ttyS0
on UART-less hardware), what was tried, what was fixed, and remaining
work items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 04:00:51 +01:00
84afe7d5e4 Merge pull request 'feat: PXE debug boot mode for rescue/diagnostics' (#4) from wip/ks-debugging into main
Some checks failed
CI/CD / lint (push) Failing after 9s
CI/CD / test (push) Failing after 10s
CI/CD / typecheck (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
2026-03-30 02:59:34 +00:00
Michal
0a4916d3c9 fix: remove serial console (root cause of 30s boot delay), enable syslog logging, disk auto-detect
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every
systemd boot phase on hardware without a physical serial UART. Each phase
transition blocks waiting for the non-existent UART.

Changes:
- Remove console=ttyS0 from kickstart bootloader args and %post setup
- Enable Anaconda syslog forwarding (logging --host --port) for install visibility
- Improve syslog IP→MAC resolution (register from kickstart fetch + progress)
- Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support
- Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM
- Simplify debug command: remove --sshd flag (inst.sshd always available),
  add /debug-setup.sh HTTP endpoint for nc listener setup
- Add labctl provision logs -f (follow mode with polling)
- Add syslog listener unit tests
- Enable syslog log capture test in integration suite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 03:58:51 +01:00
Michal
a4a4840930 feat: debug --pxe-boot flag, boot installed system via PXE
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Loads kernel+initrd from bastion HTTP server, mounts root from local
NVMe. Workaround for UEFI firmware bugs that make local disk boot
100x slower. One-time use, auto-clears after boot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:49:44 +01:00
Michal
8da947a1c3 fix: use %pre instead of %post for debug --sshd (rescue mode skips %post)
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:19 +01:00
Michal
92c65b4672 fix: generic rescue instructions in debug command output
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:59:38 +01:00
Michal
3835fefba1 feat: debug --sshd flag, auto SSH + nc listener + IP callback
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
When using `labctl provision debug <target> --sshd`, the rescue
kickstart generates host keys, starts sshd (pw: debug) and nc
listener (port 2323), and reports the IP back to bastion via
/api/progress callback. Fully self-contained, no mounted FS needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:54:22 +01:00
Michal
d7a59665ad fix: route command-debug through bastion WebSocket handler
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 6m53s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:01:16 +01:00
Michal
82ca93f4d7 fix: add debug field to inline BastionState in labd server
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 8s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:54:02 +01:00
Michal
52150fd955 fix: add command-debug to LabdBastionMessage protocol types
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:42:52 +01:00
Michal
e87edfcfbd feat: PXE debug boot mode for rescue/diagnostics
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
New `labctl provision debug <target>` command that PXE boots a machine
into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears
after one boot so next reboot returns to normal.

Adds debug state to BastionState, dispatch routing, API endpoints,
labd command routing, and CLI with rescue workflow guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:25:44 +01:00
Michal
6c6d5763c4 fix: skip USB-attached disks in %pre (JetKVM virtual media is SCSI-over-USB)
Check sysfs device path for 'usb' to skip JetKVM virtual media which
appears as /dev/sda but is not a real install target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:51:44 +01:00
Michal
a7a6ad8098 fix: skip removable/USB disks in %pre, wait for NVMe init
JetKVM virtual media appears as /dev/sda before NVMe initializes.
Now: wait up to 10s for disks, skip removable disks and anything
under 20GB. Fixes "ignoredisk: sda does not exist" on SER9MAX.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:38:41 +01:00
Michal
e3523d642c fix: remove serial console from iPXE kernel args (may hang on SER9MAX)
ttyS0 console output on iPXE kernel line may cause kernel hang on
hardware without physical serial port. Removed from both discover
and install iPXE scripts. Serial console stays in bootloader config
for the installed system only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:32:02 +01:00
Michal
5b04d3162b fix: disable logging --host (UDP not exposed), add nomodeset + JetKVM helper
- logging --host blocks Anaconda when syslog UDP port not reachable
- nomodeset prevents amdgpu hang on SER9MAX (Radeon 780M)
- JetKVM helper script for device control (status, reboot, power)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:07:48 +01:00
Michal
a14fd04947 fix: add nomodeset to iPXE kernel args (amdgpu hangs on SER9MAX)
Radeon 780M GPU driver initialization hangs during Anaconda boot
on SER9MAX. nomodeset disables kernel modesetting so the installer
doesn't try to initialize the GPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 03:01:21 +01:00
Michal
0c1e18cee1 feat: persist machine state to CockroachDB on bastion-state-sync
When bastion syncs state, labd now upserts discovered and installed
machines into the Server table. /api/machines merges live bastion
state with DB records, so machines survive pod restarts.

Discovered machines get status=discovered with hardware labels.
Installed machines get status=online with hostname, role, IP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 02:34:26 +01:00
Michal
aae03d9877 fix: syslog parser TS strict null check, deploy script
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:58:00 +00:00
d4e9101bb6 Merge pull request 'fix: PXE boot debugging — bisect root cause, syslog logging, serial console' (#3) from wip/ks-debugging into main
Some checks failed
CI/CD / lint (push) Failing after 9s
CI/CD / test (push) Failing after 9s
CI/CD / typecheck (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
Reviewed-on: #3
2026-03-29 00:50:04 +00:00
Michal
84f1a7b133 feat: serial console on iPXE kernel boot args
Some checks failed
CI/CD / lint (pull_request) Failing after 12s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Add console=ttyS0,115200n8 to both discover and install iPXE kernel
lines so Anaconda output is visible on serial during install phase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:46:25 +00:00
Michal
c0fb1310cb fix: re-enable logging --host (removed invalid --level flag)
ksvalidator caught the issue: --level=info is not valid for F43.
Correct syntax is just: logging --host=<ip> --port=<port>

Also added ksvalidator syntax check to unit tests — validates
rendered kickstart for all roles (vanilla, worker, infra) against
F43 pykickstart. This catches kickstart syntax errors at test time
instead of during a 12-minute VM install.

Integration test passes: 21/22 (1 skipped: log lines capture).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:45:11 +00:00
Michal
48b2230665 fix: disable logging --host (breaks Anaconda), add integration config
The kickstart `logging --host` directive stalls Anaconda install —
likely firewall blocks UDP syslog or Fedora 43 Anaconda has issues
with it. Commented out for now. Syslog listener infrastructure is
in place and ready once we resolve the Anaconda/firewall issue.

Added vitest.integration.config.ts for running integration tests:
  pnpm exec vitest run --config vitest.integration.config.ts

All 21 integration tests pass, serial console rsyslog forwarding works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:19:48 +00:00
Michal
3dc1317301 feat: Anaconda syslog logging, serial console forwarding, protocol types
- Add UDP syslog listener (port 5514) for receiving Anaconda install logs
  via native `logging --host` kickstart directive — no background processes
- Add rsyslog serial console forwarding in %post (AWS EC2 compatible ttyS0@115200n8)
- Add ProvisionStackType ("dhcpproxy" | "iso" | "cloud-init") to shared types
- Add bastion-install-log WebSocket protocol message for bastion→labd log sync
- Add syslogPort to BastionConfig (default 5514)
- Wire syslog listener into bastion startup/shutdown lifecycle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:14:10 +00:00
Michal
cac7514014 feat: admin user 'lab' with SSH key auth (Step 7 — PASS)
Changed admin user from 'michal' to generic 'lab' user.
SSH key auth works for both root and lab user.
21/22 tests pass (1 skipped: log lines, needs log streamer redesign).

Bisection complete — all features work except background log streamer
which prevents Anaconda from syncing filesystem writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:30:59 +00:00
Michal
25a2beccff fix: add error trap, bastion helpers, serial console (Steps 2-5 pass)
Bisection results:
- Step 2: bastion_log/bastion_error helpers — PASS
- Step 3: ERR trap in %post — PASS
- Step 4: background log streamer — FAIL (breaks boot, NOT included)
- Step 5: serial console on ttyS0 — PASS

The background log streamer (tail -f subprocess in %post) prevents
Anaconda from properly syncing the installed filesystem. This was
the root cause of all boot failures. Will need a different approach
for real-time log streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:17:47 +00:00
Michal
2a1a29c03b fix: revert kickstart to near-original baseline (Step 0 — boots clean)
Reverted install.ks.ts to near-original state from commit 64533b2.
This is the bisection baseline — 21/22 integration tests pass,
0 failed systemd services, SSH works, /boot/efi mounts.

Removed all accumulated fixes that collectively broke boot:
- ERR trap, background log streamer, bastion_log/bastion_error
- depmod rebuild, nofail on /boot/efi, SELinux autorelabel
- chcon/restorecon for /etc /var /root
- kernel-modules and dosfstools packages

Kept from current branch:
- rootpw --plaintext lab-root-pw (console debug access)
- Network-first boot order (bastion controls boot)
- Vanilla role support, rancher partition support
- Boot screenshots during SSH wait (1/sec rolling buffer)
- Test runner script (run-pxe-test.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 20:47:34 +00:00