feat: PXE debug boot mode for rescue/diagnostics #4

Merged
michal merged 16 commits from wip/ks-debugging into main 2026-03-30 02:59:35 +00:00

16 Commits

Author SHA1 Message Date
Michal
0a4916d3c9 fix: remove serial console (root cause of 30s boot delay), enable syslog logging, disk auto-detect
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every
systemd boot phase on hardware without a physical serial UART. Each phase
transition blocks waiting for the non-existent UART.

Changes:
- Remove console=ttyS0 from kickstart bootloader args and %post setup
- Enable Anaconda syslog forwarding (logging --host --port) for install visibility
- Improve syslog IP→MAC resolution (register from kickstart fetch + progress)
- Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support
- Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM
- Simplify debug command: remove --sshd flag (inst.sshd always available),
  add /debug-setup.sh HTTP endpoint for nc listener setup
- Add labctl provision logs -f (follow mode with polling)
- Add syslog listener unit tests
- Enable syslog log capture test in integration suite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 03:58:51 +01:00
Michal
a4a4840930 feat: debug --pxe-boot flag, boot installed system via PXE
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Loads kernel+initrd from bastion HTTP server, mounts root from local
NVMe. Workaround for UEFI firmware bugs that make local disk boot
100x slower. One-time use, auto-clears after boot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:49:44 +01:00
Michal
8da947a1c3 fix: use %pre instead of %post for debug --sshd (rescue mode skips %post)
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:19 +01:00
Michal
92c65b4672 fix: generic rescue instructions in debug command output
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:59:38 +01:00
Michal
3835fefba1 feat: debug --sshd flag, auto SSH + nc listener + IP callback
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
When using `labctl provision debug <target> --sshd`, the rescue
kickstart generates host keys, starts sshd (pw: debug) and nc
listener (port 2323), and reports the IP back to bastion via
/api/progress callback. Fully self-contained, no mounted FS needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:54:22 +01:00
Michal
d7a59665ad fix: route command-debug through bastion WebSocket handler
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 6m53s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:01:16 +01:00
Michal
82ca93f4d7 fix: add debug field to inline BastionState in labd server
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 8s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:54:02 +01:00
Michal
52150fd955 fix: add command-debug to LabdBastionMessage protocol types
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:42:52 +01:00
Michal
e87edfcfbd feat: PXE debug boot mode for rescue/diagnostics
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
New `labctl provision debug <target>` command that PXE boots a machine
into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears
after one boot so next reboot returns to normal.

Adds debug state to BastionState, dispatch routing, API endpoints,
labd command routing, and CLI with rescue workflow guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:25:44 +01:00
Michal
6c6d5763c4 fix: skip USB-attached disks in %pre (JetKVM virtual media is SCSI-over-USB)
Check sysfs device path for 'usb' to skip JetKVM virtual media which
appears as /dev/sda but is not a real install target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:51:44 +01:00
Michal
a7a6ad8098 fix: skip removable/USB disks in %pre, wait for NVMe init
JetKVM virtual media appears as /dev/sda before NVMe initializes.
Now: wait up to 10s for disks, skip removable disks and anything
under 20GB. Fixes "ignoredisk: sda does not exist" on SER9MAX.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:38:41 +01:00
Michal
e3523d642c fix: remove serial console from iPXE kernel args (may hang on SER9MAX)
ttyS0 console output on iPXE kernel line may cause kernel hang on
hardware without physical serial port. Removed from both discover
and install iPXE scripts. Serial console stays in bootloader config
for the installed system only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:32:02 +01:00
Michal
5b04d3162b fix: disable logging --host (UDP not exposed), add nomodeset + JetKVM helper
- logging --host blocks Anaconda when syslog UDP port not reachable
- nomodeset prevents amdgpu hang on SER9MAX (Radeon 780M)
- JetKVM helper script for device control (status, reboot, power)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:07:48 +01:00
Michal
a14fd04947 fix: add nomodeset to iPXE kernel args (amdgpu hangs on SER9MAX)
Radeon 780M GPU driver initialization hangs during Anaconda boot
on SER9MAX. nomodeset disables kernel modesetting so the installer
doesn't try to initialize the GPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 03:01:21 +01:00
Michal
0c1e18cee1 feat: persist machine state to CockroachDB on bastion-state-sync
When bastion syncs state, labd now upserts discovered and installed
machines into the Server table. /api/machines merges live bastion
state with DB records, so machines survive pod restarts.

Discovered machines get status=discovered with hardware labels.
Installed machines get status=online with hostname, role, IP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 02:34:26 +01:00
Michal
aae03d9877 fix: syslog parser TS strict null check, deploy script
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:58:00 +00:00