Files
lab/bastion/docs/pxe-boot-debugging-2026-03-30.md
Michal 87c1a34232
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 7m4s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
docs: PXE boot debugging post-mortem — serial console root cause
Documents the 2026-03-30 debugging session: root cause (console=ttyS0
on UART-less hardware), what was tried, what was fixed, and remaining
work items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 04:00:51 +01:00

5.8 KiB

PXE Boot Debugging Session — 2026-03-30

Problem

Beelink SER Mini Pro (AMD Ryzen 7 255, Radeon 780M, 64GB DDR5, 1TB NVMe) boots Fedora 43 100x slower than normal after PXE kickstart install. Every systemd boot phase takes ~30 seconds. The Anaconda installer/rescue mode boots fast on the same hardware.

Root Cause

console=ttyS0,115200n8 in kernel cmdline — added via kickstart bootloader --append during install.

This mini PC has no physical serial UART. When systemd writes to ttyS0, each log write blocks for ~30 seconds waiting for the non-existent UART hardware. Since systemd logs at every phase transition, the total boot time was 10+ minutes.

The Anaconda installer was unaffected because it uses a different init flow that doesn't go through the same systemd phase transitions.

How We Found It

Hours of systematic elimination:

What we tried Result Ruled out
modprobe.blacklist=amdgpu No change GPU driver
amd_iommu=off No change IOMMU
Rebuild initramfs without plymouth/drm/fips No change Initramfs bloat
systemd-boot instead of GRUB Still slow Bootloader
PXE-boot kernel+initrd (skip local GRUB entirely) Still slow Local bootloader/firmware
Disable TPM in BIOS No change TPM
Remove resume= + resume dracut module No change Hibernate resume
Manual LVM activation in rescue shell Fast NVMe/LVM themselves
Remove console=ttyS0,115200n8 from GRUB FAST BOOT This was it

The key breakthrough was noticing the timestamps showed exactly 30-second gaps between boot phases — a timeout pattern, not general slowness. Then realising the serial console was added during install and had never been tested without.

What Was Fixed (PR #4, merged)

1. Removed serial console from kickstart

  • Removed console=ttyS0,115200n8 from bootloader --append
  • Removed serial-getty@ttyS0.service enablement
  • Removed rsyslog serial forwarding

2. Enabled Anaconda syslog forwarding

  • Uncommented logging --host --port directive in kickstart
  • Bastion's SyslogListener was already built — just needed IP→MAC resolution improvement
  • Added registerIp() calls from kickstart fetch and progress callbacks
  • Added syslog listener unit tests

3. Fixed disk auto-detection

  • Default disk changed from /dev/sda to "" (auto-detect) in labd route and bastion command handler
  • The kickstart %pre auto-detect logic probes nvme0n1, sda, sdb, vda in order
  • Without this fix, NVMe-only machines (like the SER Mini Pro) fail immediately

4. SysRq magic keys

  • Added kernel.sysrq=1 sysctl to kickstart %post
  • Enables Alt+SysRq+REISUB via JetKVM for emergency reboot of stuck machines

5. Simplified debug command

  • Removed --sshd flag (SSH always available via inst.sshd + sshpw in rescue mode)
  • Added /debug-setup.sh HTTP endpoint for nc listener setup from rescue shell
  • Cleaned up sshd field from DebugConfig, protocol types, all routes

6. Added labctl provision logs -f

  • Follow mode with 5-second polling for real-time install monitoring

What Works

  • PXE discovery → install → boot — full flow works end-to-end
  • Anaconda syslog forwarding — install logs stream to bastion
  • Progress callbacks — stage-by-stage install tracking via curl
  • Auto disk detection — works for NVMe and SATA
  • Debug rescue modelabctl provision debug <target> boots Anaconda rescue with SSH
  • Network-first boot order — bastion controls every reboot via efibootmgr
  • SysRq keys — emergency reboot via JetKVM keyboard

What Doesn't Work / Known Issues

  • --sshd in rescue mode — Anaconda rescue mode skips both %pre and %post kickstart sections. inst.sshd + sshpw should provide SSH access, but hasn't been verified end-to-end yet. The /debug-setup.sh curl workaround exists for nc.
  • arm64 container build — iPXE cross-compilation fails on arm64 (GCC flag incompatibility). Workaround: build with --platforms linux/amd64 only.
  • Integration test SSH timeout — VM boots fine but SSH times out due to libvirt nftables reject rules after VM restart. Test infrastructure issue, not a code bug.

What Was Skipped / Left To Do

  1. Syslog UDP port in k3s — works because bastion uses hostNetwork: true, but should be documented properly
  2. Background log streamer — the old tail -f approach broke Anaconda filesystem sync. Replaced with syslog forwarding. If more granular %post logging is needed, a synchronous log push at end of %post would be safe.
  3. Per-machine hardware overrides — turned out not to be needed (serial console was the only "special" setting, and removing it is universal)
  4. Ubuntu autoinstall disk defaultubuntu-autoinstall.ts still has disk || "/dev/sda" fallback (line 38), should be changed to auto-detect
  5. Verify inst.sshd works in rescue mode — test SSH with password "debug" next time debug mode is used
  6. Re-enable TPM in BIOS — was disabled during debugging, should be factory-reset (user plans to reset BIOS to factory)

Key Learnings

  1. console=ttyS0 on hardware without UART = 30s timeout per boot phase. Never add serial console to kernel cmdline unless the hardware has a verified physical UART.
  2. Exactly-N-second gaps in boot logs = timeout, not slowness. Look for the timeout source, not performance issues.
  3. The bisection approach works. Systematically removing features one at a time found the root cause. But it took hours because the serial console was added early and seemed harmless.
  4. Anaconda rescue mode is limited. It skips %pre and %post, so you can't automate setup via kickstart. Use inst.sshd + sshpw for SSH, and serve helper scripts via HTTP for everything else.
  5. Default disk paths break NVMe machines. Always default to auto-detect (empty string) rather than /dev/sda.