From 87c1a342322119aea54fedd689b6992469016e66 Mon Sep 17 00:00:00 2001 From: Michal Date: Mon, 30 Mar 2026 04:00:51 +0100 Subject: [PATCH] =?UTF-8?q?docs:=20PXE=20boot=20debugging=20post-mortem=20?= =?UTF-8?q?=E2=80=94=20serial=20console=20root=20cause?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents the 2026-03-30 debugging session: root cause (console=ttyS0 on UART-less hardware), what was tried, what was fixed, and remaining work items. Co-Authored-By: Claude Opus 4.6 (1M context) --- bastion/docs/pxe-boot-debugging-2026-03-30.md | 91 +++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 bastion/docs/pxe-boot-debugging-2026-03-30.md diff --git a/bastion/docs/pxe-boot-debugging-2026-03-30.md b/bastion/docs/pxe-boot-debugging-2026-03-30.md new file mode 100644 index 0000000..144bd47 --- /dev/null +++ b/bastion/docs/pxe-boot-debugging-2026-03-30.md @@ -0,0 +1,91 @@ +# PXE Boot Debugging Session — 2026-03-30 + +## Problem +Beelink SER Mini Pro (AMD Ryzen 7 255, Radeon 780M, 64GB DDR5, 1TB NVMe) boots Fedora 43 100x slower than normal after PXE kickstart install. Every systemd boot phase takes ~30 seconds. The Anaconda installer/rescue mode boots fast on the same hardware. + +## Root Cause +**`console=ttyS0,115200n8` in kernel cmdline** — added via kickstart `bootloader --append` during install. + +This mini PC has **no physical serial UART**. When systemd writes to ttyS0, each log write blocks for ~30 seconds waiting for the non-existent UART hardware. Since systemd logs at every phase transition, the total boot time was 10+ minutes. + +The Anaconda installer was unaffected because it uses a different init flow that doesn't go through the same systemd phase transitions. + +## How We Found It +Hours of systematic elimination: + +| What we tried | Result | Ruled out | +|---|---|---| +| `modprobe.blacklist=amdgpu` | No change | GPU driver | +| `amd_iommu=off` | No change | IOMMU | +| Rebuild initramfs without plymouth/drm/fips | No change | Initramfs bloat | +| systemd-boot instead of GRUB | Still slow | Bootloader | +| PXE-boot kernel+initrd (skip local GRUB entirely) | Still slow | Local bootloader/firmware | +| Disable TPM in BIOS | No change | TPM | +| Remove `resume=` + resume dracut module | No change | Hibernate resume | +| Manual LVM activation in rescue shell | **Fast** | NVMe/LVM themselves | +| Remove `console=ttyS0,115200n8` from GRUB | **FAST BOOT** | **This was it** | + +The key breakthrough was noticing the timestamps showed **exactly 30-second gaps** between boot phases — a timeout pattern, not general slowness. Then realising the serial console was added during install and had never been tested without. + +## What Was Fixed (PR #4, merged) + +### 1. Removed serial console from kickstart +- Removed `console=ttyS0,115200n8` from `bootloader --append` +- Removed `serial-getty@ttyS0.service` enablement +- Removed rsyslog serial forwarding + +### 2. Enabled Anaconda syslog forwarding +- Uncommented `logging --host --port` directive in kickstart +- Bastion's SyslogListener was already built — just needed IP→MAC resolution improvement +- Added `registerIp()` calls from kickstart fetch and progress callbacks +- Added syslog listener unit tests + +### 3. Fixed disk auto-detection +- Default disk changed from `/dev/sda` to `""` (auto-detect) in labd route and bastion command handler +- The kickstart `%pre` auto-detect logic probes nvme0n1, sda, sdb, vda in order +- Without this fix, NVMe-only machines (like the SER Mini Pro) fail immediately + +### 4. SysRq magic keys +- Added `kernel.sysrq=1` sysctl to kickstart `%post` +- Enables Alt+SysRq+REISUB via JetKVM for emergency reboot of stuck machines + +### 5. Simplified debug command +- Removed `--sshd` flag (SSH always available via `inst.sshd` + `sshpw` in rescue mode) +- Added `/debug-setup.sh` HTTP endpoint for nc listener setup from rescue shell +- Cleaned up `sshd` field from DebugConfig, protocol types, all routes + +### 6. Added `labctl provision logs -f` +- Follow mode with 5-second polling for real-time install monitoring + +## What Works + +- **PXE discovery → install → boot** — full flow works end-to-end +- **Anaconda syslog forwarding** — install logs stream to bastion +- **Progress callbacks** — stage-by-stage install tracking via curl +- **Auto disk detection** — works for NVMe and SATA +- **Debug rescue mode** — `labctl provision debug ` boots Anaconda rescue with SSH +- **Network-first boot order** — bastion controls every reboot via efibootmgr +- **SysRq keys** — emergency reboot via JetKVM keyboard + +## What Doesn't Work / Known Issues + +- **`--sshd` in rescue mode** — Anaconda rescue mode skips both `%pre` and `%post` kickstart sections. `inst.sshd` + `sshpw` should provide SSH access, but hasn't been verified end-to-end yet. The `/debug-setup.sh` curl workaround exists for nc. +- **arm64 container build** — iPXE cross-compilation fails on arm64 (GCC flag incompatibility). Workaround: build with `--platforms linux/amd64` only. +- **Integration test SSH timeout** — VM boots fine but SSH times out due to libvirt nftables reject rules after VM restart. Test infrastructure issue, not a code bug. + +## What Was Skipped / Left To Do + +1. **Syslog UDP port in k3s** — works because bastion uses `hostNetwork: true`, but should be documented properly +2. **Background log streamer** — the old `tail -f` approach broke Anaconda filesystem sync. Replaced with syslog forwarding. If more granular %post logging is needed, a synchronous log push at end of %post would be safe. +3. **Per-machine hardware overrides** — turned out not to be needed (serial console was the only "special" setting, and removing it is universal) +4. **Ubuntu autoinstall disk default** — `ubuntu-autoinstall.ts` still has `disk || "/dev/sda"` fallback (line 38), should be changed to auto-detect +5. **Verify `inst.sshd` works in rescue mode** — test SSH with password "debug" next time debug mode is used +6. **Re-enable TPM in BIOS** — was disabled during debugging, should be factory-reset (user plans to reset BIOS to factory) + +## Key Learnings + +1. **`console=ttyS0` on hardware without UART = 30s timeout per boot phase.** Never add serial console to kernel cmdline unless the hardware has a verified physical UART. +2. **Exactly-N-second gaps in boot logs = timeout, not slowness.** Look for the timeout source, not performance issues. +3. **The bisection approach works.** Systematically removing features one at a time found the root cause. But it took hours because the serial console was added early and seemed harmless. +4. **Anaconda rescue mode is limited.** It skips `%pre` and `%post`, so you can't automate setup via kickstart. Use `inst.sshd` + `sshpw` for SSH, and serve helper scripts via HTTP for everything else. +5. **Default disk paths break NVMe machines.** Always default to auto-detect (empty string) rather than `/dev/sda`.