The v2.0 Phase 1 commit (04faa07) introduced the @lab/core package but
the labd Dockerfile still only copied @lab/shared and @lab/labd, so the
container build would fail to resolve @lab/core imports.
Both stages updated:
- Builder: copy @lab/core package.json/tsconfig + src, add it to the
build order between @lab/shared and @lab/labd.
- Runtime: copy @lab/core dist and package.json into the final image.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The worker0-k8s0 bug: when labd restarts, the in-memory installed map
is lost. The next DHCP/PXE re-discovery for that MAC ran an upsert that
wrote status="discovered", silently downgrading the DB record from
"online" or "offline" and erasing the machine's known hostname/role
identity from the CLI view.
- server.ts: drop status="discovered" from the upsert update branch so
re-discovery cannot downgrade an installed record.
- routes/bastions.ts (/api/machines): when the DB knows a real
hostname+role for a MAC currently only in live.discovered, promote
it back to live.installed so the CLI sees the right state. Also
reordered the live-vs-DB fallback so DB online/offline maps to
live.installed and the discovered branch is the else.
- tests: 3 new vitest cases covering promotion, fresh-discovery, and
unknown-MAC fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes prompted by today's etcd raft panic on worker1-k8s0
(tocommit out of range, lost-write on follower) and the cascading
disk pressure that surfaced underneath it.
Audit logs to journald
- kube-apiserver now uses audit-log-path=- so audit events flow to
k3s.service stdout and into journald instead of growing files in
/var/log/kubernetes. The previous setup combined apiserver's
internal rotation with a logrotate *.log glob that double-rotated
the rotated files into permanent orphans (observed: 7+ GB).
- New journald-limits operation writes a SystemMaxUse=2G drop-in so
audit volume cannot fill /var/log even under bursty load.
- log-rotation operation repurposed to decommission the obsolete
logrotate rule and reap leftover audit files. Idempotent: no-op
on fresh installs.
Etcd member recovery
- New recoverEtcdMember(broken, peer, hostname) codifies the
documented k3s recovery: stop k3s, etcdctl member remove, wipe
/var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for
rejoin. Refuses to operate when cluster size < 3 to preserve
quorum.
Tests
- 7 new unit tests covering both decommission paths and the
recovery procedure (54 total, all green).
- install.test.ts asserts the file-based audit args are gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `labctl provision recheck` to refresh hardware info via SSH
- Preserve hardware info in InstalledInfo when install completes
- Fix /ks-auto: run nested %pre scripts from included kickstarts
- Add command-discover WebSocket routing for hw info updates
- Fix k3s join: clean stale TLS/cred when joining existing cluster
- Add --tls-verify=false for internal HTTP registry pushes
- Add fix-ssh-root.sh script for root SSH access on all nodes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add state-aware kickstart dispatch for machines that boot from ISO
(no PXE/network at UEFI level). Replaces hardcoded discover.ks.
- /ks-auto: %pre detects MAC, queries /api/machine-state/<mac>,
writes discover or install kickstart to /tmp/dynamic.ks,
main body %include's it
- /api/machine-state/<mac>: simple state endpoint returning
unknown|discovered|queued|installing|installed|debug
- ISO kernel cmdline updated: discover.ks → ks-auto
- Handles: discovery (first boot), install (queued), debug modes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip Cilium install for joining servers (already in cluster via daemonset)
- Longhorn annotation for workers: SSH to server node from CLI to apply
kubectl annotation (workers don't have kubectl access)
- Default SSH user for k3s/app commands changed to 'root' (operations
need root privileges, using 'lab' user broke installs)
- k3s server config: cluster-init for initial server, server+token for joins
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Server config now uses cluster-init: true for initial server (enables
embedded etcd). Joining servers get server: + token: in config.
- Cilium install already checks for existing installation, so joining
servers skip it gracefully (the "release name in use" error is non-fatal)
Cluster rebuilt as etcd HA:
worker0-k8s0 control-plane,etcd (initial server, cluster-init)
worker1-k8s0 control-plane,etcd (joined server, Mac Studio aarch64)
spark-2935 worker (DGX Spark, aarch64)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's
sed doesn't support \| delimiter or \? quantifier)
- Use idempotent write_lab_fstab function: removes all old entries first,
comments out conflicting btrfs subvol entries, adds fresh LVM entries
- Fix sed for SSH hardening: use #* instead of \? (POSIX compatible)
- Tested on Mac Studio: no duplicate fstab entries after multiple runs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Firstboot script defaults admin user to 'lab' instead of bastion's
config.adminUser (which was 'michal' from host system)
- iSCSI OS detection uses case-insensitive match for 'fedora'
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The firstboot script now auto-detects hostname (from hostnamectl) and
MAC address (from first UP interface) at runtime. No URL query parameters
required — just `curl bastion/asahi/firstboot.sh | sudo bash`.
Fixes the shell escaping issue where `&` in query params broke curl piping.
Updated labctl provision asahi instructions accordingly.
Tested on Mac Studio (worker1-k8s0): hostname, MAC, and bastion
registration all auto-detected correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the reprovision path exited early after re-mounting LVs,
skipping hostname setup, admin user creation, metadata, and bastion
registration. Now both paths fall through to the common post-setup code.
Tested on Mac Studio (worker1-k8s0) — reprovision + self-registration
confirmed working via curl | bash pipe.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Asahi installer's urlcache.py fails with AssertionError on macOS
when streaming ZIP via HTTP Range requests from Fastify. Fix: download
the ZIP with curl first (reliable on macOS), then set REPO_BASE to the
local directory so the installer opens it as a local file.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
k3s host prep:
- Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils,
Ubuntu: open-iscsi) — required by Longhorn
- Add Longhorn disk label to k3s server+agent configs
- Add Longhorn disk annotation operation in post-install hardening
CLI:
- Add `labctl provision asahi` command with interactive install guide
- Change default SSH user from "michal" to "lab" in all commands
- Change admin user in bastion progress callback to "lab"
Asahi provisioning fixes:
- Download installer_data.json locally (installer reads it as file)
- Use REPO_BASE to serve upstream ZIP from bastion (LAN speed)
- Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified
(our repackaged ZIP used ZIP64 which breaks Asahi urlcache)
- Add /data/asahi-repo fallback path for k3s container PVC mount
- Deploy script syncs asahi-repo to bastion pod after deployment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add scripts/build-asahi-rootfs.sh: downloads upstream Fedora Asahi
Remix Server, injects lab firstboot script + systemd service + SSH
keys, repackages with installer_data.json that adds LVM Data partition
- Bastion serves built artifacts at /asahi/repo/* via fastify-static
- installer_data.json prefers built config, falls back to minimal
- Fix __dirname crash in ESM module (use import.meta.url)
- Fix smoke test timeout (was crashing due to __dirname)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VM-based end-to-end test using Fedora cloud image with two disks:
root (20GB) + data (200GB). Verifies the firstboot script creates
labvg with correct LV sizes, mounts volumes, migrates /home content,
sets hostname, creates admin user, and handles reprovision.
Fixes to firstboot script:
- Detect whole disks (not just partitions) for LVM PV
- Handle btrfs subvolume paths in root device detection
- Copy /home content before mounting LV (preserves SSH keys)
- Don't restart sshd (config takes effect on reboot)
- Make swapon and mount operations resilient to failures
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add bastion endpoints for provisioning Apple Silicon machines via the
Asahi Linux installer with custom LVM partitioning:
- GET /asahi — wrapper script (curl bastion:8080/asahi | sh)
- GET /asahi/installer_data.json — custom partition layout (60GB root + LVM data)
- GET /asahi/firstboot.sh — first-boot LVM setup matching kickstart layout
- GET /asahi/firstboot.service — systemd oneshot unit
The firstboot script creates labvg with role-specific LVs (var, varlog,
home, srv, rancher, longhorn) and handles reprovision by detecting
existing VGs. Includes 19 new tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove hardcoded devices/directRoutingDevice from Cilium install (let
Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7)
- Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init
containers can reach the API via k3s agent's local LB proxy
- Add node-role.kubernetes.io/worker label to agent config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `labctl provision register` to re-add machines to installed state
without reprovisioning (e.g. after bastion state loss). Full stack:
protocol type, bastion API + WS handler, labd route, CLI command.
Add `labctl app k3s kubeconfig <target>` to fetch kubeconfig from a
k3s node via SSH, rewrite server URL, and merge into ~/.kube/config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add UserKnownHostsFile=/dev/null to SSH in debug and reprovision commands
- Track install state in log follower so it doesn't exit prematurely on "installed"
- Reorder bastion status check to prioritize active queue over stale installed state
- Update .gitignore with task file entries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers all components (bastion, labd, labctl, agent, modules),
data flow, machine lifecycle, disk layout, kickstart features,
deployment, testing, security, known issues, and planned work.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the 2026-03-30 debugging session: root cause (console=ttyS0
on UART-less hardware), what was tried, what was fixed, and remaining
work items.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every
systemd boot phase on hardware without a physical serial UART. Each phase
transition blocks waiting for the non-existent UART.
Changes:
- Remove console=ttyS0 from kickstart bootloader args and %post setup
- Enable Anaconda syslog forwarding (logging --host --port) for install visibility
- Improve syslog IP→MAC resolution (register from kickstart fetch + progress)
- Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support
- Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM
- Simplify debug command: remove --sshd flag (inst.sshd always available),
add /debug-setup.sh HTTP endpoint for nc listener setup
- Add labctl provision logs -f (follow mode with polling)
- Add syslog listener unit tests
- Enable syslog log capture test in integration suite
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads kernel+initrd from bastion HTTP server, mounts root from local
NVMe. Workaround for UEFI firmware bugs that make local disk boot
100x slower. One-time use, auto-clears after boot.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When using `labctl provision debug <target> --sshd`, the rescue
kickstart generates host keys, starts sshd (pw: debug) and nc
listener (port 2323), and reports the IP back to bastion via
/api/progress callback. Fully self-contained, no mounted FS needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `labctl provision debug <target>` command that PXE boots a machine
into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears
after one boot so next reboot returns to normal.
Adds debug state to BastionState, dispatch routing, API endpoints,
labd command routing, and CLI with rescue workflow guide.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check sysfs device path for 'usb' to skip JetKVM virtual media which
appears as /dev/sda but is not a real install target.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JetKVM virtual media appears as /dev/sda before NVMe initializes.
Now: wait up to 10s for disks, skip removable disks and anything
under 20GB. Fixes "ignoredisk: sda does not exist" on SER9MAX.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ttyS0 console output on iPXE kernel line may cause kernel hang on
hardware without physical serial port. Removed from both discover
and install iPXE scripts. Serial console stays in bootloader config
for the installed system only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- logging --host blocks Anaconda when syslog UDP port not reachable
- nomodeset prevents amdgpu hang on SER9MAX (Radeon 780M)
- JetKVM helper script for device control (status, reboot, power)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Radeon 780M GPU driver initialization hangs during Anaconda boot
on SER9MAX. nomodeset disables kernel modesetting so the installer
doesn't try to initialize the GPU.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When bastion syncs state, labd now upserts discovered and installed
machines into the Server table. /api/machines merges live bastion
state with DB records, so machines survive pod restarts.
Discovered machines get status=discovered with hardware labels.
Installed machines get status=online with hostname, role, IP.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add console=ttyS0,115200n8 to both discover and install iPXE kernel
lines so Anaconda output is visible on serial during install phase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ksvalidator caught the issue: --level=info is not valid for F43.
Correct syntax is just: logging --host=<ip> --port=<port>
Also added ksvalidator syntax check to unit tests — validates
rendered kickstart for all roles (vanilla, worker, infra) against
F43 pykickstart. This catches kickstart syntax errors at test time
instead of during a 12-minute VM install.
Integration test passes: 21/22 (1 skipped: log lines capture).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The kickstart `logging --host` directive stalls Anaconda install —
likely firewall blocks UDP syslog or Fedora 43 Anaconda has issues
with it. Commented out for now. Syslog listener infrastructure is
in place and ready once we resolve the Anaconda/firewall issue.
Added vitest.integration.config.ts for running integration tests:
pnpm exec vitest run --config vitest.integration.config.ts
All 21 integration tests pass, serial console rsyslog forwarding works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed admin user from 'michal' to generic 'lab' user.
SSH key auth works for both root and lab user.
21/22 tests pass (1 skipped: log lines, needs log streamer redesign).
Bisection complete — all features work except background log streamer
which prevents Anaconda from syncing filesystem writes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bisection results:
- Step 2: bastion_log/bastion_error helpers — PASS
- Step 3: ERR trap in %post — PASS
- Step 4: background log streamer — FAIL (breaks boot, NOT included)
- Step 5: serial console on ttyS0 — PASS
The background log streamer (tail -f subprocess in %post) prevents
Anaconda from properly syncing the installed filesystem. This was
the root cause of all boot failures. Will need a different approach
for real-time log streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All accumulated changes to kickstart template, test infrastructure,
and dnsmasq config. None of these produce a clean boot yet — saving
state before reverting to baseline for bisection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- VMs get serial console on TCP (PXE: port 4555, ISO: port 4556)
- serialExec() helper: runs commands via telnet when SSH/network is down
- PXE test: on SSH failure, dumps hostname, IP, NetworkManager, sshd,
failed units, and fstab via serial console before failing
- Kickstart enables serial-getty@ttyS0 for auto-login on serial
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>