The v2.0 Phase 1 commit (04faa07) added AuthService, RbacService,
ResourceStore, AuditService, the bearer auth middleware, and the
v2-auth/environments/resources route files, but createApp() never
registered any of them. They sat in the codebase as dead code: a
running labd would 404 on /api/auth/login, /api/resources, /api/events,
etc.
Wiring (server.ts)
- Instantiate AuthService, RbacService, ResourceStore, AuditService at
app creation. Cast DbClient to PrismaClient (the runtime db is a real
PrismaClient; DbClient is a structural shim).
- Start AuditService timer, register an onClose hook to stop it on
shutdown so we never lose the last batch.
- Register v2 routes inside a Fastify scope with the bearer-auth
middleware as preHandler. v1 routes (registered on the root scope)
are unaffected so existing labd clients keep working.
AuditService (audit.ts)
- Expose flushPending() so tests can deterministically observe events
without leaning on the 5-second flush interval. Implementation
delegates to the existing private flush().
Smoke tests (v2-smoke.test.ts, 11 cases)
- Bootstrap: first POST /api/auth/login with empty users creates the
admin (role=ADMIN, hashed password), returns a 64-hex token, marks
isBootstrap=true, emits an auth_bootstrap audit event. Second login
uses the normal flow. Wrong password returns 401 and audits failure.
Missing credentials returns 400.
- RBAC: missing/empty/invalid bearer tokens return 401. ADMIN role
bypasses RBAC. A non-admin with no role bindings gets 403 with
"no matching role binding". A user with an env-A binding is denied
for env-B resources.
- Audit: bootstrap event is queryable via /api/events?correlation=...
Explicit parent/child chain (shared correlationId, parentEventId)
is preserved across emits.
All 246 workspace tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bastion serve/stop default for --dir was hardcoded to /tmp/lab-bastion.
Now reads BASTION_DIR from env if set, so a deployed bastion daemon
can run from a persistent directory without callers having to pass
--dir on every invocation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v2.0 Phase 1 commit (04faa07) introduced the @lab/core package but
the labd Dockerfile still only copied @lab/shared and @lab/labd, so the
container build would fail to resolve @lab/core imports.
Both stages updated:
- Builder: copy @lab/core package.json/tsconfig + src, add it to the
build order between @lab/shared and @lab/labd.
- Runtime: copy @lab/core dist and package.json into the final image.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The worker0-k8s0 bug: when labd restarts, the in-memory installed map
is lost. The next DHCP/PXE re-discovery for that MAC ran an upsert that
wrote status="discovered", silently downgrading the DB record from
"online" or "offline" and erasing the machine's known hostname/role
identity from the CLI view.
- server.ts: drop status="discovered" from the upsert update branch so
re-discovery cannot downgrade an installed record.
- routes/bastions.ts (/api/machines): when the DB knows a real
hostname+role for a MAC currently only in live.discovered, promote
it back to live.installed so the CLI sees the right state. Also
reordered the live-vs-DB fallback so DB online/offline maps to
live.installed and the discovered branch is the else.
- tests: 3 new vitest cases covering promotion, fresh-discovery, and
unknown-MAC fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes prompted by today's etcd raft panic on worker1-k8s0
(tocommit out of range, lost-write on follower) and the cascading
disk pressure that surfaced underneath it.
Audit logs to journald
- kube-apiserver now uses audit-log-path=- so audit events flow to
k3s.service stdout and into journald instead of growing files in
/var/log/kubernetes. The previous setup combined apiserver's
internal rotation with a logrotate *.log glob that double-rotated
the rotated files into permanent orphans (observed: 7+ GB).
- New journald-limits operation writes a SystemMaxUse=2G drop-in so
audit volume cannot fill /var/log even under bursty load.
- log-rotation operation repurposed to decommission the obsolete
logrotate rule and reap leftover audit files. Idempotent: no-op
on fresh installs.
Etcd member recovery
- New recoverEtcdMember(broken, peer, hostname) codifies the
documented k3s recovery: stop k3s, etcdctl member remove, wipe
/var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for
rejoin. Refuses to operate when cluster size < 3 to preserve
quorum.
Tests
- 7 new unit tests covering both decommission paths and the
recovery procedure (54 total, all green).
- install.test.ts asserts the file-based audit args are gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project tracking for labctl v2.0 platform design. Includes P1 (arch doc update),
P2 (SSH emergency mode, Prometheus metrics), and P3 (graph viz, import, secrets rotation)
items from the CEO and eng review sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add `labctl provision recheck` to refresh hardware info via SSH
- Preserve hardware info in InstalledInfo when install completes
- Fix /ks-auto: run nested %pre scripts from included kickstarts
- Add command-discover WebSocket routing for hw info updates
- Fix k3s join: clean stale TLS/cred when joining existing cluster
- Add --tls-verify=false for internal HTTP registry pushes
- Add fix-ssh-root.sh script for root SSH access on all nodes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add state-aware kickstart dispatch for machines that boot from ISO
(no PXE/network at UEFI level). Replaces hardcoded discover.ks.
- /ks-auto: %pre detects MAC, queries /api/machine-state/<mac>,
writes discover or install kickstart to /tmp/dynamic.ks,
main body %include's it
- /api/machine-state/<mac>: simple state endpoint returning
unknown|discovered|queued|installing|installed|debug
- ISO kernel cmdline updated: discover.ks → ks-auto
- Handles: discovery (first boot), install (queued), debug modes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip Cilium install for joining servers (already in cluster via daemonset)
- Longhorn annotation for workers: SSH to server node from CLI to apply
kubectl annotation (workers don't have kubectl access)
- Default SSH user for k3s/app commands changed to 'root' (operations
need root privileges, using 'lab' user broke installs)
- k3s server config: cluster-init for initial server, server+token for joins
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Server config now uses cluster-init: true for initial server (enables
embedded etcd). Joining servers get server: + token: in config.
- Cilium install already checks for existing installation, so joining
servers skip it gracefully (the "release name in use" error is non-fatal)
Cluster rebuilt as etcd HA:
worker0-k8s0 control-plane,etcd (initial server, cluster-init)
worker1-k8s0 control-plane,etcd (joined server, Mac Studio aarch64)
spark-2935 worker (DGX Spark, aarch64)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's
sed doesn't support \| delimiter or \? quantifier)
- Use idempotent write_lab_fstab function: removes all old entries first,
comments out conflicting btrfs subvol entries, adds fresh LVM entries
- Fix sed for SSH hardening: use #* instead of \? (POSIX compatible)
- Tested on Mac Studio: no duplicate fstab entries after multiple runs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Firstboot script defaults admin user to 'lab' instead of bastion's
config.adminUser (which was 'michal' from host system)
- iSCSI OS detection uses case-insensitive match for 'fedora'
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The firstboot script now auto-detects hostname (from hostnamectl) and
MAC address (from first UP interface) at runtime. No URL query parameters
required — just `curl bastion/asahi/firstboot.sh | sudo bash`.
Fixes the shell escaping issue where `&` in query params broke curl piping.
Updated labctl provision asahi instructions accordingly.
Tested on Mac Studio (worker1-k8s0): hostname, MAC, and bastion
registration all auto-detected correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the reprovision path exited early after re-mounting LVs,
skipping hostname setup, admin user creation, metadata, and bastion
registration. Now both paths fall through to the common post-setup code.
Tested on Mac Studio (worker1-k8s0) — reprovision + self-registration
confirmed working via curl | bash pipe.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Asahi installer's urlcache.py fails with AssertionError on macOS
when streaming ZIP via HTTP Range requests from Fastify. Fix: download
the ZIP with curl first (reliable on macOS), then set REPO_BASE to the
local directory so the installer opens it as a local file.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
k3s host prep:
- Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils,
Ubuntu: open-iscsi) — required by Longhorn
- Add Longhorn disk label to k3s server+agent configs
- Add Longhorn disk annotation operation in post-install hardening
CLI:
- Add `labctl provision asahi` command with interactive install guide
- Change default SSH user from "michal" to "lab" in all commands
- Change admin user in bastion progress callback to "lab"
Asahi provisioning fixes:
- Download installer_data.json locally (installer reads it as file)
- Use REPO_BASE to serve upstream ZIP from bastion (LAN speed)
- Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified
(our repackaged ZIP used ZIP64 which breaks Asahi urlcache)
- Add /data/asahi-repo fallback path for k3s container PVC mount
- Deploy script syncs asahi-repo to bastion pod after deployment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add scripts/build-asahi-rootfs.sh: downloads upstream Fedora Asahi
Remix Server, injects lab firstboot script + systemd service + SSH
keys, repackages with installer_data.json that adds LVM Data partition
- Bastion serves built artifacts at /asahi/repo/* via fastify-static
- installer_data.json prefers built config, falls back to minimal
- Fix __dirname crash in ESM module (use import.meta.url)
- Fix smoke test timeout (was crashing due to __dirname)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VM-based end-to-end test using Fedora cloud image with two disks:
root (20GB) + data (200GB). Verifies the firstboot script creates
labvg with correct LV sizes, mounts volumes, migrates /home content,
sets hostname, creates admin user, and handles reprovision.
Fixes to firstboot script:
- Detect whole disks (not just partitions) for LVM PV
- Handle btrfs subvolume paths in root device detection
- Copy /home content before mounting LV (preserves SSH keys)
- Don't restart sshd (config takes effect on reboot)
- Make swapon and mount operations resilient to failures
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add bastion endpoints for provisioning Apple Silicon machines via the
Asahi Linux installer with custom LVM partitioning:
- GET /asahi — wrapper script (curl bastion:8080/asahi | sh)
- GET /asahi/installer_data.json — custom partition layout (60GB root + LVM data)
- GET /asahi/firstboot.sh — first-boot LVM setup matching kickstart layout
- GET /asahi/firstboot.service — systemd oneshot unit
The firstboot script creates labvg with role-specific LVs (var, varlog,
home, srv, rancher, longhorn) and handles reprovision by detecting
existing VGs. Includes 19 new tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove hardcoded devices/directRoutingDevice from Cilium install (let
Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7)
- Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init
containers can reach the API via k3s agent's local LB proxy
- Add node-role.kubernetes.io/worker label to agent config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `labctl provision register` to re-add machines to installed state
without reprovisioning (e.g. after bastion state loss). Full stack:
protocol type, bastion API + WS handler, labd route, CLI command.
Add `labctl app k3s kubeconfig <target>` to fetch kubeconfig from a
k3s node via SSH, rewrite server URL, and merge into ~/.kube/config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add UserKnownHostsFile=/dev/null to SSH in debug and reprovision commands
- Track install state in log follower so it doesn't exit prematurely on "installed"
- Reorder bastion status check to prioritize active queue over stale installed state
- Update .gitignore with task file entries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers all components (bastion, labd, labctl, agent, modules),
data flow, machine lifecycle, disk layout, kickstart features,
deployment, testing, security, known issues, and planned work.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the 2026-03-30 debugging session: root cause (console=ttyS0
on UART-less hardware), what was tried, what was fixed, and remaining
work items.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every
systemd boot phase on hardware without a physical serial UART. Each phase
transition blocks waiting for the non-existent UART.
Changes:
- Remove console=ttyS0 from kickstart bootloader args and %post setup
- Enable Anaconda syslog forwarding (logging --host --port) for install visibility
- Improve syslog IP→MAC resolution (register from kickstart fetch + progress)
- Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support
- Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM
- Simplify debug command: remove --sshd flag (inst.sshd always available),
add /debug-setup.sh HTTP endpoint for nc listener setup
- Add labctl provision logs -f (follow mode with polling)
- Add syslog listener unit tests
- Enable syslog log capture test in integration suite
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads kernel+initrd from bastion HTTP server, mounts root from local
NVMe. Workaround for UEFI firmware bugs that make local disk boot
100x slower. One-time use, auto-clears after boot.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When using `labctl provision debug <target> --sshd`, the rescue
kickstart generates host keys, starts sshd (pw: debug) and nc
listener (port 2323), and reports the IP back to bastion via
/api/progress callback. Fully self-contained, no mounted FS needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `labctl provision debug <target>` command that PXE boots a machine
into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears
after one boot so next reboot returns to normal.
Adds debug state to BastionState, dispatch routing, API endpoints,
labd command routing, and CLI with rescue workflow guide.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>