michal/lab - lab - Gitea: Git with a cup of tea

michal/lab

Author	SHA1	Message	Date
Michal	37a3b51e57	build(labd): include @lab/core in the Dockerfile build chain The v2.0 Phase 1 commit (`04faa07`) introduced the @lab/core package but the labd Dockerfile still only copied @lab/shared and @lab/labd, so the container build would fail to resolve @lab/core imports. Both stages updated: - Builder: copy @lab/core package.json/tsconfig + src, add it to the build order between @lab/shared and @lab/labd. - Runtime: copy @lab/core dist and package.json into the final image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 22:09:24 +01:00
Michal	d6e1f3c74d	fix(labd): preserve machine identity across bastion restarts The worker0-k8s0 bug: when labd restarts, the in-memory installed map is lost. The next DHCP/PXE re-discovery for that MAC ran an upsert that wrote status="discovered", silently downgrading the DB record from "online" or "offline" and erasing the machine's known hostname/role identity from the CLI view. - server.ts: drop status="discovered" from the upsert update branch so re-discovery cannot downgrade an installed record. - routes/bastions.ts (/api/machines): when the DB knows a real hostname+role for a MAC currently only in live.discovered, promote it back to live.installed so the CLI sees the right state. Also reordered the live-vs-DB fallback so DB online/offline maps to live.installed and the discovered branch is the else. - tests: 3 new vitest cases covering promotion, fresh-discovery, and unknown-MAC fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 22:09:24 +01:00
Michal	52e831b8c1	Merge branch 'main' into feat/v2-phase1-foundation	2026-05-05 22:06:34 +01:00
Michal	dd92147341	fix(k3s): route audit logs through journald, codify etcd member recovery Some checks failed CI/CD / typecheck (pull_request) Failing after 13s Details CI/CD / lint (pull_request) Failing after 23s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Two changes prompted by today's etcd raft panic on worker1-k8s0 (tocommit out of range, lost-write on follower) and the cascading disk pressure that surfaced underneath it. Audit logs to journald - kube-apiserver now uses audit-log-path=- so audit events flow to k3s.service stdout and into journald instead of growing files in /var/log/kubernetes. The previous setup combined apiserver's internal rotation with a logrotate *.log glob that double-rotated the rotated files into permanent orphans (observed: 7+ GB). - New journald-limits operation writes a SystemMaxUse=2G drop-in so audit volume cannot fill /var/log even under bursty load. - log-rotation operation repurposed to decommission the obsolete logrotate rule and reap leftover audit files. Idempotent: no-op on fresh installs. Etcd member recovery - New recoverEtcdMember(broken, peer, hostname) codifies the documented k3s recovery: stop k3s, etcdctl member remove, wipe /var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for rejoin. Refuses to operate when cluster size < 3 to preserve quorum. Tests - 7 new unit tests covering both decommission paths and the recovery procedure (54 total, all green). - install.test.ts asserts the file-based audit args are gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 21:29:16 +01:00
Michal	04faa079e2	feat: v2.0 Phase 1 foundation — @lab/core, auth, RBAC, audit, resource store New packages: - @lab/core: Resource types, Output<T> (Pulumi), audit event types, auth types, environment/account types, resource kind registry New Prisma schema (mcpctl pattern): - User (email/password/bcrypt), Session (bearer tokens), Group, GroupMember - ServiceAccount, RbacDefinition (JSON subjects + roleBindings) - AuditEvent (correlation IDs, causal chains, fire-and-forget batching) - Environment, Account (driver config, Infisical secret path), Binding - Resource (generic, kind/name/env unique, origin/managedBy tracking) - Secret, Fleet, FleetMember, GitSource - Keeps v1.0 models: Server, Agent, Bastion, Cluster, JoinToken New services: - AuthService: bearer token login, bootstrap (first login creates admin), session management with 30-day expiry - RbacService: environment-scoped permission checks, group membership, role hierarchy (admin > edit > view) - AuditService: fire-and-forget event collection, batch 50 / flush 5s, correlation IDs for causal chains - ResourceStore: CRUD with origin/managedBy, RBAC-enforced routes New routes: - POST /api/auth/login, POST /api/auth/logout (bearer token auth) - GET/POST/PUT/DELETE /api/resources (RBAC-enforced CRUD) - GET/POST /api/environments, GET/POST /api/accounts - POST /api/accounts/bind, GET /api/bindings - GET /api/events (audit query with --last, --kind, --env, --correlation) New middleware: - Bearer token auth (validates Authorization header, resolves user identity) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 01:42:28 +01:00
Michal	9ddab24931	feat: provision recheck, hardware info preservation, ISO boot fixes Some checks failed CI/CD / lint (pull_request) Failing after 1m26s Details CI/CD / typecheck (pull_request) Failing after 11s Details CI/CD / test (pull_request) Failing after 11s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Add `labctl provision recheck` to refresh hardware info via SSH - Preserve hardware info in InstalledInfo when install completes - Fix /ks-auto: run nested %pre scripts from included kickstarts - Add command-discover WebSocket routing for hw info updates - Fix k3s join: clean stale TLS/cred when joining existing cluster - Add --tls-verify=false for internal HTTP registry pushes - Add fix-ssh-root.sh script for root SSH access on all nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 17:59:39 +01:00
Michal	ae91f2895e	feat: dynamic /ks-auto kickstart for ISO boot (R1 ARM support) Some checks failed CI/CD / lint (push) Failing after 11s Details CI/CD / typecheck (push) Failing after 22s Details CI/CD / test (push) Failing after 7m5s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details Add state-aware kickstart dispatch for machines that boot from ISO (no PXE/network at UEFI level). Replaces hardcoded discover.ks. - /ks-auto: %pre detects MAC, queries /api/machine-state/<mac>, writes discover or install kickstart to /tmp/dynamic.ks, main body %include's it - /api/machine-state/<mac>: simple state endpoint returning unknown\|discovered\|queued\|installing\|installed\|debug - ISO kernel cmdline updated: discover.ks → ks-auto - Handles: discovery (first boot), install (queued), debug modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 16:17:08 +01:00
Michal	06fc40a857	fix: k3s install automation — skip Cilium on join, Longhorn via server, default root user Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 9s Details CI/CD / lint (push) Failing after 22s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Skip Cilium install for joining servers (already in cluster via daemonset) - Longhorn annotation for workers: SSH to server node from CLI to apply kubectl annotation (workers don't have kubectl access) - Default SSH user for k3s/app commands changed to 'root' (operations need root privileges, using 'lab' user broke installs) - k3s server config: cluster-init for initial server, server+token for joins Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 16:02:19 +01:00
Michal	a68d6d617e	feat: k3s cluster-init for etcd HA, fix Cilium duplicate install Some checks failed CI/CD / lint (push) Failing after 11s Details CI/CD / test (push) Failing after 10s Details CI/CD / typecheck (push) Failing after 22s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Server config now uses cluster-init: true for initial server (enables embedded etcd). Joining servers get server: + token: in config. - Cilium install already checks for existing installation, so joining servers skip it gracefully (the "release name in use" error is non-fatal) Cluster rebuilt as etcd HA: worker0-k8s0 control-plane,etcd (initial server, cluster-init) worker1-k8s0 control-plane,etcd (joined server, Mac Studio aarch64) spark-2935 worker (DGX Spark, aarch64) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:53:18 +01:00
Michal	c49a650888	fix: firstboot fstab handling — no duplicates, compatible with Asahi sed Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 11s Details CI/CD / lint (push) Failing after 23s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's sed doesn't support \\| delimiter or \? quantifier) - Use idempotent write_lab_fstab function: removes all old entries first, comments out conflicting btrfs subvol entries, adds fresh LVM entries - Fix sed for SSH hardening: use #* instead of \? (POSIX compatible) - Tested on Mac Studio: no duplicate fstab entries after multiple runs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:40:29 +01:00
Michal	87e09af941	fix: default admin user to 'lab', case-insensitive OS detection for iSCSI Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 10s Details CI/CD / lint (push) Failing after 22s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Firstboot script defaults admin user to 'lab' instead of bastion's config.adminUser (which was 'michal' from host system) - iSCSI OS detection uses case-insensitive match for 'fedora' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:13:53 +01:00
Michal	6f13e284fd	fix: firstboot script auto-detects hostname and MAC, no query params needed Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 10s Details CI/CD / lint (push) Failing after 23s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details The firstboot script now auto-detects hostname (from hostnamectl) and MAC address (from first UP interface) at runtime. No URL query parameters required — just `curl bastion/asahi/firstboot.sh \| sudo bash`. Fixes the shell escaping issue where `&` in query params broke curl piping. Updated labctl provision asahi instructions accordingly. Tested on Mac Studio (worker1-k8s0): hostname, MAC, and bastion registration all auto-detected correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:05:25 +01:00
Michal	6c963a15bd	fix: firstboot reprovision path now runs hostname, user, and registration Some checks failed CI/CD / lint (push) Failing after 12s Details CI/CD / test (push) Failing after 10s Details CI/CD / typecheck (push) Failing after 29s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details Previously the reprovision path exited early after re-mounting LVs, skipping hostname setup, admin user creation, metadata, and bastion registration. Now both paths fall through to the common post-setup code. Tested on Mac Studio (worker1-k8s0) — reprovision + self-registration confirmed working via curl \| bash pipe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 09:59:02 +01:00
Michal	17bae7ddbf	fix: pre-download rootfs ZIP to avoid macOS Python HTTP streaming issues Some checks failed CI/CD / lint (pull_request) Failing after 11s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details The Asahi installer's urlcache.py fails with AssertionError on macOS when streaming ZIP via HTTP Range requests from Fastify. Fix: download the ZIP with curl first (reliable on macOS), then set REPO_BASE to the local directory so the installer opens it as a local file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 00:30:29 +01:00
Michal	bb8f37ef7d	feat: iSCSI, Longhorn disk labels, labctl asahi command, ZIP32 fix Some checks failed CI/CD / typecheck (pull_request) Failing after 12s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details k3s host prep: - Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils, Ubuntu: open-iscsi) — required by Longhorn - Add Longhorn disk label to k3s server+agent configs - Add Longhorn disk annotation operation in post-install hardening CLI: - Add `labctl provision asahi` command with interactive install guide - Change default SSH user from "michal" to "lab" in all commands - Change admin user in bastion progress callback to "lab" Asahi provisioning fixes: - Download installer_data.json locally (installer reads it as file) - Use REPO_BASE to serve upstream ZIP from bastion (LAN speed) - Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified (our repackaged ZIP used ZIP64 which breaks Asahi urlcache) - Add /data/asahi-repo fallback path for k3s container PVC mount - Deploy script syncs asahi-repo to bastion pod after deployment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 23:32:38 +01:00
Michal	a8dc79bc5a	feat: Asahi validation tests, rootfs build fixes, shellcheck-clean scripts Some checks failed CI/CD / lint (pull_request) Failing after 12s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Add 16 validation tests: shellcheck (3 roles), installer_data.json schema (8), Python parser validation, ZIP structure (3), rootfs mount - Fix empty SSH keys generating invalid bash (SC1073) - Fix __dirname crash in ESM modules (use import.meta.url) - Fix rootfs build: mkdir -p before writing, correct binary paths - Add .gitignore for large build artifacts (.asahi-cache, *.zip) - Bump smoke test timeout for additional static plugin registration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 13:22:24 +01:00
Michal	ad76c74020	fix: rootfs build script — mkdir before write, fix package path checks Some checks failed CI/CD / typecheck (pull_request) Failing after 10s Details CI/CD / lint (pull_request) Failing after 21s Details CI/CD / test (pull_request) Failing after 11s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 03:26:26 +01:00
Michal	6807632d46	feat: Asahi rootfs build pipeline + serve from bastion Some checks failed CI/CD / lint (pull_request) Failing after 10s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Add scripts/build-asahi-rootfs.sh: downloads upstream Fedora Asahi Remix Server, injects lab firstboot script + systemd service + SSH keys, repackages with installer_data.json that adds LVM Data partition - Bastion serves built artifacts at /asahi/repo/* via fastify-static - installer_data.json prefers built config, falls back to minimal - Fix __dirname crash in ESM module (use import.meta.url) - Fix smoke test timeout (was crashing due to __dirname) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 03:20:12 +01:00
Michal	53265bb18c	test: integration test for Asahi firstboot LVM setup Some checks failed CI/CD / lint (pull_request) Failing after 21s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / test (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details VM-based end-to-end test using Fedora cloud image with two disks: root (20GB) + data (200GB). Verifies the firstboot script creates labvg with correct LV sizes, mounts volumes, migrates /home content, sets hostname, creates admin user, and handles reprovision. Fixes to firstboot script: - Detect whole disks (not just partitions) for LVM PV - Handle btrfs subvolume paths in root device detection - Copy /home content before mounting LV (preserves SSH keys) - Don't restart sshd (config takes effect on reboot) - Make swapon and mount operations resilient to failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 03:07:38 +01:00
Michal	863c7f2b83	feat: Asahi Linux provisioning for Apple Silicon (Mac Studio) Some checks failed CI/CD / typecheck (pull_request) Failing after 11s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / test (pull_request) Failing after 11s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Add bastion endpoints for provisioning Apple Silicon machines via the Asahi Linux installer with custom LVM partitioning: - GET /asahi — wrapper script (curl bastion:8080/asahi \| sh) - GET /asahi/installer_data.json — custom partition layout (60GB root + LVM data) - GET /asahi/firstboot.sh — first-boot LVM setup matching kickstart layout - GET /asahi/firstboot.service — systemd oneshot unit The firstboot script creates labvg with role-specific LVs (var, varlog, home, srv, rancher, longhorn) and handles reprovision by detecting existing VGs. Includes 19 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 02:46:27 +01:00
Michal	aea28b5a0f	fix: Cilium multi-node support — auto-detect NIC, k3s agent API port, worker label Some checks failed CI/CD / typecheck (pull_request) Failing after 10s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / test (pull_request) Failing after 7m8s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Remove hardcoded devices/directRoutingDevice from Cilium install (let Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7) - Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init containers can reach the API via k3s agent's local LB proxy - Add node-role.kubernetes.io/worker label to agent config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 01:35:51 +01:00
Michal	49d747db98	feat: provision register command and k3s kubeconfig merge Some checks failed CI/CD / lint (pull_request) Failing after 11s Details CI/CD / test (pull_request) Failing after 11s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Add `labctl provision register` to re-add machines to installed state without reprovisioning (e.g. after bastion state loss). Full stack: protocol type, bastion API + WS handler, labd route, CLI command. Add `labctl app k3s kubeconfig <target>` to fetch kubeconfig from a k3s node via SSH, rewrite server URL, and merge into ~/.kube/config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 01:15:31 +01:00
Michal	6a5f23c0f5	fix: reprovision workflow bugs — SSH host key warnings, log following, status priority Some checks failed CI/CD / lint (pull_request) Failing after 10s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 23s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Add UserKnownHostsFile=/dev/null to SSH in debug and reprovision commands - Track install state in log follower so it doesn't exit prematurely on "installed" - Reorder bastion status check to prioritize active queue over stale installed state - Update .gitignore with task file entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:59:45 +01:00
Michal	d7a25066bd	docs: comprehensive architecture document Some checks failed CI/CD / lint (pull_request) Failing after 13s Details CI/CD / typecheck (pull_request) Failing after 23s Details CI/CD / test (pull_request) Failing after 14s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Covers all components (bastion, labd, labctl, agent, modules), data flow, machine lifecycle, disk layout, kickstart features, deployment, testing, security, known issues, and planned work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 17:31:29 +01:00
Michal	87c1a34232	docs: PXE boot debugging post-mortem — serial console root cause Some checks failed CI/CD / lint (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 23s Details CI/CD / test (pull_request) Failing after 7m4s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Documents the 2026-03-30 debugging session: root cause (console=ttyS0 on UART-less hardware), what was tried, what was fixed, and remaining work items. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 04:00:51 +01:00
Michal	0a4916d3c9	fix: remove serial console (root cause of 30s boot delay), enable syslog logging, disk auto-detect Some checks failed CI/CD / typecheck (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every systemd boot phase on hardware without a physical serial UART. Each phase transition blocks waiting for the non-existent UART. Changes: - Remove console=ttyS0 from kickstart bootloader args and %post setup - Enable Anaconda syslog forwarding (logging --host --port) for install visibility - Improve syslog IP→MAC resolution (register from kickstart fetch + progress) - Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support - Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM - Simplify debug command: remove --sshd flag (inst.sshd always available), add /debug-setup.sh HTTP endpoint for nc listener setup - Add labctl provision logs -f (follow mode with polling) - Add syslog listener unit tests - Enable syslog log capture test in integration suite Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 03:58:51 +01:00
Michal	a4a4840930	feat: debug --pxe-boot flag, boot installed system via PXE Some checks failed CI/CD / lint (pull_request) Failing after 10s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Loads kernel+initrd from bastion HTTP server, mounts root from local NVMe. Workaround for UEFI firmware bugs that make local disk boot 100x slower. One-time use, auto-clears after boot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 00:49:44 +01:00
Michal	8da947a1c3	fix: use %pre instead of %post for debug --sshd (rescue mode skips %post) Some checks failed CI/CD / typecheck (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / lint (pull_request) Failing after 23s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 00:25:19 +01:00
Michal	92c65b4672	fix: generic rescue instructions in debug command output Some checks failed CI/CD / typecheck (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 23:59:38 +01:00
Michal	3835fefba1	feat: debug --sshd flag, auto SSH + nc listener + IP callback Some checks failed CI/CD / lint (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details When using `labctl provision debug <target> --sshd`, the rescue kickstart generates host keys, starts sshd (pw: debug) and nc listener (port 2323), and reports the IP back to bastion via /api/progress callback. Fully self-contained, no mounted FS needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 23:54:22 +01:00
Michal	d7a59665ad	fix: route command-debug through bastion WebSocket handler Some checks failed CI/CD / typecheck (pull_request) Failing after 9s Details CI/CD / lint (pull_request) Failing after 23s Details CI/CD / test (pull_request) Failing after 6m53s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 23:01:16 +01:00
Michal	82ca93f4d7	fix: add debug field to inline BastionState in labd server Some checks failed CI/CD / typecheck (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 8s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 22:54:02 +01:00
Michal	52150fd955	fix: add command-debug to LabdBastionMessage protocol types Some checks failed CI/CD / lint (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 22:42:52 +01:00
Michal	e87edfcfbd	feat: PXE debug boot mode for rescue/diagnostics Some checks failed CI/CD / lint (pull_request) Failing after 11s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / typecheck (pull_request) Failing after 22s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details New `labctl provision debug <target>` command that PXE boots a machine into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears after one boot so next reboot returns to normal. Adds debug state to BastionState, dispatch routing, API endpoints, labd command routing, and CLI with rescue workflow guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 22:25:44 +01:00
Michal	6c6d5763c4	fix: skip USB-attached disks in %pre (JetKVM virtual media is SCSI-over-USB) Check sysfs device path for 'usb' to skip JetKVM virtual media which appears as /dev/sda but is not a real install target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 12:51:44 +01:00
Michal	a7a6ad8098	fix: skip removable/USB disks in %pre, wait for NVMe init JetKVM virtual media appears as /dev/sda before NVMe initializes. Now: wait up to 10s for disks, skip removable disks and anything under 20GB. Fixes "ignoredisk: sda does not exist" on SER9MAX. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 12:38:41 +01:00
Michal	e3523d642c	fix: remove serial console from iPXE kernel args (may hang on SER9MAX) ttyS0 console output on iPXE kernel line may cause kernel hang on hardware without physical serial port. Removed from both discover and install iPXE scripts. Serial console stays in bootloader config for the installed system only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 12:32:02 +01:00
Michal	5b04d3162b	fix: disable logging --host (UDP not exposed), add nomodeset + JetKVM helper - logging --host blocks Anaconda when syslog UDP port not reachable - nomodeset prevents amdgpu hang on SER9MAX (Radeon 780M) - JetKVM helper script for device control (status, reboot, power) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 11:07:48 +01:00
Michal	a14fd04947	fix: add nomodeset to iPXE kernel args (amdgpu hangs on SER9MAX) Radeon 780M GPU driver initialization hangs during Anaconda boot on SER9MAX. nomodeset disables kernel modesetting so the installer doesn't try to initialize the GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 03:01:21 +01:00
Michal	0c1e18cee1	feat: persist machine state to CockroachDB on bastion-state-sync When bastion syncs state, labd now upserts discovered and installed machines into the Server table. /api/machines merges live bastion state with DB records, so machines survive pod restarts. Discovered machines get status=discovered with hardware labels. Installed machines get status=online with hostname, role, IP. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 02:34:26 +01:00
Michal	aae03d9877	fix: syslog parser TS strict null check, deploy script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 00:58:00 +00:00
Michal	84f1a7b133	feat: serial console on iPXE kernel boot args Some checks failed CI/CD / lint (pull_request) Failing after 12s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / typecheck (pull_request) Failing after 23s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Add console=ttyS0,115200n8 to both discover and install iPXE kernel lines so Anaconda output is visible on serial during install phase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 00:46:25 +00:00
Michal	c0fb1310cb	fix: re-enable logging --host (removed invalid --level flag) ksvalidator caught the issue: --level=info is not valid for F43. Correct syntax is just: logging --host=<ip> --port=<port> Also added ksvalidator syntax check to unit tests — validates rendered kickstart for all roles (vanilla, worker, infra) against F43 pykickstart. This catches kickstart syntax errors at test time instead of during a 12-minute VM install. Integration test passes: 21/22 (1 skipped: log lines capture). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 00:45:11 +00:00
Michal	48b2230665	fix: disable logging --host (breaks Anaconda), add integration config The kickstart `logging --host` directive stalls Anaconda install — likely firewall blocks UDP syslog or Fedora 43 Anaconda has issues with it. Commented out for now. Syslog listener infrastructure is in place and ready once we resolve the Anaconda/firewall issue. Added vitest.integration.config.ts for running integration tests: pnpm exec vitest run --config vitest.integration.config.ts All 21 integration tests pass, serial console rsyslog forwarding works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 00:19:48 +00:00
Michal	3dc1317301	feat: Anaconda syslog logging, serial console forwarding, protocol types - Add UDP syslog listener (port 5514) for receiving Anaconda install logs via native `logging --host` kickstart directive — no background processes - Add rsyslog serial console forwarding in %post (AWS EC2 compatible ttyS0@115200n8) - Add ProvisionStackType ("dhcpproxy" \| "iso" \| "cloud-init") to shared types - Add bastion-install-log WebSocket protocol message for bastion→labd log sync - Add syslogPort to BastionConfig (default 5514) - Wire syslog listener into bastion startup/shutdown lifecycle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 23:14:10 +00:00
Michal	cac7514014	feat: admin user 'lab' with SSH key auth (Step 7 — PASS) Changed admin user from 'michal' to generic 'lab' user. SSH key auth works for both root and lab user. 21/22 tests pass (1 skipped: log lines, needs log streamer redesign). Bisection complete — all features work except background log streamer which prevents Anaconda from syncing filesystem writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 22:30:59 +00:00
Michal	25a2beccff	fix: add error trap, bastion helpers, serial console (Steps 2-5 pass) Bisection results: - Step 2: bastion_log/bastion_error helpers — PASS - Step 3: ERR trap in %post — PASS - Step 4: background log streamer — FAIL (breaks boot, NOT included) - Step 5: serial console on ttyS0 — PASS The background log streamer (tail -f subprocess in %post) prevents Anaconda from properly syncing the installed filesystem. This was the root cause of all boot failures. Will need a different approach for real-time log streaming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 22:17:47 +00:00
Michal	2a1a29c03b	fix: revert kickstart to near-original baseline (Step 0 — boots clean) Reverted install.ks.ts to near-original state from commit `64533b2`. This is the bisection baseline — 21/22 integration tests pass, 0 failed systemd services, SSH works, /boot/efi mounts. Removed all accumulated fixes that collectively broke boot: - ERR trap, background log streamer, bastion_log/bastion_error - depmod rebuild, nofail on /boot/efi, SELinux autorelabel - chcon/restorecon for /etc /var /root - kernel-modules and dosfstools packages Kept from current branch: - rootpw --plaintext lab-root-pw (console debug access) - Network-first boot order (bastion controls boot) - Vanilla role support, rancher partition support - Boot screenshots during SSH wait (1/sec rolling buffer) - Test runner script (run-pxe-test.sh) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 20:47:34 +00:00
Michal	a664074fa3	wip: save current ks debugging state before bisect revert All accumulated changes to kickstart template, test infrastructure, and dnsmasq config. None of these produce a clean boot yet — saving state before reverting to baseline for bisection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 20:24:14 +00:00
Michal	cc289c0f94	feat: serial console on test VMs for debugging without SSH Some checks failed CI/CD / typecheck (pull_request) Failing after 9s Details CI/CD / test (pull_request) Failing after 9s Details CI/CD / lint (pull_request) Failing after 21s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - VMs get serial console on TCP (PXE: port 4555, ISO: port 4556) - serialExec() helper: runs commands via telnet when SSH/network is down - PXE test: on SSH failure, dumps hostname, IP, NetworkManager, sshd, failed units, and fstab via serial console before failing - Kickstart enables serial-getty@ttyS0 for auto-login on serial Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 15:22:43 +00:00

1 2

69 Commits