Commit Graph

10 Commits

Author SHA1 Message Date
Michal
49d747db98 feat: provision register command and k3s kubeconfig merge
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 11s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Add `labctl provision register` to re-add machines to installed state
without reprovisioning (e.g. after bastion state loss). Full stack:
protocol type, bastion API + WS handler, labd route, CLI command.

Add `labctl app k3s kubeconfig <target>` to fetch kubeconfig from a
k3s node via SSH, rewrite server URL, and merge into ~/.kube/config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:15:31 +01:00
Michal
6a5f23c0f5 fix: reprovision workflow bugs — SSH host key warnings, log following, status priority
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 23s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Add UserKnownHostsFile=/dev/null to SSH in debug and reprovision commands
- Track install state in log follower so it doesn't exit prematurely on "installed"
- Reorder bastion status check to prioritize active queue over stale installed state
- Update .gitignore with task file entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:59:45 +01:00
Michal
0a4916d3c9 fix: remove serial console (root cause of 30s boot delay), enable syslog logging, disk auto-detect
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Root cause found: console=ttyS0,115200n8 causes 30-second timeout at every
systemd boot phase on hardware without a physical serial UART. Each phase
transition blocks waiting for the non-existent UART.

Changes:
- Remove console=ttyS0 from kickstart bootloader args and %post setup
- Enable Anaconda syslog forwarding (logging --host --port) for install visibility
- Improve syslog IP→MAC resolution (register from kickstart fetch + progress)
- Fix disk auto-detect: default to empty string (not /dev/sda) for NVMe support
- Enable SysRq magic keys (kernel.sysrq=1) for emergency reboot via JetKVM
- Simplify debug command: remove --sshd flag (inst.sshd always available),
  add /debug-setup.sh HTTP endpoint for nc listener setup
- Add labctl provision logs -f (follow mode with polling)
- Add syslog listener unit tests
- Enable syslog log capture test in integration suite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 03:58:51 +01:00
Michal
a4a4840930 feat: debug --pxe-boot flag, boot installed system via PXE
Some checks failed
CI/CD / lint (pull_request) Failing after 10s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Loads kernel+initrd from bastion HTTP server, mounts root from local
NVMe. Workaround for UEFI firmware bugs that make local disk boot
100x slower. One-time use, auto-clears after boot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:49:44 +01:00
Michal
3835fefba1 feat: debug --sshd flag, auto SSH + nc listener + IP callback
Some checks failed
CI/CD / lint (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
When using `labctl provision debug <target> --sshd`, the rescue
kickstart generates host keys, starts sshd (pw: debug) and nc
listener (port 2323), and reports the IP back to bastion via
/api/progress callback. Fully self-contained, no mounted FS needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:54:22 +01:00
Michal
82ca93f4d7 fix: add debug field to inline BastionState in labd server
Some checks failed
CI/CD / typecheck (pull_request) Failing after 9s
CI/CD / test (pull_request) Failing after 8s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:54:02 +01:00
Michal
e87edfcfbd feat: PXE debug boot mode for rescue/diagnostics
Some checks failed
CI/CD / lint (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 9s
CI/CD / typecheck (pull_request) Failing after 22s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
New `labctl provision debug <target>` command that PXE boots a machine
into Fedora rescue mode (inst.rescue) for live debugging. Auto-clears
after one boot so next reboot returns to normal.

Adds debug state to BastionState, dispatch routing, API endpoints,
labd command routing, and CLI with rescue workflow guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 22:25:44 +01:00
Michal
0c1e18cee1 feat: persist machine state to CockroachDB on bastion-state-sync
When bastion syncs state, labd now upserts discovered and installed
machines into the Server table. /api/machines merges live bastion
state with DB records, so machines survive pod restarts.

Discovered machines get status=discovered with hardware labels.
Installed machines get status=online with hostname, role, IP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 02:34:26 +01:00
Michal
46b017d77e feat: install logging, error trapping, PXE/ISO integration tests
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00
Michal
44f1ebb843 feat: scaffold labd — master daemon with CockroachDB + Prisma
New @lab/labd workspace package:
- Fastify HTTP server + WebSocket for agent connections
- Prisma schema (CockroachDB): Server, Agent, User, Role, Permission,
  UserRole, JoinToken, AuditLog, PulumiRun, Cluster models
- Health endpoint with DB connectivity check
- Server listing with cloud/env/status filters
- Auth routes: agent enrollment, join token management
- Placeholder mTLS auth middleware
- Dev stack: CockroachDB single-node in docker-compose
- 32 tests passing (2 new for labd health)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 00:13:16 +00:00