Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting, only 3 progress callbacks, zero log streaming. This overhaul makes every install fully observable. Kickstart improvements: - Error trapping in %pre and %post (trap ERR sends failure details to bastion) - 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata - Background log streamer: tails %post output and batch-sends to /api/log - bastion_log() function for explicit log lines from kickstart scripts Bastion API: - POST /api/log — receives raw log lines from kickstart (single or batch) - InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence - GET /api/logs/:mac — now returns log_lines + log_total alongside stages - SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log) - Progress events forwarded to labd via bastion-progress WebSocket message - Post-provision k3s logs routed through progressBus (was console-only) dnsmasq fixes found during VM testing: - HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach) - pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode) - PXEClient vendor class echo for UEFI firmware compatibility Integration tests: - PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install - ISO boot test: blank VM boots from bastion-generated ISO → same flow - Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot) - test-provision.sh: runs both PXE + ISO tests with prerequisite checks - 250GB sparse QCOW2 disk (LVM layout needs ~204GB) 201 unit tests passing (11 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 KiB
15 KiB
labctl Platform — Implementation Status
What This Document Is
An honest assessment of what code exists, what works, what is stubbed, and what hasn't been started — measured against the PRD phases.
Architecture Overview (as built)
labctl CLI ──HTTP──▶ bastion (PXE server) ← WORKING
labctl CLI ──HTTP──▶ labd (master daemon) ← PARTIALLY WORKING
│
├── CockroachDB/Prisma ← SCHEMA DEFINED, NOT DEPLOYED
├── /ws/agent WebSocket ← ACCEPTS CONNECTIONS, DOES NOT ROUTE
└── mTLS CA ← NOT IMPLEMENTED
lab-agent ──WS──▶ labd ← LIBRARY CODE, NO DAEMON BINARY
Package Inventory
| Package | Lines of Source | Tests | Status |
|---|---|---|---|
| @lab/shared | ~200 | 0 | Complete — types, protocol, errors |
| @lab/bastion | ~800 | 32 | Production-ready — PXE discovery, install, reprovision |
| @lab/cli | ~600 | 0 (uses bastion tests) | Complete — all commands implemented |
| @lab/labd | ~500 | 2 | Partial — routes exist, core features stubbed |
| @lab/agent | ~300 | 0 | Library only — no daemon binary |
All 5 packages compile. 32 tests pass.
Phase 1: Foundation
DONE — Working in production
| Feature | Code | How It Works |
|---|---|---|
| PXE bastion server | src/bastion/ |
Fastify HTTP + dnsmasq DHCP/TFTP. Machines PXE boot, get iPXE script from /dispatch?mac=XX, chain to discovery or install kickstart. State persisted to JSON file. |
| Machine discovery | routes/dispatch.ts, templates/discover.ks.ts |
Unknown MACs get a mini-kickstart that boots a RAM-only Fedora, scrapes hardware via /proc, /sys, dmidecode, POSTs to /api/discover, then reboots. No disk touch. |
| Machine installation | routes/api.ts, templates/install.ks.ts |
Queue a MAC via POST /api/install. Next PXE boot gets a full Kickstart with LVM partitioning (worker: longhorn LV, infra: rancher LV), SSH keys, k3s kernel prereqs, progress callbacks. |
| Reprovision with data preservation | commands/reprovision.ts, install.ks.ts |
%pre script detects existing LVM. Reformats /, /var, /boot but preserves /home, /srv, /var/lib/longhorn, /var/lib/rancher. |
| CLI: init/provision commands | src/cli/src/commands/ |
labctl init bastion standalone start/stop/status, labctl provision list/install/reprovision/forget. All talk to bastion HTTP API. |
| CLI: config management | config/index.ts, commands/config.ts |
labctl config list/get/set/path. YAML config at ~/.labctl/config.yaml with env var overrides. |
| labd scaffold | src/labd/ |
Fastify server with health, server listing, token management routes. Prisma schema for all models. Starts with or without database. |
| Prisma schema | prisma/schema.prisma |
10 models: Server, Agent, User, Role, Permission, UserRole, JoinToken, AuditLog, PulumiRun, Cluster. CockroachDB provider. |
| Database seeding | prisma/seed.ts |
Creates admin/viewer/operator roles with proper allow/deny permissions. Idempotent via upsert. |
| Multi-arch builds + packaging | nfpm.yaml, scripts/ |
nfpm config for RPM/DEB. Bun compile for standalone binary (102MB labctl in dist/). |
| Gitea CI/CD | .gitea/ (on remote) |
Lint → typecheck → test → build → publish pipeline on mysources.co.uk. |
DONE — Code exists, not yet connected end-to-end
| Feature | Code | What's Real | What's Missing |
|---|---|---|---|
| lab-agent connection library | lab-agent/src/services/connection.ts |
AgentConnection class: WebSocket to labd, heartbeat (10s), exponential backoff reconnect (1-30s), state machine (disconnected/connecting/connected/reconnecting), handles server-shutdown messages. |
No daemon binary. This is a library — nothing starts it. No systemd unit. No enrollment flow. |
| lab-agent command executor | lab-agent/src/services/executor.ts |
CommandExecutor class: spawn() with timeout handling (SIGTERM then SIGKILL after 5s), stdout/stderr streaming via EventEmitter, stdin writing, signal forwarding. |
Not wired to WebSocket. The executor and connection don't talk to each other. No message dispatch. |
| Agent registry (labd) | labd/src/services/agent-registry.ts |
AgentRegistry: in-memory Map tracking by serverId and hostname, lifecycle events, heartbeat updates. Singleton exported. |
Not used by /ws/agent handler. The WebSocket handler in server.ts just logs messages — it doesn't call agentRegistry.register(). |
| Message router (labd) | labd/src/services/message-router.ts |
MessageRouter: handler registration, pending request tracking with timeouts, streaming support, log subscription, agent cleanup on disconnect. |
Not used. server.ts doesn't call messageRouter.handleMessage(). The router exists but is dead code. |
| Token management | labd/src/routes/auth.ts |
Create, list, revoke join tokens. Validates one-time vs reusable, expiry, revocation. Marks tokens as used. | Token validation works. But enrollment returns certificatePem: null — no actual certificate is issued. |
| CLI API client | cli/src/api/client.ts |
LabdClient with mTLS support, typed methods for servers/tokens/health/enrollment. |
Works for REST endpoints. No CLI commands use it yet — existing commands still talk directly to bastion HTTP. |
| CLI WebSocket streaming | cli/src/api/websocket.ts |
streamExec() and streamLogs() functions. |
No labctl exec or labctl logs commands exist. The streaming code has no consumer. |
| Zod validation | labd/src/validation/ |
Schemas for createToken, enrollment, serverFilters, createRole, permission patterns. Middleware for body/query validation. | Not applied to routes. The schemas and middleware exist but no route uses preHandler: [validateBody(schema)]. |
| Encryption service | labd/src/services/encryption.ts |
AES-256-GCM with scrypt key derivation. Encrypt/decrypt roundtrip. Singleton from CA_ENCRYPTION_KEY env var. |
Not used anywhere. No CA key is encrypted, no kubeconfig is stored. |
| Graceful shutdown | labd/src/services/shutdown.ts |
SIGTERM/SIGINT handlers, agent notification, message router cleanup, DB disconnect, force exit timer. | Works but agent notification is a no-op since no agents are registered (see above). |
| Rate limiting | labd/src/middleware/rate-limit.ts |
@fastify/rate-limit: 100/min global, 10/min for enrollment, 20/min for tokens. |
Wired up in server.ts. This actually works. |
| Health checks | labd/src/routes/health.ts |
/healthz, /health, /health/detailed, /health/live, /health/ready. Checks DB latency and agent count. |
Works. Returns agents: { connected: 0 } since no agents ever register. |
| Error hierarchy | shared/src/errors/ |
LabError, NotFoundError, PermissionDeniedError, ValidationError, AgentNotConnectedError. |
Not used in routes. Routes still use inline reply.code(404).send({error: ...}). |
| Table formatting | cli/src/utils/table.ts |
printTable, formatStatus, formatRelativeTime, predefined column sets. |
Not used by existing commands. provision list has its own inline formatting. |
| Resource parsing | cli/src/utils/resource.ts |
Parse server/labmaster, app/kube-system/nginx format. |
Not used. No commands accept type/name arguments yet. |
| Doctor command | cli/src/commands/doctor.ts |
Config, cert, connectivity diagnostics. | Works standalone. |
| Login command | cli/src/commands/login.ts |
Generates EC keypair, prompts for token, POSTs to /api/auth/user-enroll. |
labd has no /api/auth/user-enroll endpoint. Only /api/auth/enroll exists (for agents). Login will 404. |
NOT DONE — Phase 1 items from PRD with no code
| Feature | PRD Description | Status |
|---|---|---|
| Certificate Authority | Built-in CA in labd. Generate root CA, sign CSRs, revoke certs, rotate. | Nothing. No CA code. No X.509 operations. No @peculiar/x509 dependency. EncryptionService exists but it's for data-at-rest, not PKI. |
| RBAC engine | Middleware that checks permissions on every request. Deny overrides allow. | Nothing. auth.ts middleware is a placeholder. No route checks permissions. Anyone can call any endpoint. |
| Audit logging | Log every action with user, session, action, resource, result, duration. | Nothing. AuditLog Prisma model exists but nothing writes to it. No audit middleware. |
labctl exec |
Remote command execution via labd → agent WebSocket relay. | Nothing. No exec CLI command. The executor library exists in lab-agent but isn't connected. |
labctl logs |
Resource-scoped log streaming (server, app, bastion, audit). | Nothing. No logs CLI command. |
labctl get servers |
List servers from labd with filters. | Nothing. No get CLI command. The API client has getServers() but no command calls it. |
| Smoke test stack | podman-compose with CockroachDB + labd + 2 agents, testing enrollment/heartbeat/exec/RBAC. |
Nothing. stack/docker-compose.yml exists but only runs bastion + CockroachDB, not labd or agents. |
| Agent enrollment during PXE | Embed join token in kickstart, agent auto-enrolls on first boot. | Nothing. Kickstart installs k3s prereqs but doesn't install or start lab-agent. |
Phase 2: Deployment
Nothing from Phase 2 has been built.
| Feature | Status |
|---|---|
| Reprovision labmaster as labmaster.ad.itaz.eu | Not done — manual operation |
| Deploy k3s with Cilium CNI | Not done — kickstart only sets up kernel prereqs, leaves a comment "run curl -sfL https://get.k3s.io" |
| Deploy CockroachDB on k3s | Not done — docker-compose.yml runs it in-memory for dev, no k8s manifests for CRDB |
| Deploy labd on k3s | K8s manifests exist (deploy/k8s/labd/base/) — Deployment, Service, ConfigMap, HPA, PDB. But no CockroachDB to connect to and no TLS configured. |
| Deploy bastion as managed app | Not done — bastion runs standalone, no Pulumi chart |
| Auto-enroll agents during PXE | Not done — no agent install in kickstart, no token embedding |
Phase 3: Infrastructure as Code
Nothing from Phase 3 has been built.
| Feature | Status |
|---|---|
| Module system | Not done — no module.yaml, no module loader |
| Pulumi charts | Not done — no Pulumi dependency, no chart structure |
labctl apps install/upgrade/rollback |
Not done — no apps command |
labctl apply -f |
Not done — no apply command |
kubectl proxy (audited) |
Not done — no kubectl proxy |
| Kubeconfig store (encrypted) | EncryptionService exists but nothing uses it. Cluster.kubeconfigEnc field exists in Prisma but nothing reads/writes it. |
Phase 4: Multi-Cloud
Nothing from Phase 4 has been built.
| Feature | Status |
|---|---|
| AWS provider | Not done |
| Reusable join tokens for ASGs | Token model supports reusable type, but no AWS integration |
| Cilium Cluster Mesh | Not done |
| Ephemeral test environments | Not done |
| Grafana Loki | Not done |
Infrastructure Files
| File | Status |
|---|---|
Dockerfile.labd |
Exists. Multi-stage Alpine build. Would work if you docker build it. |
Dockerfile.bastion |
Exists. Multi-stage Fedora build. Would work. |
.dockerignore |
Exists. |
deploy/k8s/labd/base/ |
Kustomize manifests for labd (Deployment, Service, ConfigMap, HPA, PDB). Points at a non-existent CockroachDB and has no TLS. |
stack/docker-compose.yml |
Runs bastion + CockroachDB for local dev. Works. |
nfpm.yaml |
RPM/DEB packaging config. Works with nfpm pkg. |
The Disconnection Problem
The core issue is that many services were built in isolation but never wired together:
┌─────────────────────────────────────────────────────────┐
│ BUILT BUT NOT CONNECTED │
│ │
│ AgentConnection ──✗──▶ /ws/agent handler │
│ CommandExecutor ──✗──▶ MessageRouter │
│ MessageRouter ──✗──▶ /ws/agent handler │
│ AgentRegistry ──✗──▶ /ws/agent handler │
│ Zod schemas ──✗──▶ Route preHandlers │
│ Error classes ──✗──▶ Route error handling │
│ LabdClient ──✗──▶ CLI commands (get/exec/logs) │
│ Table formatting──✗──▶ CLI commands │
│ Resource parsing──✗──▶ CLI commands │
│ EncryptionService──✗──▶ CA / kubeconfig storage │
│ Login command ──✗──▶ /api/auth/user-enroll (missing) │
│ Audit logging ──✗──▶ Any middleware │
│ RBAC engine ──✗──▶ Any middleware │
└─────────────────────────────────────────────────────────┘
What Actually Works End-to-End Today
-
PXE boot a bare-metal machine:
labctl init bastion standalone start # Machine PXE boots → discovered automatically labctl provision list labctl provision install AA:BB:CC:DD:EE:FF worker-1 --role worker # Machine reboots → installs Fedora → reports complete -
Manage bastion lifecycle:
labctl init bastion standalone status labctl init bastion standalone stop -
Start labd (without database):
LABD_PORT=3100 tsx src/labd/src/main.ts # Starts with stub DB, health endpoint works, token/server routes return errors -
Start labd (with CockroachDB):
docker-compose -f stack/docker-compose.yml up cockroachdb DATABASE_URL=postgresql://root@localhost:26257/lab tsx src/labd/src/main.ts # Token creation/listing/revocation works # Server listing works (empty until agents register) -
CLI diagnostics:
labctl doctor labctl config list labctl version
That's it. No agent communication, no remote exec, no log streaming, no RBAC, no certificates.
Recommended Next Steps (to make Phase 1 actually work)
Priority 1: Wire up the agent connection
- Update
/ws/agenthandler to useagentRegistry.register()andmessageRouter.handleMessage() - Create lab-agent daemon binary that uses
AgentConnection+CommandExecutor - Create systemd unit for lab-agent
Priority 2: Certificate Authority
- Add
@peculiar/x509dependency - Implement CA service: generate root CA, sign CSRs
- Wire enrollment route to actually sign and return certificates
- Store CA key encrypted using
EncryptionService
Priority 3: RBAC + Audit
- Create RBAC middleware that checks
Permissiontable - Create audit middleware that writes to
AuditLog - Apply both to all routes
Priority 4: CLI commands for labd
labctl get serversusingLabdClient.getServers()labctl exec server/<name>usingstreamExec()labctl logs server/<name>usingstreamLogs()
Priority 5: Smoke test stack
- Update
docker-compose.ymlto include labd + 2 agents - Write integration tests for enrollment → heartbeat → exec → logs