Files
lab/STATUS.md
Michal 46b017d77e
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
feat: install logging, error trapping, PXE/ISO integration tests
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00

15 KiB

labctl Platform — Implementation Status

What This Document Is

An honest assessment of what code exists, what works, what is stubbed, and what hasn't been started — measured against the PRD phases.


Architecture Overview (as built)

labctl CLI ──HTTP──▶ bastion (PXE server)     ← WORKING
labctl CLI ──HTTP──▶ labd (master daemon)     ← PARTIALLY WORKING
                       │
                       ├── CockroachDB/Prisma  ← SCHEMA DEFINED, NOT DEPLOYED
                       ├── /ws/agent WebSocket  ← ACCEPTS CONNECTIONS, DOES NOT ROUTE
                       └── mTLS CA              ← NOT IMPLEMENTED

lab-agent ──WS──▶ labd                        ← LIBRARY CODE, NO DAEMON BINARY

Package Inventory

Package Lines of Source Tests Status
@lab/shared ~200 0 Complete — types, protocol, errors
@lab/bastion ~800 32 Production-ready — PXE discovery, install, reprovision
@lab/cli ~600 0 (uses bastion tests) Complete — all commands implemented
@lab/labd ~500 2 Partial — routes exist, core features stubbed
@lab/agent ~300 0 Library only — no daemon binary

All 5 packages compile. 32 tests pass.


Phase 1: Foundation

DONE — Working in production

Feature Code How It Works
PXE bastion server src/bastion/ Fastify HTTP + dnsmasq DHCP/TFTP. Machines PXE boot, get iPXE script from /dispatch?mac=XX, chain to discovery or install kickstart. State persisted to JSON file.
Machine discovery routes/dispatch.ts, templates/discover.ks.ts Unknown MACs get a mini-kickstart that boots a RAM-only Fedora, scrapes hardware via /proc, /sys, dmidecode, POSTs to /api/discover, then reboots. No disk touch.
Machine installation routes/api.ts, templates/install.ks.ts Queue a MAC via POST /api/install. Next PXE boot gets a full Kickstart with LVM partitioning (worker: longhorn LV, infra: rancher LV), SSH keys, k3s kernel prereqs, progress callbacks.
Reprovision with data preservation commands/reprovision.ts, install.ks.ts %pre script detects existing LVM. Reformats /, /var, /boot but preserves /home, /srv, /var/lib/longhorn, /var/lib/rancher.
CLI: init/provision commands src/cli/src/commands/ labctl init bastion standalone start/stop/status, labctl provision list/install/reprovision/forget. All talk to bastion HTTP API.
CLI: config management config/index.ts, commands/config.ts labctl config list/get/set/path. YAML config at ~/.labctl/config.yaml with env var overrides.
labd scaffold src/labd/ Fastify server with health, server listing, token management routes. Prisma schema for all models. Starts with or without database.
Prisma schema prisma/schema.prisma 10 models: Server, Agent, User, Role, Permission, UserRole, JoinToken, AuditLog, PulumiRun, Cluster. CockroachDB provider.
Database seeding prisma/seed.ts Creates admin/viewer/operator roles with proper allow/deny permissions. Idempotent via upsert.
Multi-arch builds + packaging nfpm.yaml, scripts/ nfpm config for RPM/DEB. Bun compile for standalone binary (102MB labctl in dist/).
Gitea CI/CD .gitea/ (on remote) Lint → typecheck → test → build → publish pipeline on mysources.co.uk.

DONE — Code exists, not yet connected end-to-end

Feature Code What's Real What's Missing
lab-agent connection library lab-agent/src/services/connection.ts AgentConnection class: WebSocket to labd, heartbeat (10s), exponential backoff reconnect (1-30s), state machine (disconnected/connecting/connected/reconnecting), handles server-shutdown messages. No daemon binary. This is a library — nothing starts it. No systemd unit. No enrollment flow.
lab-agent command executor lab-agent/src/services/executor.ts CommandExecutor class: spawn() with timeout handling (SIGTERM then SIGKILL after 5s), stdout/stderr streaming via EventEmitter, stdin writing, signal forwarding. Not wired to WebSocket. The executor and connection don't talk to each other. No message dispatch.
Agent registry (labd) labd/src/services/agent-registry.ts AgentRegistry: in-memory Map tracking by serverId and hostname, lifecycle events, heartbeat updates. Singleton exported. Not used by /ws/agent handler. The WebSocket handler in server.ts just logs messages — it doesn't call agentRegistry.register().
Message router (labd) labd/src/services/message-router.ts MessageRouter: handler registration, pending request tracking with timeouts, streaming support, log subscription, agent cleanup on disconnect. Not used. server.ts doesn't call messageRouter.handleMessage(). The router exists but is dead code.
Token management labd/src/routes/auth.ts Create, list, revoke join tokens. Validates one-time vs reusable, expiry, revocation. Marks tokens as used. Token validation works. But enrollment returns certificatePem: null — no actual certificate is issued.
CLI API client cli/src/api/client.ts LabdClient with mTLS support, typed methods for servers/tokens/health/enrollment. Works for REST endpoints. No CLI commands use it yet — existing commands still talk directly to bastion HTTP.
CLI WebSocket streaming cli/src/api/websocket.ts streamExec() and streamLogs() functions. No labctl exec or labctl logs commands exist. The streaming code has no consumer.
Zod validation labd/src/validation/ Schemas for createToken, enrollment, serverFilters, createRole, permission patterns. Middleware for body/query validation. Not applied to routes. The schemas and middleware exist but no route uses preHandler: [validateBody(schema)].
Encryption service labd/src/services/encryption.ts AES-256-GCM with scrypt key derivation. Encrypt/decrypt roundtrip. Singleton from CA_ENCRYPTION_KEY env var. Not used anywhere. No CA key is encrypted, no kubeconfig is stored.
Graceful shutdown labd/src/services/shutdown.ts SIGTERM/SIGINT handlers, agent notification, message router cleanup, DB disconnect, force exit timer. Works but agent notification is a no-op since no agents are registered (see above).
Rate limiting labd/src/middleware/rate-limit.ts @fastify/rate-limit: 100/min global, 10/min for enrollment, 20/min for tokens. Wired up in server.ts. This actually works.
Health checks labd/src/routes/health.ts /healthz, /health, /health/detailed, /health/live, /health/ready. Checks DB latency and agent count. Works. Returns agents: { connected: 0 } since no agents ever register.
Error hierarchy shared/src/errors/ LabError, NotFoundError, PermissionDeniedError, ValidationError, AgentNotConnectedError. Not used in routes. Routes still use inline reply.code(404).send({error: ...}).
Table formatting cli/src/utils/table.ts printTable, formatStatus, formatRelativeTime, predefined column sets. Not used by existing commands. provision list has its own inline formatting.
Resource parsing cli/src/utils/resource.ts Parse server/labmaster, app/kube-system/nginx format. Not used. No commands accept type/name arguments yet.
Doctor command cli/src/commands/doctor.ts Config, cert, connectivity diagnostics. Works standalone.
Login command cli/src/commands/login.ts Generates EC keypair, prompts for token, POSTs to /api/auth/user-enroll. labd has no /api/auth/user-enroll endpoint. Only /api/auth/enroll exists (for agents). Login will 404.

NOT DONE — Phase 1 items from PRD with no code

Feature PRD Description Status
Certificate Authority Built-in CA in labd. Generate root CA, sign CSRs, revoke certs, rotate. Nothing. No CA code. No X.509 operations. No @peculiar/x509 dependency. EncryptionService exists but it's for data-at-rest, not PKI.
RBAC engine Middleware that checks permissions on every request. Deny overrides allow. Nothing. auth.ts middleware is a placeholder. No route checks permissions. Anyone can call any endpoint.
Audit logging Log every action with user, session, action, resource, result, duration. Nothing. AuditLog Prisma model exists but nothing writes to it. No audit middleware.
labctl exec Remote command execution via labd → agent WebSocket relay. Nothing. No exec CLI command. The executor library exists in lab-agent but isn't connected.
labctl logs Resource-scoped log streaming (server, app, bastion, audit). Nothing. No logs CLI command.
labctl get servers List servers from labd with filters. Nothing. No get CLI command. The API client has getServers() but no command calls it.
Smoke test stack podman-compose with CockroachDB + labd + 2 agents, testing enrollment/heartbeat/exec/RBAC. Nothing. stack/docker-compose.yml exists but only runs bastion + CockroachDB, not labd or agents.
Agent enrollment during PXE Embed join token in kickstart, agent auto-enrolls on first boot. Nothing. Kickstart installs k3s prereqs but doesn't install or start lab-agent.

Phase 2: Deployment

Nothing from Phase 2 has been built.

Feature Status
Reprovision labmaster as labmaster.ad.itaz.eu Not done — manual operation
Deploy k3s with Cilium CNI Not done — kickstart only sets up kernel prereqs, leaves a comment "run curl -sfL https://get.k3s.io"
Deploy CockroachDB on k3s Not done — docker-compose.yml runs it in-memory for dev, no k8s manifests for CRDB
Deploy labd on k3s K8s manifests exist (deploy/k8s/labd/base/) — Deployment, Service, ConfigMap, HPA, PDB. But no CockroachDB to connect to and no TLS configured.
Deploy bastion as managed app Not done — bastion runs standalone, no Pulumi chart
Auto-enroll agents during PXE Not done — no agent install in kickstart, no token embedding

Phase 3: Infrastructure as Code

Nothing from Phase 3 has been built.

Feature Status
Module system Not done — no module.yaml, no module loader
Pulumi charts Not done — no Pulumi dependency, no chart structure
labctl apps install/upgrade/rollback Not done — no apps command
labctl apply -f Not done — no apply command
kubectl proxy (audited) Not done — no kubectl proxy
Kubeconfig store (encrypted) EncryptionService exists but nothing uses it. Cluster.kubeconfigEnc field exists in Prisma but nothing reads/writes it.

Phase 4: Multi-Cloud

Nothing from Phase 4 has been built.

Feature Status
AWS provider Not done
Reusable join tokens for ASGs Token model supports reusable type, but no AWS integration
Cilium Cluster Mesh Not done
Ephemeral test environments Not done
Grafana Loki Not done

Infrastructure Files

File Status
Dockerfile.labd Exists. Multi-stage Alpine build. Would work if you docker build it.
Dockerfile.bastion Exists. Multi-stage Fedora build. Would work.
.dockerignore Exists.
deploy/k8s/labd/base/ Kustomize manifests for labd (Deployment, Service, ConfigMap, HPA, PDB). Points at a non-existent CockroachDB and has no TLS.
stack/docker-compose.yml Runs bastion + CockroachDB for local dev. Works.
nfpm.yaml RPM/DEB packaging config. Works with nfpm pkg.

The Disconnection Problem

The core issue is that many services were built in isolation but never wired together:

┌─────────────────────────────────────────────────────────┐
│  BUILT BUT NOT CONNECTED                                │
│                                                         │
│  AgentConnection ──✗──▶ /ws/agent handler               │
│  CommandExecutor ──✗──▶ MessageRouter                   │
│  MessageRouter   ──✗──▶ /ws/agent handler               │
│  AgentRegistry   ──✗──▶ /ws/agent handler               │
│  Zod schemas     ──✗──▶ Route preHandlers               │
│  Error classes   ──✗──▶ Route error handling             │
│  LabdClient      ──✗──▶ CLI commands (get/exec/logs)    │
│  Table formatting──✗──▶ CLI commands                    │
│  Resource parsing──✗──▶ CLI commands                    │
│  EncryptionService──✗──▶ CA / kubeconfig storage        │
│  Login command   ──✗──▶ /api/auth/user-enroll (missing) │
│  Audit logging   ──✗──▶ Any middleware                  │
│  RBAC engine     ──✗──▶ Any middleware                  │
└─────────────────────────────────────────────────────────┘

What Actually Works End-to-End Today

  1. PXE boot a bare-metal machine:

    labctl init bastion standalone start
    # Machine PXE boots → discovered automatically
    labctl provision list
    labctl provision install AA:BB:CC:DD:EE:FF worker-1 --role worker
    # Machine reboots → installs Fedora → reports complete
    
  2. Manage bastion lifecycle:

    labctl init bastion standalone status
    labctl init bastion standalone stop
    
  3. Start labd (without database):

    LABD_PORT=3100 tsx src/labd/src/main.ts
    # Starts with stub DB, health endpoint works, token/server routes return errors
    
  4. Start labd (with CockroachDB):

    docker-compose -f stack/docker-compose.yml up cockroachdb
    DATABASE_URL=postgresql://root@localhost:26257/lab tsx src/labd/src/main.ts
    # Token creation/listing/revocation works
    # Server listing works (empty until agents register)
    
  5. CLI diagnostics:

    labctl doctor
    labctl config list
    labctl version
    

That's it. No agent communication, no remote exec, no log streaming, no RBAC, no certificates.


Priority 1: Wire up the agent connection

  1. Update /ws/agent handler to use agentRegistry.register() and messageRouter.handleMessage()
  2. Create lab-agent daemon binary that uses AgentConnection + CommandExecutor
  3. Create systemd unit for lab-agent

Priority 2: Certificate Authority

  1. Add @peculiar/x509 dependency
  2. Implement CA service: generate root CA, sign CSRs
  3. Wire enrollment route to actually sign and return certificates
  4. Store CA key encrypted using EncryptionService

Priority 3: RBAC + Audit

  1. Create RBAC middleware that checks Permission table
  2. Create audit middleware that writes to AuditLog
  3. Apply both to all routes

Priority 4: CLI commands for labd

  1. labctl get servers using LabdClient.getServers()
  2. labctl exec server/<name> using streamExec()
  3. labctl logs server/<name> using streamLogs()

Priority 5: Smoke test stack

  1. Update docker-compose.yml to include labd + 2 agents
  2. Write integration tests for enrollment → heartbeat → exec → logs