245 lines
15 KiB
Markdown
245 lines
15 KiB
Markdown
|
|
# labctl Platform — Implementation Status
|
||
|
|
|
||
|
|
## What This Document Is
|
||
|
|
|
||
|
|
An honest assessment of what code exists, what works, what is stubbed, and what
|
||
|
|
hasn't been started — measured against the PRD phases.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Architecture Overview (as built)
|
||
|
|
|
||
|
|
```
|
||
|
|
labctl CLI ──HTTP──▶ bastion (PXE server) ← WORKING
|
||
|
|
labctl CLI ──HTTP──▶ labd (master daemon) ← PARTIALLY WORKING
|
||
|
|
│
|
||
|
|
├── CockroachDB/Prisma ← SCHEMA DEFINED, NOT DEPLOYED
|
||
|
|
├── /ws/agent WebSocket ← ACCEPTS CONNECTIONS, DOES NOT ROUTE
|
||
|
|
└── mTLS CA ← NOT IMPLEMENTED
|
||
|
|
|
||
|
|
lab-agent ──WS──▶ labd ← LIBRARY CODE, NO DAEMON BINARY
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Package Inventory
|
||
|
|
|
||
|
|
| Package | Lines of Source | Tests | Status |
|
||
|
|
|---------|---------------|-------|--------|
|
||
|
|
| @lab/shared | ~200 | 0 | Complete — types, protocol, errors |
|
||
|
|
| @lab/bastion | ~800 | 32 | **Production-ready** — PXE discovery, install, reprovision |
|
||
|
|
| @lab/cli | ~600 | 0 (uses bastion tests) | Complete — all commands implemented |
|
||
|
|
| @lab/labd | ~500 | 2 | Partial — routes exist, core features stubbed |
|
||
|
|
| @lab/agent | ~300 | 0 | Library only — no daemon binary |
|
||
|
|
|
||
|
|
All 5 packages compile. 32 tests pass.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 1: Foundation
|
||
|
|
|
||
|
|
### DONE — Working in production
|
||
|
|
|
||
|
|
| Feature | Code | How It Works |
|
||
|
|
|---------|------|-------------|
|
||
|
|
| PXE bastion server | `src/bastion/` | Fastify HTTP + dnsmasq DHCP/TFTP. Machines PXE boot, get iPXE script from `/dispatch?mac=XX`, chain to discovery or install kickstart. State persisted to JSON file. |
|
||
|
|
| Machine discovery | `routes/dispatch.ts`, `templates/discover.ks.ts` | Unknown MACs get a mini-kickstart that boots a RAM-only Fedora, scrapes hardware via `/proc`, `/sys`, `dmidecode`, POSTs to `/api/discover`, then reboots. No disk touch. |
|
||
|
|
| Machine installation | `routes/api.ts`, `templates/install.ks.ts` | Queue a MAC via `POST /api/install`. Next PXE boot gets a full Kickstart with LVM partitioning (worker: longhorn LV, infra: rancher LV), SSH keys, k3s kernel prereqs, progress callbacks. |
|
||
|
|
| Reprovision with data preservation | `commands/reprovision.ts`, `install.ks.ts` | `%pre` script detects existing LVM. Reformats `/`, `/var`, `/boot` but preserves `/home`, `/srv`, `/var/lib/longhorn`, `/var/lib/rancher`. |
|
||
|
|
| CLI: init/provision commands | `src/cli/src/commands/` | `labctl init bastion standalone start/stop/status`, `labctl provision list/install/reprovision/forget`. All talk to bastion HTTP API. |
|
||
|
|
| CLI: config management | `config/index.ts`, `commands/config.ts` | `labctl config list/get/set/path`. YAML config at `~/.labctl/config.yaml` with env var overrides. |
|
||
|
|
| labd scaffold | `src/labd/` | Fastify server with health, server listing, token management routes. Prisma schema for all models. Starts with or without database. |
|
||
|
|
| Prisma schema | `prisma/schema.prisma` | 10 models: Server, Agent, User, Role, Permission, UserRole, JoinToken, AuditLog, PulumiRun, Cluster. CockroachDB provider. |
|
||
|
|
| Database seeding | `prisma/seed.ts` | Creates admin/viewer/operator roles with proper allow/deny permissions. Idempotent via upsert. |
|
||
|
|
| Multi-arch builds + packaging | `nfpm.yaml`, `scripts/` | nfpm config for RPM/DEB. Bun compile for standalone binary (102MB labctl in `dist/`). |
|
||
|
|
| Gitea CI/CD | `.gitea/` (on remote) | Lint → typecheck → test → build → publish pipeline on mysources.co.uk. |
|
||
|
|
|
||
|
|
### DONE — Code exists, not yet connected end-to-end
|
||
|
|
|
||
|
|
| Feature | Code | What's Real | What's Missing |
|
||
|
|
|---------|------|------------|----------------|
|
||
|
|
| lab-agent connection library | `lab-agent/src/services/connection.ts` | `AgentConnection` class: WebSocket to labd, heartbeat (10s), exponential backoff reconnect (1-30s), state machine (disconnected/connecting/connected/reconnecting), handles server-shutdown messages. | **No daemon binary.** This is a library — nothing starts it. No systemd unit. No enrollment flow. |
|
||
|
|
| lab-agent command executor | `lab-agent/src/services/executor.ts` | `CommandExecutor` class: `spawn()` with timeout handling (SIGTERM then SIGKILL after 5s), stdout/stderr streaming via EventEmitter, stdin writing, signal forwarding. | **Not wired to WebSocket.** The executor and connection don't talk to each other. No message dispatch. |
|
||
|
|
| Agent registry (labd) | `labd/src/services/agent-registry.ts` | `AgentRegistry`: in-memory Map tracking by serverId and hostname, lifecycle events, heartbeat updates. Singleton exported. | **Not used by /ws/agent handler.** The WebSocket handler in `server.ts` just logs messages — it doesn't call `agentRegistry.register()`. |
|
||
|
|
| Message router (labd) | `labd/src/services/message-router.ts` | `MessageRouter`: handler registration, pending request tracking with timeouts, streaming support, log subscription, agent cleanup on disconnect. | **Not used.** `server.ts` doesn't call `messageRouter.handleMessage()`. The router exists but is dead code. |
|
||
|
|
| Token management | `labd/src/routes/auth.ts` | Create, list, revoke join tokens. Validates one-time vs reusable, expiry, revocation. Marks tokens as used. | Token validation works. **But enrollment returns `certificatePem: null`** — no actual certificate is issued. |
|
||
|
|
| CLI API client | `cli/src/api/client.ts` | `LabdClient` with mTLS support, typed methods for servers/tokens/health/enrollment. | Works for REST endpoints. **No CLI commands use it yet** — existing commands still talk directly to bastion HTTP. |
|
||
|
|
| CLI WebSocket streaming | `cli/src/api/websocket.ts` | `streamExec()` and `streamLogs()` functions. | **No `labctl exec` or `labctl logs` commands exist.** The streaming code has no consumer. |
|
||
|
|
| Zod validation | `labd/src/validation/` | Schemas for createToken, enrollment, serverFilters, createRole, permission patterns. Middleware for body/query validation. | **Not applied to routes.** The schemas and middleware exist but no route uses `preHandler: [validateBody(schema)]`. |
|
||
|
|
| Encryption service | `labd/src/services/encryption.ts` | AES-256-GCM with scrypt key derivation. Encrypt/decrypt roundtrip. Singleton from `CA_ENCRYPTION_KEY` env var. | **Not used anywhere.** No CA key is encrypted, no kubeconfig is stored. |
|
||
|
|
| Graceful shutdown | `labd/src/services/shutdown.ts` | SIGTERM/SIGINT handlers, agent notification, message router cleanup, DB disconnect, force exit timer. | Works but agent notification is a no-op since no agents are registered (see above). |
|
||
|
|
| Rate limiting | `labd/src/middleware/rate-limit.ts` | `@fastify/rate-limit`: 100/min global, 10/min for enrollment, 20/min for tokens. | **Wired up in `server.ts`.** This actually works. |
|
||
|
|
| Health checks | `labd/src/routes/health.ts` | `/healthz`, `/health`, `/health/detailed`, `/health/live`, `/health/ready`. Checks DB latency and agent count. | Works. Returns `agents: { connected: 0 }` since no agents ever register. |
|
||
|
|
| Error hierarchy | `shared/src/errors/` | `LabError`, `NotFoundError`, `PermissionDeniedError`, `ValidationError`, `AgentNotConnectedError`. | **Not used in routes.** Routes still use inline `reply.code(404).send({error: ...})`. |
|
||
|
|
| Table formatting | `cli/src/utils/table.ts` | `printTable`, `formatStatus`, `formatRelativeTime`, predefined column sets. | **Not used by existing commands.** `provision list` has its own inline formatting. |
|
||
|
|
| Resource parsing | `cli/src/utils/resource.ts` | Parse `server/labmaster`, `app/kube-system/nginx` format. | **Not used.** No commands accept `type/name` arguments yet. |
|
||
|
|
| Doctor command | `cli/src/commands/doctor.ts` | Config, cert, connectivity diagnostics. | Works standalone. |
|
||
|
|
| Login command | `cli/src/commands/login.ts` | Generates EC keypair, prompts for token, POSTs to `/api/auth/user-enroll`. | **labd has no `/api/auth/user-enroll` endpoint.** Only `/api/auth/enroll` exists (for agents). Login will 404. |
|
||
|
|
|
||
|
|
### NOT DONE — Phase 1 items from PRD with no code
|
||
|
|
|
||
|
|
| Feature | PRD Description | Status |
|
||
|
|
|---------|----------------|--------|
|
||
|
|
| Certificate Authority | Built-in CA in labd. Generate root CA, sign CSRs, revoke certs, rotate. | **Nothing.** No CA code. No X.509 operations. No `@peculiar/x509` dependency. `EncryptionService` exists but it's for data-at-rest, not PKI. |
|
||
|
|
| RBAC engine | Middleware that checks permissions on every request. Deny overrides allow. | **Nothing.** `auth.ts` middleware is a placeholder. No route checks permissions. Anyone can call any endpoint. |
|
||
|
|
| Audit logging | Log every action with user, session, action, resource, result, duration. | **Nothing.** `AuditLog` Prisma model exists but nothing writes to it. No audit middleware. |
|
||
|
|
| `labctl exec` | Remote command execution via labd → agent WebSocket relay. | **Nothing.** No `exec` CLI command. The executor library exists in lab-agent but isn't connected. |
|
||
|
|
| `labctl logs` | Resource-scoped log streaming (server, app, bastion, audit). | **Nothing.** No `logs` CLI command. |
|
||
|
|
| `labctl get servers` | List servers from labd with filters. | **Nothing.** No `get` CLI command. The API client has `getServers()` but no command calls it. |
|
||
|
|
| Smoke test stack | `podman-compose` with CockroachDB + labd + 2 agents, testing enrollment/heartbeat/exec/RBAC. | **Nothing.** `stack/docker-compose.yml` exists but only runs bastion + CockroachDB, not labd or agents. |
|
||
|
|
| Agent enrollment during PXE | Embed join token in kickstart, agent auto-enrolls on first boot. | **Nothing.** Kickstart installs k3s prereqs but doesn't install or start lab-agent. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 2: Deployment
|
||
|
|
|
||
|
|
**Nothing from Phase 2 has been built.**
|
||
|
|
|
||
|
|
| Feature | Status |
|
||
|
|
|---------|--------|
|
||
|
|
| Reprovision labmaster as labmaster.ad.itaz.eu | Not done — manual operation |
|
||
|
|
| Deploy k3s with Cilium CNI | Not done — kickstart only sets up kernel prereqs, leaves a comment "run `curl -sfL https://get.k3s.io`" |
|
||
|
|
| Deploy CockroachDB on k3s | Not done — `docker-compose.yml` runs it in-memory for dev, no k8s manifests for CRDB |
|
||
|
|
| Deploy labd on k3s | **K8s manifests exist** (`deploy/k8s/labd/base/`) — Deployment, Service, ConfigMap, HPA, PDB. But no CockroachDB to connect to and no TLS configured. |
|
||
|
|
| Deploy bastion as managed app | Not done — bastion runs standalone, no Pulumi chart |
|
||
|
|
| Auto-enroll agents during PXE | Not done — no agent install in kickstart, no token embedding |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 3: Infrastructure as Code
|
||
|
|
|
||
|
|
**Nothing from Phase 3 has been built.**
|
||
|
|
|
||
|
|
| Feature | Status |
|
||
|
|
|---------|--------|
|
||
|
|
| Module system | Not done — no `module.yaml`, no module loader |
|
||
|
|
| Pulumi charts | Not done — no Pulumi dependency, no chart structure |
|
||
|
|
| `labctl apps install/upgrade/rollback` | Not done — no `apps` command |
|
||
|
|
| `labctl apply -f` | Not done — no `apply` command |
|
||
|
|
| `kubectl proxy` (audited) | Not done — no kubectl proxy |
|
||
|
|
| Kubeconfig store (encrypted) | `EncryptionService` exists but nothing uses it. `Cluster.kubeconfigEnc` field exists in Prisma but nothing reads/writes it. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 4: Multi-Cloud
|
||
|
|
|
||
|
|
**Nothing from Phase 4 has been built.**
|
||
|
|
|
||
|
|
| Feature | Status |
|
||
|
|
|---------|--------|
|
||
|
|
| AWS provider | Not done |
|
||
|
|
| Reusable join tokens for ASGs | Token model supports `reusable` type, but no AWS integration |
|
||
|
|
| Cilium Cluster Mesh | Not done |
|
||
|
|
| Ephemeral test environments | Not done |
|
||
|
|
| Grafana Loki | Not done |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Infrastructure Files
|
||
|
|
|
||
|
|
| File | Status |
|
||
|
|
|------|--------|
|
||
|
|
| `Dockerfile.labd` | Exists. Multi-stage Alpine build. Would work if you `docker build` it. |
|
||
|
|
| `Dockerfile.bastion` | Exists. Multi-stage Fedora build. Would work. |
|
||
|
|
| `.dockerignore` | Exists. |
|
||
|
|
| `deploy/k8s/labd/base/` | Kustomize manifests for labd (Deployment, Service, ConfigMap, HPA, PDB). Points at a non-existent CockroachDB and has no TLS. |
|
||
|
|
| `stack/docker-compose.yml` | Runs bastion + CockroachDB for local dev. Works. |
|
||
|
|
| `nfpm.yaml` | RPM/DEB packaging config. Works with `nfpm pkg`. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Disconnection Problem
|
||
|
|
|
||
|
|
The core issue is that many services were built in isolation but never wired together:
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────┐
|
||
|
|
│ BUILT BUT NOT CONNECTED │
|
||
|
|
│ │
|
||
|
|
│ AgentConnection ──✗──▶ /ws/agent handler │
|
||
|
|
│ CommandExecutor ──✗──▶ MessageRouter │
|
||
|
|
│ MessageRouter ──✗──▶ /ws/agent handler │
|
||
|
|
│ AgentRegistry ──✗──▶ /ws/agent handler │
|
||
|
|
│ Zod schemas ──✗──▶ Route preHandlers │
|
||
|
|
│ Error classes ──✗──▶ Route error handling │
|
||
|
|
│ LabdClient ──✗──▶ CLI commands (get/exec/logs) │
|
||
|
|
│ Table formatting──✗──▶ CLI commands │
|
||
|
|
│ Resource parsing──✗──▶ CLI commands │
|
||
|
|
│ EncryptionService──✗──▶ CA / kubeconfig storage │
|
||
|
|
│ Login command ──✗──▶ /api/auth/user-enroll (missing) │
|
||
|
|
│ Audit logging ──✗──▶ Any middleware │
|
||
|
|
│ RBAC engine ──✗──▶ Any middleware │
|
||
|
|
└─────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What Actually Works End-to-End Today
|
||
|
|
|
||
|
|
1. **PXE boot a bare-metal machine:**
|
||
|
|
```
|
||
|
|
labctl init bastion standalone start
|
||
|
|
# Machine PXE boots → discovered automatically
|
||
|
|
labctl provision list
|
||
|
|
labctl provision install AA:BB:CC:DD:EE:FF worker-1 --role worker
|
||
|
|
# Machine reboots → installs Fedora → reports complete
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Manage bastion lifecycle:**
|
||
|
|
```
|
||
|
|
labctl init bastion standalone status
|
||
|
|
labctl init bastion standalone stop
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Start labd (without database):**
|
||
|
|
```
|
||
|
|
LABD_PORT=3100 tsx src/labd/src/main.ts
|
||
|
|
# Starts with stub DB, health endpoint works, token/server routes return errors
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Start labd (with CockroachDB):**
|
||
|
|
```
|
||
|
|
docker-compose -f stack/docker-compose.yml up cockroachdb
|
||
|
|
DATABASE_URL=postgresql://root@localhost:26257/lab tsx src/labd/src/main.ts
|
||
|
|
# Token creation/listing/revocation works
|
||
|
|
# Server listing works (empty until agents register)
|
||
|
|
```
|
||
|
|
|
||
|
|
5. **CLI diagnostics:**
|
||
|
|
```
|
||
|
|
labctl doctor
|
||
|
|
labctl config list
|
||
|
|
labctl version
|
||
|
|
```
|
||
|
|
|
||
|
|
That's it. No agent communication, no remote exec, no log streaming, no RBAC, no certificates.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Next Steps (to make Phase 1 actually work)
|
||
|
|
|
||
|
|
### Priority 1: Wire up the agent connection
|
||
|
|
1. Update `/ws/agent` handler to use `agentRegistry.register()` and `messageRouter.handleMessage()`
|
||
|
|
2. Create lab-agent daemon binary that uses `AgentConnection` + `CommandExecutor`
|
||
|
|
3. Create systemd unit for lab-agent
|
||
|
|
|
||
|
|
### Priority 2: Certificate Authority
|
||
|
|
1. Add `@peculiar/x509` dependency
|
||
|
|
2. Implement CA service: generate root CA, sign CSRs
|
||
|
|
3. Wire enrollment route to actually sign and return certificates
|
||
|
|
4. Store CA key encrypted using `EncryptionService`
|
||
|
|
|
||
|
|
### Priority 3: RBAC + Audit
|
||
|
|
1. Create RBAC middleware that checks `Permission` table
|
||
|
|
2. Create audit middleware that writes to `AuditLog`
|
||
|
|
3. Apply both to all routes
|
||
|
|
|
||
|
|
### Priority 4: CLI commands for labd
|
||
|
|
1. `labctl get servers` using `LabdClient.getServers()`
|
||
|
|
2. `labctl exec server/<name>` using `streamExec()`
|
||
|
|
3. `labctl logs server/<name>` using `streamLogs()`
|
||
|
|
|
||
|
|
### Priority 5: Smoke test stack
|
||
|
|
1. Update `docker-compose.yml` to include labd + 2 agents
|
||
|
|
2. Write integration tests for enrollment → heartbeat → exec → logs
|