feat: install logging, error trapping, PXE/ISO integration tests

Kickstart installs on real hardware failed silently — no error reporting, only 3 progress callbacks, zero log streaming. This overhaul makes every install fully observable. Kickstart improvements: - Error trapping in %pre and %post (trap ERR sends failure details to bastion) - 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata - Background log streamer: tails %post output and batch-sends to /api/log - bastion_log() function for explicit log lines from kickstart scripts Bastion API: - POST /api/log — receives raw log lines from kickstart (single or batch) - InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence - GET /api/logs/:mac — now returns log_lines + log_total alongside stages - SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log) - Progress events forwarded to labd via bastion-progress WebSocket message - Post-provision k3s logs routed through progressBus (was console-only) dnsmasq fixes found during VM testing: - HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach) - pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode) - PXEClient vendor class echo for UEFI firmware compatibility Integration tests: - PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install - ISO boot test: blank VM boots from bastion-generated ISO → same flow - Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot) - test-provision.sh: runs both PXE + ISO tests with prerequisite checks - 250GB sparse QCOW2 disk (LVM layout needs ~204GB) 201 unit tests passing (11 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00
parent ffc4a782d2
commit 46b017d77e
189 changed files with 16241 additions and 432 deletions
--- a/.taskmaster/config.json
+++ b/.taskmaster/config.json
@@ -1,22 +1,21 @@
 {
  "models": {
    "main": {
-      "provider": "anthropic",
-      "modelId": "claude-sonnet-4-20250514",
-      "maxTokens": 64000,
+      "provider": "claude-code",
+      "modelId": "opus",
+      "maxTokens": 32000,
      "temperature": 0.2
    },
    "research": {
-      "provider": "anthropic",
-      "modelId": "claude-sonnet-4-20250514",
-      "maxTokens": 64000,
+      "provider": "claude-code",
+      "modelId": "opus",
+      "maxTokens": 32000,
      "temperature": 0.2
    },
-    "resolution": "main",
    "fallback": {
-      "provider": "anthropic",
-      "modelId": "claude-3-7-sonnet-20250219",
-      "maxTokens": 120000,
+      "provider": "claude-code",
+      "modelId": "sonnet",
+      "maxTokens": 64000,
      "temperature": 0.2
    }
  },
--- a/.taskmaster/state.json
+++ b/.taskmaster/state.json
@@ -0,0 +1,6 @@
+{
+  "currentTag": "master",
+  "lastSwitched": "2026-03-18T00:17:54.213Z",
+  "branchTagMapping": {},
+  "migrationNoticeShown": true
+}
--- a/.taskmaster/tasks/tasks.json
+++ b/.taskmaster/tasks/tasks.json
@@ -0,0 +1,180 @@
+{
+  "master": {
+    "tasks": [
+      {
+        "id": 72,
+        "title": "Expand Prisma Schema with Resource Relationships",
+        "description": "Add Network, ServerNic, ServerDisk, and ClusterMember models to the Prisma schema. Add bastionId foreign key to Server model to track which bastion owns each server.",
+        "details": "Edit `bastion/src/labd/prisma/schema.prisma` to add:\n\n1. **Server model changes**:\n   - Add `bastionId String?` with relation to Bastion\n   - Add `hardwareInfo Json?` for storing raw HardwareInfo\n   - Add `os String?` for installed OS\n\n2. **Network model**:\n```prisma\nmodel Network {\n  id          String   @id @default(uuid())\n  name        String   @unique\n  cidr        String\n  vlan        Int?\n  gateway     String?\n  domain      String?\n  dhcpEnabled Boolean  @default(false)\n  createdAt   DateTime @default(now())\n  updatedAt   DateTime @updatedAt\n  \n  nics ServerNic[]\n}\n```\n\n3. **ServerNic model**:\n```prisma\nmodel ServerNic {\n  id        String  @id @default(uuid())\n  serverId  String\n  server    Server  @relation(fields: [serverId], references: [id], onDelete: Cascade)\n  networkId String?\n  network   Network? @relation(fields: [networkId], references: [id])\n  mac       String\n  ip        String?\n  name      String\n  state     String  @default(\"DOWN\")\n  \n  @@unique([serverId, mac])\n  @@index([networkId])\n}\n```\n\n4. **ServerDisk model**:\n```prisma\nmodel ServerDisk {\n  id       String @id @default(uuid())\n  serverId String\n  server   Server @relation(fields: [serverId], references: [id], onDelete: Cascade)\n  name     String\n  sizeGb   Float\n  model    String?\n  \n  @@unique([serverId, name])\n}\n```\n\n5. **ClusterMember model**:\n```prisma\nmodel ClusterMember {\n  id        String @id @default(uuid())\n  clusterId String\n  cluster   Cluster @relation(fields: [clusterId], references: [id], onDelete: Cascade)\n  serverId  String\n  server    Server  @relation(fields: [serverId], references: [id], onDelete: Cascade)\n  role      String  @default(\"worker\") // control-plane, worker\n  joinedAt  DateTime @default(now())\n  \n  @@unique([clusterId, serverId])\n  @@index([clusterId])\n  @@index([serverId])\n}\n```\n\n6. Update Server model with relations to nics, disks, clusterMemberships, and bastion.\n\nRun `pnpm prisma generate` and `pnpm prisma migrate dev --name add-resource-models`.",
+        "testStrategy": "1. Run `pnpm prisma validate` to verify schema syntax\n2. Run `pnpm prisma generate` to confirm client generation\n3. Create migration and verify it applies cleanly to local CockroachDB\n4. Write unit tests that create/read/delete each new model\n5. Verify cascade deletes work (deleting Server removes its NICs and Disks)",
+        "priority": "high",
+        "dependencies": [],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 73,
+        "title": "Implement State Persistence Service in labd",
+        "description": "Create a new service in labd that persists bastion state syncs to the Server table in CockroachDB. When bastion-state-sync messages arrive, upsert machines into Server with their hardware info, status, and ownership.",
+        "details": "Create `bastion/src/labd/src/services/state-persistence.ts`:\n\n```typescript\nimport type { PrismaClient } from \"@prisma/client\";\nimport type { BastionState, HardwareInfo, InstallConfig, InstalledInfo } from \"@lab/shared\";\nimport { logger } from \"./logger.js\";\n\nexport class StatePersistence {\n  constructor(private readonly db: PrismaClient) {}\n\n  async syncBastionState(bastionId: string, state: BastionState): Promise<void> {\n    // Process discovered machines\n    for (const [mac, hw] of Object.entries(state.discovered)) {\n      await this.upsertDiscoveredServer(bastionId, mac, hw);\n    }\n    \n    // Process queued machines (update status to provisioning)\n    for (const [mac, cfg] of Object.entries(state.install_queue)) {\n      await this.upsertQueuedServer(bastionId, mac, cfg);\n    }\n    \n    // Process installed machines\n    for (const [mac, info] of Object.entries(state.installed)) {\n      await this.upsertInstalledServer(bastionId, mac, info);\n    }\n  }\n\n  private async upsertDiscoveredServer(bastionId: string, mac: string, hw: HardwareInfo): Promise<void> {\n    const normalized = mac.toLowerCase();\n    \n    await this.db.server.upsert({\n      where: { mac: normalized },\n      create: {\n        hostname: `unknown-${normalized.replace(/:/g, \"\").slice(-6)}`,\n        mac: normalized,\n        bastionId,\n        status: \"discovered\",\n        hardwareInfo: hw as any,\n        labels: {\n          arch: hw.arch,\n          cpu_model: hw.cpu_model,\n          cpu_cores: hw.cpu_cores,\n          memory_gb: hw.memory_gb,\n        },\n      },\n      update: {\n        bastionId,\n        status: \"discovered\", // only if not already provisioning/installed\n        hardwareInfo: hw as any,\n      },\n    });\n    \n    // Sync NICs and Disks\n    await this.syncServerHardware(normalized, hw);\n  }\n  \n  private async syncServerHardware(mac: string, hw: HardwareInfo): Promise<void> {\n    const server = await this.db.server.findUnique({ where: { mac } });\n    if (!server) return;\n    \n    // Upsert NICs\n    for (const nic of hw.nics) {\n      await this.db.serverNic.upsert({\n        where: { serverId_mac: { serverId: server.id, mac: nic.mac.toLowerCase() } },\n        create: { serverId: server.id, mac: nic.mac.toLowerCase(), name: nic.name, state: nic.state },\n        update: { name: nic.name, state: nic.state },\n      });\n    }\n    \n    // Upsert Disks\n    for (const disk of hw.disks) {\n      await this.db.serverDisk.upsert({\n        where: { serverId_name: { serverId: server.id, name: disk.name } },\n        create: { serverId: server.id, name: disk.name, sizeGb: disk.size_gb, model: disk.model },\n        update: { sizeGb: disk.size_gb, model: disk.model },\n      });\n    }\n  }\n  \n  // Similar methods for upsertQueuedServer and upsertInstalledServer...\n}\n```\n\nIntegrate into `server.ts` WebSocket handler by calling `statePersistence.syncBastionState()` when `bastion-state-sync` messages arrive.",
+        "testStrategy": "1. Unit test StatePersistence with mocked PrismaClient\n2. Integration test: simulate bastion-state-sync message, verify Server rows created\n3. Test idempotency: send same state twice, verify no duplicates\n4. Test status transitions: discovered -> provisioning -> installed\n5. Verify hardware info (NICs, Disks) is correctly persisted",
+        "priority": "high",
+        "dependencies": [
+          72
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 74,
+        "title": "Add State Loading from labd on Bastion Startup",
+        "description": "Modify bastion startup to request its persisted state from labd before using the local JSON cache. This ensures bastions restore their state after pod restarts.",
+        "details": "1. Add new labd API endpoint `GET /api/bastions/:id/state` that returns the aggregated state for a specific bastion from the Server table:\n\n```typescript\n// bastion/src/labd/src/routes/bastions.ts\napp.get<{ Params: { id: string } }>(\"/api/bastions/:id/state\", async (request, reply) => {\n  const { id } = request.params;\n  \n  const servers = await db.server.findMany({\n    where: { bastionId: id },\n    include: { nics: true, disks: true },\n  });\n  \n  // Transform back to BastionState format\n  const state: BastionState = { discovered: {}, install_queue: {}, installed: {} };\n  for (const server of servers) {\n    const mac = server.mac;\n    if (!mac) continue;\n    \n    switch (server.status) {\n      case \"discovered\":\n        state.discovered[mac] = transformToHardwareInfo(server);\n        break;\n      case \"provisioning\":\n        state.install_queue[mac] = transformToInstallConfig(server);\n        break;\n      case \"installed\":\n        state.installed[mac] = transformToInstalledInfo(server);\n        break;\n    }\n  }\n  \n  return reply.send(state);\n});\n```\n\n2. Modify `BastionConnection.connect()` in `labd-connection.ts` to fetch state after enrollment:\n\n```typescript\nprivate async loadRemoteState(): Promise<BastionState | null> {\n  if (!this.bastionId || !this.config.labdUrl) return null;\n  try {\n    const resp = await fetch(`${this.config.labdUrl}/api/bastions/${this.bastionId}/state`);\n    if (resp.ok) return await resp.json();\n  } catch { /* fall back to local */ }\n  return null;\n}\n```\n\n3. In bastion `main.ts`, after establishing labd connection, merge remote state with local state (remote takes precedence for installed machines, local wins for in-progress installs).",
+        "testStrategy": "1. Integration test: start bastion, let it persist state, restart bastion, verify state restored\n2. Test merge logic: local has in-progress install, remote has discovered - verify install preserved\n3. Test offline mode: labd unavailable, bastion falls back to local JSON\n4. Test fresh start: no local state, no remote state - bastion starts with empty state",
+        "priority": "high",
+        "dependencies": [
+          73
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 75,
+        "title": "Fix Bastion --dir Environment Variable Default",
+        "description": "Fix the bug where CLI's --dir default overrides the BASTION_DIR environment variable. The CLI option should use the env var as its default.",
+        "details": "Edit `bastion/src/cli/src/commands/serve.ts`:\n\n```typescript\n// Before (line 14):\n.option(\"--dir <dir>\", \"Bastion data directory\", \"/tmp/lab-bastion\")\n\n// After:\n.option(\n  \"--dir <dir>\",\n  \"Bastion data directory\",\n  process.env[\"BASTION_DIR\"] ?? \"/tmp/lab-bastion\"\n)\n```\n\nThis ensures:\n1. If `BASTION_DIR` env var is set (e.g., in k8s deployment), it's used as default\n2. Explicit `--dir` flag still overrides both\n3. Falls back to `/tmp/lab-bastion` if neither is set\n\nAlso update the k8s deployment manifest `bastion/deploy/k3s/deployment.yaml` to ensure `BASTION_DIR=/data` is properly set.",
+        "testStrategy": "1. Unit test: verify option default reads from process.env\n2. Integration test: set BASTION_DIR, run labctl without --dir, verify correct dir used\n3. Integration test: set BASTION_DIR, run labctl with --dir /custom, verify /custom used\n4. Test no env var: verify default /tmp/lab-bastion used",
+        "priority": "high",
+        "dependencies": [],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 76,
+        "title": "Create Resource Type Registry with Aliases",
+        "description": "Create a centralized resource type registry that maps resource names, plurals, and short aliases to canonical types. This enables kubectl-style resource resolution.",
+        "details": "Create `bastion/src/cli/src/utils/resources.ts`:\n\n```typescript\nexport interface ResourceDefinition {\n  kind: string;           // Canonical type: \"Server\", \"Cluster\", etc.\n  singular: string;       // \"server\"\n  plural: string;         // \"servers\"\n  aliases: string[];      // [\"srv\"]\n  apiPath: string;        // \"/api/servers\"\n  columns: TableColumn[]; // Default columns for 'get' output\n  wideColumns?: TableColumn[]; // Extra columns for -o wide\n}\n\nconst RESOURCE_DEFINITIONS: ResourceDefinition[] = [\n  {\n    kind: \"Server\",\n    singular: \"server\",\n    plural: \"servers\",\n    aliases: [\"srv\"],\n    apiPath: \"/api/servers\",\n    columns: serverColumns,\n    wideColumns: serverWideColumns,\n  },\n  {\n    kind: \"Cluster\",\n    singular: \"cluster\",\n    plural: \"clusters\",\n    aliases: [],\n    apiPath: \"/api/clusters\",\n    columns: clusterColumns,\n  },\n  {\n    kind: \"Network\",\n    singular: \"network\",\n    plural: \"networks\",\n    aliases: [\"net\"],\n    apiPath: \"/api/networks\",\n    columns: networkColumns,\n  },\n  // ... bastion, role, user, token, audit\n];\n\nconst aliasMap = new Map<string, ResourceDefinition>();\nfor (const def of RESOURCE_DEFINITIONS) {\n  aliasMap.set(def.singular, def);\n  aliasMap.set(def.plural, def);\n  for (const alias of def.aliases) {\n    aliasMap.set(alias, def);\n  }\n}\n\nexport function resolveResourceType(input: string): ResourceDefinition {\n  const normalized = input.toLowerCase();\n  const def = aliasMap.get(normalized);\n  if (!def) {\n    const valid = RESOURCE_DEFINITIONS.map(d => d.plural).join(\", \");\n    throw new Error(`Unknown resource type \"${input}\". Valid types: ${valid}`);\n  }\n  return def;\n}\n\nexport function resolveResourceIdentifier(input: string): {\n  type: ResourceDefinition;\n  name?: string;\n} {\n  // Handle \"server/labmaster\" or just \"servers\"\n  const parts = input.split(\"/\");\n  const type = resolveResourceType(parts[0]);\n  const name = parts.length > 1 ? parts.slice(1).join(\"/\") : undefined;\n  return { type, name };\n}\n```\n\nUpdate `bastion/src/cli/src/utils/resource.ts` to use the new registry.",
+        "testStrategy": "1. Unit test resolveResourceType with all aliases: server, servers, srv -> Server\n2. Test unknown resource type throws descriptive error\n3. Test case insensitivity: SERVER, Server, server all resolve correctly\n4. Test resolveResourceIdentifier parses \"server/labmaster\" correctly",
+        "priority": "high",
+        "dependencies": [],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 77,
+        "title": "Implement 'labctl get' Command",
+        "description": "Create the core 'labctl get <resource> [name]' command that lists resources with filtering and output format support. This is the foundation of the kubectl-style CLI.",
+        "details": "Create `bastion/src/cli/src/commands/get.ts`:\n\n```typescript\nimport { Command } from \"commander\";\nimport { resolveResourceType, type ResourceDefinition } from \"../utils/resources.js\";\nimport { getLabdClient } from \"../api/config.js\";\nimport { formatOutput, type TableColumn } from \"../utils/table.js\";\n\nexport function registerGetCommand(program: Command): void {\n  program\n    .command(\"get <resource> [name]\")\n    .description(\"List resources or get a specific resource by name\")\n    .option(\"--status <status>\", \"Filter by status\")\n    .option(\"--role <role>\", \"Filter by role (servers only)\")\n    .option(\"--cloud <cloud>\", \"Filter by cloud\")\n    .option(\"--env <environment>\", \"Filter by environment\")\n    .option(\"-l, --label <label>\", \"Filter by label (key=value)\")\n    .option(\"-A, --all-namespaces\", \"List across all clouds/environments\")\n    .action(async (resource: string, name: string | undefined, opts) => {\n      const config = program.opts()[\"_config\"];\n      const resourceDef = resolveResourceType(resource);\n      const client = getLabdClient();\n      \n      try {\n        let data: unknown[];\n        \n        if (name) {\n          // Get specific resource - could be name, ID, or MAC\n          const item = await client.getResource(resourceDef, name);\n          data = item ? [item] : [];\n        } else {\n          // List with filters\n          data = await client.listResources(resourceDef, {\n            status: opts.status,\n            role: opts.role,\n            cloud: opts.allNamespaces ? undefined : (opts.cloud ?? config.defaultCloud),\n            environment: opts.allNamespaces ? undefined : (opts.env ?? config.defaultEnvironment),\n            label: opts.label,\n          });\n        }\n        \n        if (data.length === 0) {\n          console.log(`No ${resourceDef.plural} found.`);\n          return;\n        }\n        \n        const columns = config.outputFormat === \"wide\" && resourceDef.wideColumns\n          ? [...resourceDef.columns, ...resourceDef.wideColumns]\n          : resourceDef.columns;\n        \n        formatOutput(data, config.outputFormat, columns);\n      } catch (err) {\n        console.error(`Error: ${err instanceof Error ? err.message : String(err)}`);\n        process.exit(1);\n      }\n    });\n}\n```\n\nAdd to `index.ts`: `registerGetCommand(program);`\n\nExtend LabdClient with generic resource methods.",
+        "testStrategy": "1. Integration test: `labctl get servers` returns list from labd\n2. Test filtering: `labctl get servers --status discovered` only shows discovered\n3. Test name lookup: `labctl get server labmaster` returns single server\n4. Test MAC lookup: `labctl get server 38:05:25:33:e2:e4` resolves by MAC\n5. Test output formats: -o json, -o yaml, -o wide produce correct output\n6. Test unknown resource: `labctl get foo` shows helpful error",
+        "priority": "high",
+        "dependencies": [
+          76
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 78,
+        "title": "Implement 'labctl describe' Command",
+        "description": "Create the 'labctl describe <resource> <name>' command that shows detailed information about a resource including relationships, hardware info, and history.",
+        "details": "Create `bastion/src/cli/src/commands/describe.ts`:\n\n```typescript\nimport { Command } from \"commander\";\nimport { resolveResourceType } from \"../utils/resources.js\";\nimport { getLabdClient } from \"../api/config.js\";\n\nconst BOLD = \"\\x1b[1m\";\nconst DIM = \"\\x1b[2m\";\nconst RESET = \"\\x1b[0m\";\n\ninterface DescribeSection {\n  title: string;\n  fields: Array<[string, string | undefined]>;\n}\n\nfunction printDescribe(name: string, sections: DescribeSection[]): void {\n  console.log(`${BOLD}Name:${RESET} ${name}`);\n  for (const section of sections) {\n    console.log(`\\n${BOLD}${section.title}:${RESET}`);\n    for (const [key, value] of section.fields) {\n      if (value !== undefined) {\n        console.log(`  ${DIM}${key}:${RESET} ${value}`);\n      }\n    }\n  }\n}\n\nexport function registerDescribeCommand(program: Command): void {\n  program\n    .command(\"describe <resource> <name>\")\n    .description(\"Show detailed information about a resource\")\n    .action(async (resource: string, name: string) => {\n      const resourceDef = resolveResourceType(resource);\n      const client = getLabdClient();\n      \n      try {\n        const item = await client.describeResource(resourceDef, name);\n        if (!item) {\n          console.error(`${resourceDef.singular} \"${name}\" not found.`);\n          process.exit(1);\n        }\n        \n        // Resource-specific formatting\n        switch (resourceDef.kind) {\n          case \"Server\":\n            printServerDescription(item);\n            break;\n          case \"Cluster\":\n            printClusterDescription(item);\n            break;\n          default:\n            console.log(JSON.stringify(item, null, 2));\n        }\n      } catch (err) {\n        console.error(`Error: ${err instanceof Error ? err.message : String(err)}`);\n        process.exit(1);\n      }\n    });\n}\n\nfunction printServerDescription(server: any): void {\n  const sections: DescribeSection[] = [\n    {\n      title: \"Metadata\",\n      fields: [\n        [\"ID\", server.id],\n        [\"Cloud\", server.cloud],\n        [\"Environment\", server.environment],\n        [\"Role\", server.role],\n        [\"Status\", server.status],\n        [\"Created\", server.createdAt],\n        [\"Last Seen\", server.lastHeartbeat],\n      ],\n    },\n    {\n      title: \"Hardware\",\n      fields: [\n        [\"MAC\", server.mac],\n        [\"IP\", server.ip],\n        [\"Architecture\", server.hardwareInfo?.arch],\n        [\"CPU\", server.hardwareInfo?.cpu_model],\n        [\"Cores\", String(server.hardwareInfo?.cpu_cores)],\n        [\"Memory\", `${server.hardwareInfo?.memory_gb}GB`],\n        [\"Product\", server.hardwareInfo?.product],\n      ],\n    },\n  ];\n  \n  if (server.nics?.length > 0) {\n    sections.push({\n      title: \"Network Interfaces\",\n      fields: server.nics.map((n: any) => [n.name, `${n.mac} ${n.ip ?? \"\"} (${n.state})`]),\n    });\n  }\n  \n  if (server.disks?.length > 0) {\n    sections.push({\n      title: \"Disks\",\n      fields: server.disks.map((d: any) => [d.name, `${d.sizeGb}GB ${d.model ?? \"\"}`]),\n    });\n  }\n  \n  if (server.clusterMemberships?.length > 0) {\n    sections.push({\n      title: \"Cluster Membership\",\n      fields: server.clusterMemberships.map((m: any) => [m.cluster.name, m.role]),\n    });\n  }\n  \n  printDescribe(server.hostname, sections);\n}\n```",
+        "testStrategy": "1. Integration test: `labctl describe server labmaster` shows full details\n2. Test hardware info display: CPU, memory, disks, NICs all shown\n3. Test cluster membership: server in cluster shows membership section\n4. Test not found: `labctl describe server nonexistent` shows helpful error\n5. Test different resource types: describe cluster, network, bastion",
+        "priority": "medium",
+        "dependencies": [
+          77
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 79,
+        "title": "Implement 'labctl create/delete' Commands",
+        "description": "Create the 'labctl create <resource>' and 'labctl delete <resource> <name>' commands for creating and removing resources like networks, clusters, and tokens.",
+        "details": "Create `bastion/src/cli/src/commands/create.ts`:\n\n```typescript\nimport { Command } from \"commander\";\nimport { resolveResourceType } from \"../utils/resources.js\";\nimport { getLabdClient } from \"../api/config.js\";\n\nexport function registerCreateCommand(program: Command): void {\n  const create = program\n    .command(\"create <resource>\")\n    .description(\"Create a resource\");\n  \n  // labctl create network --name lab --cidr 192.168.8.0/24\n  create\n    .command(\"network\")\n    .description(\"Create a network\")\n    .requiredOption(\"--name <name>\", \"Network name\")\n    .requiredOption(\"--cidr <cidr>\", \"Network CIDR (e.g., 192.168.8.0/24)\")\n    .option(\"--gateway <gateway>\", \"Gateway IP\")\n    .option(\"--vlan <vlan>\", \"VLAN ID\", parseInt)\n    .option(\"--domain <domain>\", \"DNS domain\")\n    .option(\"--dhcp\", \"Enable DHCP\")\n    .action(async (opts) => {\n      const client = getLabdClient();\n      try {\n        const network = await client.createNetwork({\n          name: opts.name,\n          cidr: opts.cidr,\n          gateway: opts.gateway,\n          vlan: opts.vlan,\n          domain: opts.domain,\n          dhcpEnabled: opts.dhcp ?? false,\n        });\n        console.log(`network/${network.name} created`);\n      } catch (err) {\n        console.error(`Error: ${err instanceof Error ? err.message : String(err)}`);\n        process.exit(1);\n      }\n    });\n  \n  // labctl create token --label \"worker enrollment\" --type reusable\n  create\n    .command(\"token\")\n    .description(\"Create a join token\")\n    .option(\"--label <label>\", \"Token label/description\")\n    .option(\"--type <type>\", \"Token type: one-time or reusable\", \"one-time\")\n    .option(\"--expires <duration>\", \"Expiration (e.g., 24h, 7d)\")\n    .action(async (opts) => {\n      const client = getLabdClient();\n      try {\n        const token = await client.createToken(opts);\n        console.log(`Token created: ${token.token}`);\n        if (opts.label) console.log(`Label: ${opts.label}`);\n        if (token.expiresAt) console.log(`Expires: ${token.expiresAt}`);\n      } catch (err) {\n        console.error(`Error: ${err instanceof Error ? err.message : String(err)}`);\n        process.exit(1);\n      }\n    });\n}\n```\n\nCreate `bastion/src/cli/src/commands/delete.ts`:\n\n```typescript\nexport function registerDeleteCommand(program: Command): void {\n  program\n    .command(\"delete <resource> <name>\")\n    .description(\"Delete a resource\")\n    .option(\"--force\", \"Skip confirmation\")\n    .action(async (resource: string, name: string, opts) => {\n      const resourceDef = resolveResourceType(resource);\n      const client = getLabdClient();\n      \n      if (!opts.force) {\n        const { confirm } = await import(\"../utils/prompts.js\");\n        const yes = await confirm(`Delete ${resourceDef.singular} \"${name}\"?`);\n        if (!yes) {\n          console.log(\"Cancelled.\");\n          return;\n        }\n      }\n      \n      try {\n        await client.deleteResource(resourceDef, name);\n        console.log(`${resourceDef.singular}/${name} deleted`);\n      } catch (err) {\n        console.error(`Error: ${err instanceof Error ? err.message : String(err)}`);\n        process.exit(1);\n      }\n    });\n}\n```",
+        "testStrategy": "1. Integration test: `labctl create network` creates network in DB\n2. Test validation: missing required flags shows helpful error\n3. Test token creation: token returned is valid UUID, stored in DB\n4. Test delete with confirmation: prompts user, respects --force\n5. Test delete cascade: deleting server removes NICs, disks\n6. Test delete protection: cannot delete bastion with connected servers",
+        "priority": "medium",
+        "dependencies": [
+          77
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 80,
+        "title": "Refactor Provision Commands to kubectl-style",
+        "description": "Refactor existing provision commands to use kubectl-style syntax: 'labctl provision <server>' instead of 'labctl provision install <mac>'.",
+        "details": "The new command structure should be:\n- `labctl provision <server> --os fedora-43 --role worker` (queue install)\n- `labctl reprovision <server>` (reinstall)\n- `labctl forget <server>` (remove from tracking)\n\nModify `bastion/src/cli/src/commands/install.ts` → rename to `provision.ts`:\n\n```typescript\nexport function registerProvisionCommand(program: Command): void {\n  program\n    .command(\"provision <server>\")\n    .description(\"Queue a server for OS installation\")\n    .requiredOption(\"--os <os>\", \"Operating system\", \"fedora-43\")\n    .requiredOption(\"--role <role>\", \"Server role\", \"worker\")\n    .option(\"--disk <disk>\", \"Target disk (auto-detected if not specified)\")\n    .option(\"--hostname <hostname>\", \"Override hostname\")\n    .action(async (server: string, opts) => {\n      const client = getLabdClient();\n      \n      // Resolve server: could be hostname, MAC, or ID\n      const resolved = await client.resolveServer(server);\n      if (!resolved) {\n        console.error(`Server \"${server}\" not found.`);\n        console.error(\"Tip: Use 'labctl get servers' to see available servers.\");\n        process.exit(1);\n      }\n      \n      if (resolved.status === \"installed\") {\n        console.error(`Server \"${resolved.hostname}\" is already installed.`);\n        console.error(\"Tip: Use 'labctl reprovision' to reinstall.\");\n        process.exit(1);\n      }\n      \n      try {\n        await client.provisionServer(resolved.mac, {\n          hostname: opts.hostname ?? resolved.hostname,\n          os: opts.os,\n          role: opts.role,\n          disk: opts.disk,\n        });\n        console.log(`Server ${resolved.hostname} queued for ${opts.os} installation as ${opts.role}.`);\n      } catch (err) {\n        console.error(`Error: ${err instanceof Error ? err.message : String(err)}`);\n        process.exit(1);\n      }\n    });\n}\n```\n\nSimilarly update reprovision.ts and forget.ts to accept server name/MAC/ID.\n\nUpdate index.ts to register commands at top level instead of under 'provision' subcommand.",
+        "testStrategy": "1. Test server resolution: provision by hostname, MAC, or UUID all work\n2. Test already installed: provisioning installed server shows reprovision hint\n3. Test unknown server: helpful error message with tip\n4. Test reprovision: reinstalls installed server\n5. Test forget: removes server from all state categories\n6. Backward compat: verify 'labctl provision list' still works (deprecation warning)",
+        "priority": "medium",
+        "dependencies": [
+          77
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 81,
+        "title": "Implement Server and Resource API Endpoints in labd",
+        "description": "Add REST API endpoints in labd for full resource CRUD operations: networks, clusters, tokens. Extend servers endpoint with filters and relationship includes.",
+        "details": "Create/extend labd route files:\n\n1. **Extend servers.ts**:\n```typescript\n// GET /api/servers - with extended filters and includes\napp.get(\"/api/servers\", async (request, reply) => {\n  const { status, role, cloud, environment, label, include } = request.query;\n  \n  const where = {};\n  if (status) where.status = status;\n  if (role) where.role = role;\n  if (cloud) where.cloud = cloud;\n  if (environment) where.environment = environment;\n  if (label) where.labels = { path: [labelKey], equals: labelValue };\n  \n  const servers = await db.server.findMany({\n    where,\n    include: {\n      nics: include?.includes(\"nics\"),\n      disks: include?.includes(\"disks\"),\n      clusterMemberships: include?.includes(\"clusters\") ? { include: { cluster: true } } : false,\n      bastion: include?.includes(\"bastion\"),\n    },\n  });\n  return servers;\n});\n\n// GET /api/servers/:id - by ID, hostname, or MAC\napp.get(\"/api/servers/:identifier\", async (request, reply) => {\n  const { identifier } = request.params;\n  \n  // Try UUID first\n  let server = await db.server.findUnique({ where: { id: identifier }, include: fullInclude });\n  // Try hostname\n  if (!server) server = await db.server.findUnique({ where: { hostname: identifier }, include: fullInclude });\n  // Try MAC\n  if (!server) server = await db.server.findUnique({ where: { mac: identifier.toLowerCase() }, include: fullInclude });\n  \n  if (!server) return reply.code(404).send({ error: \"Server not found\" });\n  return server;\n});\n```\n\n2. **Create networks.ts**:\n```typescript\n// GET /api/networks, POST /api/networks, DELETE /api/networks/:id\nexport function registerNetworkRoutes(app: FastifyInstance, db: DbClient): void {\n  app.get(\"/api/networks\", async () => db.network.findMany());\n  \n  app.post(\"/api/networks\", async (request, reply) => {\n    const { name, cidr, gateway, vlan, domain, dhcpEnabled } = request.body;\n    // Validate CIDR format\n    const network = await db.network.create({ data: { name, cidr, gateway, vlan, domain, dhcpEnabled } });\n    return reply.code(201).send(network);\n  });\n  \n  app.delete(\"/api/networks/:id\", async (request, reply) => {\n    await db.network.delete({ where: { id: request.params.id } });\n    return reply.code(204).send();\n  });\n}\n```\n\n3. **Create clusters.ts**:\n```typescript\n// Similar CRUD for clusters with member management\napp.get(\"/api/clusters/:id/members\", ...);\napp.post(\"/api/clusters/:id/members\", ...);\napp.delete(\"/api/clusters/:id/members/:serverId\", ...);\n```",
+        "testStrategy": "1. Integration test all CRUD endpoints with HTTP client\n2. Test server resolution: by id, hostname, and MAC all return same server\n3. Test include parameter: nics, disks, clusters included when requested\n4. Test validation: invalid CIDR rejected, duplicate names rejected\n5. Test cascade: delete network with NICs fails or cascades appropriately",
+        "priority": "medium",
+        "dependencies": [
+          72,
+          73
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 82,
+        "title": "Implement RBAC Permission Checks in CLI",
+        "description": "Wire RBAC permission checks into CLI commands. Check user permissions before executing operations using the existing Permission model.",
+        "details": "1. Create `bastion/src/cli/src/middleware/rbac.ts`:\n\n```typescript\nimport { getLabdClient } from \"../api/config.js\";\n\nexport interface PermissionContext {\n  action: string;      // read, exec, apply, destroy, manage, admin\n  cloud?: string;\n  environment?: string;\n  server?: string;\n}\n\nexport async function checkPermission(ctx: PermissionContext): Promise<boolean> {\n  const client = getLabdClient();\n  try {\n    const result = await client.checkPermission(ctx);\n    return result.allowed;\n  } catch {\n    // If can't reach labd, fail open for local operations\n    return true;\n  }\n}\n\nexport async function requirePermission(ctx: PermissionContext): Promise<void> {\n  const allowed = await checkPermission(ctx);\n  if (!allowed) {\n    throw new Error(\n      `Permission denied: ${ctx.action} on ${ctx.server ?? \"*\"}@${ctx.cloud ?? \"*\"}/${ctx.environment ?? \"*\"}`\n    );\n  }\n}\n```\n\n2. Add labd endpoint `POST /api/auth/check-permission`:\n```typescript\napp.post(\"/api/auth/check-permission\", async (request, reply) => {\n  const user = await authenticateRequest(request); // from cert or token\n  const { action, cloud, environment, server } = request.body;\n  \n  const permissions = await db.permission.findMany({\n    where: {\n      role: { userBindings: { some: { userId: user.id } } },\n    },\n  });\n  \n  const allowed = permissions.some(p => \n    matchesPattern(p.action, action) &&\n    matchesPattern(p.cloud, cloud ?? \"*\") &&\n    matchesPattern(p.environment, environment ?? \"*\") &&\n    matchesPattern(p.server, server ?? \"*\")\n  );\n  \n  return { allowed };\n});\n```\n\n3. Integrate into commands:\n```typescript\n// In provision command\nawait requirePermission({ action: \"apply\", cloud, environment, server: resolved.hostname });\n\n// In delete command\nawait requirePermission({ action: \"destroy\", cloud, environment, server: name });\n\n// In get command (filter results)\nconst servers = await client.listServers(filters);\nconst visible = await filterByPermission(servers, \"read\");\n```",
+        "testStrategy": "1. Unit test permission matching logic with wildcards\n2. Test admin role: has access to all resources\n3. Test operator role: can read/exec but not destroy\n4. Test viewer role: can only read, provision denied\n5. Test scope matching: permission for cloud=aws doesn't grant access to cloud=baremetal\n6. Test denied action is audit-logged",
+        "priority": "medium",
+        "dependencies": [
+          77,
+          81
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 83,
+        "title": "Implement Audit Logging for Resource Operations",
+        "description": "Log all resource mutations to the AuditLog table. Include user, action, resource type/name, result, and source IP.",
+        "details": "1. Create `bastion/src/labd/src/services/audit.ts`:\n\n```typescript\nimport type { PrismaClient } from \"@prisma/client\";\n\nexport interface AuditEntry {\n  userId?: string;\n  serverId?: string;\n  sessionId?: string;\n  action: string;         // create, update, delete, provision, exec, rbac-denied\n  resourceType: string;   // server, cluster, network, token, etc.\n  resourceName: string;\n  args?: string;          // sanitized args (no secrets)\n  result: \"success\" | \"denied\" | \"error\";\n  durationMs?: number;\n  sourceIp?: string;\n}\n\nexport class AuditService {\n  constructor(private readonly db: PrismaClient) {}\n  \n  async log(entry: AuditEntry): Promise<void> {\n    await this.db.auditLog.create({\n      data: {\n        userId: entry.userId,\n        serverId: entry.serverId,\n        sessionId: entry.sessionId,\n        action: entry.action,\n        resourceType: entry.resourceType,\n        resourceName: entry.resourceName,\n        args: entry.args,\n        result: entry.result,\n        durationMs: entry.durationMs,\n        sourceIp: entry.sourceIp,\n      },\n    });\n  }\n  \n  async query(filters: {\n    userId?: string;\n    action?: string;\n    resourceType?: string;\n    since?: Date;\n    limit?: number;\n  }): Promise<AuditEntry[]> {\n    return this.db.auditLog.findMany({\n      where: {\n        userId: filters.userId,\n        action: filters.action,\n        resourceType: filters.resourceType,\n        timestamp: filters.since ? { gte: filters.since } : undefined,\n      },\n      orderBy: { timestamp: \"desc\" },\n      take: filters.limit ?? 100,\n    });\n  }\n}\n```\n\n2. Add Fastify hook to wrap route handlers:\n```typescript\napp.addHook(\"onResponse\", async (request, reply) => {\n  // Log mutations (POST, PUT, DELETE)\n  if ([\"POST\", \"PUT\", \"DELETE\"].includes(request.method)) {\n    const path = request.url;\n    const resourceMatch = path.match(/\\/api\\/(\\w+)(?:\\/([^/]+))?/);\n    if (resourceMatch) {\n      await auditService.log({\n        action: methodToAction(request.method),\n        resourceType: resourceMatch[1],\n        resourceName: resourceMatch[2] ?? \"\",\n        result: reply.statusCode < 400 ? \"success\" : \"error\",\n        sourceIp: request.ip,\n      });\n    }\n  }\n});\n```\n\n3. Add `labctl get audit` command to view audit logs.",
+        "testStrategy": "1. Integration test: create network, verify audit log entry created\n2. Test RBAC denial is logged with result=denied\n3. Test sensitive data sanitization: tokens/passwords not in args\n4. Test query filters: by user, action, resourceType, time range\n5. Test `labctl get audit` displays recent entries correctly",
+        "priority": "medium",
+        "dependencies": [
+          81,
+          82
+        ],
+        "status": "pending",
+        "subtasks": []
+      },
+      {
+        "id": 84,
+        "title": "Update CLI Entry Point and Help Text",
+        "description": "Update the CLI entry point to register all new commands and update help text to reflect the kubectl-style interface. Add deprecation warnings for old command structure.",
+        "details": "Update `bastion/src/cli/src/index.ts`:\n\n```typescript\nimport { Command } from \"commander\";\nimport { APP_VERSION } from \"@lab/shared\";\nimport { loadConfig } from \"./config/index.js\";\n\n// New kubectl-style commands\nimport { registerGetCommand } from \"./commands/get.js\";\nimport { registerDescribeCommand } from \"./commands/describe.js\";\nimport { registerCreateCommand } from \"./commands/create.js\";\nimport { registerDeleteCommand } from \"./commands/delete.js\";\nimport { registerApplyCommand } from \"./commands/apply.js\";\nimport { registerEditCommand } from \"./commands/edit.js\";\n\n// Action commands\nimport { registerProvisionCommand } from \"./commands/provision.js\";\nimport { registerReprovisionCommand } from \"./commands/reprovision.js\";\nimport { registerForgetCommand } from \"./commands/forget.js\";\n\n// Bastion management\nimport { registerBastionCommand } from \"./commands/bastion.js\"; // start/stop/status\n\n// App management (unchanged)\nimport { registerAppCommand } from \"./commands/app.js\";\n\n// Utility\nimport { registerConfigCommand } from \"./commands/config.js\";\nimport { registerLoginCommand } from \"./commands/login.js\";\nimport { registerDoctorCommand } from \"./commands/doctor.js\";\n\nexport function createProgram(): Command {\n  const program = new Command();\n  \n  program\n    .name(\"labctl\")\n    .description(\"Lab infrastructure management CLI\")\n    .version(APP_VERSION);\n  \n  // Global options\n  program\n    .option(\"-o, --output <format>\", \"output format (table, json, yaml, wide)\", \"table\")\n    .option(\"--server <url>\", \"override labd server URL\")\n    .option(\"--env <name>\", \"override default environment\")\n    .option(\"--cloud <name>\", \"override default cloud\")\n    .option(\"--debug\", \"enable debug output\")\n    .option(\"--no-color\", \"disable colored output\");\n  \n  // Core CRUD commands\n  registerGetCommand(program);        // labctl get <resource> [name]\n  registerDescribeCommand(program);   // labctl describe <resource> <name>\n  registerCreateCommand(program);     // labctl create <resource>\n  registerDeleteCommand(program);     // labctl delete <resource> <name>\n  registerApplyCommand(program);      // labctl apply -f <file>\n  registerEditCommand(program);       // labctl edit <resource> <name>\n  \n  // Provisioning actions\n  registerProvisionCommand(program);  // labctl provision <server>\n  registerReprovisionCommand(program);// labctl reprovision <server>\n  registerForgetCommand(program);     // labctl forget <server>\n  \n  // Bastion management\n  registerBastionCommand(program);    // labctl bastion start|stop|status\n  \n  // App management\n  registerAppCommand(program);        // labctl app install|health k3s\n  \n  // Utility\n  registerConfigCommand(program);\n  registerLoginCommand(program);\n  registerDoctorCommand(program);\n  \n  // Legacy compatibility with deprecation warnings\n  registerLegacyCommands(program);\n  \n  return program;\n}\n\nfunction registerLegacyCommands(program: Command): void {\n  // labctl provision list -> labctl get servers (with warning)\n  program\n    .command(\"provision\")\n    .command(\"list\")\n    .action(() => {\n      console.warn(\"DEPRECATED: Use 'labctl get servers' instead.\");\n      // Delegate to get servers\n    });\n}\n```\n\nUpdate shell completions in `scripts/generate-completions.ts` for new command structure.",
+        "testStrategy": "1. Test --help shows all new commands with descriptions\n2. Test resource type help: `labctl get --help` lists valid resources\n3. Test deprecated commands show warning but still work\n4. Test shell completions generated for new commands\n5. Test global options: -o, --server, --env, --cloud all work",
+        "priority": "low",
+        "dependencies": [
+          77,
+          78,
+          79,
+          80
+        ],
+        "status": "pending",
+        "subtasks": []
+      }
+    ],
+    "metadata": {
+      "created": "2026-03-26T04:26:49.813Z",
+      "updated": "2026-03-26T04:26:49.813Z",
+      "description": "Tasks for master context"
+    }
+  }
+}
--- a/.taskmaster/templates/example_prd.txt
+++ b/.taskmaster/templates/example_prd.txt
@@ -0,0 +1,47 @@
+<context>
+# Overview  
+[Provide a high-level overview of your product here. Explain what problem it solves, who it's for, and why it's valuable.]
+
+# Core Features  
+[List and describe the main features of your product. For each feature, include:
+- What it does
+- Why it's important
+- How it works at a high level]
+
+# User Experience  
+[Describe the user journey and experience. Include:
+- User personas
+- Key user flows
+- UI/UX considerations]
+</context>
+<PRD>
+# Technical Architecture  
+[Outline the technical implementation details:
+- System components
+- Data models
+- APIs and integrations
+- Infrastructure requirements]
+
+# Development Roadmap  
+[Break down the development process into phases:
+- MVP requirements
+- Future enhancements
+- Do not think about timelines whatsoever -- all that matters is scope and detailing exactly what needs to be build in each phase so it can later be cut up into tasks]
+
+# Logical Dependency Chain
+[Define the logical order of development:
+- Which features need to be built first (foundation)
+- Getting as quickly as possible to something usable/visible front end that works
+- Properly pacing and scoping each feature so it is atomic but can also be built upon and improved as development approaches]
+
+# Risks and Mitigations  
+[Identify potential risks and how they'll be addressed:
+- Technical challenges
+- Figuring out the MVP that we can build upon
+- Resource constraints]
+
+# Appendix  
+[Include any additional information:
+- Research findings
+- Technical specifications]
+</PRD>
--- a/.taskmaster/templates/example_prd_rpg.txt
+++ b/.taskmaster/templates/example_prd_rpg.txt
@@ -0,0 +1,511 @@
+<rpg-method>
+# Repository Planning Graph (RPG) Method - PRD Template
+
+This template teaches you (AI or human) how to create structured, dependency-aware PRDs using the RPG methodology from Microsoft Research. The key insight: separate WHAT (functional) from HOW (structural), then connect them with explicit dependencies.
+
+## Core Principles
+
+1. **Dual-Semantics**: Think functional (capabilities) AND structural (code organization) separately, then map them
+2. **Explicit Dependencies**: Never assume - always state what depends on what
+3. **Topological Order**: Build foundation first, then layers on top
+4. **Progressive Refinement**: Start broad, refine iteratively
+
+## How to Use This Template
+
+- Follow the instructions in each `<instruction>` block
+- Look at `<example>` blocks to see good vs bad patterns
+- Fill in the content sections with your project details
+- The AI reading this will learn the RPG method by following along
+- Task Master will parse the resulting PRD into dependency-aware tasks
+
+## Recommended Tools for Creating PRDs
+
+When using this template to **create** a PRD (not parse it), use **code-context-aware AI assistants** for best results:
+
+**Why?** The AI needs to understand your existing codebase to make good architectural decisions about modules, dependencies, and integration points.
+
+**Recommended tools:**
+- **Claude Code** (claude-code CLI) - Best for structured reasoning and large contexts
+- **Cursor/Windsurf** - IDE integration with full codebase context
+- **Gemini CLI** (gemini-cli) - Massive context window for large codebases
+- **Codex/Grok CLI** - Strong code generation with context awareness
+
+**Note:** Once your PRD is created, `task-master parse-prd` works with any configured AI model - it just needs to read the PRD text itself, not your codebase.
+</rpg-method>
+
+---
+
+<overview>
+<instruction>
+Start with the problem, not the solution. Be specific about:
+- What pain point exists?
+- Who experiences it?
+- Why existing solutions don't work?
+- What success looks like (measurable outcomes)?
+
+Keep this section focused - don't jump into implementation details yet.
+</instruction>
+
+## Problem Statement
+[Describe the core problem. Be concrete about user pain points.]
+
+## Target Users
+[Define personas, their workflows, and what they're trying to achieve.]
+
+## Success Metrics
+[Quantifiable outcomes. Examples: "80% task completion via autopilot", "< 5% manual intervention rate"]
+
+</overview>
+
+---
+
+<functional-decomposition>
+<instruction>
+Now think about CAPABILITIES (what the system DOES), not code structure yet.
+
+Step 1: Identify high-level capability domains
+- Think: "What major things does this system do?"
+- Examples: Data Management, Core Processing, Presentation Layer
+
+Step 2: For each capability, enumerate specific features
+- Use explore-exploit strategy:
+  * Exploit: What features are REQUIRED for core value?
+  * Explore: What features make this domain COMPLETE?
+
+Step 3: For each feature, define:
+- Description: What it does in one sentence
+- Inputs: What data/context it needs
+- Outputs: What it produces/returns
+- Behavior: Key logic or transformations
+
+<example type="good">
+Capability: Data Validation
+  Feature: Schema validation
+    - Description: Validate JSON payloads against defined schemas
+    - Inputs: JSON object, schema definition
+    - Outputs: Validation result (pass/fail) + error details
+    - Behavior: Iterate fields, check types, enforce constraints
+
+  Feature: Business rule validation
+    - Description: Apply domain-specific validation rules
+    - Inputs: Validated data object, rule set
+    - Outputs: Boolean + list of violated rules
+    - Behavior: Execute rules sequentially, short-circuit on failure
+</example>
+
+<example type="bad">
+Capability: validation.js
+  (Problem: This is a FILE, not a CAPABILITY. Mixing structure into functional thinking.)
+
+Capability: Validation
+  Feature: Make sure data is good
+  (Problem: Too vague. No inputs/outputs. Not actionable.)
+</example>
+</instruction>
+
+## Capability Tree
+
+### Capability: [Name]
+[Brief description of what this capability domain covers]
+
+#### Feature: [Name]
+- **Description**: [One sentence]
+- **Inputs**: [What it needs]
+- **Outputs**: [What it produces]
+- **Behavior**: [Key logic]
+
+#### Feature: [Name]
+- **Description**:
+- **Inputs**:
+- **Outputs**:
+- **Behavior**:
+
+### Capability: [Name]
+...
+
+</functional-decomposition>
+
+---
+
+<structural-decomposition>
+<instruction>
+NOW think about code organization. Map capabilities to actual file/folder structure.
+
+Rules:
+1. Each capability maps to a module (folder or file)
+2. Features within a capability map to functions/classes
+3. Use clear module boundaries - each module has ONE responsibility
+4. Define what each module exports (public interface)
+
+The goal: Create a clear mapping between "what it does" (functional) and "where it lives" (structural).
+
+<example type="good">
+Capability: Data Validation
+  → Maps to: src/validation/
+    ├── schema-validator.js      (Schema validation feature)
+    ├── rule-validator.js         (Business rule validation feature)
+    └── index.js                  (Public exports)
+
+Exports:
+  - validateSchema(data, schema)
+  - validateRules(data, rules)
+</example>
+
+<example type="bad">
+Capability: Data Validation
+  → Maps to: src/utils.js
+  (Problem: "utils" is not a clear module boundary. Where do I find validation logic?)
+
+Capability: Data Validation
+  → Maps to: src/validation/everything.js
+  (Problem: One giant file. Features should map to separate files for maintainability.)
+</example>
+</instruction>
+
+## Repository Structure
+
+```
+project-root/
+├── src/
+│   ├── [module-name]/       # Maps to: [Capability Name]
+│   │   ├── [file].js        # Maps to: [Feature Name]
+│   │   └── index.js         # Public exports
+│   └── [module-name]/
+├── tests/
+└── docs/
+```
+
+## Module Definitions
+
+### Module: [Name]
+- **Maps to capability**: [Capability from functional decomposition]
+- **Responsibility**: [Single clear purpose]
+- **File structure**:
+  ```
+  module-name/
+  ├── feature1.js
+  ├── feature2.js
+  └── index.js
+  ```
+- **Exports**:
+  - `functionName()` - [what it does]
+  - `ClassName` - [what it does]
+
+</structural-decomposition>
+
+---
+
+<dependency-graph>
+<instruction>
+This is THE CRITICAL SECTION for Task Master parsing.
+
+Define explicit dependencies between modules. This creates the topological order for task execution.
+
+Rules:
+1. List modules in dependency order (foundation first)
+2. For each module, state what it depends on
+3. Foundation modules should have NO dependencies
+4. Every non-foundation module should depend on at least one other module
+5. Think: "What must EXIST before I can build this module?"
+
+<example type="good">
+Foundation Layer (no dependencies):
+  - error-handling: No dependencies
+  - config-manager: No dependencies
+  - base-types: No dependencies
+
+Data Layer:
+  - schema-validator: Depends on [base-types, error-handling]
+  - data-ingestion: Depends on [schema-validator, config-manager]
+
+Core Layer:
+  - algorithm-engine: Depends on [base-types, error-handling]
+  - pipeline-orchestrator: Depends on [algorithm-engine, data-ingestion]
+</example>
+
+<example type="bad">
+- validation: Depends on API
+- API: Depends on validation
+(Problem: Circular dependency. This will cause build/runtime issues.)
+
+- user-auth: Depends on everything
+(Problem: Too many dependencies. Should be more focused.)
+</example>
+</instruction>
+
+## Dependency Chain
+
+### Foundation Layer (Phase 0)
+No dependencies - these are built first.
+
+- **[Module Name]**: [What it provides]
+- **[Module Name]**: [What it provides]
+
+### [Layer Name] (Phase 1)
+- **[Module Name]**: Depends on [[module-from-phase-0], [module-from-phase-0]]
+- **[Module Name]**: Depends on [[module-from-phase-0]]
+
+### [Layer Name] (Phase 2)
+- **[Module Name]**: Depends on [[module-from-phase-1], [module-from-foundation]]
+
+[Continue building up layers...]
+
+</dependency-graph>
+
+---
+
+<implementation-roadmap>
+<instruction>
+Turn the dependency graph into concrete development phases.
+
+Each phase should:
+1. Have clear entry criteria (what must exist before starting)
+2. Contain tasks that can be parallelized (no inter-dependencies within phase)
+3. Have clear exit criteria (how do we know phase is complete?)
+4. Build toward something USABLE (not just infrastructure)
+
+Phase ordering follows topological sort of dependency graph.
+
+<example type="good">
+Phase 0: Foundation
+  Entry: Clean repository
+  Tasks:
+    - Implement error handling utilities
+    - Create base type definitions
+    - Setup configuration system
+  Exit: Other modules can import foundation without errors
+
+Phase 1: Data Layer
+  Entry: Phase 0 complete
+  Tasks:
+    - Implement schema validator (uses: base types, error handling)
+    - Build data ingestion pipeline (uses: validator, config)
+  Exit: End-to-end data flow from input to validated output
+</example>
+
+<example type="bad">
+Phase 1: Build Everything
+  Tasks:
+    - API
+    - Database
+    - UI
+    - Tests
+  (Problem: No clear focus. Too broad. Dependencies not considered.)
+</example>
+</instruction>
+
+## Development Phases
+
+### Phase 0: [Foundation Name]
+**Goal**: [What foundational capability this establishes]
+
+**Entry Criteria**: [What must be true before starting]
+
+**Tasks**:
+- [ ] [Task name] (depends on: [none or list])
+  - Acceptance criteria: [How we know it's done]
+  - Test strategy: [What tests prove it works]
+
+- [ ] [Task name] (depends on: [none or list])
+
+**Exit Criteria**: [Observable outcome that proves phase complete]
+
+**Delivers**: [What can users/developers do after this phase?]
+
+---
+
+### Phase 1: [Layer Name]
+**Goal**:
+
+**Entry Criteria**: Phase 0 complete
+
+**Tasks**:
+- [ ] [Task name] (depends on: [[tasks-from-phase-0]])
+- [ ] [Task name] (depends on: [[tasks-from-phase-0]])
+
+**Exit Criteria**:
+
+**Delivers**:
+
+---
+
+[Continue with more phases...]
+
+</implementation-roadmap>
+
+---
+
+<test-strategy>
+<instruction>
+Define how testing will be integrated throughout development (TDD approach).
+
+Specify:
+1. Test pyramid ratios (unit vs integration vs e2e)
+2. Coverage requirements
+3. Critical test scenarios
+4. Test generation guidelines for Surgical Test Generator
+
+This section guides the AI when generating tests during the RED phase of TDD.
+
+<example type="good">
+Critical Test Scenarios for Data Validation module:
+  - Happy path: Valid data passes all checks
+  - Edge cases: Empty strings, null values, boundary numbers
+  - Error cases: Invalid types, missing required fields
+  - Integration: Validator works with ingestion pipeline
+</example>
+</instruction>
+
+## Test Pyramid
+
+```
+        /\
+       /E2E\       ← [X]% (End-to-end, slow, comprehensive)
+      /------\
+     /Integration\ ← [Y]% (Module interactions)
+    /------------\
+   /  Unit Tests  \ ← [Z]% (Fast, isolated, deterministic)
+  /----------------\
+```
+
+## Coverage Requirements
+- Line coverage: [X]% minimum
+- Branch coverage: [X]% minimum
+- Function coverage: [X]% minimum
+- Statement coverage: [X]% minimum
+
+## Critical Test Scenarios
+
+### [Module/Feature Name]
+**Happy path**:
+- [Scenario description]
+- Expected: [What should happen]
+
+**Edge cases**:
+- [Scenario description]
+- Expected: [What should happen]
+
+**Error cases**:
+- [Scenario description]
+- Expected: [How system handles failure]
+
+**Integration points**:
+- [What interactions to test]
+- Expected: [End-to-end behavior]
+
+## Test Generation Guidelines
+[Specific instructions for Surgical Test Generator about what to focus on, what patterns to follow, project-specific test conventions]
+
+</test-strategy>
+
+---
+
+<architecture>
+<instruction>
+Describe technical architecture, data models, and key design decisions.
+
+Keep this section AFTER functional/structural decomposition - implementation details come after understanding structure.
+</instruction>
+
+## System Components
+[Major architectural pieces and their responsibilities]
+
+## Data Models
+[Core data structures, schemas, database design]
+
+## Technology Stack
+[Languages, frameworks, key libraries]
+
+**Decision: [Technology/Pattern]**
+- **Rationale**: [Why chosen]
+- **Trade-offs**: [What we're giving up]
+- **Alternatives considered**: [What else we looked at]
+
+</architecture>
+
+---
+
+<risks>
+<instruction>
+Identify risks that could derail development and how to mitigate them.
+
+Categories:
+- Technical risks (complexity, unknowns)
+- Dependency risks (blocking issues)
+- Scope risks (creep, underestimation)
+</instruction>
+
+## Technical Risks
+**Risk**: [Description]
+- **Impact**: [High/Medium/Low - effect on project]
+- **Likelihood**: [High/Medium/Low]
+- **Mitigation**: [How to address]
+- **Fallback**: [Plan B if mitigation fails]
+
+## Dependency Risks
+[External dependencies, blocking issues]
+
+## Scope Risks
+[Scope creep, underestimation, unclear requirements]
+
+</risks>
+
+---
+
+<appendix>
+## References
+[Papers, documentation, similar systems]
+
+## Glossary
+[Domain-specific terms]
+
+## Open Questions
+[Things to resolve during development]
+</appendix>
+
+---
+
+<task-master-integration>
+# How Task Master Uses This PRD
+
+When you run `task-master parse-prd <file>.txt`, the parser:
+
+1. **Extracts capabilities** → Main tasks
+   - Each `### Capability:` becomes a top-level task
+
+2. **Extracts features** → Subtasks
+   - Each `#### Feature:` becomes a subtask under its capability
+
+3. **Parses dependencies** → Task dependencies
+   - `Depends on: [X, Y]` sets task.dependencies = ["X", "Y"]
+
+4. **Orders by phases** → Task priorities
+   - Phase 0 tasks = highest priority
+   - Phase N tasks = lower priority, properly sequenced
+
+5. **Uses test strategy** → Test generation context
+   - Feeds test scenarios to Surgical Test Generator during implementation
+
+**Result**: A dependency-aware task graph that can be executed in topological order.
+
+## Why RPG Structure Matters
+
+Traditional flat PRDs lead to:
+- ❌ Unclear task dependencies
+- ❌ Arbitrary task ordering
+- ❌ Circular dependencies discovered late
+- ❌ Poorly scoped tasks
+
+RPG-structured PRDs provide:
+- ✅ Explicit dependency chains
+- ✅ Topological execution order
+- ✅ Clear module boundaries
+- ✅ Validated task graph before implementation
+
+## Tips for Best Results
+
+1. **Spend time on dependency graph** - This is the most valuable section for Task Master
+2. **Keep features atomic** - Each feature should be independently testable
+3. **Progressive refinement** - Start broad, use `task-master expand` to break down complex tasks
+4. **Use research mode** - `task-master parse-prd --research` leverages AI for better task generation
+</task-master-integration>