michal/lab

Files

Michal Rydlikowski ac695f506f first commit

2026-03-15 23:50:43 +00:00

14 KiB

Raw Permalink Blame History

OS Installation Research

Target Operating Systems

All must support unattended network installation and automated OpenVox enrollment. All must work across multiple CPU architectures where the OS supports it.

OS	Install System	Answer Format	Architectures	PXE Difficulty
Ubuntu 24.04	autoinstall (cloud-init)	YAML	x86_64, aarch64, RISC-V	Easy
Debian 12	preseed	preseed.cfg	x86_64, aarch64, many others	Medium
Fedora 41+	Anaconda/kickstart	.ks file	x86_64, aarch64	Easy
AlmaLinux 9	Anaconda/kickstart	.ks file	x86_64, aarch64	Easy
XCP-ng 8.3	Custom Python TUI	XML answer file	x86_64 only	HARD
VyOS 1.4	Custom installer	config.boot	x86_64, aarch64	Medium

XCP-ng Network Install — Known Hard

Why it's difficult

iPXE UEFI is fundamentally broken (open bug, multiboot module corruption)
Serial/headless install hangs after detecting storage — no fix
No VNC installer mode (unlike RHEL/Debian)
TFTP agonizingly slow for large install.img
Custom Python TUI designed for VGA console, not automation
No major provisioning tool has first-class XCP-ng support

What works

BIOS PXE more reliable than UEFI
IPMI virtual media with remastered ISO is most reliable
Answer file XML with <post-install-script> and <script stage="filesystem-populated">
Post-install puppet enrollment via /etc/firstboot.d/ scripts
XCP-ng enables SSH by default after install

Answer file format (XML, custom to XenServer/XCP-ng)

<?xml version="1.0"?>
<installation mode="fresh" srtype="ext">
    <primary-disk>sda</primary-disk>
    <keymap>us</keymap>
    <root-password type="hash">$6$...</root-password>
    <source type="url">http://server/xcp-ng/</source>
    <admin-interface name="eth0" proto="dhcp" />
    <hostname>xcphost01</hostname>
    <timezone>Europe/London</timezone>
    <ntp-server>pool.ntp.org</ntp-server>
    <network-backend>openvswitch</network-backend>
    <post-install-script type="url">http://server/scripts/post-install.sh</post-install-script>
    <script stage="filesystem-populated" type="url">http://server/scripts/fs-setup.sh</script>
</installation>

Post-install puppet enrollment

The filesystem-populated stage script drops a firstboot script:

#!/bin/bash
MOUNT=$1
cat > "$MOUNT/etc/firstboot.d/99-lab-enroll" << 'SCRIPT'
#!/bin/bash
# Install puppet agent (XCP-ng is CentOS-based, yum works)
yum install -y puppet-agent
# Configure and start
puppet config set server puppet.lab.internal
systemctl enable --now puppet
SCRIPT
chmod +x "$MOUNT/etc/firstboot.d/99-lab-enroll"

Lab Install Profile Abstraction

Lab needs an InstallerPlugin interface so the same lab onboard command works for all OS types. Each plugin handles answer file generation, PXE chain setup, and post-install enrollment for its OS type.

type InstallerPlugin interface {
    Name() string
    SupportedArchitectures() []string

    // Generate the answer/config file for unattended install
    GenerateAnswerFile(config InstallConfig) ([]byte, error)

    // Set up PXE boot artifacts (kernel, initrd, bootloader configs)
    PreparePXE(config PXEConfig) error

    // Generate post-install enrollment script
    GenerateEnrollmentScript(token string, labels []string) ([]byte, error)
}

Built-in installer plugins:

installer-autoinstall — Ubuntu (cloud-init based autoinstall YAML)
installer-kickstart — Fedora, AlmaLinux, RHEL (kickstart .ks files)
installer-preseed — Debian (preseed.cfg)
installer-xcpng — XCP-ng (custom XML + firstboot.d scripts)
installer-vyos — VyOS (config.boot)

Auto-Onboard Rules

Automatic onboarding based on detected hardware characteristics:

auto-onboard:
  rules:
    - name: large-compute-to-xcpng
      conditions:
        cores: ">= 40"
        memory: ">= 500GB"
        provider: ovh
      action:
        image: xcpng-8.3
        labels: [xen-host, production]

    - name: arm-to-ubuntu
      conditions:
        arch: aarch64
      action:
        image: ubuntu-24.04
        labels: [arm, k8s-worker]

Must support:

Preview: show which existing servers match/don't match rules
Dry-run: show what would happen for pending servers
Apply: actually onboard matching servers

Deployment Approach: Universal PXE Agent + Rootfs Images

Decision: NOT using native installers

Instead of dealing with 6 different installer formats (autoinstall, kickstart, preseed, XCP-ng XML, VyOS config), Lab uses a universal approach:

PXE boot ONE agent OS (same for all target distros)
Agent contacts Lab server, gets instructions
Agent partitions disk, deploys rootfs tarball, injects config, reboots
Target OS boots with lab-agent, enrolls with OpenVox

This avoids the nightmare of maintaining 6 installer plugins × 3 architectures.

Tool Evaluation

Tool	What It Does	For Lab?
Tinkerbell (CNCF)	PXE → HookOS agent → workflow actions (partition, deploy, inject)	Best candidate to wrap
LinuxKit	Build minimal agent OS (used by Tinkerbell's HookOS)	Build our PXE agent
mkosi	Build rootfs tarballs for any distro (Fedora, Ubuntu, Debian, etc.)	Image production
iPXE	Universal PXE bootloader with scripting	PXE foundation
Pixiecore	Simple Go PXE server with per-MAC API mode	PXE building block
bootc	Bootable OCI containers → install to disk (RHEL-family)	Image format option
cloud-init	First-boot config injection	Post-deploy config
Packer	Build VM/machine images	Golden image building
MAAS/Curtin	Production-grade, same pattern, but Ubuntu-centric + heavy	Too opinionated
Warewulf	Stateless/diskless boot from container images	Wrong model (RAM-only)
Kairos	Immutable k8s-focused OS from containers	Too opinionated
FOG/Clonezilla	Block-level disk cloning	Too rigid
FAI	Debian-centric installer framework	Too narrow
Razor (Puppet)	Dead (archived 2019)	Dead
netboot.xyz	PXE boot menu into native installers	Opposite of what we want

Tinkerbell — Closest Match

Tinkerbell already implements this pattern:

HookOS: minimal agent OS built with LinuxKit, boots via PXE, multi-arch (x86 + ARM)
Tink Worker: runs inside HookOS, contacts server via gRPC, executes workflows
Workflow Actions:
- rootio — partition disks, create filesystems
- archive2disk — stream compressed rootfs tarball to mounted filesystem
- image2disk — write raw disk image (dd-style)
- oci2disk — pull OCI container image, write to disk
- writefile — write individual files (puppet certs, config, enrollment token)
- cexec — chroot and run commands (install bootloader, etc.)
- kexec — kexec into new kernel (avoids reboot)

Tinkerbell's limitation: requires Kubernetes to run (Tink Server is k8s-native). Options:

Run on bootstrap node's k3s (works but adds k3s dependency before we have k3s)
Extract just HookOS + actions, replace Tink Server with Lab's own API
Use Tinkerbell after initial bootstrap

Option A: Wrap Tinkerbell

Use Tinkerbell's HookOS and actions, Lab translates lab onboard into Tinkerbell workflows. Proven, multi-arch, battle-tested by Equinix Metal.

Option B: Build our own lightweight agent

If Tinkerbell's k8s dependency is too heavy:

Build agent OS with LinuxKit (like HookOS but simpler)
Small Go binary as the agent: contacts lab-server, gets instructions, partitions, deploys rootfs, injects files, installs bootloader, reboots
Embedded in Lab binary — no k8s dependency
Essentially "Tinkerbell actions without Tinkerbell's workflow engine"

Decision: TBD — needs hands-on evaluation of Tinkerbell

VyOS Inspiration

VyOS proves this pattern works:

Image-based install (rootfs deployed to partition)
Also runs as Docker container (same config system)
Same concept as Lab: one definition → VM image, bare metal, or container

Image Production Pipeline

Lab needs to produce rootfs tarballs for each OS × architecture:

$ lab image build ubuntu-24.04 --arch x86_64,aarch64
  → Uses mkosi or debootstrap to build rootfs
  → Injects lab-agent, cloud-init datasource
  → Produces: ubuntu-24.04-x86_64.tar.gz, ubuntu-24.04-aarch64.tar.gz

$ lab image build xcpng-8.3 --arch x86_64
  → Extract/capture rootfs from XCP-ng installer/installed system
  → Produces: xcpng-8.3-x86_64.tar.gz

$ lab image list
IMAGE              ARCH              SIZE      BUILT
ubuntu-24.04       x86_64, aarch64   850MB     2026-03-15
debian-12          x86_64, aarch64   620MB     2026-03-14
fedora-41          x86_64, aarch64   920MB     2026-03-14
almalinux-9        x86_64, aarch64   780MB     2026-03-13
xcpng-8.3          x86_64            1.2GB     2026-03-10
vyos-1.4           x86_64, aarch64   450MB     2026-03-12

Image build tools per OS:

Ubuntu/Debian: debootstrap or mkosi
Fedora/AlmaLinux: dnf --installroot or mkosi
XCP-ng: install in QEMU + Packer, capture rootfs (only viable method)
VyOS: extract squashfs from ISO (unsquashfs /mnt/live/filesystem.squashfs)
Asahi Linux: NOT BUILDABLE — SSH onboard only, OS already installed by user

XCP-ng Rootfs Production — Detailed

Why package-based build doesn't work

install.img is the installer ramdisk, NOT the target system
The installer (host-installer/backend.py) does post-install XAPI setup that can't be replicated with just yum --installroot
Nobody has successfully built XCP-ng from packages alone
create-install-image scripts only produce ISOs

Viable approach: Packer + QEMU capture

1. Boot XCP-ng ISO in QEMU with answerfile (unattended)
2. Installer runs normally, does all XAPI/Xen setup
3. Mount resulting disk image
4. Tar up root partition
5. Generalize: remove SSH keys, XAPI state.db, hostname, UUIDs, persistent net rules
6. Output: xcpng-8.3-x86_64.tar.gz

XCP-ng partition layout (PXE agent must recreate this)

sda1: 18GB  ext3  /           (dom0 root)
sda2: 18GB  ext3  (backup)    (upgrade slot)
sda3: rest  LVM   (SR)        (VM storage repository)
sda4: 512MB vfat  /boot/efi   (UEFI ESP)
sda5: 4GB   ext3  /var/log
sda6: 1GB   swap

Asahi Linux — Special Case

Why it can't follow the standard path

No PXE boot — Apple Silicon only boots from internal NVMe or USB (iBoot)
Firmware partition — m1n1 must be in Apple's APFS container, coexists with macOS
Device tree — generated per-chip at install time
GPU drivers — Asahi's reverse-engineered drivers are kernel-specific
Boot chain: iBoot → m1n1 → U-Boot/GRUB → Linux (completely non-standard)

How Lab handles it

SSH onboard only: lab onboard mac-studio --provider ssh --host <ip>
Asahi is already installed (user did this manually or via Asahi installer)
Lab manages the userspace (Fedora-based) via puppet normally
Kernel updates from Asahi repos, managed by puppet/dnf
m1n1/U-Boot/firmware layer is untouched by Lab

Lesson

Not everything is PXE-bootable. Lab needs two onboard paths:

PXE onboard: bare metal with no OS (Beelinks, OVH servers, XCP-ng hosts)
SSH onboard: OS already installed (Mac Studio, DGX Spark, cloud VMs)

Image Deployment Matrix

                    PXE Deploy    SSH Onboard    Container    VM Image
Ubuntu 24.04        ✓ rootfs      ✓              ✓            ✓ qcow2
Debian 12           ✓ rootfs      ✓              ✓            ✓ qcow2
Fedora 41           ✓ rootfs      ✓              ✓            ✓ qcow2
AlmaLinux 9         ✓ rootfs      ✓              ✓            ✓ qcow2
XCP-ng 8.3          ✓ rootfs      ✓ (existing)   ✗            ✗
VyOS 1.4            ✓ rootfs      ✓ (existing)   ✓ docker     ✓ qcow2
Asahi Linux         ✗ impossible  ✓ (only way)   ✗            ✗

Automated Image Pipeline

Images must be rebuilt regularly to include security updates and new lab-agent versions.

Pipeline Configuration

image-pipelines:
  ubuntu-24.04:
    method: debootstrap
    schedule: weekly
    architectures: [x86_64, aarch64]
    outputs: [rootfs-tarball, container-base, qcow2]
    retention: 4 builds

  xcpng-8.3:
    method: packer-qemu          # install in QEMU, capture
    schedule: monthly
    architectures: [x86_64]
    outputs: [rootfs-tarball]
    retention: 3 builds

  vyos-1.4:
    method: squashfs-extract     # extract from ISO
    schedule: monthly
    architectures: [x86_64, aarch64]
    outputs: [rootfs-tarball, container-base]
    retention: 3 builds

Build runs on Lab itself (dogfooding)

x86 images build on x86 machines (Beelink SER9 MAX)
ARM images build on ARM machines (DGX Spark, Minisforum)
XCP-ng builds on any x86 with QEMU/KVM
Lab picks the right builder based on architecture

Upgrade flow

New image built → Lab knows which servers run old version
lab image diff shows package changes
lab image promote makes new image the default for new deploys
Existing servers: puppet manages package updates (not re-imaged unless requested)

Connection to Puppet → Container Artifact Builder

Same pipeline, different output targets:

Label "mailserver" + base image "ubuntu-24.04":
  → rootfs + puppet classes = bare metal image (tar.gz for PXE deploy)
  → rootfs + puppet classes = container image (OCI for k8s/docker)
  → rootfs + puppet classes = VM image (qcow2/vmdk for XCP-ng/AWS)

One label, one set of puppet modules, three deployment formats.

Multi-Architecture Considerations

PXE boot chain differs between x86 (BIOS/UEFI) and ARM (UEFI only)
Need separate kernel/initrd per architecture for the agent OS
Rootfs tarballs are architecture-specific
Some OS images don't exist for all architectures (XCP-ng = x86 only)
Lab must track architecture per image and refuse mismatches
Tinkerbell's HookOS already builds for x86_64 and aarch64

14 KiB Raw Permalink Blame History Unescape Escape