Files
lab/os-install-research.md
Michal Rydlikowski ac695f506f first commit
2026-03-15 23:50:43 +00:00

14 KiB
Raw Permalink Blame History

OS Installation Research

Target Operating Systems

All must support unattended network installation and automated OpenVox enrollment. All must work across multiple CPU architectures where the OS supports it.

OS Install System Answer Format Architectures PXE Difficulty
Ubuntu 24.04 autoinstall (cloud-init) YAML x86_64, aarch64, RISC-V Easy
Debian 12 preseed preseed.cfg x86_64, aarch64, many others Medium
Fedora 41+ Anaconda/kickstart .ks file x86_64, aarch64 Easy
AlmaLinux 9 Anaconda/kickstart .ks file x86_64, aarch64 Easy
XCP-ng 8.3 Custom Python TUI XML answer file x86_64 only HARD
VyOS 1.4 Custom installer config.boot x86_64, aarch64 Medium

XCP-ng Network Install — Known Hard

Why it's difficult

  • iPXE UEFI is fundamentally broken (open bug, multiboot module corruption)
  • Serial/headless install hangs after detecting storage — no fix
  • No VNC installer mode (unlike RHEL/Debian)
  • TFTP agonizingly slow for large install.img
  • Custom Python TUI designed for VGA console, not automation
  • No major provisioning tool has first-class XCP-ng support

What works

  • BIOS PXE more reliable than UEFI
  • IPMI virtual media with remastered ISO is most reliable
  • Answer file XML with <post-install-script> and <script stage="filesystem-populated">
  • Post-install puppet enrollment via /etc/firstboot.d/ scripts
  • XCP-ng enables SSH by default after install

Answer file format (XML, custom to XenServer/XCP-ng)

<?xml version="1.0"?>
<installation mode="fresh" srtype="ext">
    <primary-disk>sda</primary-disk>
    <keymap>us</keymap>
    <root-password type="hash">$6$...</root-password>
    <source type="url">http://server/xcp-ng/</source>
    <admin-interface name="eth0" proto="dhcp" />
    <hostname>xcphost01</hostname>
    <timezone>Europe/London</timezone>
    <ntp-server>pool.ntp.org</ntp-server>
    <network-backend>openvswitch</network-backend>
    <post-install-script type="url">http://server/scripts/post-install.sh</post-install-script>
    <script stage="filesystem-populated" type="url">http://server/scripts/fs-setup.sh</script>
</installation>

Post-install puppet enrollment

The filesystem-populated stage script drops a firstboot script:

#!/bin/bash
MOUNT=$1
cat > "$MOUNT/etc/firstboot.d/99-lab-enroll" << 'SCRIPT'
#!/bin/bash
# Install puppet agent (XCP-ng is CentOS-based, yum works)
yum install -y puppet-agent
# Configure and start
puppet config set server puppet.lab.internal
systemctl enable --now puppet
SCRIPT
chmod +x "$MOUNT/etc/firstboot.d/99-lab-enroll"

Lab Install Profile Abstraction

Lab needs an InstallerPlugin interface so the same lab onboard command works for all OS types. Each plugin handles answer file generation, PXE chain setup, and post-install enrollment for its OS type.

type InstallerPlugin interface {
    Name() string
    SupportedArchitectures() []string

    // Generate the answer/config file for unattended install
    GenerateAnswerFile(config InstallConfig) ([]byte, error)

    // Set up PXE boot artifacts (kernel, initrd, bootloader configs)
    PreparePXE(config PXEConfig) error

    // Generate post-install enrollment script
    GenerateEnrollmentScript(token string, labels []string) ([]byte, error)
}

Built-in installer plugins:

  • installer-autoinstall — Ubuntu (cloud-init based autoinstall YAML)
  • installer-kickstart — Fedora, AlmaLinux, RHEL (kickstart .ks files)
  • installer-preseed — Debian (preseed.cfg)
  • installer-xcpng — XCP-ng (custom XML + firstboot.d scripts)
  • installer-vyos — VyOS (config.boot)

Auto-Onboard Rules

Automatic onboarding based on detected hardware characteristics:

auto-onboard:
  rules:
    - name: large-compute-to-xcpng
      conditions:
        cores: ">= 40"
        memory: ">= 500GB"
        provider: ovh
      action:
        image: xcpng-8.3
        labels: [xen-host, production]

    - name: arm-to-ubuntu
      conditions:
        arch: aarch64
      action:
        image: ubuntu-24.04
        labels: [arm, k8s-worker]

Must support:

  • Preview: show which existing servers match/don't match rules
  • Dry-run: show what would happen for pending servers
  • Apply: actually onboard matching servers

Deployment Approach: Universal PXE Agent + Rootfs Images

Decision: NOT using native installers

Instead of dealing with 6 different installer formats (autoinstall, kickstart, preseed, XCP-ng XML, VyOS config), Lab uses a universal approach:

  1. PXE boot ONE agent OS (same for all target distros)
  2. Agent contacts Lab server, gets instructions
  3. Agent partitions disk, deploys rootfs tarball, injects config, reboots
  4. Target OS boots with lab-agent, enrolls with OpenVox

This avoids the nightmare of maintaining 6 installer plugins × 3 architectures.

Tool Evaluation

Tool What It Does For Lab?
Tinkerbell (CNCF) PXE → HookOS agent → workflow actions (partition, deploy, inject) Best candidate to wrap
LinuxKit Build minimal agent OS (used by Tinkerbell's HookOS) Build our PXE agent
mkosi Build rootfs tarballs for any distro (Fedora, Ubuntu, Debian, etc.) Image production
iPXE Universal PXE bootloader with scripting PXE foundation
Pixiecore Simple Go PXE server with per-MAC API mode PXE building block
bootc Bootable OCI containers → install to disk (RHEL-family) Image format option
cloud-init First-boot config injection Post-deploy config
Packer Build VM/machine images Golden image building
MAAS/Curtin Production-grade, same pattern, but Ubuntu-centric + heavy Too opinionated
Warewulf Stateless/diskless boot from container images Wrong model (RAM-only)
Kairos Immutable k8s-focused OS from containers Too opinionated
FOG/Clonezilla Block-level disk cloning Too rigid
FAI Debian-centric installer framework Too narrow
Razor (Puppet) Dead (archived 2019) Dead
netboot.xyz PXE boot menu into native installers Opposite of what we want

Tinkerbell — Closest Match

Tinkerbell already implements this pattern:

  • HookOS: minimal agent OS built with LinuxKit, boots via PXE, multi-arch (x86 + ARM)
  • Tink Worker: runs inside HookOS, contacts server via gRPC, executes workflows
  • Workflow Actions:
    • rootio — partition disks, create filesystems
    • archive2disk — stream compressed rootfs tarball to mounted filesystem
    • image2disk — write raw disk image (dd-style)
    • oci2disk — pull OCI container image, write to disk
    • writefile — write individual files (puppet certs, config, enrollment token)
    • cexec — chroot and run commands (install bootloader, etc.)
    • kexec — kexec into new kernel (avoids reboot)

Tinkerbell's limitation: requires Kubernetes to run (Tink Server is k8s-native). Options:

  • Run on bootstrap node's k3s (works but adds k3s dependency before we have k3s)
  • Extract just HookOS + actions, replace Tink Server with Lab's own API
  • Use Tinkerbell after initial bootstrap

Option A: Wrap Tinkerbell

Use Tinkerbell's HookOS and actions, Lab translates lab onboard into Tinkerbell workflows. Proven, multi-arch, battle-tested by Equinix Metal.

Option B: Build our own lightweight agent

If Tinkerbell's k8s dependency is too heavy:

  • Build agent OS with LinuxKit (like HookOS but simpler)
  • Small Go binary as the agent: contacts lab-server, gets instructions, partitions, deploys rootfs, injects files, installs bootloader, reboots
  • Embedded in Lab binary — no k8s dependency
  • Essentially "Tinkerbell actions without Tinkerbell's workflow engine"

Decision: TBD — needs hands-on evaluation of Tinkerbell

VyOS Inspiration

VyOS proves this pattern works:

  • Image-based install (rootfs deployed to partition)
  • Also runs as Docker container (same config system)
  • Same concept as Lab: one definition → VM image, bare metal, or container

Image Production Pipeline

Lab needs to produce rootfs tarballs for each OS × architecture:

$ lab image build ubuntu-24.04 --arch x86_64,aarch64
  → Uses mkosi or debootstrap to build rootfs
  → Injects lab-agent, cloud-init datasource
  → Produces: ubuntu-24.04-x86_64.tar.gz, ubuntu-24.04-aarch64.tar.gz

$ lab image build xcpng-8.3 --arch x86_64
  → Extract/capture rootfs from XCP-ng installer/installed system
  → Produces: xcpng-8.3-x86_64.tar.gz

$ lab image list
IMAGE              ARCH              SIZE      BUILT
ubuntu-24.04       x86_64, aarch64   850MB     2026-03-15
debian-12          x86_64, aarch64   620MB     2026-03-14
fedora-41          x86_64, aarch64   920MB     2026-03-14
almalinux-9        x86_64, aarch64   780MB     2026-03-13
xcpng-8.3          x86_64            1.2GB     2026-03-10
vyos-1.4           x86_64, aarch64   450MB     2026-03-12

Image build tools per OS:

  • Ubuntu/Debian: debootstrap or mkosi
  • Fedora/AlmaLinux: dnf --installroot or mkosi
  • XCP-ng: install in QEMU + Packer, capture rootfs (only viable method)
  • VyOS: extract squashfs from ISO (unsquashfs /mnt/live/filesystem.squashfs)
  • Asahi Linux: NOT BUILDABLE — SSH onboard only, OS already installed by user

XCP-ng Rootfs Production — Detailed

Why package-based build doesn't work

  • install.img is the installer ramdisk, NOT the target system
  • The installer (host-installer/backend.py) does post-install XAPI setup that can't be replicated with just yum --installroot
  • Nobody has successfully built XCP-ng from packages alone
  • create-install-image scripts only produce ISOs

Viable approach: Packer + QEMU capture

1. Boot XCP-ng ISO in QEMU with answerfile (unattended)
2. Installer runs normally, does all XAPI/Xen setup
3. Mount resulting disk image
4. Tar up root partition
5. Generalize: remove SSH keys, XAPI state.db, hostname, UUIDs, persistent net rules
6. Output: xcpng-8.3-x86_64.tar.gz

XCP-ng partition layout (PXE agent must recreate this)

sda1: 18GB  ext3  /           (dom0 root)
sda2: 18GB  ext3  (backup)    (upgrade slot)
sda3: rest  LVM   (SR)        (VM storage repository)
sda4: 512MB vfat  /boot/efi   (UEFI ESP)
sda5: 4GB   ext3  /var/log
sda6: 1GB   swap

Asahi Linux — Special Case

Why it can't follow the standard path

  • No PXE boot — Apple Silicon only boots from internal NVMe or USB (iBoot)
  • Firmware partition — m1n1 must be in Apple's APFS container, coexists with macOS
  • Device tree — generated per-chip at install time
  • GPU drivers — Asahi's reverse-engineered drivers are kernel-specific
  • Boot chain: iBoot → m1n1 → U-Boot/GRUB → Linux (completely non-standard)

How Lab handles it

  • SSH onboard only: lab onboard mac-studio --provider ssh --host <ip>
  • Asahi is already installed (user did this manually or via Asahi installer)
  • Lab manages the userspace (Fedora-based) via puppet normally
  • Kernel updates from Asahi repos, managed by puppet/dnf
  • m1n1/U-Boot/firmware layer is untouched by Lab

Lesson

Not everything is PXE-bootable. Lab needs two onboard paths:

  • PXE onboard: bare metal with no OS (Beelinks, OVH servers, XCP-ng hosts)
  • SSH onboard: OS already installed (Mac Studio, DGX Spark, cloud VMs)

Image Deployment Matrix

                    PXE Deploy    SSH Onboard    Container    VM Image
Ubuntu 24.04        ✓ rootfs      ✓              ✓            ✓ qcow2
Debian 12           ✓ rootfs      ✓              ✓            ✓ qcow2
Fedora 41           ✓ rootfs      ✓              ✓            ✓ qcow2
AlmaLinux 9         ✓ rootfs      ✓              ✓            ✓ qcow2
XCP-ng 8.3          ✓ rootfs      ✓ (existing)   ✗            ✗
VyOS 1.4            ✓ rootfs      ✓ (existing)   ✓ docker     ✓ qcow2
Asahi Linux         ✗ impossible  ✓ (only way)   ✗            ✗

Automated Image Pipeline

Images must be rebuilt regularly to include security updates and new lab-agent versions.

Pipeline Configuration

image-pipelines:
  ubuntu-24.04:
    method: debootstrap
    schedule: weekly
    architectures: [x86_64, aarch64]
    outputs: [rootfs-tarball, container-base, qcow2]
    retention: 4 builds

  xcpng-8.3:
    method: packer-qemu          # install in QEMU, capture
    schedule: monthly
    architectures: [x86_64]
    outputs: [rootfs-tarball]
    retention: 3 builds

  vyos-1.4:
    method: squashfs-extract     # extract from ISO
    schedule: monthly
    architectures: [x86_64, aarch64]
    outputs: [rootfs-tarball, container-base]
    retention: 3 builds

Build runs on Lab itself (dogfooding)

  • x86 images build on x86 machines (Beelink SER9 MAX)
  • ARM images build on ARM machines (DGX Spark, Minisforum)
  • XCP-ng builds on any x86 with QEMU/KVM
  • Lab picks the right builder based on architecture

Upgrade flow

  • New image built → Lab knows which servers run old version
  • lab image diff shows package changes
  • lab image promote makes new image the default for new deploys
  • Existing servers: puppet manages package updates (not re-imaged unless requested)

Connection to Puppet → Container Artifact Builder

Same pipeline, different output targets:

Label "mailserver" + base image "ubuntu-24.04":
  → rootfs + puppet classes = bare metal image (tar.gz for PXE deploy)
  → rootfs + puppet classes = container image (OCI for k8s/docker)
  → rootfs + puppet classes = VM image (qcow2/vmdk for XCP-ng/AWS)

One label, one set of puppet modules, three deployment formats.

Multi-Architecture Considerations

  • PXE boot chain differs between x86 (BIOS/UEFI) and ARM (UEFI only)
  • Need separate kernel/initrd per architecture for the agent OS
  • Rootfs tarballs are architecture-specific
  • Some OS images don't exist for all architectures (XCP-ng = x86 only)
  • Lab must track architecture per image and refuse mismatches
  • Tinkerbell's HookOS already builds for x86_64 and aarch64