Files
lab/crossplane-evaluation.md

107 lines
5.0 KiB
Markdown
Raw Normal View History

2026-03-15 23:50:43 +00:00
# Crossplane Evaluation
## Decision: NOT ADOPTING
Crossplane will not be used in this stack. The lack of a plan/preview mechanism is a dealbreaker
for enterprise adoption and safe infrastructure management.
---
## Why We Evaluated It
The core problem: Terraform/OpenTofu requires re-implementing the same infrastructure concepts
per platform (AWS, XCP-ng, bare metal). At thousands of nodes across multiple platforms, this is
a massive maintenance burden. Crossplane's XRD/Composition model promised a unified API:
```
XRD: "VirtualMachine" (universal API)
├── Composition: AWS → EC2 instance
├── Composition: XCP-ng → XO VM
└── Composition: bare metal → MAAS / Ansible
```
One API, multiple backends — teams request a "VirtualMachine" and the right composition handles it.
## Strengths
- **CNCF Graduated** (Nov 2025, v2.2) — Apache 2.0 license, top-tier maturity
- **Continuous drift detection** — automatically reverts manual changes, unlike Terraform's on-demand plan/apply
- **No state file management** — no remote backends, locking issues, or state corruption
- **Kubernetes-native** — works with ArgoCD, Flux, kubectl, RBAC out of the box
- **XRDs/Compositions** — genuine multi-platform abstraction layer, solves the "re-implement per cloud" problem
- **Eventual consistency** — resources with complex dependencies don't get stuck like Terraform's dependency graph
- **Enterprise adoption** — Deutsche Kreditbank, Elastic, Nike, Apple, NASA, Grafana Labs, 60+ orgs
- **Deutsche Kreditbank** replaced Terraform; deployments went from weeks to under one hour
## Dealbreaker: No Plan/Preview
The single biggest issue. Terraform's `terraform plan` lets operators see exactly what will change
before applying. Crossplane applies changes immediately upon resource creation/modification.
- Discussed in the community for 2+ years with no resolution
- A Kubernetes-native solution would be a `Plan` CRD that shows proposed changes before approval
- ArgoCD `sync --dry-run` is a partial workaround but only shows k8s resource diffs, not what the
cloud provider will actually do underneath
- **For regulated environments and SRE teams at scale, change preview is non-negotiable**
Possible reasons it hasn't been implemented:
- The continuous reconciliation architecture may make point-in-time snapshots fundamentally hard
- Upbound (commercial entity) may be reserving it for their paid platform
- Or simply not prioritised
## Other Significant Concerns
### CRD Bloat
- `provider-aws` installs 900+ CRDs — can make API server unresponsive for up to an hour (GitHub #2649)
- Exceeds Kubernetes' recommended ~500 CRD limit
- Mitigated by "Provider Families" (install per-service sub-providers) but requires careful planning
### Debugging Difficulty
- Errors propagate through layers: Claim → XR → Composition → Managed Resource → Provider → Cloud API
- Multiple sources report debugging compositions is painful
- Pipeline Inspector (alpha in v2.2) is being introduced but not production-ready
### Chicken-and-Egg Problem
- Crossplane runs inside Kubernetes — cannot provision the cluster it runs on
- Requires a "management cluster" bootstrapped by other means (Terraform, Puppet, etc.)
- If the management cluster dies, no drift detection or reconciliation runs
- Recovery: applying YAMLs to a new cluster works if deterministic resource names are used,
otherwise risks creating duplicate cloud resources
### Cluster Loss / Immutability Concerns
- State lives in etcd, not a versionable state file
- No independent audit trail or easy way to diff historical states
- On new cluster: resources with explicit external names get adopted; auto-named resources get duplicated
- Need etcd backups as insurance, and deterministic naming everywhere
### Performance at Scale
- ~2000 composites took 6+ minutes to reconcile on k3d (GitHub #2256)
- Reconciliation interval not easily configurable globally (GitHub #5934)
### YAML Limitations
- No native loops, conditionals, or programming constructs
- Complex compositions require changes in multiple locations
## XCP-ng Provider Gap
- No Crossplane provider for XCP-ng exists today
- A mature Terraform provider (`terraform-provider-xenorchestra`) exists, maintained by Vates
- Could be wrapped via Upjet to auto-generate a Crossplane provider — but nobody has done it
- Would be a greenfield open-source project
## Real Issues Reported
- API server unresponsiveness with too many CRDs (GitHub #2649)
- CRD scaling issues beyond ~500 CRDs (GitHub #2895)
- GCP SQL resources randomly marked for deletion — dangerous for production databases
- Reconciliation rate limiting at scale (GitHub #2256)
## Conclusion
Crossplane solves a real problem (multi-platform abstraction) that we need, but the lack of
plan/preview makes it unsuitable for enterprise-scale production infrastructure management.
The operational concerns (CRD bloat, debugging, cluster dependency) add further risk.
We need to find an alternative approach to the multi-platform abstraction problem that Crossplane
solves, while retaining plan/preview capabilities.