Project Motivation
I run a Kubernetes cluster on Proxmox in my homelab. Most of the time it sits there idle with 2-3 worker nodes. Then I'll spin up a machine learning workload or a batch job that needs 64GB of RAM and suddenly I'm SSHing into Proxmox to manually clone VMs and join them to the cluster.
Beyond the ML workloads, I wanted to be able to spin up random stuff without the hassle of setting up new hosts or expanding existing ones. Try out a new database? Deploy a testing environment? Just kubectl apply and let the cluster handle it. I also wanted to run ephemeral GitHub Actions runners - the kind that spin up for a build job and disappear after. Those are a perfect fit for autoscaling but terrible for manual provisioning.
Things I Considered
Cluster Autoscaler is the official Kubernetes autoscaling solution. It expects cloud provider node groups with pre-configured templates and makes assumptions that don't fit homelabs well. I just wanted to tell Proxmox "create a VM with 8 cores and 16GB RAM" when pods can't schedule.
Karpenter is brilliant but it's deeply integrated with AWS concepts - instance types, launch templates, spot instances. It's built for cloud environments where you select from a catalog of predefined machine types rather than specifying arbitrary resource amounts.
Metal3 is designed for bare metal provisioning with Kubernetes. But it would still need wrapping around Proxmox APIs, or I'd have to migrate my entire infrastructure to Metal3 and ditch Proxmox entirely. That felt like overkill.
kproximate is specifically built for Proxmox autoscaling, but it uses RabbitMQ for message passing when I wanted to keep everything native Kubernetes with CRDs. More fundamentally, it relies on VM templates. The whole reason I picked up NixOS was to avoid Packer and traditional image building. Templates are fine, but flakes give you reproducibility that image snapshots can't match. If you want to rebuild an Ubuntu image from a few weeks ago, you're stuck - pulling new packages changes everything. With a flake, I can modify the system while keeping all dependencies locked to specific versions. I wanted that property for my worker nodes.
Talos might also work here - it has a minimal base image and I think it can be made to auto-join clusters. But I was already using NixOS for other servers in my homelab, so I haven't explored it yet.
None of these felt quite right, and I figured I'd learn more by building exactly what I wanted. So I built KubeNodeSmith using Kubebuilder. It watches for unschedulable pods, provisions VMs from Proxmox, waits for them to register with the cluster, and cleans them up when idle. The architecture follows Karpenter's NodeClaim model but strips away the cloud-specific assumptions.
Architecture Decisions
I went with custom resources instead of a stateful daemon. This gives me declarative configuration, GitOps support, and automatic reconciliation if the controller crashes. Four CRDs handle the lifecycle:
NodeSmithProvider stores infrastructure credentials and VM settings. One provider can serve multiple pools:
apiVersion: kubenodesmith.parawell.cloud/v1alpha1
kind: NodeSmithProvider
metadata:
name: proxmox-production
spec:
type: proxmox
credentialsSecretRef:
name: proxmox-api-secret
proxmox:
endpoint: https://10.0.4.30:8006/api2/json
nodeWhitelist: [alfaromeo, porsche]
vmIDRange: { lower: 1250, upper: 1300 }
networkInterfaces:
- { name: net0, model: virtio, bridge: vmbr0, vlanTag: 20 }
vmOptions:
- { name: boot, value: order=net0 } # PXE boot
NodeSmithPool defines capacity limits and scaling behavior:
apiVersion: kubenodesmith.parawell.cloud/v1alpha1
kind: NodeSmithPool
metadata:
name: proxmox-small
spec:
providerRef: proxmox-production
limits:
minNodes: 0
maxNodes: 5
memoryMiB: 30720
machineTemplate:
kubeNodeNamePrefix: worker-auto
scaleUp:
stabilizationWindow: 2m
scaleDown:
stabilizationWindow: 5m
NodeSmithClaim represents a single node request. The NodePoolReconciler creates these automatically when pods can't schedule, and the NodeClaimReconciler provisions the actual VM:
apiVersion: kubenodesmith.parawell.cloud/v1alpha1
kind: NodeSmithClaim
metadata:
name: proxmox-small-abc123
spec:
poolRef: proxmox-small
requirements:
cpuCores: 4
memoryMiB: 8192
status:
conditions:
- type: Launched
status: "True"
- type: Registered
status: "True"
- type: Ready
status: "True"
providerID: proxmox://cluster/vms/1251
nodeName: worker-auto-1
The claim lifecycle tracks four states: Launched (VM created) → Registered (node joins cluster) → Initialized (node ready) → Ready (available for scheduling). This gives visibility into where provisioning is stuck when things go wrong.
The Netboot + Cloud-Init Setup
Early on I tried using Proxmox's cloud-init integration to inject cluster join tokens into VMs. This worked but had a fatal flaw: Proxmox cloud-init images work only on one VM. You have to build them on a per-VM basis. They cannot be shared.
For autoscaling, this means every time you spin up a new node, you'd need to dynamically build a new cloud-init image on the Proxmox host. How do you do that? Run a service on the Proxmox host that processes build requests? SSH from the controller into Proxmox to run build commands? Both approaches are messy, insecure, and brittle. And any configuration change - rotating the k3s join token, updating SSH keys, adjusting kubelet flags - means rebuilding every single image. You end up maintaining multiple images per VM for what should be simple configuration changes, and each one takes time to build and upload.
I worked around this using Proxmox's cicustom feature. Instead of embedding cloud-init data in the image, cicustom points to a snippet file stored on the Proxmox host at /var/lib/vz/snippets/. This snippet can be shared across all VMs, and changes take effect immediately on new boots without touching the base image.
My setup uses a NixOS netboot image with k3s and cloud-init enabled, but no secrets baked in. VMs boot from net0 and pull the image via PXE from a dedicated machine called "aspen".
The boot flow looks like this:
1. VM powers on, sends DHCP request over the network interface
2. aspen's dnsmasq assigns an IP and tells the VM to load boot.ipxe via TFTP
3. iPXE chainloads the actual NixOS kernel (bzImage) and initrd from /srv/tftp
4. The kernel boots with init=/init netboot=true, and the initrd fetches the full root filesystem over HTTP from nginx
5. Cloud-init runs and reads the k3s token from the Proxmox snippet
The k3s token lives in a snippet file on each Proxmox host:
# /var/lib/vz/snippets/zagato-shared-cloud-init-meta-data.yaml
k3s_token: K10abc123def456...
The Proxmox provider configures each VM to reference this shared snippet via cicustom:
# Proxmox VM config (INI format for visualization purposes)
name = zagato-worker-auto-1
cores = 4
memory = 8192
boot = order=net0
cicustom = meta=local:snippets/zagato-shared-cloud-init-meta-data.yaml
Now I can update the base image (rebuild the NixOS flake, update /srv/tftp/bzImage and /srv/tftp/initrd on aspen) without touching secrets, and rotate the k3s token (edit the snippet) without rebuilding images. One snippet file services all the autoscaled nodes. The images themselves are truly stateless - no configuration, no credentials, just a reproducible system built from a flake.
Scale-Up Logic
The NodePoolReconciler runs every 30 seconds or whenever a pod becomes unschedulable. The flow:
1. List all unschedulable pods and calculate their total resource requirements
2. Check if we're within the stabilization window (default 2 minutes since last scale-up)
- If yes, wait and requeue
- If no, proceed
3. Create a NodeSmithClaim with the calculated CPU and memory requirements
4. NodeClaimReconciler picks up the new claim
5. Provider creates a VM with the requested resources
6. VM boots via PXE, pulls the netboot image from aspen
7. Cloud-init injects the k3s token from the Proxmox snippet
8. Node registers with the cluster
9. Kubernetes scheduler places pending pods on the new node
The stabilization window prevents thrashing. If a pod can't schedule due to a transient issue (image pull, init container failure), we don't want to immediately provision a new node. Wait 2 minutes and check if the pod is still pending.
Scale-Down and Draining
Scale-down runs when there are no unschedulable pods. The controller finds nodes that are empty (no running pods except DaemonSets) and have been idle for longer than the stabilization window (default 5 minutes). It deletes the corresponding NodeSmithClaim, which triggers cleanup:
1. Cordon the node to prevent new pods from scheduling
2. Drain pods with the configured grace period (default 60s)
3. Wait for pods to terminate
4. Call the provider to delete the VM
5. Remove the finalizer
The finalizer ensures we don't orphan VMs if the controller crashes mid-drain. On restart, it'll see the claim has a deletion timestamp and resume cleanup.
Provider Interface
The provider interface is intentionally minimal. This makes it easy to add support for other infrastructure:
// internal/provider/provider.go
type Provider interface {
ProvisionMachine(ctx context.Context, spec MachineSpec) (*Machine, error)
DeprovisionMachine(ctx context.Context, machine Machine) error
ListMachines(ctx context.Context, namePrefix string) ([]Machine, error)
}
type MachineSpec struct {
NamePrefix string
CPUCores int64
MemoryMiB int64
}
The Proxmox implementation is about 300 lines. Adding Redfish support for bare metal would be straightforward - implement the three methods and add a new provider type to the CRD.
Testing and Reliability
I added retry logic with exponential backoff for VM provisioning. Proxmox occasionally returns 500 errors when it's under load. Rather than marking the claim as failed immediately, the controller retries up to 5 times with increasing delays.
The other common failure mode is nodes that provision successfully but never register with the cluster. This usually means the PXE boot image is misconfigured or the cluster join token expired. After 5 minutes, the controller gives up and marks the claim as failed with a RegistrationTimeout condition.
Current Status
The autoscaler has been running in my homelab for about 3 months. It handles my ML workloads without manual intervention - pods schedule, nodes appear, work completes, nodes disappear.
What works:
- Automatic scale-up when pods can't schedule
- PXE boot provisioning with NixOS
- Graceful scale-down with pod draining
- Retry logic for transient failures
- Declarative configuration via CRDs
Future Improvements
Secret infrastructure doesn't match the rest of the system.
The cicustom snippet approach works well for a v1 and is reasonably secure. But its "scale" doesn't match the rest of my setup, which is hyper-reproducible. I have a YAML snippet at /var/lib/vz/snippets/ that needs to be manually copied to each Proxmox node in the cluster. If I add a new Proxmox host or rebuild one, I have to remember to copy the snippet over. This feels out of place in a system where everything else is declarative. Ideally the k3s token would be stored in a Kubernetes Secret and injected at boot time, but that creates a chicken-and-egg problem - the node needs the token to join the cluster that stores the token. There's probably a solution involving an init service that fetches from an external secret store, but I haven't built it yet.
Bin packing pools.
Right now the scale-up logic is simple: pod can't schedule → create node big enough for pod. This works but isn't efficient. If I have three 2-core pods pending, I create three 2-core nodes instead of one 8-core node that could run all three. I want to add "bucket" pools with predefined sizes (small: 4c/8GB, medium: 8c/16GB, large: 16c/32GB) and bin-pack pods into the smallest bucket that fits. Karpenter does this with consolidation, and it makes a big difference for cost. For my homelab it's less critical since I'm not paying per-hour, but it would reduce the VM sprawl. I'm just not convinced it's worth the added complexity yet.
Redfish provider for on-demand bare metal.
I have some older servers - Dell T630s and similar - that are still useful but don't need to be online all the time. The problem is thermal: if I keep them all running 24/7, the room gets too hot. I've already upgraded the ventilation, but adding another server or two online permanently is too much for the server room to handle.
The solution is a Redfish provider. Redfish is an industry-standard API for managing servers - power on/off, BIOS configuration, virtual media mounting, all over HTTP. Most modern server BMCs support it. The provider interface is intentionally minimal (just three methods), so adding Redfish support would be straightforward. When a workload needs extra capacity, KubeNodeSmith would power on one of the idle servers via Redfish, wait for it to netboot and join the cluster, schedule the pods, then power it back off when it's idle. The server only runs when needed.
This would make KubeNodeSmith a bare-metal/hypervisor hybrid autoscaler - something like Metal3's bare metal provisioning with Karpenter's flexible, declarative model. You could configure pools with different providers based on workload requirements. Need quick ephemeral capacity? Use Proxmox VMs. Need a large batch job that can tolerate a 2-3 minute boot time? Power on a physical server with more cores and RAM.
I haven't implemented this yet, but the architecture is already there. The Proxmox provider is about 300 lines. A Redfish provider would be similar - implement ProvisionMachine (power on, configure PXE boot order), DeprovisionMachine (power off), and GetMachineStatus (query power state). The hard part is getting the netboot image and cluster join token to the machine, but if I've solved that for VMs, bare metal should work the same way.
What I Learned
The NodeClaim model from Karpenter is brilliant. Separating "I want a node" (the claim) from "here's how to make nodes" (the pool) and "here's where to get machines" (the provider) makes the system composable. I can have multiple pools using the same provider, or switch providers without changing pools.
PXE boot combined with NixOS flakes gives me the reproducibility I wanted. No template maintenance, no image versioning, no "which Ubuntu AMI am I running again?" If I need to rebuild a system from three months ago, the flake lock ensures I get the exact same packages. That's impossible with traditional image snapshots where pulling new packages during the build changes everything.
Custom resources with controller-runtime handle edge cases I didn't anticipate. Controller crashes? It resumes where it left off. Network partition? It reconciles when connectivity returns. Accidental claim deletion? The finalizer prevents VM orphaning.
The hardest part wasn't the Kubernetes API or the provider integration - it was getting the timing right. Too aggressive on scale-up and you waste resources. Too conservative and pods sit pending. The stabilization windows help but there's still tuning involved.
With the right configuration, the cluster feels dynamic. Deploy a big workload and watch nodes appear. Delete the workload and watch them disappear. Exactly what I wanted when I started this project.