The infrastructure engineering blog that goes past the official docs — covering Kubernetes internals, cloud-native security, distributed systems design, and platform engineering at scale.

This post assumes Kubernetes 1.27+ and the autoscaling/v2 API. It targets senior ICs and platform engineers who operate autoscaling systems in production.


Table of Contents


Prerequisites: Setting Up Your Lab Cluster

Before diving in, spin up a local kind cluster with metrics-server pre-configured. All exercises in this guide assume this setup.

# Install kind if you haven't already
brew install kind # macOS
# or: curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64 && chmod +x kind && mv kind /usr/local/bin/
# Create a 3-node cluster (1 control-plane + 2 workers)
cat <<EOF | kind create cluster --name autoscaling-lab --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
EOF
# Install metrics-server (kind doesn't ship it)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Patch metrics-server to work without TLS verification (required in kind)
kubectl patch deployment metrics-server -n kube-system --type='json' \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
# Wait for metrics-server to be ready
kubectl rollout status deployment/metrics-server -n kube-system --timeout=60s
# Verify it's working (may take ~30s after rollout)
kubectl top nodes

Autoscaling Is a Multi-Loop System

Before diving into HPA and VPA internals, it is worth establishing the full system. Kubernetes autoscaling is not one controller — it is four independent control loops operating on different timescales and different variables:

LoopWhat it controlsTimescale
HPAReplica countSeconds to minutes
VPAPer-pod resource requestsMinutes to hours
Cluster AutoscalerNode countMinutes
SchedulerPod placementMilliseconds

Most production autoscaling incidents do not occur because a single loop misbehaved. They occur because two loops reacted to the same signal on different timescales — HPA scaling out while VPA evicts, CA provisioning for a transient condition, the scheduler unable to place pods while CA is still bootstrapping. Understanding each loop in isolation is necessary but not sufficient. This post focuses on HPA and VPA, but always with awareness of how they interact with the broader system.


The Problem Space

Autoscaling is not “automatic scaling” — it is approximate control under delayed, noisy signals. It is two independent control systems manipulating different variables with incomplete information and non-zero lag. HPA and VPA operate on fundamentally different axes, use different control models, and interact with each other in ways that will cause production incidents if misunderstood. The goal of this post is to build the internal mental model needed to tune and debug them without flying blind.

🧠 Mental Model: Autoscaling is Approximation

Autoscalers operate on metrics that are sampled, aggregated, and delayed. They apply changes that take tens of seconds to minutes to materialize. Perfect elastic scaling is not achievable — only bounded approximation is. The engineering goal is not to eliminate the gap between supply and demand, but to constrain how large that gap can grow and how long it can persist.


Horizontal Pod Autoscaler (HPA)

The Control Loop

HPA is a classic reconciliation controller running in kube-controller-manager. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it wakes up, samples metrics, computes a desired replica count, and patches the target’s spec.replicas.

HPA as a Delayed, Saturating P-Controller

HPA is not just a proportional controller — it is a delayed, rate-limited, saturating P-controller operating on a lagging signal. It reacts to the instantaneous ratio between observed and desired metric values with no integral or derivative terms. This framing matters because it predicts failure modes precisely:

  • No integral term: Steady-state error persists. If your metric target is set too high, HPA will converge to a replica count that satisfies the ratio on paper but still leaves the service under-provisioned relative to actual demand.
  • No derivative term: HPA cannot anticipate spikes. It has no model of metric velocity or acceleration — only current deviation from target.
  • High phase lag: The 75–135 second reaction chain means HPA is always responding to load conditions that no longer exist at the moment the new pods are ready.
  • Hard saturation: minReplicas/maxReplicas and scaling policies create non-linear saturation effects. At saturation boundaries, the proportional response is simply clipped.

This combination makes HPA inherently prone to limit cycles under bursty load: it oscillates between under-provisioned and recovering states because it cannot hold position at steady state under noisy input. Stabilization windows exist as bolt-on hysteresis mechanisms rather than intrinsic damping — they reduce oscillation frequency but do not eliminate the underlying phase lag.

🧠 Mental Model: HPA Buys Time, Not Capacity

HPA does not handle spikes — it reacts after the spike has already started. Your system must survive the first 75–135 seconds without any additional pods. Conservative CPU targets (50–65%), generous minReplicas, and pre-warmed capacity buffers are not timidity — they are the engineering response to a controller with 90+ second phase lag.

The End-to-End Reaction Time

A critical mental model that most teams lack is a quantified timing chain. When traffic spikes, the time before new pods are actually serving requests is approximately:

Total Reaction Time ≈
metric scrape interval (~15s for metrics-server)
+ metrics aggregation lag (~15s)
+ HPA sync period (~15s)
+ pod startup time (20–60s depending on image and init)
+ readiness probe delay (10–30s)
─────────────────────────────────────
Realistic range: 75135 seconds

This means that under a sharp traffic spike, your service absorbs load for over a minute before a single additional pod is ready. Setting CPU targets at 80–90% leaves no headroom for that window. Conservative targets (50–65%) exist precisely to buy time for this pipeline to execute.


🧪 Exercise 1: Observe the HPA Reaction Time Pipeline

Deploy a simple CPU-bound workload and watch the timing chain in action.

Step 1: Deploy the target workload

kubectl create deployment php-apache \
--image=registry.k8s.io/hpa-example \
--port=80
kubectl set resources deployment php-apache \
--requests=cpu=200m,memory=64Mi \
--limits=cpu=500m,memory=128Mi
kubectl expose deployment php-apache --port=80 --name=php-apache

Step 2: Create an HPA targeting 50% CPU

kubectl autoscale deployment php-apache \
--cpu-percent=50 \
--min=1 --max=10
# Watch the HPA state in one terminal
kubectl get hpa php-apache --watch

Step 3: Generate load and timestamp the spike

# In a second terminal: record the exact time and start load
echo "Load started at: $(date +%T)"
kubectl run -i --tty load-generator --rm \
--image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"

Step 4: Measure the delay

# In a third terminal, poll and timestamp HPA events
kubectl get events --field-selector involvedObject.name=php-apache \
--sort-by='.lastTimestamp' --watch

What to observe: Note the timestamp when you started the load vs. when the first SuccessfulRescale event appears. You should see roughly 45–90 seconds of lag. Compare the gap against the timing chain formula above.

Expected signal shape if you were graphing this:

  • CPU utilization: sharp spike within 15s of load start
  • Replica count: flat for 45–90s, then a step increase
  • Phase lag between CPU spike and replica step is the controller’s entire reaction pipeline made visible
  • After scaling, CPU drops as load spreads across new pods — but there is typically a secondary spike as readiness probes pass and traffic routing catches up

Stop the load: Ctrl+C in the load-generator terminal. The pod will self-delete (it was --rm).


The Scaling Algorithm

The core formula:

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

Several stabilizing mechanisms layer on top:

Stabilization windows prevent oscillation. Because HPA lacks intrinsic damping, stabilization windows act as an external hysteresis mechanism. The controller maintains a rolling window of past recommendations. For scale-down, it selects the maximum recommendation seen during the window (default: 300s), preventing premature scale-in. For scale-up, it selects the minimum recommendation (default: 0s — acts immediately). This asymmetry is intentional: be aggressive about adding capacity, conservative about removing it. CA mirrors this philosophy at the node level — its scale-down is even more conservative, with a default 10-minute idle delay before a node is considered for removal. Both loops are deliberately slow to release capacity.

Tolerance (default 0.1 = 10%) means HPA won’t act if currentValue is within 10% of targetValue, preventing constant micro-adjustments under noisy metrics.

Missing pod handling: For pods that have no metrics (not yet Running, or mid-startup), HPA applies a conservative heuristic. During scale-up, it assumes those pods are consuming 100% of target utilization to avoid under-scaling. During scale-down, it assumes they are at target-level utilization to avoid premature scale-in. They are not assumed to be idle.


🧪 Exercise 2: Verify the Stabilization Window During Scale-Down

This exercise makes the 300-second scale-down stabilization window visible. You’ll drive scale-up, stop the load, and watch HPA refuse to scale down immediately.

Setup (continuing from Exercise 1, or re-run setup):

# Ensure php-apache HPA exists with default behavior
kubectl get hpa php-apache

Generate load until HPA scales out to 3+ replicas:

kubectl run -i --tty load-generator --rm \
--image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
# Wait until replicas >= 3
kubectl get hpa php-apache --watch

Stop load and record the time:

# Ctrl+C in load-generator terminal, then:
echo "Load stopped at: $(date +%T)"
kubectl get hpa php-apache --watch

What to observe: After load stops, CPU will drop immediately, but HPA will hold replica count for ~5 minutes before scaling down. This is the 300-second scaleDown stabilization window in action.

Shortcut the wait — override the stabilization window:

kubectl patch hpa php-apache --type='merge' -p='
spec:
behavior:
scaleDown:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 30'
# Now watch scale-down happen much faster
kubectl get hpa php-apache --watch

Key insight: The default 300s window exists to prevent flapping. Override it with care — a too-aggressive scale-down policy can cause oscillation under bursty traffic.


Multi-Metric Behavior

computeReplicasForMetrics in pkg/controller/podautoscaler/ iterates over all configured metrics and takes the maximum desired replica count — metrics are not averaged. Consider a service configured with both CPU and RPS targets:

MetricCurrentTargetDesired Replicas
CPU70%50%6
RPS80040010

HPA sets replicas = 10, driven by RPS. This is mathematically correct but operationally dangerous when one metric is noisy or misconfigured — a spurious spike in any single metric drives the entire replica count up. Monitor individual metric recommendations, not just the resulting replica count.

CPU vs. External Metrics: An Explicit Tradeoff

The choice of HPA signal is one of the highest-leverage tuning decisions you make. Most teams default to CPU because it requires no additional pipeline — but that convenience has a cost:

DimensionCPURPS / Queue Depth
Signal freshness❌ Lagging (scrape + aggregation + sync = 45s+)✅ Near-real-time
Infra independence✅ Always available❌ Requires metrics pipeline (Prometheus Adapter, KEDA)
VPA coupling risk❌ High — VPA changes requests, distorts utilization ratio✅ None — orthogonal signal
Throttling blind spot❌ Throttled CPUs appear underloaded✅ Not affected
Stability✅ High — noisy workloads still converge⚠️ Lower — noisy metrics drive unnecessary scale events
Failure modeUnder-scaling (HPA reacts too late)Over-scaling (transient metric spikes)

CPU is safer to configure but slower to react and couples badly with VPA. RPS is faster and decoupled, but requires a functioning metrics pipeline and careful target setting. Production systems often blend both — RPS as the primary signal with a CPU ceiling to catch cases where the metrics pipeline has a gap.

🧠 Mental Model: VPA is a Batch System Disguised as Real-Time

VPA reacts on a timescale of minutes to hours, applies changes via pod restarts, and builds recommendations from historical data. Treat it as an offline optimizer that runs continuously in the background — not a real-time controller. Its job is to right-size pods between load cycles, not to respond to them.

HPA v2 Scaling Policies

A commonly overlooked feature of autoscaling/v2 is scaling rate policies. These cap how fast replica counts can change, and in practice they are more important than stabilization windows for protecting downstream systems from traffic amplification during burst scale-out:

behavior:
scaleUp:
policies:
- type: Percent
value: 100 # at most double replicas per period
periodSeconds: 60
- type: Pods
value: 4 # or add at most 4 pods per period
periodSeconds: 60
selectPolicy: Min # use whichever is more conservative
scaleDown:
policies:
- type: Percent
value: 10
periodSeconds: 120

Without explicit policies, a sudden load spike can cause HPA to jump from 3 to 50 replicas in a single sync cycle. Rate-limiting scale-out smooths the curve and gives downstream dependencies time to adapt.


🧪 Exercise 3: Observe Unconstrained vs. Rate-Limited Scale-Out

This exercise demonstrates why scaling rate policies matter. You’ll compare the replica jump with and without a Pods rate cap.

Step 1: Deploy a fresh workload with low CPU requests (makes it easy to saturate)

kubectl create deployment rate-test \
--image=registry.k8s.io/hpa-example --port=80
kubectl set resources deployment rate-test \
--requests=cpu=50m --limits=cpu=100m
kubectl expose deployment rate-test --port=80
kubectl scale deployment rate-test --replicas=2

Step 2: Create an HPA without rate policies

cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rate-test
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rate-test
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 30
EOF

Step 3: Blast it with load and watch the replica jump

# Run 5 parallel load generators
for i in {1..5}; do
kubectl run load-$i --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://rate-test; done" &
done
# Watch replicas — note the size of the jump
kubectl get hpa rate-test --watch

Step 4: Kill load, reset, and add a rate policy

kubectl delete pod -l run=load-1 -l run=load-2 -l run=load-3 -l run=load-4 -l run=load-5 2>/dev/null || true
for i in {1..5}; do kubectl delete pod load-$i --ignore-not-found; done
kubectl scale deployment rate-test --replicas=2
# Now patch the HPA with a rate cap
kubectl patch hpa rate-test --type='merge' -p='
spec:
behavior:
scaleUp:
policies:
- type: Pods
value: 2
periodSeconds: 30
selectPolicy: Min'
# Re-run the same load burst
for i in {1..5}; do
kubectl run load-$i --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://rate-test; done" &
done
kubectl get hpa rate-test --watch

What to observe: With no policy, replicas may jump 2→10+ in a single cycle. With the 2-pods-per-30s cap, the scale-out is gradual. Neither is always “better” — this illustrates the tradeoff between responsiveness and stability.

Cleanup:

for i in {1..5}; do kubectl delete pod load-$i --ignore-not-found; done
kubectl delete deployment rate-test
kubectl delete hpa rate-test
kubectl delete svc rate-test

The CPU Request Coupling Problem (Why VPA Breaks CPU HPA)

This is the most architecturally significant HPA pitfall that teams consistently miss. CPU utilization in HPA is computed relative to the pod’s requested CPU, not actual node capacity:

cpuUtilization = currentCPUUsage / requestedCPU

This creates direct coupling between resource requests and scaling behavior. If you over-request CPU (e.g., requests: 2000m for a service that realistically uses 400m), computed utilization is suppressed — HPA sees a low percentage and refuses to scale out even under genuine load. Conversely, under-requesting CPU inflates utilization and causes premature scale-out.

This is why VPA and HPA must be used together carefully: VPA continuously adjusts requests, which directly shifts HPA’s utilization baseline. Run them on separate metrics or you get a feedback loop.


🧪 Exercise 4: Demonstrate CPU Request Coupling

This exercise shows how the same real CPU usage produces different HPA behavior depending on the resource request value.

Step 1: Deploy with a very high CPU request (simulates over-provisioning)

kubectl create deployment coupling-test \
--image=registry.k8s.io/hpa-example --port=80
# Set a deliberately inflated CPU request
kubectl set resources deployment coupling-test \
--requests=cpu=1000m --limits=cpu=2000m
kubectl expose deployment coupling-test --port=80
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coupling-test
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coupling-test
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
EOF

Step 2: Generate load and observe the HPA metric value

kubectl run load-test --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://coupling-test; done"
# Watch — the CPU utilization % will be much lower than real usage
# because it's divided by the 1000m request
kubectl get hpa coupling-test --watch
# In another terminal:
kubectl top pods -l app=coupling-test

Step 3: Reset with a realistic request and observe the difference

kubectl delete pod load-test --ignore-not-found
kubectl scale deployment coupling-test --replicas=1
# Now set a realistic (low) CPU request
kubectl set resources deployment coupling-test \
--requests=cpu=100m --limits=cpu=500m
# Force pod restart to pick up new requests
kubectl rollout restart deployment coupling-test
# Re-run load
kubectl run load-test --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://coupling-test; done"
kubectl get hpa coupling-test --watch
kubectl top pods -l app=coupling-test

What to observe: The same real CPU usage produces a dramatically different utilization percentage depending on the request. With 1000m request, HPA may show 15-20% and not scale. With 100m request, the same workload shows 150-200%+ and triggers aggressive scale-out. This is exactly the feedback loop that emerges when VPA adjusts requests while CPU-based HPA is running.

Cleanup:

kubectl delete pod load-test --ignore-not-found
kubectl delete deployment coupling-test
kubectl delete hpa coupling-test
kubectl delete svc coupling-test

Metrics Pipeline

HPA talks to one of three metrics APIs:

  • metrics.k8s.io — Resource metrics (CPU/memory) served by metrics-server, which scrapes kubelet’s Summary API at ~15s resolution. End-to-end metric freshness (scrape + aggregation + HPA sync cycle) still introduces meaningful lag of 30–60s under normal conditions.
  • custom.metrics.k8s.io — Arbitrary per-object metrics. Backed by adapters like Prometheus Adapter or Datadog Cluster Agent.
  • external.metrics.k8s.io — Metrics external to the cluster (queue depths, SQS, etc).

The latency consequence: CPU-based HPA reacts to load that has already materialized. For latency-sensitive services, augment with external or custom metrics that reflect current load (active connections, queue depth, RPS) rather than CPU, which lags by the full pipeline round-trip.

Scale-to-Zero

minReplicas defaults to 1 but can be set to 0 in autoscaling/v2. However, CPU-based scaling cannot recover from zero — there are no pods to report metrics. Scale-to-zero is only viable with external or object metrics, where the metric source exists independently of pod count. In practice, KEDA is the standard solution, as it manages the activator component needed to bridge the zero-to-one cold-start gap.


Vertical Pod Autoscaler (VPA)

Architecture: Three Separate Components

Unlike HPA (a single controller loop), VPA is split into three distinct processes with distinct responsibilities and failure modes:


🧪 Exercise 5: Install VPA and Observe Recommendations

Install the VPA components and run it in Off mode first — as a pure recommendation engine, with no evictions. This is the safest first step for any production environment.

Step 1: Install VPA from the official repo

git clone https://github.com/kubernetes/autoscaler.git /tmp/autoscaler
cd /tmp/autoscaler/vertical-pod-autoscaler
# Install CRDs and components
./hack/vpa-up.sh
# Verify all 3 components are running
kubectl get pods -n kube-system | grep vpa
# Expect: vpa-admission-controller, vpa-recommender, vpa-updater

Step 2: Deploy a workload to monitor

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: hamster
spec:
replicas: 2
selector:
matchLabels:
app: hamster
template:
metadata:
labels:
app: hamster
spec:
containers:
- name: hamster
image: registry.k8s.io/ubuntu-slim:0.14
resources:
requests:
cpu: 100m
memory: 50Mi
command: ["/bin/sh"]
args:
- "-c"
- "while true; do timeout 0.5s yes >/dev/null; sleep 0.5s; done"
EOF

Step 3: Create a VPA in Off mode (recommendation only)

cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: hamster-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: hamster
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: hamster
minAllowed:
cpu: 50m
memory: 50Mi
maxAllowed:
cpu: 2
memory: 1Gi
EOF

Step 4: Wait ~5 minutes for recommendations to populate, then inspect

# Poll until recommendations appear
kubectl get vpa hamster-vpa --watch
# When RECOMMENDED shows values, describe for full detail
kubectl describe vpa hamster-vpa

What to observe: The status.recommendation.containerRecommendations section shows lowerBound, target, and upperBound for both CPU and memory. Compare these against your manifest’s requests. The gap is your rightsizing debt.

Key things to note:

  • The memory recommendation is likely much higher than 50Mi (processes have real overhead)
  • The CPU recommendation may differ significantly from 100m
  • These are updated continuously as the workload runs

The Recommender: Statistical Core

The Recommender maintains an in-memory histogram of CPU and memory usage per container, modeled as a decay-weighted percentile estimator. Older samples are down-weighted exponentially, giving more influence to recent behavior while retaining long-tail signal.

The histogram uses exponential bucket boundaries — each bucket is ~10% wider than the previous, enabling compact representation across orders of magnitude of resource values.

Two important asymmetries in how CPU and memory are modeled:

Memory uses peak samples, not averages. Since memory is not compressible (a process that allocates 2GB cannot be throttled down to 1GB without an OOMKill), the Recommender intentionally biases toward observed peaks rather than typical usage. This makes memory recommendations more conservative than CPU by design.

CPU recommendations smooth over bursts. CPU is compressible — throttling slows a process but doesn’t kill it. The recommender uses a smoother model for CPU, accepting that brief spikes will be throttled rather than sizing for them. However, this creates a blind spot: if CPU limits are enforced aggressively, throttling suppresses the observed usage signal, making VPA’s histogram reflect artificially low CPU consumption. The Recommender cannot distinguish “this container uses 200m” from “this container is throttled at 200m.” If you see VPA recommending low CPU while your application has high p99 latency, check container_cpu_throttled_seconds_total before trusting the recommendation.

Key estimation parameters:

  • Target percentile: CPU recommended at p90 of observed usage; memory at p95. Both are configurable.
  • Safety margin: +15% added on top of the percentile estimate (configurable via --recommendation-margin-fraction).
  • Confidence: For containers with sparse samples, confidence intervals widen and recommendations inflate conservatively.

Critically, the Recommender produces three values written to VPA.status.recommendation:

The Updater only evicts a pod if its current requests fall outside the [lowerBound, upperBound] range — not every time the target shifts. This prevents constant churn under normal variance.

The Updater: The Disruptive Actor

The Updater runs every 1 minute. If a pod’s current requests are outside the recommended bounds, the Updater evicts it. The pod is recreated by its owning controller, and the Admission Webhook intercepts that new pod creation to inject the updated requests.

Two important constraints on Updater behavior:

PodDisruptionBudgets are respected. If a PDB is too strict, or the workload is running at minimum replicas, VPA will refuse to evict and silently do nothing. Teams often discover this when VPA appears “stuck” — recommendations update in .status but pods never change. Check PDB disruptions allowed if VPA seems inert.

It requires pod restarts. In-Place Pod Vertical Scaling (KEP-1287) is beta in recent Kubernetes releases but requires feature gates and has provider-specific support constraints. Do not assume it is available without verifying your cluster version and managed Kubernetes provider.

For stateful workloads, control eviction behavior explicitly:

updatePolicy:
updateMode: "Off" # Recommendations only — never evict
# updateMode: "Initial" # Inject on creation, never evict running pods
# updateMode: "Auto" # Full lifecycle management (default)

Starting with Off and using VPA as a recommendation engine is the safest posture for stateful workloads. Apply recommendations via a GitOps pipeline or scheduled maintenance window.


🧪 Exercise 6: Observe VPA Auto Mode and the PDB Blocker

This exercise demonstrates VPA’s Auto mode evicting pods, then shows how a PDB silently blocks it.

Part A: Enable Auto mode and watch the eviction

# Switch hamster VPA to Auto mode
kubectl patch vpa hamster-vpa --type='merge' -p='
spec:
updatePolicy:
updateMode: "Auto"'
# Watch for evictions — VPA will evict pods whose requests differ from recommendation
kubectl get pods -l app=hamster --watch &
kubectl get events --field-selector reason=EvictedByVPA --watch &
# After eviction, check the new pod's actual resource requests
# (these are injected by the VPA Admission Controller)
kubectl get pod -l app=hamster -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

What to observe: The new pods will have different requests than what’s in the Deployment spec. The VPA Admission Controller mutated them at pod creation time. Run kubectl get deployment hamster -o yaml | grep -A5 resources — the Deployment spec is unchanged. This is the “advisory manifest” behavior described in the Admission Controller section.

Part B: Create a PDB that blocks eviction

# First, scale down to 1 replica to make the PDB bite
kubectl scale deployment hamster --replicas=1
# Create a PDB requiring minAvailable=1
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: hamster-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: hamster
EOF
# Force VPA to want to evict by temporarily setting a request far outside bounds
kubectl patch deployment hamster --type='json' \
-p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/cpu","value":"999m"}]'
# Wait a couple minutes, then check — VPA recommendations will show divergence
# but no eviction will occur
kubectl describe vpa hamster-vpa | grep -A20 "Conditions:"
kubectl describe pdb hamster-pdb

What to observe: The VPA status will show the recommendation is out of bounds, but the pod is not evicted. The PDB shows Disruptions Allowed: 0. This is the “VPA appears stuck” scenario described in the Updater section.

Cleanup:

kubectl delete pdb hamster-pdb
kubectl scale deployment hamster --replicas=2
kubectl patch vpa hamster-vpa --type='merge' -p='{"spec":{"updatePolicy":{"updateMode":"Off"}}}'

The Admission Controller: The Mutation Point

When a pod creation request reaches the API server, the VPA Admission Controller (MutatingWebhookConfiguration) intercepts it, looks up the VPA object for the pod’s owner, and overwrites resources.requests in the pod spec before it is persisted.

Your Deployment YAML’s resource requests become advisory at runtime — VPA owns the actual values. This is intentional but can surprise teams who expect kubectl get pod -o yaml to match their manifests.


🧪 Exercise 7: Confirm the Admission Webhook Mutation

This is a quick but important exercise to internalize that VPA mutates pods at creation time, making manifests advisory.

Step 1: Inspect the MutatingWebhookConfiguration

kubectl get mutatingwebhookconfigurations | grep vpa
kubectl describe mutatingwebhookconfiguration vpa-webhook-config | grep -A10 "Rules:"

Step 2: Check the current VPA mode

Before restarting pods, confirm which update mode VPA is in:

kubectl get vpa hamster-vpa -o jsonpath='{.spec.updatePolicy.updateMode}'

⚠️ If the mode is Off, the Admission Webhook will not mutate pod requests — the pod spec will match the Deployment manifest exactly. This is expected. You must switch to Initial (Step 3) to observe the mutation.

Step 3: Switch to Initial mode to enable webhook mutation

Initial mode instructs VPA to inject recommendations at pod creation time, but never evict running pods. This is the safest mode to observe mutation without disruption:

kubectl patch vpa hamster-vpa --type='merge' \
-p='{"spec":{"updatePolicy":{"updateMode":"Initial"}}}'
# Confirm the mode change took effect
kubectl get vpa hamster-vpa -o jsonpath='{.spec.updatePolicy.updateMode}'

Step 4: Restart pods and compare requests

# Trigger a rollout so new pods are created (and mutated by the webhook)
kubectl rollout restart deployment hamster
kubectl rollout status deployment hamster
# Compare Deployment spec requests vs. actual pod requests
echo "=== Deployment spec requests ==="
kubectl get deployment hamster \
-o jsonpath='{.spec.template.spec.containers[0].resources}' | python3 -m json.tool
echo "=== Actual pod requests (post-mutation) ==="
kubectl get pods -l app=hamster \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{.spec.containers[0].resources}{"\n\n"}{end}'

What to observe: The pod’s actual CPU and memory requests should now differ from the Deployment manifest — they reflect VPA’s recommendation values injected by the Admission Webhook at pod creation time. The Deployment spec itself is unchanged; VPA only mutates the live pod spec.

💡 If requests still match after switching to Initial mode, VPA may not have built up enough sample history yet to generate a recommendation. Wait 2–3 minutes and check: kubectl describe vpa hamster-vpa | grep -A10 "Recommendation:". If the Recommendation section is empty, give the workload more time to run before restarting.

Step 5: Reset VPA mode

# Return to Off mode so later exercises start from a known state
kubectl patch vpa hamster-vpa --type='merge' \
-p='{"spec":{"updatePolicy":{"updateMode":"Off"}}}'

HPA vs VPA: When to Use Which

The safe combination rule: Never run HPA and VPA on the same metric. If HPA is managing CPU utilization while VPA is adjusting CPU requests, they form a destabilizing positive feedback loop:

  1. VPA increases CPU requests
  2. Same real CPU usage is now a smaller fraction of the larger request — HPA utilization drops
  3. HPA scales in (fewer replicas)
  4. Load concentrates on remaining pods — per-pod CPU rises
  5. VPA observes higher per-pod usage, increases requests further
  6. Repeat

This loop does not converge. It oscillates with each VPA eviction cycle acting as a perturbation that resets the HPA signal. The safe pattern is HPA on external/custom metrics (RPS, queue depth, active connections) with VPA managing CPU/memory requests. Operating on orthogonal signals, the two controllers cannot interfere with each other’s feedback paths.


🧪 Exercise 8: Reproduce the HPA + VPA Feedback Loop

This is the most important exercise in the guide. You will deliberately create the feedback loop described above and observe it destabilize replica count.

Step 1: Deploy a workload with both CPU-based HPA and VPA in Auto mode

kubectl create deployment feedback-test \
--image=registry.k8s.io/hpa-example --port=80
kubectl set resources deployment feedback-test \
--requests=cpu=200m --limits=cpu=500m
kubectl expose deployment feedback-test --port=80
# CPU-based HPA
kubectl autoscale deployment feedback-test \
--cpu-percent=50 --min=1 --max=8
# VPA in Auto mode on the same workload
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: feedback-test-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: feedback-test
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: feedback-test
minAllowed:
cpu: 50m
maxAllowed:
cpu: 2
EOF

Step 2: Apply moderate, sustained load

kubectl run feedback-load --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://feedback-test; sleep 0.1; done"
# Monitor replica count and CPU utilization over 10+ minutes
watch -n5 "kubectl get hpa feedback-test && echo '---' && kubectl get vpa feedback-test-vpa && echo '---' && kubectl top pods -l app=feedback-test"

What to observe: VPA will adjust CPU requests upward. Each time it does, the same real CPU usage becomes a smaller percentage of the new (larger) request. HPA sees lower utilization and scales in. Fewer pods means more load per pod. VPA observes higher per-pod CPU and adjusts requests further. Watch for oscillation in replica count.

Step 3: Fix it — switch HPA to a non-CPU metric

In a real cluster you’d use RPS from Prometheus. In kind, use a ContainerResource metric on memory instead (orthogonal to CPU), or simply document that the fix is to replace CPU-based HPA with an external/custom metric.

# The correct fix: delete the CPU-based HPA, use a different signal
kubectl delete hpa feedback-test
# In production, replace with:
# - An ingress RPS metric via Prometheus Adapter
# - A queue depth metric via KEDA
# - An active connections metric from your load balancer

Cleanup:

kubectl delete pod feedback-load --ignore-not-found
kubectl delete deployment feedback-test
kubectl delete hpa feedback-test --ignore-not-found
kubectl delete vpa feedback-test-vpa
kubectl delete svc feedback-test

Cluster Autoscaler Interaction

HPA and VPA both create pressure on the node pool, but on different timescales and through different mechanisms.

HPA is fast. CA is slow. Node bootstrap time is the dominant constant in the system — every autoscaling strategy is bounded by it. The 2–4 minute bootstrap lag (longer for GPU or large instance types) sets a hard floor on how quickly new capacity can serve traffic. Any strategy that relies on CA to absorb spikes has accepted this floor as a design constraint.

CA solves local schedulability, not global efficiency. CA provisions enough nodes to schedule the pods that are currently Pending. It does not optimize bin-packing across the cluster — it does not rebalance existing pods, consolidate fragmented nodes, or optimize for cost. This is why VPA can increase node count even when actual CPU utilization is low: the scheduler makes placement decisions based on requests, not observed usage. VPA inflates requests → pods no longer fit on existing nodes → CA provisions new nodes → actual utilization stays flat or even falls. The cluster grows without the workload growing.

VPA raises the node pressure threshold — and can increase your bill. VPA increases requests, not limits. Larger requests make pods harder to schedule on existing nodes, pushing CA to provision additional capacity or larger instance types. This silently changes your node pool’s instance shape economics. You may end up with fewer, larger nodes than intended — or more total nodes — without any increase in actual cluster utilization. Monitor instance type distribution and node count trends after enabling VPA in Auto mode; the cost impact will appear there before it shows up in billing reports.

🧠 Mental Model: Requests Drive Cost, Not Usage

In Kubernetes, you pay for what you reserve, not what you use. The scheduler, the bin-packer, and CA all operate on requests. VPA optimizes requests. This means every VPA recommendation upward is a potential cost event — even if actual utilization is unchanged.

Autoscaling optimizes for performance first, cost second unless explicitly constrained. HPA and VPA have no cost objective — they optimize to keep the metric within bounds. Over-scaling is operationally safer than under-scaling from their perspective. If cost matters (it always does), you need to encode it through maxReplicas, maxAllowed bounds, and node pool configuration — the autoscalers will not self-constrain.

Prevent CA overshoot: If VPA evicts a large batch of pods simultaneously, the scheduler may not fit them all, triggering CA to provision capacity for a transient condition. Stage transitions between VPA updateMode values, and consider CA’s --scale-down-delay-after-add to prevent immediate scale-in after a VPA-triggered provisioning event.

Overprovisioning buffers address the CA latency problem directly. Deploy a Deployment of low-priority placeholder pods (using a PriorityClass with a negative value) sized to your expected burst headroom. These pods consume cluster capacity when idle, keeping nodes warm and schedulable. When real pods scale out, the scheduler evicts the placeholder pods to make room — no CA provisioning required. The cost is always-on reserved capacity; the benefit is eliminating the 2–4 minute bootstrap lag from your scaling critical path.


🧪 Exercise 9: Trigger Pending Pods via VPA Request Inflation

In kind, nodes have fixed resources. You can reproduce the VPA-inflates-requests-causing-unschedulable scenario by setting maxAllowed to values larger than your kind node’s allocatable capacity.

# First, check your kind nodes' allocatable CPU and memory
kubectl describe nodes | grep -A5 "Allocatable:"
# Deploy a tight workload
kubectl create deployment inflate-test --image=nginx
kubectl set resources deployment inflate-test \
--requests=cpu=100m,memory=64Mi --limits=cpu=200m,memory=128Mi
kubectl scale deployment inflate-test --replicas=3
# Create VPA with maxAllowed far exceeding available per-node headroom
# Adjust these numbers to be just over your node's allocatable / 3
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: inflate-test-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: inflate-test
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: nginx
minAllowed:
cpu: 800m # Intentionally large — adjust to exceed your node headroom
memory: 512Mi
maxAllowed:
cpu: 2
memory: 2Gi
EOF
# Watch for Pending pods
kubectl get pods -l app=inflate-test --watch &
# After VPA evicts and re-creates pods with large requests, check for Pending
kubectl get events --field-selector reason=FailedScheduling --watch

What to observe: After VPA injects the inflated requests, some pods may enter Pending state because no single node has enough remaining allocatable resources. In a real cluster, this is the trigger for Cluster Autoscaler to provision new nodes.

Cleanup:

kubectl delete deployment inflate-test
kubectl delete vpa inflate-test-vpa

Operational Gotchas

VPA’s OOM learning problem: VPA recommends based on observed usage. If your application hasn’t experienced peak load during the observation window, VPA will under-recommend memory. An OOMKill resets the histogram’s confidence weighting. Always set minAllowed bounds anchored to values from load testing, not from observed idle-state usage.

CPU throttling blindspot: If your containers have tight CPU limits, container_cpu_throttled_seconds_total will be high but observed CPU usage will appear low. VPA will recommend lower CPU requests, worsening the throttling. Always check the throttling metric before acting on VPA CPU recommendations.

Memory target at p95 is not a ceiling: VPA recommends memory at p95, meaning 5% of observed samples exceeded the recommendation. For workloads with heavy GC or periodic batch operations, the tail can be large. Setting maxAllowed memory without headroom above p95 will still produce OOMKills at peak.


🧪 Exercise 10: Inspect VPA Recommendations Under CPU Throttling

This exercise demonstrates the CPU throttling blindspot: tight limits cause VPA to recommend less CPU, creating a vicious cycle.

Step 1: Deploy with intentionally tight CPU limits

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: throttle-test
spec:
replicas: 1
selector:
matchLabels:
app: throttle-test
template:
metadata:
labels:
app: throttle-test
spec:
containers:
- name: app
image: registry.k8s.io/ubuntu-slim:0.14
resources:
requests:
cpu: 200m
limits:
cpu: 210m # Limit barely above request — maximum throttling
command: ["/bin/sh"]
args:
- "-c"
- "while true; do yes >/dev/null; done" # 100% CPU burn
EOF
# Attach a VPA
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: throttle-test-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: throttle-test
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 50m
maxAllowed:
cpu: 4
EOF

Step 2: Check throttling and VPA recommendation

# Check actual CPU usage — it will appear bounded by the limit
kubectl top pods -l app=throttle-test
# After 5+ minutes, check VPA recommendation
kubectl describe vpa throttle-test-vpa | grep -A10 "Container Recommendations"

What to observe: Even though the container is burning 100% CPU, kubectl top shows only ~200m (the limit). VPA sees this capped observation and may recommend a value near or below the current request. In a real environment, you’d check container_cpu_throttled_seconds_total in Prometheus to confirm throttling.

Cleanup:

kubectl delete deployment throttle-test
kubectl delete vpa throttle-test-vpa

Autoscaling Failure Taxonomy

Production autoscaling incidents tend to fall into a small number of reusable classes. Naming them makes debugging faster — you can pattern-match a symptom to a class before you have the full picture.

Failure ClassRoot CauseObservable SymptomCanonical Example
Lag-induced saturationReaction pipeline slower than load rampHigh error rate for 90–120s before replicas increaseCPU HPA at 80% target + sudden 3× traffic spike
Signal distortionMetric ≠ actual loadVPA recommends lower CPU despite high latencyCPU throttling suppresses observed usage
Control loop interferenceTwo loops reacting to the same signalOscillating replica count without load changeCPU-based HPA + VPA Auto mode running simultaneously
Capacity illusionScheduler or CA lag hides true capacity deficitPods Pending despite “sufficient” cluster capacityVPA evicts pods during CA bootstrap window
Overcorrection / oscillationAggressive scale policies or too-low stabilization windowReplica count thrashes up and down under steady loadscaleDown.stabilizationWindowSeconds: 0 on noisy metric
Bound-induced blindnessmaxReplicas or maxAllowed set too conservativelyScalingLimited condition True; SLO degraded but HPA appears healthymaxReplicas: 5 on a service that needs 20 during peak

When an autoscaling incident starts, the first question is: which class is this? The answer determines whether you look at metric freshness, HPA/VPA coupling, scheduler events, or policy configuration.


Production Incident Pattern: The Black Friday Failure Mode

Consider a typical API service under sudden high load:

  1. Traffic spikes 5× over 2 minutes.
  2. CPU metrics are ~30s stale. HPA does not yet see elevated utilization.
  3. HPA eventually fires — but CPU target was set at 80%. The service is already saturated before the first new pod starts.
  4. VPA, running in Auto mode, decides this is a good time to evict two pods to update their memory requests. Pod count temporarily drops.
  5. The evicted pods cannot fit on existing nodes due to larger VPA-requested resources. CA begins provisioning — with a 2–4 minute bootstrap lag.
  6. By the time new capacity is available, the load spike has peaked and is declining. CA provisions nodes that are no longer needed.

The fix is not a single knob. It requires: external metrics for HPA (RPS instead of CPU), VPA in Initial mode during high-risk windows, CA warm pools or overprovisioning buffers, and load-tested minAllowed VPA bounds.


🧪 Exercise 11: Simulate the Black Friday Failure Mode End-to-End

This pulls together HPA, VPA, and the scheduler to reproduce the scenario.

Step 1: Deploy the reference “API service”

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 2
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
requests:
cpu: 200m
memory: 64Mi
limits:
cpu: 500m
memory: 128Mi
EOF
kubectl expose deployment api-service --port=80
# HPA with 80% CPU target (the anti-pattern)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80 # Anti-pattern: too high, no headroom
EOF
# VPA in Auto mode (will evict during the spike)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
updatePolicy:
updateMode: "Auto"
EOF

Step 2: Apply a sudden 5× load spike

echo "Spike started at: $(date +%T)"
for i in {1..5}; do
kubectl run spike-$i --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://api-service; done" &
done
# Monitor everything simultaneously
watch -n3 "
echo '=== HPA ==='; kubectl get hpa api-service;
echo '=== Pods ==='; kubectl get pods -l app=api-service;
echo '=== Events (last 5) ==='; kubectl get events --sort-by='.lastTimestamp' | tail -5
"

What to observe over ~10 minutes:

  • Initial delay before HPA fires (metric lag + sync period)
  • VPA evicting a pod during the spike (pod count temporarily drops)
  • HPA and VPA fighting over replica count
  • If pods request more resources after eviction, potential scheduling pressure

Step 3: Apply the fix and compare

# Stop the spike
for i in {1..5}; do kubectl delete pod spike-$i --ignore-not-found; done
# Fix 1: Lower HPA CPU target to leave headroom
kubectl patch hpa api-service --type='merge' -p='
spec:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50'
# Fix 2: Switch VPA to Initial mode (no evictions of running pods)
kubectl patch vpa api-service-vpa --type='merge' -p='
spec:
updatePolicy:
updateMode: "Initial"'
echo "Fixed config applied at: $(date +%T)"
# Re-run the spike
for i in {1..5}; do
kubectl run spike-$i --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://api-service; done" &
done
watch -n3 "
echo '=== HPA ==='; kubectl get hpa api-service;
echo '=== Pods ==='; kubectl get pods -l app=api-service;
echo '=== Events (last 5) ==='; kubectl get events --sort-by='.lastTimestamp' | tail -5
"

Cleanup everything:

for i in {1..5}; do kubectl delete pod spike-$i --ignore-not-found; done
kubectl delete deployment api-service php-apache hamster
kubectl delete hpa api-service php-apache --ignore-not-found
kubectl delete vpa api-service-vpa hamster-vpa --ignore-not-found
kubectl delete svc api-service php-apache --ignore-not-found

Choosing an Autoscaling Strategy

Given a workload, how do you decide what to configure? The right answer depends on the workload’s scheduling properties, traffic shape, and operational risk tolerance — not on what’s easiest to configure.

Workload is stateless and traffic is spiky?
→ HPA with external metric (RPS or queue depth)
→ Add CPU as a secondary ceiling if external pipeline is unreliable
Workload is stateful (database, queue, cache)?
→ VPA in Off or Initial mode only — use recommendations to right-size at deploy time
→ HPA only if the workload supports safe horizontal scaling
Traffic is queue-driven (async workers, batch processors)?
→ KEDA with queue-depth metric — HPA's pull-based model is a poor fit for push-based work
Workload is latency-sensitive (p99 SLO < 100ms)?
→ HPA with headroom baked into the target (50% CPU or lower, not 80%)
→ Overprovisioning buffer to absorb the CA bootstrap window
→ VPA in Initial mode; never Auto
CPU-bound workload with well-understood load curve?
→ HPA on CPU is acceptable if: target ≤ 60%, minReplicas absorbs the reaction window,
and VPA is on an orthogonal metric or in Off mode
You are starting from scratch with no load data?
→ VPA in Off mode for 1–2 full traffic cycles to collect recommendations
→ Use recommendations to set initial requests, then graduate to HPA

The general principle: configure autoscaling conservatively (lower CPU targets, wider stabilization windows, explicit maxReplicas) and then loosen based on observed behavior. The failure modes of over-conservative configuration (slightly higher cost, slightly slower reaction) are far more recoverable than the failure modes of over-aggressive configuration (oscillation, cascading evictions, CA thrash).


Production Design Pattern: A Battle-Tested Reference Architecture

For a stateless, latency-sensitive service that you want to operate safely at scale:

# HPA: scale on RPS, not CPU
metrics:
- type: External
external:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "400"
# Scaling policies: don't surge, don't collapse
behavior:
scaleUp:
policies:
- type: Percent
value: 50
periodSeconds: 60
selectPolicy: Min
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120

Pair this with VPA in Initial mode:

updatePolicy:
updateMode: "Initial" # inject at pod creation, never evict running pods
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 256Mi # anchored to load test p99
maxAllowed:
cpu: 4
memory: 4Gi

And complete the stack with:

  • CA warm pool or overprovisioning buffer (low-priority placeholder pods that get evicted first, keeping spare capacity pre-warmed)
  • CPU requests tuned to p50 load (not observed idle), informed by VPA recommendations after a full traffic cycle
  • --scale-down-delay-after-add on CA set to at least 10 minutes to prevent thrashing after a provisioning event

This architecture means HPA scales on a signal with no CPU-request coupling, VPA rightsizes without disrupting running pods, and CA only sees pressure from genuine, sustained scheduling demand.


Cost Dynamics of Autoscaling

Autoscalers have no cost objective — they optimize to keep metrics within bounds. This means cost consequences are entirely a function of how you constrain them.

HPA cost profile: Over-scaling costs money (idle pods billed at full rate). Under-scaling costs SLO attainment. The tradeoff is asymmetric: SLO violations have reputational and sometimes contractual consequences; idle capacity has a predictable cost. Most production systems err toward over-scaling by design, using minReplicas floors that keep pods warm even during off-peak hours. The cost of that floor is the explicit price of low-latency reaction.

VPA cost profile: VPA’s cost impact is indirect and counterintuitive. By inflating requests, VPA can reduce bin-packing efficiency — larger requests mean fewer pods per node, which means more nodes for the same actual workload. The mechanism: CA provisions for request pressure, not usage pressure. A cluster running at 30% actual CPU utilization but 90% request utilization looks fully packed to CA. VPA can worsen this by pushing requests upward toward observed peaks. Track both actual utilization and request utilization as separate metrics.

CA cost profile: CA cost is a step function — it changes in node increments. This creates a zone of structural over-provisioning: the last node in a pool will typically carry only whatever load couldn’t fit elsewhere, but it is billed the same as a fully loaded node. Overprovisioning buffer pods deliberately fill this slack, converting the wasted allocation into controlled headroom rather than accidental waste.

The lever most teams forget: maxAllowed in VPA and maxReplicas in HPA are your primary cost controls. Without explicit upper bounds, both systems will scale toward whatever is needed to satisfy the metric — with no regard for what that costs. Set these bounds based on cost budgets, not just technical ceilings.


What Experienced Engineers Actually Do

Theory and configuration syntax are table stakes. The harder-won knowledge is what practitioners actually run in production after a few incidents:

On metric selection: RPS or queue depth as the primary HPA signal, with CPU as a secondary ceiling to catch cases where the metrics pipeline has gaps or delays. CPU-only HPA is treated as a legacy pattern to migrate away from when the metrics infrastructure is available.

On VPA modes: Initial only for production workloads. Auto mode is reserved for non-critical batch workloads or development environments where evictions are acceptable. The workflow for using VPA in production is: run in Off mode for two to four weeks across a full traffic cycle, collect recommendations, apply them to manifests via GitOps during a low-traffic window, and re-evaluate quarterly.

On request sizing: minAllowed in VPA is always anchored to load test p99 observed usage, not to VPA’s recommendation from off-peak periods. This prevents VPA from shrinking requests toward near-zero values observed at 3am and then evicting pods at 9am when the requests no longer fit.

On CA and warm capacity: Overprovisioning buffer pods (low-priority Deployment + negative PriorityClass) are standard practice at any org that has been burned by CA bootstrap lag during a traffic event. The sizing is calibrated from load tests: buffer = expected peak replica count minus baseline replica count, sized for the workload’s request footprint.

On stabilization: Scale-down stabilizationWindowSeconds of 300s (the default) is treated as a floor, not a ceiling. For services with expensive startup (JVM warmup, cache population), it is extended to 600–900s to prevent premature scale-in during multi-wave traffic patterns.

On observability: Alerting on ScalingLimited=True for more than two minutes, sustained Pending pods, and rising container_cpu_throttled_seconds_total before VPA recommendations are trusted. The debugging workflow is always: metrics first, then events, then pod resource comparison, then cross-loop interaction analysis.


Common Misconfigurations

HPA Anti-Patterns

  • CPU target above 75%: Leaves insufficient headroom for the ~90–120s reaction time pipeline. The service is already degraded before new pods serve traffic.
  • No scaleUp policies: Allows HPA to multiply replicas in a single cycle, potentially overwhelming downstream dependencies.
  • Using memory as a scale-out trigger: Memory-based HPA often fails to scale back in because most applications do not release allocated memory after load drops — the process holds the heap. HPA will see sustained high memory utilization and resist scale-in indefinitely. Use memory as an HPA metric only if you have confirmed your application actively releases memory under reduced load.
  • Not accounting for pod warm-up: A newly scheduled pod is not immediately useful. If your service has a slow startup (JVM warmup, cache population), include minReadySeconds and configure readiness probes that reflect actual traffic-readiness.

VPA Anti-Patterns

  • Auto mode on stateful workloads: Eviction of a database or queue pod mid-operation is a data risk. Use Off or Initial.
  • No minAllowed: Without a lower bound, VPA will shrink requests toward observed minimums, which may be near zero during off-peak hours.
  • Switching to Auto during peak traffic: Triggers an immediate wave of evictions. Always test mode changes in off-peak windows.
  • Combining with CPU-based HPA: Creates the feedback loop described earlier. Use orthogonal metrics.

Observability: Metrics That Matter

Autoscaling is only debuggable if you are measuring the right signals. The closing principle of this post — that the goal is observability, not perfect configuration — requires knowing exactly which metrics to watch.

HPA signals:

  • kube_horizontalpodautoscaler_status_desired_replicas vs kube_horizontalpodautoscaler_status_current_replicas — the gap between these is your scale lag in real time
  • kube_horizontalpodautoscaler_status_condition — surfaces ScalingLimited, AbleToScale, and ScalingActive conditions; ScalingLimited means rate policies or min/max bounds are constraining HPA from reaching its desired count
  • The raw metric value vs the target threshold for each configured metric — monitor these independently to catch noisy metrics driving unexpected scale decisions

VPA signals:

  • VPA.status.recommendation.containerRecommendations[].target vs actual pod requests — the gap is your rightsizing debt
  • Eviction events on VPA-managed pods (kubectl get events --field-selector reason=Evicted) — unexpected eviction frequency signals too-aggressive bounds or PDB misconfiguration
  • container_cpu_throttled_seconds_total — a high value means VPA’s CPU observation is artificially suppressed; recommendations cannot be trusted until throttling is resolved
  • kube_pod_container_status_last_terminated_reason=OOMKilled — indicates VPA memory recommendations are too low or minAllowed is not set correctly

Cross-loop signals:

  • kube_pod_status_phase=Pending with reason=Unschedulable — the trigger condition for CA; sustained Pending pods mean either CA is bootstrapping or no node shape can fit the requested resources
  • Node instance type distribution over time — VPA-driven request inflation silently changing your node pool shape will appear here before it appears in cost reports

🧪 Exercise 12: Interrogate HPA Status Conditions

Practice reading the HPA status conditions that appear in production debugging. These conditions surface the internal state of the control loop.

# Assuming php-apache HPA still exists (or recreate it)
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=2 2>/dev/null || true
# Read the full status conditions
kubectl get hpa php-apache -o jsonpath='{.status.conditions}' | python3 -m json.tool
# Drive it to its maxReplicas to trigger ScalingLimited
kubectl run limit-test --image=busybox:1.28 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://php-apache; sleep 0.01; done"
# Wait for HPA to hit maxReplicas=2, then check conditions
sleep 60
kubectl describe hpa php-apache | grep -A20 "Conditions:"

What to observe: Once HPA hits maxReplicas, the ScalingLimited condition becomes True with a message indicating the bound was hit. In production, alerting on ScalingLimited=True for more than a few minutes signals that your maxReplicas is too low or your workload has genuinely outgrown its current sizing.

# Also useful: describe shows human-readable metric values
kubectl describe hpa php-apache | grep -A5 "Metrics:"
# Cleanup
kubectl delete pod limit-test --ignore-not-found
kubectl delete hpa php-apache --ignore-not-found
kubectl delete deployment php-apache --ignore-not-found
kubectl delete svc php-apache --ignore-not-found

Final Cleanup

When you’re done with all exercises:

# Delete the kind cluster entirely
kind delete cluster --name autoscaling-lab

Summary

HPA Signal Tradeoff: CPU vs. External Metrics

DimensionCPURPS / Queue Depth
Signal latencyHigh (lagging ~45s+)Low (near-real-time)
Infra dependencyNoneMetrics pipeline required
VPA coupling riskHigh — distorts utilization ratioNone — orthogonal signal
Throttling blind spotYesNo
StabilityHigherLower (noisy metrics amplify)
Failure modeUnder-scalingOver-scaling

HPA vs. VPA at a Glance

DimensionHPAVPA
Scaling axisHorizontal (replica count)Vertical (resource requests)
Reaction speedSeconds to minutesMinutes to hours
Pod disruptionNoneRestart required (unless In-Place beta)
Control modelDelayed P-controllerStatistical percentile estimator
Safe to combineOnly on orthogonal metrics
Best forSpiky, stateless workloadsRightsizing; stateful workloads
PDB interactionRespects during rolling updateUpdater respects PDB — can stall silently

Full Reference

HPAVPA
Controller locationkube-controller-managerSeparate Deployment (3 components)
Metrics sourceMetrics APIs (resource/custom/external)metrics-server + historical samples

Understanding HPA as a delayed, saturating proportional controller with a 90–120 second reaction pipeline, and VPA as a statistical offline optimizer that must restart pods to apply its recommendations and cannot observe throttled CPU accurately, reframes how you tune both systems. Neither loop operates in isolation — they share the same node pool, react to overlapping signals, and can amplify each other’s effects into the destabilizing positive feedback loops described above. Map symptoms to the failure taxonomy before reaching for knobs. Instrument the signals in the observability section, and you will know which loop to blame before the incident review is scheduled.

Autoscaling is not about making systems perfectly elastic — that’s impossible given the phase lag, signal noise, and discrete provisioning steps involved. It is about designing systems where the failure modes are predictable, observable, and bounded. The engineers who succeed with autoscaling aren’t the ones who tune it perfectly — they’re the ones who understand how it breaks.


Further reading: KEP-1287 In-Place Pod Vertical Scaling · HPA algorithm design doc · autoscaling/v2 API reference


NotebookLM Link

Leave a comment