Scaling Kubernetes Operators: A Deep Dive into Reconciliation at Scale

Audience: Senior/Advanced engineers building production-grade operators managing O(10³–10⁵) Custom Resources.

Scope & Assumptions
The Reconciliation Mental Model
A Throughput Model for Capacity Planning
The API Server Is Your Real Bottleneck
Reconcile Loop Performance & Latency
Rate Limiting & Queue Depth
Work Queue Backpressure Strategies
Write Amplification
Finalizers at Scale
Cache & Informer Efficiency
Leader Election & Operator Sharding
Observability & SLOs
Load & Failure Testing
Common Anti-Patterns
Putting It All Together: A Scaling Checklist
Closing Thoughts

Scope & Assumptions

Before diving in, it’s worth being explicit about what this document covers — and what it doesn’t.

In scope:

Kubernetes ≥ 1.26
controller-runtime ≥ v0.17 (Go 1.21+)
Single cluster control plane
Operators managing O(10³–10⁵) CRs
API server is a shared, rate-limited dependency — not dedicated to your operator

Non-goals:

CRD design best practices
Admission webhooks
Multi-cluster federation
Operator SDK specifics (patterns apply, APIs differ)

Violating the “shared API server” assumption changes almost every tuning recommendation in this post. If you have a dedicated API server, you can be more aggressive. In most real environments, you don’t.

The Reconciliation Mental Model

Most operator engineers understand the basic loop: watch for changes, reconcile desired vs. actual state, repeat. What becomes non-obvious at scale is that the reconciler isn’t a single pipeline — it is a concurrent, rate-limited, queue-backed distributed system in miniature, and every design decision compounds when you go from 50 to 50,000 Custom Resources (CRs).

Before diving into individual dimensions, establish the full lifecycle of a reconciliation event:

Every box in this diagram is a knob you can tune. The rest of this post unpacks each one.

A Throughput Model for Capacity Planning

Before touching any configuration, build a mental model of your system’s throughput ceiling. This gives you a principled basis for tuning rather than cargo-culting numbers from blog posts.

Steady-State Throughput

Your operator’s throughput has three independent ceilings. The one with the lowest value wins — raising any other ceiling above it gives you nothing until you raise that one first.

Ceiling 1 — Workers

Workers are your goroutines running Reconcile(). Each worker can only process one reconcile at a time, so:

worker_throughput = MaxConcurrentReconciles / avg_reconcile_duration

Think of it like a checkout line: if you have 50 cashiers and each transaction takes 200ms, you can serve 250 customers/sec.

50 workers ÷ 0.2s per reconcile = 250 reconciles/sec

Ceiling 2 — Rate Limiter

The work queue’s token bucket caps how many reconciles per second are dispatched, regardless of how many workers are free. If your rate limiter is set to 100 QPS, your workers will sit idle even when the queue is full.

Ceiling 3 — API Server Write Capacity

Your operator shares the API server with the whole cluster. Even if your workers and rate limiter are healthy, the API server can reject or throttle writes. This is gated by your APF PriorityLevelConfiguration and etcd write throughput.

Effective throughput is the minimum of all three:

			
effective_throughput = min(
    worker_throughput,          # MaxConcurrentReconciles / avg_duration
    rate_limiter_QPS,           # your token bucket cap
    api_server_write_capacity   # from APF config + etcd
)

		

Practical implication: If your rate limiter is 100 QPS and you have 200 workers, you have 100 idle workers. Adding more workers won’t help — raise the rate limiter (and verify your API server budget can absorb it).

Drain Time Under Burst

When a thundering herd hits — rolling node restart, mass deployment, operator pod bounce — every CR may enqueue at once. Use this to estimate how long your system takes to recover:

drain_time = queued_CRs / effective_throughput

Worked example — two scenarios with 10,000 CRs enqueued:

Scenario A: 50 workers, 250ms avg reconcile, rate limiter at 500 QPS

			
worker_throughput     = 50 / 0.25s = 200 rec/sec
effective_throughput  = min(200, 500) = 200 rec/sec   ← workers are the bottleneck
drain_time            = 10,000 / 200 = 50 seconds

Scenario B: Same workers, but rate limiter capped at 100 QPS

			
worker_throughput     = 200 rec/sec (unchanged)
effective_throughput  = min(200, 100) = 100 rec/sec   ← rate limiter is now the bottleneck
drain_time            = 10,000 / 100 = 100 seconds

The only change between scenarios is the rate limiter — nothing else. Drain time doubled. This is the most common “why isn’t adding workers helping?” situation.

The rate limiter is often the binding constraint, not worker count. Blindly increasing MaxConcurrentReconciles without tuning the rate limiter and cfg.QPS produces no improvement.

Capacity Planning Reference

Use this table to size your operator given a target drain time of ≤ 30 seconds under full burst. “Workers Needed” and “API Server QPS Needed” are the minimum for each respective bottleneck — you need both to hold, not just one.

CR Count	Avg Reconcile Duration	Workers Needed	API Server QPS Needed
1,000	100ms	4	34
10,000	100ms	34	334
10,000	500ms	167	334
50,000	200ms	334	1,667

How to read this: At 10,000 CRs with 500ms reconciles, you need 167 workers and 334 QPS of API server budget. If you only have 50 workers, your actual drain time is 10,000 / min(100, 334) = 100 seconds — bounded by workers. If you have 167 workers but only 100 QPS of API server budget, your drain time is 10,000 / min(334, 100) = 100 seconds — now bounded by the API server. Both ceilings must be high enough.

At the bottom of this table you’re well past what a single operator instance should attempt. That’s where sharding enters.

The API Server Is Your Real Bottleneck

This is the most commonly overlooked constraint in operator scaling.

Default client-go Limits Are Tiny

The default rest.Config QPS settings are designed for human-paced tooling, not operators:

			
// client-go defaults — almost certainly wrong for a production operator
cfg.QPS   = 5    // 5 requests/sec
cfg.Burst = 10   // burst up to 10

An operator with 50 workers will be throttled to a crawl by these defaults. You won’t see obvious 429 errors — you’ll see mysteriously slow reconciles as the HTTP client self-throttles before requests even leave the pod. Always override them:

			
cfg, err := ctrl.GetConfig()
cfg.QPS   = 100  // tune to your API server capacity and APF priority level
cfg.Burst = 200
mgr, err := ctrl.NewManager(cfg, ctrl.Options{ ... })

The right value depends on your API server’s total load, the APF priority level assigned to your operator’s ServiceAccount, and how many other controllers compete for capacity.

API Priority & Fairness (APF)

Kubernetes 1.20+ replaced the old --max-requests-inflight with API Priority and Fairness. Your operator’s requests are classified into a FlowSchema and queued in a PriorityLevelConfiguration. Without explicit configuration, your operator likely lands in global-default — a shared bucket that gets exhausted under any meaningful cluster load.

APF works through two cooperating objects:

PriorityLevelConfiguration — defines a bucket of API server concurrency slots and how they’re managed
FlowSchema — defines which requests (from which subjects, targeting which resources) get routed into that bucket

Think of a PriorityLevelConfiguration as a dedicated lane on a highway, and a FlowSchema as the on-ramp rules for who is allowed into it.

For high-throughput operators, create a dedicated priority level:

			
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: PriorityLevelConfiguration
metadata:
  name: my-operator-high
spec:
  # "Limited" means this level competes for a share of the API server's
  # total concurrency. The alternative is "Exempt" — used by system-critical
  # components like kube-controller-manager — which bypasses all limits entirely.
  # Never use Exempt for your own operator.
  type: Limited
  limited:
    # nominalConcurrencyShares is not a hard slot count — it's a weight.
    # The API server distributes its total concurrency budget proportionally
    # across all Limited priority levels based on these weights.
    #
    # Example: If the API server allows 600 total concurrent requests and
    # you have three Limited levels with shares of 30, 30, and 40, they get
    # 180, 180, and 240 slots respectively.
    #
    # The built-in "global-default" level has 20 shares. Setting yours to 30
    # gives your operator ~1.5× the concurrency of the default bucket.
    # Tune this based on your operator's share of total cluster API load.
    nominalConcurrencyShares: 30
    limitResponse:
      # "Queue" means requests that exceed current concurrency are held in a
      # queue and dispatched when a slot opens. The alternative is "Reject",
      # which returns 429 immediately when slots are full — only appropriate
      # for truly latency-sensitive paths where queuing delay is worse than failure.
      type: Queue
      queuing:
        # The number of FIFO sub-queues within this priority level.
        # APF uses shuffle sharding: each incoming request is assigned to a
        # small random subset of queues (controlled by handSize) and placed
        # in the shortest one. More queues = better isolation between flows,
        # but higher memory overhead. 16 is a reasonable default.
        queues: 16
        # handSize controls how many queues each request "tries" when being
        # placed. A request is assigned to the shortest queue among its
        # handSize candidates. Higher handSize = better load balancing
        # across queues, but increases the probability that two competing
        # flows share a queue (reducing isolation).
        #
        # The trade-off: low handSize (e.g., 1) gives perfect isolation but
        # poor load balancing; high handSize (e.g., queues-1) gives perfect
        # balancing but no isolation. 4 is the standard default.
        handSize: 4
        # Maximum number of requests allowed to wait in any single sub-queue.
        # Once a sub-queue hits this limit, new requests for that flow are
        # rejected with 429 rather than queued further.
        #
        # Total buffer capacity = queues × queueLengthLimit = 16 × 50 = 800
        # pending requests. Size this based on your burst profile and
        # acceptable tail latency — a longer queue absorbs more burst but
        # increases worst-case wait time.
        queueLengthLimit: 50
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
  name: my-operator
spec:
  # Which PriorityLevelConfiguration this schema routes matched requests into.
  priorityLevelConfiguration:
    name: my-operator-high
  # Lower number = higher precedence. APF evaluates all FlowSchemas in
  # ascending precedence order and routes each request to the first match.
  # Built-in schemas use 0–999. The "global-default" catch-all uses 10000.
  # 500 is a safe value that sits below system-critical flows (< 200) but
  # above the default catch-all — ensures your operator's requests match
  # here before falling through to global-default.
  matchingPrecedence: 500
  rules:
  - subjects:
    # Match requests originating from your operator's ServiceAccount.
    # This is the most precise subject — locks routing to your operator's
    # identity specifically, not to any other workload in the namespace.
    - kind: ServiceAccount
      serviceAccount:
        name: my-operator
        namespace: my-operator-ns
    resourceRules:
    # Which API requests from the above subjects this schema captures.
    # Tighten this to the actual resources your operator touches in production.
    # Overly broad rules (["*"] on everything) can inadvertently capture
    # read traffic (List, Watch) that you might prefer to route differently.
    - verbs: ["*"]
      apiGroups: ["mygroup.example.com"]
      resources: ["*"]

		

Verifying APF is working

After applying, confirm your operator’s requests are landing in the right priority level:

			
# Watch APF metrics — look for your priority level name
kubectl get --raw /metrics | grep apiserver_flowcontrol | grep my-operator-high
# Key metrics to check:
# apiserver_flowcontrol_current_inqueue_requests     — requests waiting in queue
# apiserver_flowcontrol_current_executing_requests   — requests actively being served
# apiserver_flowcontrol_rejected_requests_total       — requests dropped (429'd)
# apiserver_flowcontrol_request_wait_duration_seconds — time spent in queue

		

If current_inqueue_requests is consistently non-zero, your concurrency share is too small — increase nominalConcurrencyShares. If rejected_requests_total is climbing, your queue depth (queues × queueLengthLimit) is being exhausted under burst — increase queueLengthLimit or nominalConcurrencyShares.

Common misconfiguration: routing reads and writes together

Your operator issues two categories of API requests with very different throughput profiles:

Reads (Get, List, Watch) — high volume, cache-backed, cheap on etcd
Writes (Create, Update, Patch, Delete) — lower volume, always hit etcd, expensive

A single PriorityLevelConfiguration for both means a flood of List/Watch requests can exhaust queue slots and delay your writes. For operators managing O(10k+) CRs, consider splitting into two schemas — one for reads with a broad queueLengthLimit, one for writes with a higher nominalConcurrencyShares:

			
# Write-path priority level — higher concurrency share, tighter queue
# (writes are fewer but more critical to reconcile throughput)
nominalConcurrencyShares: 30
queuing:
  queues: 16
  queueLengthLimit: 30   # fail fast if write slots are exhausted
---
# Read-path priority level — lower concurrency share, larger queue buffer
# (reads are more tolerant of queuing delay)
nominalConcurrencyShares: 10
queuing:
  queues: 16
  queueLengthLimit: 100  # absorb read bursts without dropping

		

etcd Write Amplification

Every API write translates to an etcd write. etcd’s performance is bounded by disk fsync latency — typically 1–10ms on SSDs. At high write rates, etcd becomes the bottleneck long before your operator or API server saturates CPU.

Signs of etcd pressure:

etcd_disk_backend_commit_duration_seconds P99 climbing
API server latency increasing without an obvious load spike on the operator side
429 responses despite low operator QPS

Mitigations: minimize writes per reconcile (see Write Amplification), prefer Server-Side Apply to reduce conflict retries, and batch status updates.

Reconcile Loop Performance & Latency

Measure First

Instrument your reconcile duration as a histogram before optimizing anything:

			
var reconcileDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "operator_reconcile_duration_seconds",
        Help: "Duration of reconcile loop iterations",
        // Buckets starting at 1ms, doubling 15 times: 1ms → 2ms → 4ms → ... → 16s.
        // This range covers both fast cache-hit reconciles (sub-10ms) and
        // slow outliers with external calls or retries (multi-second).
        // Adjust the upper bound if your reconciles regularly exceed 16s.
        Buckets: prometheus.ExponentialBuckets(0.001, 2, 15),
    },
    // "controller" lets you reuse this metric across multiple controllers
    // in the same operator binary. "result" splits success vs. error so
    // you can compute error rate directly from this histogram without a
    // separate counter.
    []string{"controller", "result"},
)
func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Capture the wall-clock start time before any work begins — including
    // the cache read in r.reconcile. This measures total Reconcile() cost,
    // not just business logic.
    start := time.Now()
    result, err := r.reconcile(ctx, req)
    // Separate success and error observations so you can alert on error rate
    // and compare latency distributions between healthy and failing reconciles.
    // A rising P99 on the "error" label often indicates retry storms or slow
    // external dependencies, while a rising P99 on "success" points to logic cost.
    resultLabel := "success"
    if err != nil {
        resultLabel = "error"
    }
    reconcileDuration.WithLabelValues("mycontroller", resultLabel).
        Observe(time.Since(start).Seconds())
    return result, err
}

		

Target: P99 reconcile duration should fit within your drain-time budget from the throughput model. If your P99 is 10× your P50, you have outlier reconciles worth investigating — they disproportionately consume worker slots and inflate your effective throughput calculation.

Common Latency Culprits

1. Unconditional API reads

The controller-runtime Client is cache-backed by default — r.Client.Get and r.Client.List read from the local informer cache, not the API server. The exception is r.APIReader, which bypasses the cache and hits the API server directly. Use APIReader only when you need a strongly consistent read and are willing to pay the latency and QPS cost.

Be aware of cache staleness: the cache reflects the last List+Watch state. Under normal operation, staleness is bounded by network RTT plus informer processing time — typically milliseconds, not the resync period. Resync is a correctness backstop, not a freshness mechanism.

Correctness invariant: Reads from the cache are eventually consistent. Design reconcile logic to tolerate stale reads — never assume a read reflects the latest write you just made in the same reconcile pass. If you write an object and immediately read it back, you may see the pre-write state. This is a common source of subtle off-by-one reconcile cycles.

2. Sequential writes when parallel is possible

			
// ❌ Sequential — each write blocks the next
for _, child := range children {
    if err := r.Create(ctx, child); err != nil {
        return ctrl.Result{}, err
    }
}

		

Replace with bounded concurrency using errgroup:

			
// ✅ Bounded parallel writes
g, gctx := errgroup.WithContext(ctx)
sem := make(chan struct{}, 10) // max 10 concurrent API calls per reconcile
for _, child := range children {
    child := child
    g.Go(func() error {
        sem <- struct{}{}
        defer func() { <-sem }()
        return r.Create(gctx, child)
    })
}
if err := g.Wait(); err != nil {
    return ctrl.Result{}, err
}

		

Note: parallelism within a reconcile amplifies API server pressure — bound the semaphore relative to your global QPS budget, not just local convenience.

3. Expensive status computation

If computing .Status requires listing many child objects, structure your owned-object watches so status can be derived from events rather than re-enumerated on every reconcile. Computing aggregated status by listing 100 child Pods on every reconcile, at 50 reconciles/sec, is 5,000 list operations per second — expensive in CPU and allocations even against the local cache.

4. Unnecessary requeues

ctrl.Result{RequeueAfter: 30 * time.Second} on a healthy, stable object is self-inflicted load. At 10,000 CRs with a 30-second requeue, that’s ~333 reconciles/sec of purely unnecessary work — consuming a large fraction of your effective QPS budget to accomplish nothing.

			
if myObj.Status.Phase == PhasePending {
    return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}
// Stable — let watches drive the next reconcile
return ctrl.Result{}, nil

		

Rate Limiting & Queue Depth

How controller-runtime Rate Limiting Works

controller-runtime uses a workqueue.RateLimiter on the internal work queue. By default it uses BucketRateLimiter combined with ItemExponentialFailureRateLimiter, composed via MaxOf:

An item waits for whichever limiter imposes the longer delay. A high global QPS cap doesn’t help an item in exponential backoff from repeated errors. A low global cap throttles everything regardless of per-item error state. Size each independently.

Customizing the Rate Limiter

The default global QPS cap is 10 — correct for a kubebuilder tutorial, wrong for a production operator:

			
ctrl.NewControllerManagedBy(mgr).
    For(&myv1.MyResource{}).
    WithOptions(controller.Options{
        MaxConcurrentReconciles: 50,
        RateLimiter: workqueue.NewMaxOfRateLimiter(
            // Per-item exponential backoff: 5ms base, 500s max
            // (default 1000s max is often too long — tune to your recovery SLO)
            workqueue.NewItemExponentialFailureRateLimiter(
                5*time.Millisecond,
                500*time.Second,
            ),
            // Global token bucket: ~cfg.QPS / avg_writes_per_reconcile
            &workqueue.BucketRateLimiter{
                Limiter: rate.NewLimiter(rate.Limit(100), 200),
            },
        ),
    }).
    Complete(r)

		

Sizing guidance: Set rate.Limit to approximately cfg.QPS / avg_writes_per_reconcile. If each reconcile issues 5 API writes on average and cfg.QPS = 500, the rate limiter ceiling should be around 100 reconciles/sec.

Rate Limiting Topology: One Queue vs. Many

A single global queue means one noisy tenant can starve all others. At multi-tenant platform scale, per-tenant or per-namespace queues with independent rate limiters give you isolation. This requires stepping outside controller-runtime’s default queue management — significant complexity, but necessary at the upper end of scale.

Queue Depth as a Signal

Queue Depth Behavior	Diagnosis
Spikes during deploys, drains quickly	Normal — burst absorbed correctly
Consistently high, drains slowly	Workers undersized or rate limiter too tight
Slowly growing without bound	Queue leak — unconditional requeues
Flat at zero	Under-watched or predicates too aggressive

			
// controller-runtime exposes workqueue_depth automatically.
// For custom queues, instrument explicitly:
prometheus.MustRegister(prometheus.NewGaugeFunc(
    prometheus.GaugeOpts{
        Name: "operator_queue_depth",
        Help: "Current depth of the reconcile work queue",
    },
    func() float64 { return float64(r.Queue.Len()) },
))

		

Work Queue Backpressure Strategies

The Fundamental Invariant

Reconcile must be idempotent and level-triggered.

If this invariant is violated, no scaling technique — rate limiting, sharding, caching — will save you. They will merely obscure the correctness problem until load reveals it.

This is not a style preference — it is a correctness requirement that the entire controller model depends on.

Why Level-Triggered Matters at Scale

The work queue deduplicates by key: if CR foo/bar is enqueued 50 times while a worker is busy, only one reconcile fires when the worker is free. Intermediate states are invisible.

Failure mode of edge-triggered reconcile:

			
Event 1: CR spec.replicas = 3 → reconcile enqueued
Event 2: CR spec.replicas = 5 → deduplicated (same key, not yet processed)
Worker picks up the key → reconcile sees spec.replicas = 5
If reconciler acted on "what changed" from Event 1, it reconciles toward 3.
→ Permanent drift

		

			
// ❌ Edge-triggered — broken at scale
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    if lastEvent.Type == EventTypeDelete {
        return r.handleDelete(ctx, req)
    }
    // ...
}
// ✅ Level-triggered — always correct
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var obj myv1.MyResource
    if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {
        if apierrors.IsNotFound(err) {
            return ctrl.Result{}, nil // Object gone — nothing to do
        }
        return ctrl.Result{}, err
    }
    // Always reconcile toward obj's current spec, not any specific event
    return r.reconcileCurrentState(ctx, &obj)
}

		

The Thundering Herd

A cluster upgrade rolling 500 nodes simultaneously produces 5,000+ reconcile events. Without backpressure:

Key insight: deduplication is what makes this survivable. Without it, 5,000 events become 5,000 reconciles in flight simultaneously.

Backoff Tuning for Transient Errors

Error Type	Strategy
API server `429`	Fixed delay, do not count toward per-item failure backoff
Dependency not ready	Moderate fixed backoff + watch-driven requeue when dependency changes
Network timeout	Return `err` — let `ItemExponentialFailureRateLimiter` handle backoff
Validation/terminal error	Set status condition, return `nil` — no requeue

			
func (r *Reconciler) reconcileCurrentState(ctx context.Context, obj *myv1.MyResource) (ctrl.Result, error) {
    dep, err := r.ensureDependency(ctx, obj)
    if err != nil {
        if apierrors.IsTooManyRequests(err) {
            // Honor API server backpressure — fixed delay, not counted as item failure
            return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
        }
        if isTerminalError(err) {
            r.setFailedStatus(ctx, obj, err)
            return ctrl.Result{}, nil // No requeue
        }
        // Transient — return err, let ItemExponentialFailureRateLimiter back off
        return ctrl.Result{}, err
    }
    _ = dep
    return ctrl.Result{}, nil
}

		

Note on Retry-After: apierrors.IsTooManyRequests detects 429 status, but the actual Retry-After response header is not surfaced by client-go’s error types — parsing it requires intercepting the raw HTTP response. A conservative fixed delay (5–10s) is a practical approximation for most operators.

Generation & ResourceVersion Gating

Use .metadata.generation to skip reconciles triggered by status-only updates (which increment resourceVersion but not generation):

			
func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&myv1.MyResource{}, builder.WithPredicates(
            predicate.GenerationChangedPredicate{},
        )).
        Complete(r)
}

		

Caution: GenerationChangedPredicate applies only to the primary watched type. Owned child objects don’t have a meaningful generation in the same sense — use ResourceVersionChangedPredicate for those, or write custom predicates that filter on semantically meaningful field changes.

Write Amplification

Write amplification is one of the most insidious scaling killers in production operators, and it’s rarely discussed explicitly.

The Amplification Chain

A single CR spec change cascades into many API writes:

At scale: 1 CR change → 10–50 API writes is not unusual. At 100 reconciles/sec, that’s 1,000–5,000 API writes/sec — enough to saturate etcd on standard hardware.

Mitigations

1. Use Server-Side Apply (SSA)

SSA lets the API server merge fields rather than requiring a full object replace, eliminating resourceVersion conflict retries:

			
obj := &appsv1.Deployment{
    TypeMeta: metav1.TypeMeta{
        APIVersion: "apps/v1",
        Kind:       "Deployment",
    },
    ObjectMeta: metav1.ObjectMeta{
        Name:      "my-deployment",
        Namespace: req.Namespace,
    },
    Spec: desiredSpec,
}
if err := r.Patch(ctx, obj, client.Apply,
    client.FieldOwner("my-operator"),
    client.ForceOwnership,
); err != nil {
    return ctrl.Result{}, err
}

		

2. Avoid no-op updates

Before writing, check if the current state already matches desired. This is especially important for status updates, which trigger watch events and may cascade into additional reconciles:

			
if !equality.Semantic.DeepEqual(obj.Status, desiredStatus) {
    obj.Status = desiredStatus
    if err := r.Status().Update(ctx, obj); err != nil {
        return ctrl.Result{}, err
    }
}

		

3. Batch status updates

If your reconcile has multiple stages that each want to update status, defer all status writes to the end of the reconcile. One status write per reconcile, not one per stage.

4. Minimize status churn

Avoid high-cardinality or frequently-changing data in status fields — sub-second timestamps, counters that increment on every reconcile. Every status change triggers a resourceVersion bump → watch event → potential reconcile enqueue. This is a feedback loop that generates load proportional to CR count × reconcile frequency.

Finalizers at Scale

Finalizers are one of the most underestimated scaling problems in operator design. They turn deletes into reconciles — which is exactly the right abstraction for cleanup, but carries significant throughput implications at scale.

How Finalizers Become a Throughput Problem

Without a finalizer, deleting a CR removes it immediately. With a finalizer, the API server marks it as DeletingTimestamp set, and your operator must complete cleanup before removing the finalizer — only then does the object disappear. This means:

			
Mass delete of 1,000 CRs
→ 1,000 objects enter Terminating state simultaneously
→ 1,000 reconciles enqueued (all triggered by DeletingTimestamp)
→ Each reconcile does cleanup work (API calls, external system calls)
→ Each successful cleanup writes a finalizer patch + triggers a watch event
→ Cleanup throughput bounded by reconcile throughput

		

If cleanup is slow (external API calls, database writes, cascading child deletions), the Terminating backlog can grow faster than it drains — leaving hundreds of CRs stuck in Terminating indefinitely.

Failure Modes

Slow cleanup blocking the queue: Finalizer cleanup that blocks on slow external systems holds worker slots, reducing effective reconcile throughput for all other CRs
Non-idempotent cleanup: If cleanup is not safe to re-run, a partial cleanup followed by a reconcile loop requeue can cause double-delete or data corruption
Permanent leaks: A bug in cleanup logic (unhandled error path that never removes the finalizer) leaves objects stuck in Terminating forever — they accumulate silently
External dependency unavailability: If your cleanup calls an external system that’s down during a mass delete, every CR in Terminating will be stuck until the system recovers

Mitigations

1. Make cleanup idempotent and bounded in time

Cleanup must be safe to run multiple times and must not block indefinitely:

			
func (r *Reconciler) handleDeletion(ctx context.Context, obj *myv1.MyResource) (ctrl.Result, error) {
    if !controllerutil.ContainsFinalizer(obj, myFinalizer) {
        return ctrl.Result{}, nil // Already cleaned up
    }
    // Cleanup must be idempotent — safe to call multiple times
    if err := r.cleanupExternalResources(ctx, obj); err != nil {
        if isExternalSystemUnavailable(err) {
            // Don't block the queue — requeue with backoff
            return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
        }
        return ctrl.Result{}, err
    }
    // Remove finalizer only after cleanup succeeds
    controllerutil.RemoveFinalizer(obj, myFinalizer)
    if err := r.Update(ctx, obj); err != nil {
        return ctrl.Result{}, err
    }
    return ctrl.Result{}, nil
}

		

2. Use status conditions for long-running cleanup

If cleanup takes more than a few seconds, track progress in status rather than blocking the worker:

			
type MyResourceStatus struct {
    // ...
    CleanupPhase    string      `json:"cleanupPhase,omitempty"`    // e.g., "DeletingChildren", "DeletingExternalResources", "Complete"
    CleanupProgress int         `json:"cleanupProgress,omitempty"` // e.g., children deleted so far
}

		

This gives operators visibility into cleanup state and allows partial progress to survive pod restarts.

3. Avoid blocking on external systems synchronously

If cleanup requires calling an external API that might be slow or unavailable, use an async pattern: issue the external call, record that it was issued in status, and requeue to verify completion:

			
if obj.Status.CleanupPhase == "" {
    // Phase 1: Issue async cleanup request
    if err := r.issueExternalCleanup(ctx, obj); err != nil {
        return ctrl.Result{}, err
    }
    obj.Status.CleanupPhase = "Pending"
    r.Status().Update(ctx, obj)
    return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}
if obj.Status.CleanupPhase == "Pending" {
    // Phase 2: Poll for completion
    done, err := r.checkExternalCleanupComplete(ctx, obj)
    if err != nil || !done {
        return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
    }
    // Cleanup done — remove finalizer
}

		

4. Monitor terminating object count

Terminating object count is a first-class operational metric. A growing Terminating count is a signal of a cleanup bottleneck or bug:

			
// Expose via a periodic background scan
terminatingCount := prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "operator_terminating_objects_total",
    Help: "Number of CRs currently in Terminating state",
})
// In a background goroutine or periodic reconcile:
var list myv1.MyResourceList
r.List(ctx, &list)
count := 0
for _, obj := range list.Items {
    if !obj.DeletionTimestamp.IsZero() {
        count++
    }
}
terminatingCount.Set(float64(count))

		

Alert when this value grows beyond a threshold that implies the cleanup queue cannot drain within your deletion SLO.

Cache & Informer Efficiency

How the Informer Cache Works

controller-runtime wraps client-go’s informer cache. Every object type your operator interacts with is listed and watched at startup, then maintained in an in-memory indexed store. Reads (r.Client.Get, r.Client.List) hit this store; writes go to the API server.

Cache Memory Footprint

At scale, cache memory becomes a hard constraint on operator pod sizing:

			
Approximate memory per cached object: 1–3 KB
(depends on CRD size — large specs with many fields cost more)
Example: 50k CRs + 200k child objects (Deployments, Services, ConfigMaps)
  = 250k objects × 2 KB avg = ~500 MB for the cache alone
Add: controller-runtime overhead, goroutine stacks, Go runtime GC headroom
Realistic total operator memory budget at this scale: 1–2 GB

		

Implications:

Set pod memory limits accordingly — OOMKill during a thundering herd is a correctness failure, not just an availability event
Avoid watching resource types you don’t need
Prefer narrow watches (namespace-scoped, label-selected) over cluster-wide watches for high-cardinality resources like Pods or Events

Cache Scope: Cluster-wide vs. Namespaced

By default, the cache indexes all namespaces. Scope it when possible:

			
// Namespace-scoped cache
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    Cache: cache.Options{
        DefaultNamespaces: map[string]cache.Config{
            "team-a": {},
            "team-b": {},
        },
    },
})
// Label-selector scoped cache for high-cardinality types
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    Cache: cache.Options{
        ByObject: map[client.Object]cache.ObjectCacheConfig{
            &corev1.Pod{}: {
                Label: labels.SelectorFromSet(labels.Set{
                    "managed-by": "my-operator",
                }),
            },
        },
    },
})

		

Selective Field Indexing

Without field indexes, r.List with arbitrary field filters scans the full cache in O(n). At 50k objects, that’s expensive in CPU and allocations on every reconcile:

			
// Register at startup
const clusterRefField = ".spec.clusterRef"
if err := mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &myv1.MyResource{},
    clusterRefField,
    func(obj client.Object) []string {
        r := obj.(*myv1.MyResource)
        if r.Spec.ClusterRef == "" {
            return nil
        }
        return []string{r.Spec.ClusterRef}
    },
); err != nil {
    return err
}
// O(1) lookup in reconcile
var relatedResources myv1.MyResourceList
if err := r.List(ctx, &relatedResources,
    client.InNamespace(req.Namespace),
    client.MatchingFields{clusterRefField: clusterName},
); err != nil {
    return ctrl.Result{}, err
}

		

Watch Predicate Filtering

Every event on a watched resource type enqueues a reconcile. Without filtering, a highly active child resource (e.g., Pods with frequent annotation churn) generates enormous spurious load:

			
ctrl.NewControllerManagedBy(mgr).
    For(&myv1.MyResource{}).
    Owns(&corev1.Pod{}, builder.WithPredicates(predicate.Funcs{
        UpdateFunc: func(e event.UpdateEvent) bool {
            oldPod := e.ObjectOld.(*corev1.Pod)
            newPod := e.ObjectNew.(*corev1.Pod)
            // Only trigger on phase transitions, not annotation/label churn
            return oldPod.Status.Phase != newPod.Status.Phase
        },
        CreateFunc:  func(e event.CreateEvent) bool { return true },
        DeleteFunc:  func(e event.DeleteEvent) bool { return true },
        GenericFunc: func(e event.GenericEvent) bool { return false },
    })).
    Complete(r)

		

Resync Period

The informer resync period triggers event handlers for all cached objects, which results in a full re-enqueue of reconcile keys — it’s a correctness safety net. The cache is kept fresh continuously via the Watch stream; resync is a belt-and-suspenders mechanism that fires reconciles for any object whose real state may have drifted from what the operator last acted on.

At 50,000 CRs with the default 10-minute resync, every 10 minutes you flood your queue with 50,000 reconcile enqueues — a self-inflicted thundering herd. Extend the period and compensate with explicit correctness mechanisms:

			
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    Cache: cache.Options{
        SyncPeriod: ptr.To(6 * time.Hour),
    },
})

		

Correctness mechanisms to substitute for frequent resyncs:

Generation-based status conditions with lastTransitionTime
Observation timestamps in status, with alerting when they fall behind SLO
Periodic controller-level health scans that target specific stale-detection patterns

Leader Election & Operator Sharding

Why Leader Election Isn’t Enough Alone

controller-runtime’s built-in leader election ensures only one replica actively reconciles — this gives you HA (fast failover) but not horizontal scale. A single leader is still bottlenecked to one process.

Sharding Strategies

Strategy 1: Label-based shard assignment

A shard-assignment controller (or webhook) stamps each CR with a shard label. Each operator shard watches only its assigned subset:

			
shardID := os.Getenv("SHARD_ID")
ctrl.NewControllerManagedBy(mgr).
    For(&myv1.MyResource{}, builder.WithPredicates(predicate.Funcs{
        CreateFunc: func(e event.CreateEvent) bool {
            return e.Object.GetLabels()["operator-shard"] == shardID
        },
        UpdateFunc: func(e event.UpdateEvent) bool {
            return e.ObjectNew.GetLabels()["operator-shard"] == shardID
        },
        DeleteFunc: func(e event.DeleteEvent) bool {
            return e.Object.GetLabels()["operator-shard"] == shardID
        },
        GenericFunc: func(e event.GenericEvent) bool { return false },
    })).
    Complete(r)

		

Strategy 2: Hash-based sharding (no label needed)

			
func belongsToShard(obj client.Object, shardID, totalShards int) bool {
    key := fmt.Sprintf("%s/%s", obj.GetNamespace(), obj.GetName())
    h := fnv.New32a()
    h.Write([]byte(key))
    return int(h.Sum32())%totalShards == shardID
}
UpdateFunc: func(e event.UpdateEvent) bool {
    return belongsToShard(e.ObjectNew, r.shardID, r.totalShards)
},

		

Sharding Strategy Trade-offs

Dimension	Label-based	Hash-based
Operational complexity	Higher (label management, assignment controller)	Lower
Stability on rescale	Stable (labels don’t change)	Rehash — assignments shift
Hot partition risk	Possible if labels are distributed unevenly	Low (hash distributes uniformly)
Split-brain risk during rescale	Low	Higher — transient dual ownership
Recommendation	Prefer if shard count changes frequently	Prefer for fixed or slow-changing shard count

Sharding Failure Modes

Split-brain: Multiple shards reconcile the same CR simultaneously. Happens during rescale windows with hash-based sharding, or if label propagation lags. Mandatory mitigation: idempotent reconcile.
Rehash storms: Increasing shard count causes a large fraction of CRs to be transiently owned by two shards. Use two-phase expansion.
Hot partitions: If your CR namespace/name distribution is non-uniform, a hash-based shard can become overwhelmed. Monitor per-shard queue depth independently.

Two-Phase Shard Expansion

Shard Auto-scaling with HPA

			
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-operator-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-operator
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: operator_queue_depth_per_shard
      target:
        type: AverageValue
        averageValue: "500"

		

Warning: HPA-driven shard rescaling requires that your shard assignment logic handles count changes safely. Test rehash windows explicitly — a scale-out event during a thundering herd is the worst time to discover split-brain.

When Not to Shard

Sharding adds significant operational complexity. Before reaching for it, verify that the actual bottleneck is single-process saturation:

API server QPS is the bottleneck, not CPU: Adding shards multiplies your write throughput but also multiplies your API server load. If etcd is already under pressure, more shards make it worse.
Reconcile latency is high due to inefficient logic: Sharding distributes a slow reconcile across more pods — it doesn’t make each reconcile faster. Fix the logic first.
Queue depth is low but lag is high: Low queue depth with high lag means workers are spending time doing expensive work per reconcile, not that there are too many items. More workers won’t help; faster reconciles will.
You haven’t profiled MaxConcurrentReconciles headroom: A single process with well-tuned concurrency and rate limiting can handle tens of thousands of CRs in many real workloads. Benchmark before sharding.

Sharding is the right tool when a single operator process is demonstrably CPU-saturated or API-server-QPS-saturated at maximum sensible worker count. Not before.

Observability & SLOs

Define the SLO First

Without an SLO, metrics are noise. Define what “working correctly” means before instrumenting:

			
Example SLO:
"99% of CRs are reconciled to match their desired spec within 30 seconds
 of a spec change, measured over any 1-hour rolling window."

This single statement immediately implies:

You need to measure reconcile lag (time from spec change to successful reconcile completion — not just reconcile duration)
You need to track it at the 99th percentile, not the average
You need error budget alerting when the burn rate is unsustainable

Measuring Reconcile Lag

Reconcile duration and reconcile lag are not the same thing.

Reconcile duration: How long a single call to Reconcile() takes
Reconcile lag: Time from spec change to the completion of a successful reconcile for that generation

Lag is what your SLO should be based on. A fast reconcile function that runs infrequently produces low duration but high lag.

Implementation approach:

Set status.observedGeneration to metadata.generation on each successful reconcile
Record status.lastReconciledTime as the completion timestamp
Expose a metric: time.Since(lastReconciledTime) for objects where observedGeneration < generation

			
// In your status type
type MyResourceStatus struct {
    ObservedGeneration int64       `json:"observedGeneration,omitempty"`
    LastReconciledTime metav1.Time `json:"lastReconciledTime,omitempty"`
    // ...
}
// At the end of a successful reconcile
obj.Status.ObservedGeneration = obj.Generation
obj.Status.LastReconciledTime = metav1.Now()
if err := r.Status().Update(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

		

A scraper or a controller-level probe can then compute lag across all CRs:

			
// Computed metric: max lag across all CRs where observedGeneration < generation
lagSeconds := time.Since(obj.Status.LastReconciledTime.Time).Seconds()
if obj.Status.ObservedGeneration < obj.Generation {
    reconcileLag.WithLabelValues(obj.Name).Set(lagSeconds)
}

		

This is the signal your SLO alert fires on — not reconcile_duration_seconds.

Derived Alerts from the SLO

Alert	Threshold	Rationale
`queue_depth > 1000` for 5+ min	Tunable per shard	Queue not draining — SLO at risk
`reconcile_p99_duration > 20s`	2/3 of SLO budget	Latency headroom shrinking
`reconcile_error_rate > 5%` for 10m	Burn rate	Error budget burning fast
`workqueue_longest_running_processor > 60s`	Fixed	Stuck reconcile — goroutine leak candidate

Key Metrics

Metric	Type	Why It Matters
`reconcile_duration_seconds`	Histogram	Primary throughput & latency signal
`reconcile_errors_total`	Counter	Error rate by type
`workqueue_depth`	Gauge	Backpressure health
`workqueue_longest_running_processor_seconds`	Gauge	Stuck reconcile detection
`rest_client_requests_total`	Counter	API server write amplification
`workqueue_retries_total`	Counter	Requeue storm detection

Wire up controller-runtime’s default metric server:

			
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    Metrics: server.Options{
        BindAddress: ":8080",
    },
})

		

controller-runtime registers workqueue_* metrics automatically via client-go’s workqueue instrumentation. Verify they appear at /metrics before building dashboards around them.

Structured Per-Stage Reconcile Logging

For operators managing complex multi-step reconciles, stage-level breadcrumbs let you pinpoint which stage is slow or failing without requiring distributed tracing infrastructure:

			
func (r *Reconciler) reconcileCurrentState(ctx context.Context, obj *myv1.MyResource) (ctrl.Result, error) {
    log := log.FromContext(ctx).WithValues(
        "resource", obj.Name,
        "generation", obj.Generation,
        "resourceVersion", obj.ResourceVersion,
    )
    stages := []struct {
        name string
        fn   func() error
    }{
        {"ensure-configmap", func() error { return r.ensureConfigMap(ctx, obj) }},
        {"ensure-deployment", func() error { return r.ensureDeployment(ctx, obj) }},
        {"sync-status", func() error { return r.syncStatus(ctx, obj) }},
    }
    for _, stage := range stages {
        start := time.Now()
        if err := stage.fn(); err != nil {
            log.Error(err, "stage failed", "stage", stage.name, "elapsed", time.Since(start))
            return ctrl.Result{}, err
        }
        log.V(1).Info("stage complete", "stage", stage.name, "elapsed", time.Since(start))
    }
    return ctrl.Result{}, nil
}

		

Load & Failure Testing

A scaling design is only as good as its tests. These scenarios are expected at staff level before a high-scale operator goes to production.

Test Matrix

Scenario	What It Validates
Mass CR creation (10k objects in 60s)	Queue drain time, worker saturation, memory spike
Mass CR deletion	Finalizer cleanup throughput, Terminating backlog, orphan prevention
Finalizer cleanup with external system down	Cleanup backoff, non-blocking behavior, Terminating count stability
Node rolling restart (thundering herd)	Backpressure, deduplication, event-collapse correctness
API server throttling (inject synthetic 429s)	Backoff behavior, recovery time, queue stability
Slow dependency (sleep in reconcile)	Worker starvation, P99 impact, queue depth growth
Operator pod restart mid-reconcile	Level-triggered correctness, no duplicate side effects
Shard rescale (increase shard count)	Rehash window safety, split-brain absence

Load Generator

			
#!/bin/bash
# Mass CR creation — 10k objects, batched to avoid fork-bomb
for i in $(seq 1 10000); do
  kubectl apply -f - <<YAML &
apiVersion: mygroup.example.com/v1
kind: MyResource
metadata:
  name: load-test-$i
  namespace: default
spec:
  replicas: 1
YAML
  if (( i % 100 == 0 )); then wait; fi
done

		

Validating Idempotency

Run the same reconcile twice on an already-reconciled object and verify zero API writes on the second pass:

			
func TestReconcileIsIdempotent(t *testing.T) {
    // First reconcile — sets up state
    _, err := reconciler.Reconcile(ctx, req)
    require.NoError(t, err)
    // Capture baseline action count
    baselineActions := len(fakeClient.Actions())
    // Second reconcile — nothing changed
    _, err = reconciler.Reconcile(ctx, req)
    require.NoError(t, err)
    // Count writes issued on second pass
    secondPassActions := fakeClient.Actions()[baselineActions:]
    writes := 0
    for _, a := range secondPassActions {
        if a.GetVerb() != "get" && a.GetVerb() != "list" {
            writes++
        }
    }
    assert.Zero(t, writes, "reconcile must be idempotent: second pass should issue no writes")
}

		

What to Measure Under Load

Queue depth over time — should reach steady-state, not grow unbounded
P99 reconcile duration under load vs. at rest — the delta is your burst cost
Operator pod memory at peak CR count — should be stable, not growing (GC pressure)
API server request rate — should never exceed cfg.QPS ceiling
Error rate — should be zero under normal load, non-zero but bounded under injected failures

Common Anti-Patterns

These patterns appear regularly in operators that don’t scale. Each is a correctness or performance cliff that reveals itself under load.

Edge-triggered reconcile logic
Acting on event type or delta rather than current desired state. Correct at low scale where deduplication rarely collapses events; catastrophic at high scale where it’s constant.

Blind RequeueAfter on stable objects
RequeueAfter on every healthy object is self-inflicted polling. At 10k CRs with a 60-second poll, that’s ~167 reconciles/sec of pure no-op work saturating worker capacity.

Ignoring cfg.QPS and cfg.Burst
Leaving client-go at its default 5 QPS/10 burst means your operator is throttled by its own HTTP client. You won’t see 429s — you’ll see mysteriously slow reconciles as the client queues requests internally.

Listing the entire cluster on every reconcile
r.List(&pods) without namespace or label selectors walks the full cache. At 100k Pods, this is expensive in CPU and allocations — and usually unnecessary if field indexes are registered.

Writing status unconditionally
Updating status even when nothing changed creates a resourceVersion bump → watch event → potential reconcile enqueue. This is a feedback loop that generates load proportional to CR count × reconcile frequency.

Unbounded parallel API calls
errgroup without a semaphore means a reconcile processing 500 children fires 500 concurrent API writes. At 50 concurrent reconciles, that’s 25,000 simultaneous requests — enough to saturate an API server.

Treating resync as a freshness mechanism
The informer cache is kept fresh by the Watch stream, not by resync. Resync is a correctness backstop. A short resync period doesn’t improve freshness — it floods the queue at regular intervals.

No per-shard queue depth monitoring
Without this, hot partitions are invisible until they produce latency SLO violations.

Using APIReader everywhere
Bypassing the cache with r.APIReader for all reads turns your operator into a direct API client on every reconcile. You lose all caching benefits and impose proportionally higher load on the API server. Use APIReader only for strongly-consistent reads where cache staleness is genuinely unacceptable — and treat each such call as spending from your QPS budget.

Putting It All Together: A Scaling Checklist

Throughput & Capacity

[ ] Throughput model calculated: min(workers/avg_duration, rate_limiter_QPS, api_server_capacity)
[ ] Drain time under thundering herd estimated and within SLO budget
[ ] cfg.QPS and cfg.Burst explicitly set (not left at client-go defaults of 5/10)
[ ] APF FlowSchema + PriorityLevelConfiguration created for operator ServiceAccount

Reconcile Loop

[ ] Reconcile is fully level-triggered — no dependency on event type or delta
[ ] Reconcile is idempotent — second pass on already-reconciled object issues zero writes
[ ] Reconcile duration P99 measured and within SLO budget
[ ] Parallel API writes bounded by semaphore relative to cfg.QPS
[ ] No unconditional RequeueAfter on stable/healthy objects
[ ] Terminal errors set status condition and return nil (no requeue)

Finalizers

[ ] Finalizer cleanup is idempotent — safe to run multiple times
[ ] Cleanup does not block workers on slow external systems — uses backoff + requeue
[ ] Long-running cleanup progress tracked in status conditions, not just in-memory
[ ] operator_terminating_objects_total metric exposed and alerted on
[ ] Mass deletion load tested: Terminating count drains within deletion SLO

Write Amplification

[ ] Server-Side Apply used for owned resource management
[ ] Status updates guarded by deep equality check before writing
[ ] Status writes batched to once per reconcile pass
[ ] High-churn fields (counters, sub-second timestamps) absent from status

Rate Limiting & Queue

[ ] Custom RateLimiter set: global QPS ≈ cfg.QPS / avg_writes_per_reconcile
[ ] Per-item backoff max tuned to recovery SLO (not left at 1000s default)
[ ] Queue depth exposed as Prometheus metric and alerted on
[ ] Queue depth diagnostic table understood and SLO breakpoints defined

Backpressure

[ ] GenerationChangedPredicate applied to primary resource
[ ] Child-resource watch predicates filter to meaningful transitions only
[ ] 429 responses handled with fixed delay (not counted as item failures)
[ ] Thundering herd load tested: queue drains within SLO window, memory stable

Cache & Informers

[ ] Cache scoped to relevant namespaces / label selectors
[ ] Field indexes registered for all non-trivial List queries
[ ] Resync period extended (≥ 1 hour at scale) with explicit correctness compensations
[ ] Cache memory profiled at full CR count — pod limits set accordingly
[ ] Unnecessary resource type watches removed

Sharding

[ ] Sharding strategy chosen with trade-offs documented (label-based vs. hash-based)
[ ] MaxConcurrentReconciles profiled under full load per shard
[ ] Rehash window tested for split-brain absence
[ ] Per-shard queue depth monitored independently

Observability & SLOs

[ ] Operator SLO defined: “P99 of CRs reconciled within Xs of spec change”
[ ] Reconcile lag metric instrumented (from spec change timestamp, not just reconcile duration)
[ ] Queue depth + P99 duration + error rate alerts wired to SLO burn rate
[ ] Structured per-stage reconcile logging in place

Load & Failure Testing

[ ] Mass CR creation tested: queue drains within SLO window
[ ] Mass CR deletion tested: Terminating backlog drains, no permanent leaks
[ ] Finalizer cleanup with external system down tested: operator remains stable
[ ] Thundering herd tested: rolling node restart simulation
[ ] API server 429 injection tested: operator recovers without manual intervention
[ ] Operator pod restart mid-reconcile tested: no duplicate side effects
[ ] Shard rescale tested if applicable: no split-brain during rehash window

Closing Thoughts

Scaling an operator is not about adding workers — it’s about respecting system limits:

The API server is finite. Your operator shares it with the entire cluster. The default client-go QPS settings assume you don’t.
The queue is lossy by design. Events collapse. Intermediate states disappear. Your reconciler must treat this as a feature, not a bug.
Writes amplify. One CR change becomes dozens of API calls becomes etcd fsyncs. At scale, every unnecessary write is a tax paid by every other controller in the cluster.
Deletes are reconciles. Finalizers turn mass deletions into a throughput problem. A cleanup bottleneck can leave thousands of objects permanently Terminating.

If your reconcile loop is level-triggered, idempotent, and backpressure-aware, scaling becomes a tuning problem — and tuning problems are solved with data.

If not, scaling becomes a correctness problem. No rate limiter, no shard count, no HPA configuration will save a reconciler that acts on events instead of state.

Build the loop correctly first. Measure everything. Tune from data.

All code examples use controller-runtime ≥ v0.17 and Go 1.21+. APF configuration targets Kubernetes ≥ 1.26 (GA flowcontrol.apiserver.k8s.io/v1).

NotebookLM Link

Platformwale

Leave a comment Cancel reply

The Archivist Theme

Scaling Kubernetes Operators: A Deep Dive into Reconciliation at Scale

Table of Contents

Scope & Assumptions

The Reconciliation Mental Model

A Throughput Model for Capacity Planning

Steady-State Throughput

Drain Time Under Burst

Capacity Planning Reference

The API Server Is Your Real Bottleneck

Default client-go Limits Are Tiny

API Priority & Fairness (APF)

Verifying APF is working

Common misconfiguration: routing reads and writes together

etcd Write Amplification

Reconcile Loop Performance & Latency

Measure First

Common Latency Culprits

Rate Limiting & Queue Depth

How controller-runtime Rate Limiting Works

Customizing the Rate Limiter

Rate Limiting Topology: One Queue vs. Many

Queue Depth as a Signal

Work Queue Backpressure Strategies

The Fundamental Invariant

Why Level-Triggered Matters at Scale

The Thundering Herd

Backoff Tuning for Transient Errors

Generation & ResourceVersion Gating

Write Amplification

The Amplification Chain

Mitigations

Finalizers at Scale

How Finalizers Become a Throughput Problem

Failure Modes

Mitigations

Cache & Informer Efficiency

How the Informer Cache Works

Cache Memory Footprint

Cache Scope: Cluster-wide vs. Namespaced

Selective Field Indexing

Watch Predicate Filtering

Resync Period

Leader Election & Operator Sharding

Why Leader Election Isn’t Enough Alone

Sharding Strategies

Sharding Strategy Trade-offs

Sharding Failure Modes

Two-Phase Shard Expansion

Shard Auto-scaling with HPA

When Not to Shard

Observability & SLOs

Define the SLO First

Measuring Reconcile Lag

Derived Alerts from the SLO

Key Metrics

Structured Per-Stage Reconcile Logging

Load & Failure Testing

Test Matrix

Load Generator

Validating Idempotency

What to Measure Under Load

Common Anti-Patterns

Putting It All Together: A Scaling Checklist

Throughput & Capacity

Reconcile Loop

Finalizers

Write Amplification

Rate Limiting & Queue

Backpressure

Cache & Informers

Sharding

Observability & SLOs

Load & Failure Testing

Closing Thoughts

Share this:

Leave a comment Cancel reply