EngineeringKubernetesDeploymentsArchitecture

How we built zero-downtime deployments for every service type

Rolling deploys sound simple until you try to do them correctly across web services, background workers, cron jobs, and stateful databases simultaneously. Here's the full architecture and what we learned the hard way.

Sarah Kim

Co-founder & CTO

April 8, 202612 min read

When we first started building StackBlaze, "zero-downtime deployments" felt like a solved problem. Kubernetes has rolling updates built in. How hard could it be?

Very hard, it turns out, but only if you care about getting it right for every service type, not just stateless web processes. After eighteen months of running production workloads for hundreds of teams, here is the full architecture we landed on and everything we learned along the way.

The naive approach: kill-and-replace

The first version of our deployment system did what most people do when they're moving fast: scale the new ReplicaSet up, wait for pods to become Ready, then scale the old one down. On a quiet Tuesday afternoon with a single stateless web service, this works fine. In production, it breaks in at least three different ways simultaneously.

First, in-flight requests get dropped. The moment the old pod receives SIGTERM, it stops accepting new connections. But if your load balancer is still routing to it, and it often is, because iptables rules propagate asynchronously across nodes, those requests see a connection reset. Second, your database migration runs before the old pods are gone, so you now have two different code versions talking to a schema that only one of them understands. Third, your background worker picks up a job, gets killed mid-flight, and that job silently disappears if you did not configure a DLQ.

Rolling updates with readiness probes

Kubernetes rolling updates fix the availability problem, but only if your readiness probes are actually meaningful. We've seen teams deploy with readiness probes that just check whether the process is running, not whether it's actually ready to serve traffic. That's the same as having no readiness probe at all.

A useful readiness probe verifies that your app has completed its startup sequence: database connections are pooled, caches are warm, feature flags are loaded. For HTTP services, we require a dedicated /healthz/ready endpoint that checks all of this, separate from a /healthz/live endpoint that just returns 200 as long as the process hasn't deadlocked.

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0          # never remove a pod before the new one is ready
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: app
          image: your-image:tag
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "5"]   # let iptables drain before SIGTERM

The preStop sleep is not a hack, it is the correct way to handle the iptables propagation delay. When a pod is marked for deletion, there is a race between the endpoint controller removing the pod from the Service endpoints and the pod receiving SIGTERM. Adding a 5-second sleep in preStop ensures that, by the time your app starts shutting down, the load balancer has stopped sending it new traffic.

Set maxUnavailable: 0

With maxUnavailable set to 0 and maxSurge set to 1, Kubernetes will always bring up a new pod before terminating an old one. This means you temporarily run at N+1 capacity during a deploy, which is usually fine and always safer than running at N-1.

The database migration problem

Rolling updates create an overlap window where both the old and new versions of your code are running simultaneously. That constraint forces a rule onto your database migrations: every migration must be backward-compatible with the previous version of your application.

This sounds obvious until you try to rename a column. You cannot just do ALTER TABLE users RENAME COLUMN email TO email_address in a single migration and deploy it alongside a rolling update. During the overlap window, old pods will be writing to email while new pods are reading from email_address, and you will lose data.

The correct approach is the expand-contract pattern: add the new column, deploy the code that writes to both columns, backfill the old data, deploy code that reads from the new column only, then drop the old column in a separate migration on the next deploy. More steps, but your deploy window stays clean.

We enforce this at the platform level with a migration linter that runs in CI. It flags any migration that drops a column, renames a column, adds a NOT NULL constraint without a default, or changes a column type in a way that would break the previous schema version.

Background workers and cron jobs

Web services are relatively easy because HTTP is stateless and connections are short-lived. Background workers are harder because they hold onto jobs for extended periods. If you kill a worker mid-job, you need to decide: should the job be retried, or is it safe to drop?

For StackBlaze deployments, we give workers a long grace period, up to 5 minutes by default, to finish in-flight jobs before we send SIGKILL. The worker's job framework is responsible for catching SIGTERM and finishing the current job gracefully before exiting. We also require that all jobs be idempotent, so that a job which does get killed and retried does not cause double-writes or duplicate emails.

Cron jobs are simpler: because they're short-lived Kubernetes Jobs rather than long-running Deployments, they don't go through rolling updates at all. We let the current run finish, then the next run will pick up the new image. The only complication is if a cron job runs for longer than your deploy interval, but that's a problem with the cron job, not the deploy system.

Health check lifecycle

Every StackBlaze service goes through a defined health check lifecycle during a deploy. Understanding this sequence is critical for debugging deploy failures.

Phase	Check	Failure action	Timeout
Startup	startupProbe (httpGet /healthz/live)	Kill and restart pod	30s total (10 retries x 3s)
Ready gate	readinessProbe (/healthz/ready)	Hold pod out of Service	60s before rollback trigger
Steady state	livenessProbe (/healthz/live)	Restart pod in place	Infinite (3 consecutive failures)
Shutdown	preStop hook + SIGTERM handler	SIGKILL after grace period	60s grace period

Testing zero-downtime in CI

We test every deploy in CI using a traffic replay harness. The test spins up two versions of the service, triggers a rolling deploy, and simultaneously hammers the service with real requests. Any 5xx response or dropped connection fails the test.

scripts/test-zero-downtime.sh

#!/usr/bin/env bash
set -euo pipefail

IMAGE_OLD="${1:-}"
IMAGE_NEW="${2:-}"

if [[ -z "$IMAGE_OLD" || -z "$IMAGE_NEW" ]]; then
  echo "Usage: $0 <old-image> <new-image>"
  exit 1
fi

# Start traffic generator in background
npx autocannon \
  --connections 10 \
  --duration 60 \
  --on-port 'echo PORT=$AUTOCANNON_PORT' \
  http://localhost:8080/healthz/ready &
TRAFFIC_PID=$!

# Trigger rolling deploy
kubectl set image deployment/web app="$IMAGE_NEW" --record

# Wait for rollout to complete
kubectl rollout status deployment/web --timeout=120s

# Stop traffic and collect results
wait $TRAFFIC_PID
echo "Deploy test complete"

Lessons learned

Readiness probes that check process health but not application health are worthless. Build a real /healthz/ready endpoint.
The iptables drain race is real. Always add a preStop sleep of at least 5 seconds.
Database migrations must be backward-compatible. Enforce this in CI, not in code review.
Set maxUnavailable to 0. The temporary N+1 capacity cost is always worth it.
Test zero-downtime in CI with real traffic. A deploy that "looked fine" in staging will fail in production under load.
Background worker grace periods matter. Give workers enough time to finish in-flight jobs, don't just send SIGKILL.
Cron jobs and Deployments have different rolling semantics. Don't mix them up.

Zero-downtime deployments are not a single feature, they're a contract between your application code, your deployment configuration, and your database migration strategy. Get all three right, and your users will never notice you shipped.

Sarah Kim

Co-founder & CTO at StackBlaze

Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.

How Calico network policies isolate tenants on shared hosting

Shared Kubernetes does not have to mean shared trust boundaries. Calico enforces network isolation, Linkerd provides automatic mTLS between services, and Falco detects runtime threats, three layers that keep tenants separated on shared infrastructure.

Sarah Kim

Security16 min read

Shared platform vs dedicated clusters: control plane isolation and policy-as-code

Policy-as-code on a shared platform gives you guardrails without operational overhead. Dedicated clusters add an isolated control plane, single-tenant nodes, and customer-owned policy boundaries, here is how to choose and what changes under the hood.

Priya Patel

Security18 min read

Regulatory compliance and data governance on StackBlaze

SOC 2, GDPR, HIPAA-readiness, data residency, encryption, audit logs, and DPAs, a detailed map of how StackBlaze controls align with common regulatory frameworks and what you own vs what the platform certifies.

Nina Okoye