Storage & Runtime
Health Checks
StackBlaze only routes traffic to pods that pass health checks. A new deployment is not marked live until its readiness probe succeeds. If a running pod fails its liveness probe, Kubernetes restarts it automatically. This gives you zero-downtime deploys and automatic recovery from crashes.
Health checks are configured per-service and run continuously throughout the pod's lifetime. There are three probe types: readiness (is the pod ready for traffic?), liveness (is the pod still healthy?), and startup (for slow-starting services).
Probe types
Readiness probe
HTTP GET to your health check path. Returns 200–299 = pod is ready to receive traffic. Returns anything else or times out = pod is removed from the load balancer endpoints until it recovers. Checked every 10 seconds.
Liveness probe
HTTP GET to your health check path. If the pod fails 3 consecutive liveness checks, Kubernetes kills and restarts the pod. Catches processes that are stuck or deadlocked but haven't crashed. Checked every 15 seconds after startup.
Startup probe
Runs on pod start only. The liveness probe is paused until the startup probe succeeds. Configure a grace period of up to 300 seconds for services with slow initialization (JVM warmup, loading ML models, etc.).
Health endpoint examples
Node.js (Express)
const express = require('express');
const app = express();
// Minimal health check, always returns 200
app.get('/health', (req, res) => {
res.json({ status: 'ok', uptime: process.uptime() });
});
// Advanced: check DB before declaring ready
app.get('/health/ready', async (req, res) => {
try {
await db.query('SELECT 1');
res.json({ status: 'ready', db: 'connected' });
} catch {
res.status(503).json({ status: 'unhealthy', db: 'disconnected' });
}
});
Python (FastAPI)
from fastapi import FastAPI, HTTPException
from sqlalchemy import text
app = FastAPI()
@app.get("/health")
async def health_check():
try:
await db.execute(text("SELECT 1"))
return {"status": "ok"}
except Exception as e:
raise HTTPException(status_code=503, detail=str(e))
Under the hood
- readinessProbe: Kubernetes runs an HTTP GET against your configured path every 10 seconds. If 3 consecutive checks fail, the pod is removed from the Service's Endpoints object, stopping all traffic to it. It is re-added automatically when checks start passing again.
- livenessProbe: runs on the same path every 15 seconds with a timeout of 10 seconds. After 3 consecutive failures, kubelet sends SIGTERM to the container and starts a new one. This catches infinite loops, deadlocks, and other unresponsive states that don't cause a crash.
- startupProbe: during the grace period, only the startup probe runs (liveness is paused). This prevents premature kills during slow initialization. The startup probe polls every 10 seconds up to the configured failure threshold (gracePeriodSeconds / 10 attempts).
- Rolling deploy gating: during a rolling update, new pods must pass their readiness probe before the old pods are scaled down. If a new pod never becomes ready, the rolling update stalls and the old version continues serving traffic. StackBlaze times out stalled deploys after 10 minutes and marks them failed.
Step by step
Implement a /health endpoint
Add a GET /health route to your application that returns HTTP 200. At minimum it can return an empty 200 response. Optionally include checks for database connectivity, cache availability, or any other critical dependencies your app relies on.
Configure the health check path
Go to Service → Settings → Health Check. The default path is /health. Change it to match your endpoint (e.g. /api/health or /status). You can also configure the timeout (default 10s, max 60s) and initial delay for slow-starting services.
Configure startup grace period if needed
For services that take a long time to start (e.g. JVM services loading large datasets), set a startup probe grace period of up to 300 seconds. This prevents Kubernetes from killing a slow-starting pod before it has a chance to become ready.
Monitor health check status
Open Service → Overview to see the current health status of each replica. Green indicates passing, red indicates failing. Click on a failing replica to view its recent health check response bodies and identify what is causing the failure.