From skillry-runtime-and-local-app
Use when you need to add or verify readiness gates, health checks, and startup sequencing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-runtime-and-local-app:09-startup-health-readinessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill ensures that a service does not accept traffic or report itself healthy until all its dependencies are actually ready — database connections established, migrations run, caches warm, external clients initialized. It covers implementing and verifying `/health`, `/ready`, and `/live` endpoints, dependency wait loops, graceful startup sequencing, and Kubernetes/Docker probe configuration.
This skill ensures that a service does not accept traffic or report itself healthy until all its dependencies are actually ready — database connections established, migrations run, caches warm, external clients initialized. It covers implementing and verifying /health, /ready, and /live endpoints, dependency wait loops, graceful startup sequencing, and Kubernetes/Docker probe configuration.
readinessProbe / livenessProbe configuration.depends_on is not sufficient because the dependency container starts but is not yet accepting connections./health endpoint that distinguishes "process alive" from "ready to serve traffic."Audit the existing startup sequence. Read the main entry point file and identify the exact order of: config load, database pool creation, migration execution, external client initialization, cache warm-up, and the moment server.listen() / app.run() is called. Note any step that can fail silently or take variable time.
Design the three-endpoint contract (or two, if liveness/readiness overlap). Define clearly:
GET /live (liveness): returns 200 if the process is not deadlocked and the event loop is running. Should never depend on external systems. Returns 503 if the app is in a shutdown sequence.GET /ready (readiness): returns 200 only when all dependencies are reachable and the server is ready to handle real requests. Returns 503 otherwise. Kubernetes stops sending traffic when this returns non-200.GET /health (optional combined): returns a JSON body with per-dependency status for human debugging.server.listen(). For each external dependency, add a retry loop with exponential backoff before calling listen. Example for Postgres (Node/pg):async function waitForDb(pool, retries = 10, delayMs = 1000) {
for (let i = 0; i < retries; i++) {
try { await pool.query('SELECT 1'); return; }
catch (e) { if (i === retries - 1) throw e; await sleep(delayMs * 2**i); }
}
}
Cap the total wait time — do not retry indefinitely; let the orchestrator restart the pod.
Run migrations inside the startup sequence, not as a separate pre-deploy job, unless you have a migration lock strategy. If migrations run inside the app: ensure only one instance runs them (use a DB advisory lock: SELECT pg_try_advisory_lock(12345)). If migrations run as a separate init container or pre-deploy hook, confirm the app checks that migrations are at the expected version before marking itself ready.
Implement the readiness check as a real probe, not a stub. A /ready endpoint that always returns 200 defeats the purpose. At minimum, execute a lightweight query against each dependency:
SELECT 1PINGGET /_cluster/health?wait_for_status=yellow&timeout=1s
Use a short timeout (1-2s) on each probe query so a slow dependency does not cause the health check itself to hang.SIGTERM: stop accepting new requests, allow in-flight requests to finish (max 30s), then close DB pool and exit. Example (Node):process.on('SIGTERM', async () => {
server.close(async () => { await pool.end(); process.exit(0); });
setTimeout(() => process.exit(1), 30_000);
});
Kubernetes sends SIGTERM and then SIGKILL after terminationGracePeriodSeconds — ensure your grace period is shorter than that value.
livenessProbe:
httpGet: { path: /live, port: 3000 }
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet: { path: /ready, port: 3000 }
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 6
initialDelaySeconds must be longer than the worst-case startup time.
DATABASE_URL pointing to a non-existent host and confirm it retries, logs clearly, and does not mark itself ready. Then restore the correct URL and confirm the app becomes ready without a restart./live endpoint returns 200 without querying external systems/ready endpoint performs real lightweight queries against each dependency/ready returns 503 with a JSON body describing which dependency failedserver.listen()SIGTERM handler closes connections and drains in-flight requests before exitlivenessProbe.initialDelaySeconds exceeds worst-case startup timereadinessProbe.failureThreshold * periodSeconds gives enough time for transient recovery/health route added as an afterthought that does res.json({ status: 'ok' }) regardless of state — this is a false positive that defeats load balancer routing.listen(): the server accepts requests while migrations are in progress, causing schema mismatch errors on the first few requests after deploy.terminationGracePeriodSeconds is set to the default 30s but requests can take 60s.Report must include:
/live and /ready with per-dependency checkslivenessProbe.failureThreshold to 1 — a single slow response will cause unnecessary pod restarts./ready or /live response body — return only dependency name and "ok"/"unavailable".Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub fluxonlab/skillry --plugin skillry-runtime-and-local-app