From clickhouse-best-practices
Provides an evidence-based RCA workflow for ClickHouse-managed Postgres performance issues. Scrapes Prometheus metrics and Slow Query Patterns API to identify root causes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/clickhouse-best-practices:clickhouse-managed-postgres-rcaThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Trigger whenever a user reports slowness, high CPU, low
Trigger whenever a user reports slowness, high CPU, low throughput, cache thrash, or any unexplained pain on a ClickHouse-managed Postgres instance.
Two APIs on https://api.clickhouse.cloud (HTTP Basic auth
using a ClickHouse Cloud API key/secret pair):
postgresInstancePrometheusGet
under the Prometheus tag. Returns Prometheus exposition format.
System and workload metrics for one Postgres service.slowQueryPatternsGetList
under the Postgres tag. Returns per-digest latency, IO, and
call statistics for normalized query patterns. Beta.Both endpoints require an organizationId and a serviceId as
path parameters. The user must supply both, plus the API
key/secret pair.
seq_scan / idx_scan).Reason from IO and timing signals, not from a plan tree.
Six steps, in order. Do not skip ahead.
Steps 2 and 3 only share auth — no data dependency between
them. Run them in parallel (background curls, & + wait) to
cut wall time from sequential ~2s to ~1s.
These endpoints are Beta — paths, params, and JSON field names
can shift. Follow rules/openapi-discovery.md to:
https://api.clickhouse.cloud/v1.operationId:
postgresInstancePrometheusGet (Prometheus tag)slowQueryPatternsGetList (Postgres tag){ semantic role → actual field name }.Use the resolved names in every subsequent request and citation. Never hardcode field names from memory.
Follow rules/prometheus-scrape.md. One scrape, no wait.
You're after gauges (current values) that don't need a delta:
CacheHitRatio, ActiveConnections, MemoryUsedPercent,
FilesystemUsedPercent.
A CacheHitRatio well below ~95% on a workload that should
fit in cache is a real signal on its own. Climbing
ActiveConnections toward the pool ceiling is a real signal
on its own. These don't need rate-of-change.
A second scrape for counter deltas is opt-in, used only when Step 4 triage points at write-congestion (where deadlock and rollback rates matter and the Slow Query Patterns API can't substitute). For the read-path case (the most common RCA shape) the single scrape is enough.
Request the slow query patterns. Follow
rules/slow-query-patterns-fields.md for the fields that
matter and how to read them. This is the primary diagnostic —
it returns per-pattern accumulated totals (call count, runtime,
blocks, rows) over the window you request, which is the
"rate-of-change" data you'd otherwise derive from two Prom
scrapes — but per query and without waiting.
If no patterns return a meaningful totalDurationUs, the
report may be overstated or the issue isn't query-shaped.
Stop and tell the user what you looked at.
Follow rules/triage.md. Match the combined Prom + slow-query
signal to one of the heuristic shapes. Each shape points to a
specific heuristic file:
rules/heuristic-full-scan.md — read-path full scan.rules/heuristic-hot-loop.md — N+1 / hot loop from the app.rules/heuristic-write-congestion.md — deadlocks, slow
writes, high rollback rate.If the signal does not match any shape cleanly, do not invent a hypothesis. Surface the top patterns and ask the user which workload they recognize. New heuristics are welcome as PRs.
Use the format in rules/output-template.md. Always include:
symptom, evidence, hypothesis (noting any alternative cause
you cannot rule out from this surface alone), short-term fix,
and long-term follow-ups.
Follow rules/recommend-only.md. Never run DDL. Never call
pg_cancel_backend or pg_terminate_backend. Write the
recommendation, explain why, and let the human apply it.
For the complete guide with every rule expanded in a single
context load: AGENTS.md.
npx claudepluginhub clickhouse/agent-skills --plugin chdb-sqlTriages and remediates ClickHouse production incidents—downtime, OOM, slow queries, errors—using system tables, SQL, curl pings, and kubectl. For on-call emergencies.
Sets up Grafana Cloud Database Observability for MySQL and PostgreSQL with query-level RED metrics, explain plans, and trace correlation. Use for diagnosing slow queries or connecting DB metrics to APM.
Troubleshoots database performance across PostgreSQL, MySQL, MongoDB, Redis using pg_stat_statements, EXPLAIN ANALYZE, Performance Schema, SLOWLOG, and benchmarks like pgbench, sysbench. For slow queries, locks, deadlocks, I/O, memory issues.