Skill

clickhouse-managed-postgres-rca

Provides an evidence-based RCA workflow for ClickHouse-managed Postgres performance issues. Scrapes Prometheus metrics and Slow Query Patterns API to identify root causes.

ClickHouse

Popularity

Stars

458

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/clickhouse-best-practices:clickhouse-managed-postgres-rca

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Trigger whenever a user reports slowness, high CPU, low

Supporting Files

AGENTS.mdREADME.mdmetadata.jsonrules/heuristic-full-scan.mdrules/heuristic-hot-loop.mdrules/heuristic-write-congestion.mdrules/openapi-discovery.mdrules/output-template.mdrules/prometheus-scrape.mdrules/recommend-only.mdrules/slow-query-patterns-fields.mdrules/triage.md

SKILL.md

131 lines · ~1.2k tokens

Stats

LanguageJavaScript

Stars458

Forks28

MaintenanceExcellent

Last CommitJun 8, 2026

Actions

View Source View Plugin View on GitHub View README

ClickHouse Managed Postgres RCA

When to use

Trigger whenever a user reports slowness, high CPU, low throughput, cache thrash, or any unexplained pain on a ClickHouse-managed Postgres instance.

What you have access to

Two APIs on https://api.clickhouse.cloud (HTTP Basic auth using a ClickHouse Cloud API key/secret pair):

Prometheus metrics — operation postgresInstancePrometheusGet under the Prometheus tag. Returns Prometheus exposition format. System and workload metrics for one Postgres service.
Slow Query Patterns — operation slowQueryPatternsGetList under the Postgres tag. Returns per-digest latency, IO, and call statistics for normalized query patterns. Beta.

Both endpoints require an organizationId and a serviceId as path parameters. The user must supply both, plus the API key/secret pair.

What you do NOT have

Query plans / EXPLAIN output.
Per-table scan-type counters (seq_scan / idx_scan).
Autovacuum or last-ANALYZE timestamps.

Reason from IO and timing signals, not from a plan tree.

Workflow

Six steps, in order. Do not skip ahead.

Steps 2 and 3 only share auth — no data dependency between them. Run them in parallel (background curls, & + wait) to cut wall time from sequential ~2s to ~1s.

1. Discover the live API shape

These endpoints are Beta — paths, params, and JSON field names can shift. Follow rules/openapi-discovery.md to:

Fetch the OpenAPI spec from https://api.clickhouse.cloud/v1.
Locate the two operations by operationId:
- postgresInstancePrometheusGet (Prometheus tag)
- slowQueryPatternsGetList (Postgres tag)
Resolve their path templates, required query parameters, and (for the slow-query endpoint) the response schema.
Build a session-scoped role map from the schema property descriptions: { semantic role → actual field name }.

Use the resolved names in every subsequent request and citation. Never hardcode field names from memory.

2. Scrape Prom once for system gauges

Follow rules/prometheus-scrape.md. One scrape, no wait. You're after gauges (current values) that don't need a delta: CacheHitRatio, ActiveConnections, MemoryUsedPercent, FilesystemUsedPercent.

A CacheHitRatio well below ~95% on a workload that should fit in cache is a real signal on its own. Climbing ActiveConnections toward the pool ceiling is a real signal on its own. These don't need rate-of-change.

A second scrape for counter deltas is opt-in, used only when Step 4 triage points at write-congestion (where deadlock and rollback rates matter and the Slow Query Patterns API can't substitute). For the read-path case (the most common RCA shape) the single scrape is enough.

3. Pull top slow query patterns

Request the slow query patterns. Follow rules/slow-query-patterns-fields.md for the fields that matter and how to read them. This is the primary diagnostic — it returns per-pattern accumulated totals (call count, runtime, blocks, rows) over the window you request, which is the "rate-of-change" data you'd otherwise derive from two Prom scrapes — but per query and without waiting.

If no patterns return a meaningful totalDurationUs, the report may be overstated or the issue isn't query-shaped. Stop and tell the user what you looked at.

4. Triage: pick the right heuristic

Follow rules/triage.md. Match the combined Prom + slow-query signal to one of the heuristic shapes. Each shape points to a specific heuristic file:

rules/heuristic-full-scan.md — read-path full scan.
rules/heuristic-hot-loop.md — N+1 / hot loop from the app.
rules/heuristic-write-congestion.md — deadlocks, slow writes, high rollback rate.

If the signal does not match any shape cleanly, do not invent a hypothesis. Surface the top patterns and ask the user which workload they recognize. New heuristics are welcome as PRs.

5. Reason, then recommend

Use the format in rules/output-template.md. Always include: symptom, evidence, hypothesis (noting any alternative cause you cannot rule out from this surface alone), short-term fix, and long-term follow-ups.

6. Do not apply the fix

Follow rules/recommend-only.md. Never run DDL. Never call pg_cancel_backend or pg_terminate_backend. Write the recommendation, explain why, and let the human apply it.

Full Compiled Document

For the complete guide with every rule expanded in a single context load: AGENTS.md.

clickhouse-managed-postgres-rca

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

clickhouse-managed-postgres-rca

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

ClickHouse Managed Postgres RCA

When to use

What you have access to

What you do NOT have

Workflow

1. Discover the live API shape

2. Scrape Prom once for system gauges

3. Pull top slow query patterns

4. Triage: pick the right heuristic

5. Reason, then recommend

6. Do not apply the fix

Full Compiled Document

Similar Skills

ClickHouse Managed Postgres RCA

When to use

What you have access to

What you do NOT have

Workflow

1. Discover the live API shape

2. Scrape Prom once for system gauges

3. Pull top slow query patterns

4. Triage: pick the right heuristic

5. Reason, then recommend

6. Do not apply the fix

Full Compiled Document

Similar Skills