Skill

clickhouse-prod-checklist

Provides ClickHouse production checklist: schema design, server tuning, S3 backups, monitoring, deployment verification. For go-live reviews and audits.

AWS

Popularity

Parent stars

2,203

Parent forks

296

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/clickhouse-pack:clickhouse-prod-checklist

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadBash(kubectl:*)Bash(curl:*)Grep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Comprehensive go-live checklist for ClickHouse covering server tuning, schema design,

SKILL.md

190 lines · ~1.6k tokens

Stats

LanguagePython

Parent stars2,203

Parent forks296

MaintenanceGood

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

ClickHouse Production Checklist

Overview

Comprehensive go-live checklist for ClickHouse covering server tuning, schema design, backup configuration, monitoring, and operational readiness.

Prerequisites

ClickHouse instance provisioned (Cloud or self-hosted)
Application integration code tested in staging

Checklist

1. Schema & Engine Design

Tables use MergeTree family engines (not Memory, Log, or TinyLog)
ORDER BY columns match primary filter/group patterns
PARTITION BY is coarse (monthly or weekly, never by ID)
TTL configured for data retention policy
LowCardinality(String) used for low-cardinality columns
CODEC(ZSTD) applied to large String/JSON columns
ReplacingMergeTree used with FINAL or dedup logic if upserts needed

2. Server Configuration (Self-Hosted)

<!-- Key production settings in config.xml / users.xml -->

<!-- Memory: set to ~80% of available RAM -->
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

<!-- Query limits -->
<max_concurrent_queries>150</max_concurrent_queries>
<max_memory_usage>10000000000</max_memory_usage>  <!-- 10GB per query -->
<max_execution_time>300</max_execution_time>       <!-- 5 min timeout -->

<!-- Merge settings -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>16</background_schedule_pool_size>

<!-- Logging -->
<query_log>
    <database>system</database>
    <table>query_log</table>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>

3. Backup Configuration

-- ClickHouse native BACKUP to S3
BACKUP TABLE analytics.events
    TO S3(
        'https://my-bucket.s3.us-east-1.amazonaws.com/backups/events',
        'ACCESS_KEY',
        'SECRET_KEY'
    )
    SETTINGS compression_method = 'zstd';

-- Incremental backup (base + delta)
BACKUP TABLE analytics.events
    TO S3('s3://my-bucket/backups/events-incremental')
    SETTINGS base_backup = S3('s3://my-bucket/backups/events-base');

ClickHouse Cloud: Backups are automatic. Configure retention and frequency in the Cloud console under Service Settings.

Backup schedule configured (daily minimum)
Backup restore tested and documented
Point-in-time recovery possible (incremental backups)
Backup stored in different region/account from primary

4. Monitoring & Alerting

-- Key metrics to monitor
-- 1. Query latency (p50/p95/p99)
-- 2. Active parts per table (alert > 300)
-- 3. Merge queue depth
-- 4. Memory usage
-- 5. Disk space free
-- 6. Replication lag (if clustered)

-- Health check query (use for load balancer probes)
SELECT 1;

-- More thorough health check
SELECT
    (SELECT count() FROM system.processes) AS running_queries,
    (SELECT sum(value) FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes,
    (SELECT count() FROM system.merges) AS active_merges;

Prometheus endpoint configured (/metrics via ClickHouse Exporter or Cloud endpoint)
Grafana dashboard with key panels (QPS, latency, memory, parts, merges)
Alerts on: error rate > 5%, p95 latency > 5s, parts > 300, disk < 20%
Query log monitoring for slow/failed queries

5. Security

Default user password changed or user disabled
Application user has minimal privileges (see clickhouse-security-basics)
TLS enabled (HTTPS on port 8443)
IP allowlist configured
Secrets in environment variables or secret manager (not code)

6. Application Integration

Connection pooling configured (max_open_connections)
Graceful shutdown calls client.close()
Insert batching in place (10K+ rows per INSERT)
Retry logic for transient errors (see clickhouse-rate-limits)
Health check endpoint includes ClickHouse ping

// Health check endpoint
app.get('/health', async (req, res) => {
  try {
    const { success } = await client.ping();
    res.json({ status: success ? 'healthy' : 'degraded', clickhouse: success });
  } catch {
    res.status(503).json({ status: 'unhealthy', clickhouse: false });
  }
});

7. Operational Readiness

Incident runbook documented (see clickhouse-incident-runbook)
On-call escalation path defined
Key rotation procedure documented
Schema migration process in place (see clickhouse-migration-deep-dive)
Load testing completed at expected peak traffic

8. Verification Queries

-- Verify table sizes and part health
SELECT database, table, count() AS parts, sum(rows) AS rows,
       formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts WHERE active
GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;

-- Verify no pending mutations
SELECT * FROM system.mutations WHERE NOT is_done;

-- Verify no replication lag
SELECT database, table, queue_size, inserts_in_queue
FROM system.replicas WHERE queue_size > 0;

Error Handling

Issue	Detection	Action
Parts > 300	Monitoring alert	Review insert patterns, wait for merges
Disk > 80%	Disk alert	Add storage, drop old partitions
Query p95 > 5s	Latency alert	Check `system.query_log` for slow queries
Replication lag	Replica check	Investigate network, merge backlog

Resources

Next Steps

For SDK version upgrades, see clickhouse-upgrade-migration.

clickhouse-prod-checklist

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

clickhouse-prod-checklist

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

ClickHouse Production Checklist

Overview

Prerequisites

Checklist

1. Schema & Engine Design

2. Server Configuration (Self-Hosted)

3. Backup Configuration

4. Monitoring & Alerting

5. Security

6. Application Integration

7. Operational Readiness

8. Verification Queries

Error Handling

Resources

Next Steps

Similar Skills

ClickHouse Production Checklist

Overview

Prerequisites

Checklist

1. Schema & Engine Design

2. Server Configuration (Self-Hosted)

3. Backup Configuration

4. Monitoring & Alerting

5. Security

6. Application Integration

7. Operational Readiness

8. Verification Queries

Error Handling

Resources

Next Steps

Similar Skills