From qe-framework
Configures monitoring systems, structured logging, Prometheus/Grafana dashboards, alerting rules, and distributed tracing. Use for observability, debugging production issues, load testing, profiling, and capacity planning.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qe-framework:Qmonitoring-expertThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.
Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.
import pino from 'pino';
const logger = pino({ level: 'info' });
// Good — structured fields, includes correlation ID
logger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created');
// Bad — string interpolation, no correlation
console.log(`Order created for user ${userId}`);
import { Counter, Histogram, register } from 'prom-client';
const httpRequests = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
});
const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
labelNames: ['method', 'route'],
buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
// Instrument a route
app.use((req, res, next) => {
const end = httpDuration.startTimer({ method: req.method, route: req.path });
res.on('finish', () => {
httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
end();
});
next();
});
// Expose scrape endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { trace } from '@opentelemetry/api';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),
});
sdk.start();
// Manual span around a critical operation
const tracer = trace.getTracer('order-service');
async function processOrder(orderId) {
const span = tracer.startSpan('order.process');
span.setAttribute('order.id', orderId);
try {
const result = await db.saveOrder(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}
groups:
- name: api.rules
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5% on {{ $labels.route }}"
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '1m', target: 50 }, // ramp up
{ duration: '5m', target: 50 }, // sustained load
{ duration: '1m', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95th percentile < 500 ms
http_req_failed: ['rate<0.01'], // error rate < 1%
},
};
export default function () {
const res = http.get('https://api.example.com/orders');
check(res, { 'status is 200': (r) => r.status === 200 });
sleep(1);
}
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Logging | references/structured-logging.md | Pino, JSON logging |
| Metrics | references/prometheus-metrics.md | Counter, Histogram, Gauge |
| Tracing | references/opentelemetry.md | OpenTelemetry, spans |
| Alerting | references/alerting-rules.md | Prometheus alerts |
| Dashboards | references/dashboards.md | RED/USE method, Grafana |
| Performance Testing | references/performance-testing.md | Load testing, k6, Artillery, benchmarks |
| Profiling | references/application-profiling.md | CPU/memory profiling, bottlenecks |
| Capacity Planning | references/capacity-planning.md | Scaling, forecasting, budgets |
Pattern 1: Prometheus Metric Definition
- name: database_query_duration_seconds
help: Database query execution time
type: histogram
labels: [query_type, status]
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
Pattern 2: Grafana Dashboard JSON (RED method)
{
"title": "API Service",
"panels": [
{"title": "Request Rate", "expr": "rate(http_requests_total[5m])"},
{"title": "Error Rate", "expr": "rate(http_requests_total{status=~'5..'}[5m])"},
{"title": "P95 Latency", "expr": "histogram_quantile(0.95, http_request_duration_seconds_bucket)"}
]
}
Pattern 3: Alert Rule (PromQL)
- alert: HighLatencyP99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 3m
labels: {severity: warning}
annotations: {summary: "P99 latency above 1s"}
# Metric naming: {namespace}_{subsystem}_{name}_{unit}
# Labels (low cardinality): job, instance, method, status
# Avoid: user_id, request_id, session_token (high cardinality)
# Config example:
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['localhost:8080'] # Single instance
scrape_interval: 15s # Adjust per load; avoid <5s
promtool check
promtool check rules alerting_rules.yml
promtool check config prometheus.yml
yamllint for configs
yamllint -d '{rules: {line-length: {max: 120}}}' *.yml
jsonnet lint for dashboards
jsonnetfmt --check dashboards/*.jsonnet
metric_relabel_configs to drop high-cardinality labels (e.g., user_id).scheme: https and tls_config in Prometheus scrape configs.| Wrong | Correct |
|---|---|
Label: user_id=12345 (high cardinality) | Label: user_type=premium (low cardinality) |
Alert: up{instance=...} == 0 (symptom) | Alert: rate(errors_total[5m]) > threshold (root cause) |
| Alert with no runbook link | Alert with annotations.runbook_url pointing to on-call guide |
| 50+ dashboards per team (sprawl) | 1 service dashboard + shared platform dashboard |
| Scrape every second (storage overhead) | Scrape 30s–60s (adjust per SLA, not per moment) |
npx claudepluginhub inho-team/qe-framework --plugin qe-frameworkCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.