Use when the user reports a GlassFlow pipeline failure, missing events, or unexpected behavior. Starts with component logs (via kubectl or whichever log backend the OTel Collector routes to) and only reaches for metrics when the logs are clean or ambiguous. Triggers on phrases like "pipeline is failing", "no events arriving in ClickHouse", "diagnose pipeline X", "why is pipeline Y stuck".
How this skill is triggered — by the user, by Claude, or both
Slash command
/glassflow-agent-skills:debug-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Localize a failure or data-flow problem on a specific pipeline. Logs are the primary diagnostic surface because they usually contain the exact error; metrics come in second as a sanity check for silent failures, throughput pain, or to confirm the fix landed. Move from broad (which component) to narrow (which line in which log) as fast as the evidence allows.
Localize a failure or data-flow problem on a specific pipeline. Logs are the primary diagnostic surface because they usually contain the exact error; metrics come in second as a sanity check for silent failures, throughput pain, or to confirm the fix landed. Move from broad (which component) to narrow (which line in which log) as fast as the evidence allows.
http://localhost:8081; ask the user if remote.kubectl configured for the cluster, with read access to the per-pipeline namespace (pipeline-<id>) and the GlassFlow control-plane namespace (typically glassflow).pipeline_id: required.api_endpoint: optional; default http://localhost:8081.kubectl_context (cluster name) or the URL and query syntax for the user's log backend.prometheus_endpoint and metric_namespace: optional, only needed if the diagnosis lands in step 4.Two cheap HTTP calls give the baseline. Always run both before reading any logs. Per-message failures (bad transform expression, malformed input JSON, filter eval error) do not appear in pod logs because the DLQ middleware swallows them; the only signal is the DLQ count rising. A pipeline can look perfectly healthy in the logs while the DLQ has thousands of rejected records.
ENDPOINT="<api-endpoint>"
PID="<id>"
# Control-plane state
curl -s "$ENDPOINT/api/v1/pipeline/$PID/health" | python3 -m json.tool
# DLQ depth (silent-rejection signal)
curl -s "$ENDPOINT/api/v1/pipeline/$PID/dlq/state" | python3 -m json.tool
Branch on overall_status from the health response:
Created: still reconciling, not running yet. Step 2 on the operator and per-pipeline Pod events; metrics may not exist yet.Running: control plane is up. Continue.Failed: control plane errored. Step 2 on the operator logs first, then the per-pipeline pods.Stopping / Stopped / Terminating: lifecycle event in progress. Confirm with the user whether intentional.Resuming: wait and re-poll.Then branch on unconsumed_messages from the DLQ response:
unconsumed_messages == 0 and total_messages == 0: no records have ever been rejected. Proceed to step 2.unconsumed_messages == 0 but total_messages > 0: rejections happened in the past but have been drained. Note the last_received_at timestamp; if it is recent, treat as an active signal and run the DLQ-consume step below.unconsumed_messages > 0: records are currently rejected. Do not stop at "logs are clean"; the actual error is in the DLQ payload. Run the DLQ-consume step now and surface the result before doing any further investigation.# When unconsumed_messages > 0: fetch the actual rejected records with their errors
curl -s "$ENDPOINT/api/v1/pipeline/$PID/dlq/consume?batch_size=10" | python3 -m json.tool
Each item in the response carries component (which stage rejected the record), error (the verbatim error message), and original_message (the record content). The error field usually localizes the problem completely. For example:
run transformation 5: cannot fetch source from string: stateless transform expression references a field that does not exist on the record.convert result for column amount: unable to cast nil to float64: expression returned nil for a non-nullable column.filter evaluation error: type string has no field source: filter expression references a nested field on a flat string.unmarshal input data: invalid character ...: source is producing malformed JSON.If the DLQ error explains the symptom, surface it to the user with the verbatim error and propose the fix. Skip step 3 and the rest.
The operator stamps one Deployment per role per pipeline in the namespace pipeline-<id> (or the operator's configured namespace when chart auto=false). Pick the roles to read based on the symptom:
| Symptom from the user | Read first | Read next if first is clean |
|---|---|---|
| "Nothing arrives in ClickHouse" | sink, then ingestor | dedup, transform, OTLP receiver (if OTLP source) |
| "Pipeline fails on create or edit" | operator (control-plane namespace) | api (control-plane namespace) |
| "Started failing this morning" | the component the user last touched | sink (most common silent failure surface) |
| "Backpressure / slow" | ingestor + sink | dedup if enabled, NATS pod for stream pressure |
| Generic / unknown | sink, then ingestor, then operator | all the rest |
kubectl --context <ctx> -n pipeline-<id> get pods
kubectl --context <ctx> -n pipeline-<id> logs deploy/glassflow-sink-<id> --tail 200
kubectl --context <ctx> -n pipeline-<id> logs deploy/glassflow-ingestor-<id> --tail 200
# StatefulSet for dedup if enabled:
kubectl --context <ctx> -n pipeline-<id> logs sts/glassflow-dedup-<id> --tail 200
The shared OTLP receiver runs in the control-plane namespace, not the per-pipeline one:
kubectl --context <ctx> -n glassflow logs deploy/glassflow-otlp-receiver --tail 200
For lifecycle / Failed / Created issues, the operator log is where the reconciliation error surfaces:
kubectl --context <ctx> -n glassflow logs deploy/glassflow-controller-manager --tail 200
If a pod is crash-looping, add --previous to read the prior container's logs.
When the deployment exports logs through the Collector to an external backend, query that backend with the standard log selectors. The Collector adds pipeline_id as an attribute on every log line it forwards. Two filter expressions cover the common backends:
{namespace="pipeline-<id>"} | json | pipeline_id="<id>"kubernetes.namespace_name: "pipeline-<id>" and attributes.pipeline_id: "<id>"/aws/containerinsights/<cluster>/application, filter fields @message | filter kubernetes.namespace_name = "pipeline-<id>"Ask the user for the exact log-backend URL or saved query if they have one.
The most common failure modes and the log lines that announce them:
| Component | Pattern in the log | What it means |
|---|---|---|
| ingestor | consumer group rebalance failed, OFFSET_OUT_OF_RANGE | Kafka topic was rotated/recreated, or the consumer group offsets are stale. Reset via kafka-consumer-groups --reset-offsets. |
| ingestor | SASL authentication failed, EOF reading SASL | Kafka credentials are wrong or expired. |
| dedup | batch processing failed ... failed to compile transformation | A stateless transform expression no longer compiles against the current schema version. The component signals itself to stop; the pipeline transitions to Failed. Roll back the schema version or fix the expression and re-edit the pipeline. |
| dedup | batch processing failed ... signal to stop component is sent | Same root cause as above: the component just sent its fatal signal and is shutting down. Confirm with the operator log to see the signal arrive. |
| dedup | batch processing failed ... schema id is missing in header | Messages arrive at dedup without the upstream gf-schema-version-id header. Indicates the ingestor or a chained component is misconfigured. |
| dedup | badger: ErrConflict, value log out of space | BadgerDB PVC is full or under contention. Grow resources.transform[].storage.size. |
| dedup | dlq_records_written_total{reason="dedup_overflow"} rate > 0 with no error in the pod log | Per-message filter, transform, or dedup failures are being silently routed to the DLQ. They do not appear in pod logs because the component routes them via the DLQ middleware and continues. Inspect the DLQ contents to see the actual errors (unmarshal input data: ..., run transformation N: cannot fetch X from string, convert result for column X: unable to cast Y to Z, filter evaluation error: ...). |
| sink | code: 60 ... unknown table | The destination table does not exist. Create it before retrying. |
| sink | code: 47 ... unknown identifier, cannot parse string as <type> | Sink mapping references a column that does not exist or the column type does not match the source field. |
| sink | clickhouse: read tcp ... i/o timeout | ClickHouse is unreachable or overloaded; sink will NACK and retry. |
| OTLP receiver | missing x-glassflow-pipeline-id header | OTLP client is misconfigured; the receiver rejects every request. |
| operator | failed to reconcile, image pull backoff | Reconciliation cannot finish; check Pod events for the underlying cause. |
If a log line directly explains the symptom, surface it to the user with the verbatim error and propose the fix. Skip step 4.
Special case: per-message expression failures. The dedup component (which hosts the filter, dedup, and stateless transform stages) does not log per-message failures to the pod's stdout. A bad expression hitting a single record routes that record to the DLQ and keeps the pipeline running. The user sees this only in DLQ contents or metrics, not in kubectl logs. This is exactly what step 1's DLQ check catches; if you reached step 3 without seeing DLQ activity, this case is ruled out.
Logs do not catch silent failures (records routed to DLQ without raising an error, slow drift in throughput, backpressure that has not yet caused timeouts). When step 3 turned up nothing actionable, the metric surface confirms whether data is actually moving.
PROM="<prometheus_endpoint>"
NS="<metric_namespace>"
# Source rate (Kafka)
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=rate(${NS}_gfm_kafka_records_read_total{pipeline_id=\"$PID\"}[5m])"
# Sink rate
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=rate(${NS}_gfm_clickhouse_records_written_total{pipeline_id=\"$PID\"}[5m])"
# DLQ rate by reason (silent rejection surface)
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=sum by (reason) (rate(${NS}_gfm_dlq_records_written_total{pipeline_id=\"$PID\"}[5m]))"
# Backpressure right now (1 = active, 0 = clear)
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=${NS}_gfm_ingestor_backpressure_active{pipeline_id=\"$PID\"}"
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=${NS}_gfm_component_backpressure_active{pipeline_id=\"$PID\"}"
# Stream fullness
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=${NS}_gfm_stream_depth_ratio{pipeline_id=\"$PID\"}"
What the patterns mean:
reason label tells you which classifier triggered. The /dlq/consume endpoint (already covered in step 1 above) is the source of truth for the actual error per record; the metric rate just tells you how fast the rejections are happening. Note that reason="dedup_overflow" is a catch-all for any failure inside the dedup component, including bad filter expression, transform expression error, type-conversion failure, or actual BadgerDB-state overflow. The label alone does not tell you which one; you have to read the per-record error.tune-pipeline job, not a debug job.stream_depth_ratio sustained > 0.8: the stream is filling because the downstream consumer is slower than the source.For component-level detail when needed:
# Per-stage processing latency (p95)
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=histogram_quantile(0.95, sum by (stage, le) (rate(${NS}_gfm_processing_duration_seconds_bucket{pipeline_id=\"$PID\"}[5m])))"
# Processor message counts by status (ok / error / filtered / duplicate)
curl -sG "$PROM/api/v1/query" \
--data-urlencode "query=sum by (component, status) (rate(${NS}_gfm_processor_messages_total{pipeline_id=\"$PID\"}[5m]))"
Report a structured diagnosis:
Pipeline: <id>
Status: <overall_status>
DLQ: <unconsumed_messages> unconsumed / <total_messages> total
last_received_at=<timestamp>
Bottleneck: <ingestor | dedup | join | sink | source | none>
Evidence:
primary error: "<verbatim DLQ error OR pod log line>"
source: <DLQ component / pod role / metric name>
metric backup: <only when reached: rate / backpressure values>
Next move: <run tune-pipeline | check kafka creds | drain DLQ | fix CH schema | escalate>
Always report DLQ counts even when zero (the absence is a positive signal). Always cite the verbatim error (DLQ payload, log line, or metric value) that drove the conclusion. Do not propose destructive actions (delete, restart, purge) without explicit user confirmation. The skill diagnoses; the user authorizes the fix.
Do not report "no problem found" while unconsumed_messages > 0. A non-zero DLQ is a problem; if you cannot explain it from the DLQ payload, escalate to the user with the evidence and let them decide what to do.
overall_status: Running does not mean events are flowing. It only confirms the control plane stamped the workloads. If logs are clean, step 4 metrics are the next check.kubectl get pods -n pipeline-<id> before constructing log commands.--previous is essential for crash loops. A pod restarting every few seconds has its current container too young to log anything useful; the previous container's logs hold the actual error.Failed and Created root causes.total_messages) survive; live depth (unconsumed_messages) does not.stream_depth metric. Confirm which side produced the name before assuming you know which stream the user means.reason label dedup_overflow is overloaded. Every dedup-component failure (filter expression error, transform expression error, type-conversion failure, BadgerDB state issue, etc.) lands under this single label. Do not infer the cause from the label alone; read the per-record error in the DLQ payload.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub glassflow/agent-skills --plugin glassflow-agent-skills