From lens
The Lens of Observability — Socratic battery over operational recovery and visibility for money flows, state machines, webhooks, and background jobs. Use for anything where a lost external event or a stalled multi-step flow can strand a user (payments, webhooks, onboarding funnels, queues, crons). Modes - design (questions to you) and audit (questions to the code).
How this skill is triggered — by the user, by Claude, or both
Slash command
/lens:observabilityThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Read first: ${CLAUDE_PLUGIN_ROOT}/core/protocol.md and ${CLAUDE_PLUGIN_ROOT}/core/dossier.md.
Read first: ${CLAUDE_PLUGIN_ROOT}/core/protocol.md and ${CLAUDE_PLUGIN_ROOT}/core/dossier.md. Dossier rules: read it, skip answered questions WITH skip lines, append your outputs.
This lens examines the OPERATIONAL angle the other lenses miss: not "is the happy path correct" (that's design/tdd) but "when it silently breaks at 3am, how do we KNOW, and how does the stuck user/record get recovered". Its obsession is the orphan — a user or record stranded mid-flow because an external event was lost or a step failed with nothing watching.
Ask one at a time; follow answers per protocol.md. Append answers to the dossier.
If external events / webhooks (Stripe, SendGrid, queues, partner callbacks): 3. What happens when an inbound event is LOST, or arrives LATE / out-of-order? Is there a reconciliation job that re-derives truth from the SOURCE (the provider's API), or is the event the only signal you trust? 4. Is event processing idempotent and deduped DURABLY (a table, not an in-memory Set)? Can a redelivery double-apply an effect (double charge, double grant)?
If a multi-step flow / state machine (onboarding, checkout, wizards): 5. Can a user get STUCK between steps — abandoned, failed mid-step, returned from a redirect that got lost? Is the resumable state SERVER-derived (re-entry routes to the right step) or client-held (lost on refresh / new device)? 6. Is there a sweep that finds records stuck in a transient state past a deadline (e.g. PROCESSING > N minutes) and repairs or escalates them — or do they sit forever?
If money / irreversible side effects: 7. Can you reconstruct, after the fact, the full sequence of attempts and outcomes for ONE user's charge/grant? What is the audit trail, and would it survive a dispute or a "you charged me twice" complaint?
If background jobs / crons: 8. How do you know a cron actually RAN and SUCCEEDED — not silently skipped, crashed, or no-op'd? Is "did nothing because nothing to do" distinguishable from "failed"?
The CODE answers. Never edit code; produce a report.
file:line (e.g. a
reconciliation cron exists; dedupe is a durable unique index; resume state is
server-derived).file:line, assign severity (critical / high / medium / low).If the riskiest external dependency went silent for an hour right now, how many users would be stranded, and which of them would we ever find out about?
If the honest answer is "we wouldn't know", that is the top finding.
file:line · severity) +
the orphan/recovery gaps ranked by user-harm + a one-line recovery direction each.
Append a summary to dossier.Gaps → {"type":"gap","date":"<YYYY-MM-DD>","lens":"observability","note":"<one line>"}
appended to <foundry>/pending-retros.jsonl when the lens config exists; otherwise
mention the gap in your output and move on.
npx claudepluginhub tiltely/lens --plugin lensProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.