Skill

lightrun-live-runtime-debugging

Guides runtime debugging in live environments using Lightrun MCP, with problem framing, hypothesis ranking, evidence capture, diagnosis confidence, and blocker handling.

developer-tools

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/lightrun-ai:lightrun-live-runtime-debugging

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Provide a repeatable live runtime debugging workflow that helps QA and engineers investigate incidents to a diagnosis with focused, high-signal runtime evidence.

Supporting Files

agents/openai.yamlassets/lightrun-small.svgassets/lightrun.png

SKILL.md

343 lines · ~5.2k tokens(exceeds 5k compaction limit)

Stats

Stars27

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Goal

Provide a repeatable live runtime debugging workflow that helps QA and engineers investigate incidents to a diagnosis with focused, high-signal runtime evidence.

Scope

In scope: problem framing, hypothesis ranking, runtime evidence capture, hypothesis elimination, diagnosis confidence, blocker handling, and investigation handoff.
Out of scope: code changes, rollout decisions, or postmortem ownership.

Preconditions

User can access the target service source path and line location.
Lightrun MCP server is installed and authenticated.
OAuth authorization for Lightrun MCP is completed before runtime capture.

MCP Preflight

Required gate tool:
- get_runtime_sources
Pass criteria:
- At least one valid agent pool is returned.
- A concrete target is selected: agentNames or customSourceName or tagNames.
Fail criteria:
- Tool is unavailable, call fails, or source list is empty.

Missing-MCP Recovery

Classify failure: missing tool, runtime call error, or empty source list.
Instruct remediation:
- Install/enable Lightrun MCP.
- Complete MCP OAuth authorization.
- Verify access to the expected environment/agent pool.
Re-run get_runtime_sources.
Continue investigation after preflight success.

Resume Criteria

Resume the investigation after get_runtime_sources returns valid sources.
Runtime evidence tools are activated after preflight success.
For asynchronous runtime actions, resume by re-checking previously created action IDs before creating duplicate actions.

Runtime Tool Selection Strategy

Keep preflight fixed on get_runtime_sources.
At run start, inspect currently exposed Lightrun runtime tools and their descriptions before selecting an evidence path.
For evidence collection, select the best-fit tool set for each hypothesis signal based on both investigation needs and currently exposed capabilities.
Record the selected tool identifier exactly as exposed by MCP.
Before each action, state what decision this action can change.
If an action cannot change any diagnosis decision, do not run it.
After each action, reassess information gain and change strategy when gain is low for two consecutive actions.
Avoid repeating similar probes across many locations without new rationale.
Re-check currently exposed runtime tools when resuming a later run, and adapt the evidence path if available capabilities changed.

Source Selection Confidence

First evaluate candidate agent/tag/custom-source options and choose the best-fit target when confidence is sufficient.
If several targets can fit, select one or multiple strongest candidates using explicit reasoning (service ownership, environment match, and expected trigger path).
Ask the user for source clarification when confidence remains low after this evaluation.
When clarification is needed, present a short comparison of candidates and continue after the user selects the source.

Investigation Principles

Start with hypotheses first, then choose tools.
Capture evidence that can confirm or falsify a specific hypothesis.
Collect runtime evidence whenever feasible, even when a bug cause appears obvious.
For user-complaint investigations, evidence must explain whether the observed failure was expected or unexpected for a concrete request context.
Prefer regular (non-async) runtime tools for same-run investigation when they can produce required evidence in the current session.
Use asynchronous runtime actions only when the expected signal likely needs a longer or uncertain reproduction window.
When async actions are used, treat them as investigation state that can span multiple skill runs.
Do not issue final diagnosis while required async actions are still pending/running without checking status and available results first.
Prefer eliminating wrong hypotheses quickly over collecting broad low-signal data.
End with a diagnosis statement that includes confidence and remaining uncertainty.
Do not close investigation based only on occurrence evidence; closure requires mechanism evidence linking runtime state to failure path.

Async Activation Gate

Async mode is optional, but MUST be activated when either condition is met:
- two consecutive no-hit/timeout outcomes occur on correctly targeted synchronous runtime captures for the active hypothesis, or
- reproduction is not available in the current session window.
After the first failed synchronous cycle, force an explicit mode decision:
- continue with synchronous capture when user can reproduce now in-session,
- switch to asynchronous capture and pause investigation for resume in a later run when reproduction timing is uncertain or delayed.
Do not run more than 2 consecutive no-hit synchronous probes on the same codepath before switching to async mode.
When async mode is activated, create async action(s), persist action IDs, provide reproduction-required handoff, and stop active diagnosis until next run.
If reproducibility confidence is low or user-reported failure is intermittent, favor async in the first evidence cycle.

Async Runtime Action Protocol

Use this protocol when async runtime tools are available in MCP.
Discover currently available runtime tool names from MCP at run time and use the exact exposed identifiers.
The protocol requires these capabilities:
- create an async runtime action for a hypothesis signal,
- check async action status by action ID,
- retrieve captured values and/or call stack when new hits are available,
- cancel async action when it is no longer needed.
At action creation time, persist: actionId, hypothesis ID, source target, code location, purpose, creation time, max wait, last known status, last retrieved hit count.
For uncertain reproduction timing, use a long async window by default (recommended baseline: 1800-3600 seconds), then adjust only with explicit reason.
After creating an async action, perform bounded in-session status polling for a short but meaningful window.
- use a small polling budget (for example 60-90 seconds total in-session),
- if new hits arrive in this window, retrieve data immediately and continue investigation in the same run,
- if no usable results arrive by budget end, keep action active and switch to reproduction-required handoff for later resume.
On each new skill run for the same investigation:
- load persisted action IDs first,
- call status first for each still-relevant action,
- retrieve data only when status hit count increased beyond previously retrieved count.
During bounded in-session polling, stop when status is terminal: COMPLETED, FAILED, ERROR, TIMEOUT, CANCELLED, or when the in-session polling budget is reached.
If status reaches terminal during bounded in-session polling and hit count increased since last getter call, perform one final getter call before closing the action outcome.
If status is still pending/running with no usable results by end of current run, return a handoff that includes:
- active action IDs,
- exact reproduction steps,
- retry condition for the next run.
Cancel stale or no-longer-needed actions and record cleanup decision in handoff.

Runtime Action Cleanup Gate

Cleanup review is mandatory before any final response in both same-run and async branches.
Maintain an investigation-owned action list for this run/session state, including each created action ID and its purpose.
Before final handoff, review each investigation-owned action and assign one of:
- cancelled (no longer needed),
- retained (still required for next reproduction window),
- already terminal (completed/failed/timeout/cancelled by system or prior run).
Cancel an investigation-owned action when:
- the mapped hypothesis is ruled out or already confirmed with sufficient evidence,
- the action is duplicate, stale, mistargeted, or replaced by a newer action,
- the investigation is complete and the action is no longer needed.
Retain an action only when a concrete next reproduction step depends on it; include retention reason and expected expiry window.
Do not cancel actions outside the investigation-owned action list.
Do not emit final handoff until cleanup review is complete and reported.

Tool Call Timing

Use tool default collection timing unless the investigation clearly benefits from a different window.
Avoid adding extra timeout constraints to runtime tool calls during normal investigation flow.
When timing is adjusted, include a short reason describing the expected diagnostic benefit.

Investigation Efficiency

Keep evidence collection focused on actions that can change diagnosis or next steps.
Once the bug mechanism is sufficiently confirmed for the current user-impact question, prefer synthesis and handoff over additional broad sampling.
Choose practical capture windows for the current goal; use longer waits only when the expected diagnostic value justifies it.

Action Error Mitigation

If a runtime action returns no hits or a timeout-related failure:
- verify whether a custom timeout/window was set,
- increase the active collection window when the scenario needs more trigger time,
- ask the user to reproduce again within the updated window.
For async actions, distinguish:
- pending/running with no hits yet: reproduction-required handoff, keep action active if still needed,
- terminal with no usable hits: blocker/mitigation path, then either retry with refined targeting or close hypothesis as inconclusive.
If reproduction is confirmed but action has no hits, treat as targeting mismatch:
- do not repeat the same reproduction request with unchanged source/location/signal/hypothesis,
- retarget at least one of source, location, signal definition, or leading hypothesis before next reproduction request.
Re-check source targeting after timeout/no-hit outcomes:
- confirm selected source target(s) still match the suspected execution path,
- ask the user to confirm source choice when confidence drops after failed captures.
For other action errors, consult the Lightrun troubleshooting guide and apply the most relevant remediation:
- https://docs.lightrun.com/troubleshooting/overview/
- summarize which troubleshooting path was used and why it fits the observed error.
Log mitigation decisions in the handoff (what changed and why).

Bug Explanation and Fix Proposal Standard

Explain the bug as a concrete execution path:
- trigger conditions
- key state values observed at runtime
- exact code path or branch that produces the failure
- visible impact for users or system behavior
Tie each diagnosis claim to runtime evidence and code location.
Propose a concrete fix at code level:
- file/module to change
- behavioral change to implement
- why this change addresses the observed mechanism
- risk notes and validation checks
Use clear, specific language and avoid generic filler.

Quick Use Guide

Use this skill in the following sequence:

Define the investigation question in one sentence.
List top hypotheses and the signal expected for each.
Inspect currently exposed runtime tools and choose the evidence path that best fits the hypotheses and available capabilities.
Run preflight and pick runtime source target.
Apply async activation gate after failed sync evidence cycles.
Update hypothesis status after each retrieved signal.
Publish diagnosis, confidence, and next best action, or reproduction-required handoff (async mode only).

Investigation template:

Question:
Impact:
Hypotheses:
- H1:
  - Confirms when:
  - Weakens when:
- H2:
  - Confirms when:
  - Weakens when:
Selected runtime target:
Signals collected:
Leading diagnosis:
Confidence:
Next action:

Flow

Frame the investigation problem.
- Tools: none
- Success: symptom, impact, expected behavior, and investigation question are explicit.
Create a hypothesis matrix.
- Tools: none
- Success: at least 2 plausible hypotheses are listed, each with a planned confirming and falsifying signal.
- Each hypothesis must include: what request/state would make this failure unexpected.
Inspect currently exposed runtime tools and choose the initial evidence path.
- Tools: none
- Success: selected path matches both investigation needs and currently exposed tool capabilities.
Run preflight and select a target source.
- Tools: get_runtime_sources
- Success: one or multiple source targets are selected and justified by agent reasoning, with user clarification only when confidence remains low.
Map full codepath and choose triggerable evidence points.
- Tools: none
- Success: full assumed bug codepath is explored, and action points are placed on executable lines likely to trigger in reproduction.
Execute focused same-run evidence steps per hypothesis when immediate capture is feasible.
- Tools: choose from currently available Lightrun MCP runtime tools based on their descriptions and fit to the active hypothesis.
- Success: each evidence step is mapped to one hypothesis, changes or preserves a specific decision, and either strengthens or weakens it.
Apply async activation gate.
- Tools: none
- Success: choose synchronous continuation or asynchronous capture with later resume explicitly; switch to async no later than the second consecutive no-hit/timeout on same codepath.
If same-run evidence is insufficient due to longer/uncertain reproduction window, switch to async branch when matching capabilities are currently exposed.
- Tools: async action creation/status/data retrieval/cancel capabilities, discovered from current MCP tool list.
- Success: async action IDs are created or resumed only for hypotheses that require delayed evidence.
In async branch, run bounded in-session status polling after async action creation.
- Tools: async status/getter tools.
- Success: if hits arrive within the in-session polling budget, retrieve them and continue investigation in this run; otherwise proceed to later-resume handoff.
In async branch, resume existing async actions before creating new ones.

Tools: async status/getter tools for existing action IDs.
Success: existing action outcomes are incorporated, duplicate action creation is avoided.

Ask for issue reproduction within action window (when async branch is active and bounded in-session polling had no usable results).

Tools: none
Success: user receives clear reproduction instructions and timing window while runtime actions are active.

Iterate investigation loop.

Tools: same minimal subset as steps 6-10, depending on active branch
Success: repeat capture and assessment until one leading hypothesis remains, or all hypotheses are inconclusive, including timeout/source mitigation when captures fail; reproduced-no-hit loops must include retargeting before next repro request.

Synthesize diagnosis and confidence.

Tools: none
Success: findings differentiate facts from inference, ruled-out hypotheses are explicit, and uncertainty is bounded.

Run mandatory runtime action cleanup gate.

Tools: async status/cancel capabilities when relevant.
Success: each investigation-owned action is marked cancelled/retained/already-terminal, and cancellations are applied when required.

Produce decision-ready handoff with next actions.

Tools: none
Success: output contract is fully populated with diagnosis quality fields and concrete fix proposal details.

Output Contract

Preflight pass:
- selected source target(s) (agentPoolName + selector mode(s))
- next runtime action (first evidence tool and why)
Preflight fail:
- blocker category
- exact remediation required
- explicit retry condition
Runtime blocker:
- failed Lightrun MCP tool
- reason/error class
- mitigation applied (timeout/window update and/or source revalidation)
- troubleshooting reference used (when applicable)
- immediate next action
Reproduction required:
- active async action IDs (only when async branch is active)
- selected source target(s)
- exact reproduction instruction
- action window used
- expected next signal to capture
- explicit retry condition
Mode decision summary:
- async activation rule met: yes/no
- selected mode: synchronous continuation or asynchronous capture with later resume
- if async not activated, explicit reason
Final handoff:
- selected source target(s) and source-selection note (if user clarification was needed)
- async action state summary (only when async branch is used; per action: id, hypothesis mapping, latest status, retrieved hit count, cleanup decision)
- cleanup summary:
  - cancelled action IDs
  - retained action IDs with reason and expected expiry window
  - already-terminal action IDs
- reproduction instruction + action time window used
- investigation question
- hypothesis matrix result (leading, ruled out, inconclusive)
- evidence summary (facts first)
- bug mechanism summary (trigger, path, failure point, impact)
- diagnosis statement with confidence level
- disconfirming evidence considered
- remaining unknowns and why they matter
- concrete code-fix proposal (target files/modules, behavior change, validation plan)
- recommended next step
- artifact path + checklist status

lightrun-live-runtime-debugging

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

lightrun-live-runtime-debugging

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Goal

Scope

Preconditions

MCP Preflight

Missing-MCP Recovery

Resume Criteria

Runtime Tool Selection Strategy

Source Selection Confidence

Investigation Principles

Async Activation Gate

Async Runtime Action Protocol

Runtime Action Cleanup Gate

Tool Call Timing

Investigation Efficiency

Action Error Mitigation

Bug Explanation and Fix Proposal Standard

Quick Use Guide

Flow

Output Contract

Runtime Quality Checklist

Similar Skills

Goal

Scope

Preconditions

MCP Preflight

Missing-MCP Recovery

Resume Criteria

Runtime Tool Selection Strategy

Source Selection Confidence

Investigation Principles

Async Activation Gate

Async Runtime Action Protocol

Runtime Action Cleanup Gate

Tool Call Timing

Investigation Efficiency

Action Error Mitigation

Bug Explanation and Fix Proposal Standard

Quick Use Guide

Flow

Output Contract

Runtime Quality Checklist

Similar Skills