Estimates heterogeneous treatment effects using Causal Forest and DML with BLP/GATES/CLAN/TOC validation and policy learning (policytree).
How this skill is triggered — by the user, by Claude, or both
Slash command
/everyday-causal-skills:causal-hteThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You guide users through a complete heterogeneous treatment effect analysis following a 5-stage pattern: Setup → Assumptions → Implementation → Validation → Interpretation + Policy.
You guide users through a complete heterogeneous treatment effect analysis following a 5-stage pattern: Setup → Assumptions → Implementation → Validation → Interpretation + Policy.
references/lessons.md — known mistakes. Do not repeat them.references/assumptions/hte.md — the assumption checklist for HTE methods.references/method-registry.md → "Heterogeneous Treatment Effects (HTE) / CATE Estimation" section.docs/causal-plans/*/plan.md. If it does, read it for context.If a plan document from /causal-planner is provided: Extract the study design (treatment, population, outcome, data structure, language) directly from the plan. Do not re-ask questions the planner already answered. Acknowledge the plan and build on it.
If coming from another ATE skill (matching, experiments, DiD, IV): Inherit the treatment, outcome, covariates, and identification strategy. Ask only HTE-specific questions below.
If plan exists: Read it. Extract business objective, treatment, covariates, outcome, language, data structure. Confirm: "I've read your analysis plan. You're estimating the effect of [treatment] on [outcome] and now want to explore heterogeneity. Does that sound right?"
If no plan / standalone: Ask:
HTE-specific questions (always ask):
X vs W decision aid (MUST present to user):
| Category | Goes in... | Meaning |
|---|---|---|
| W (confounders) | Nuisance models only | Affects both treatment AND outcome. Needed for identification. |
| X (effect modifiers) | CATE model | Might change the SIZE of the treatment effect. |
| Both X and W | Both stages | Variable is a confounder AND might moderate the effect. When in doubt, include in both. |
Platform note: In grf, all covariates go in one matrix X — there is no separate W argument. grf handles confounding control internally. In econml, X and W are separate arguments — putting a confounder only in X (not W) biases estimates.
Pre-flight checks (before proceeding to Stage 2):
Read references/assumptions/hte.md. Walk through each assumption interactively:
Critical framing (state this explicitly): "HTE estimation does NOT relax identification assumptions. If your ATE would be biased (e.g., unmeasured confounders), your CATEs are biased too. Machine learning does not overcome confounding."
For each assumption:
Key assumptions to walk through:
Conditional independence / unconfoundedness: Same as matching — must believe all confounders are measured. If coming from an RCT, this is satisfied by design. If observational, discuss plausibility.
Overlap / positivity (subgroup-level): "For HTE, overlap must hold within each CATE subgroup, not just overall. If the highest-effect group has propensity scores near 1, the effect estimate for that group is extrapolation."
SUTVA (no interference): Same as all methods.
Effect modifiers must be pre-treatment: "Variables in X must be measured before treatment. Post-treatment variables create spurious heterogeneity — the forest will 'discover' patterns that are mechanical, not real."
Sufficient sample size: n ≥ 2,000 for causal forests, n ≥ 100 per CATE quintile for reliable GATES.
Honest estimation / sample splitting: Verify honesty = TRUE in grf, cv >= 3 in econml.
After all assumptions, summarize with status indicators per assumption.
Generate complete analysis code. Read the appropriate template from templates/r/hte.md or templates/python/hte.md for code patterns.
Missing-package preflight: The template's Prerequisites block detects (never installs) missing packages. Follow references/preflight.md: report what's missing, then ask the user whether they want you to install it for them or do it themselves — install only on an explicit yes.
IMPORTANT — Template adherence: Copy the code pattern from the appropriate template exactly, then adapt only variable names to match the user's data. Do not restructure the code, use alternative function APIs, or improvise. The templates have been tested; deviations introduce bugs.
Two-pass approach (always follow this order):
LinearDML first pass (always run — fast, interpretable, screens for signal):
best_linear_projection() on a quick causal forestLinearDML with summary()Causal Forest (primary estimator):
W.hat = rep(0.5, n) in R, DummyClassifier in Python)Always include:
Generate validation code from templates. Full sequence:
Calibration test (gatekeeper — R only via test_calibration()):
BLP (Best Linear Predictor):
average_treatment_effect()GATES + overlap-within-quintile check:
CLAN (Classification Analysis):
TOC/RATE (R: rank_average_treatment_effect(eval_forest, priorities)):
rank_average_treatment_effect(cf) grades the
forest with its own predictions; its symmetric CI is anti-conservative
(grf rejects a true null ~30% of the time that way). priorities is a
required argument in current grf.Stability check:
Before proceeding to interpretation, confirm ALL of the following from actual code output:
If any box is unchecked: Flag it to the user — explain which evidence is missing and why it matters. Offer to run the missing step before interpreting. If the user chooses to continue anyway, carry the gap forward as a caveat in the interpretation.
Watch for premature conclusions — phrases like "The heterogeneity suggests..." before the gate passes. Quote actual output instead.
Severity verdicts must appear BEFORE this gate. If a Fatal or Serious issue was identified during Stage 2 or Stage 3, the severity verdict block must already be visible in the output above.
| Signal | Severity | Action |
|---|---|---|
| Post-treatment variable in X | 🚨 Fatal | Spurious heterogeneity. Remove variable before estimation. |
| Propensity < 0.05 or > 0.95 in any CATE quintile | 🚨 Fatal | GATE for that quintile is extrapolation. Warn user. |
| Honest splitting turned off (honesty=FALSE / cv=1) | 🚨 Fatal | CIs invalid, CATEs overfit. Require re-estimation. |
| n < 2,000 total | ⚠️ Serious | Low power for heterogeneity detection. Recommend LinearDML only. |
| n < 100 per CATE quintile | ⚠️ Serious | GATES unreliable for small quintiles. |
| Calibration test fails (both terms non-significant) | ⚠️ Serious | Forest may be fitting noise. |
| BLP coefficient not significant | ⚠️ Serious | Cannot detect heterogeneity at this sample size. |
| GATES CIs all overlap | ⚠️ Serious | No detectable difference between CATE groups. |
| Single variable > 60% of importance | ⚠️ Serious | May indicate confounding with treatment, not moderation. Investigate. |
| Variable importance changes across seeds | ⚠️ Serious | Heterogeneity signal is not robust. |
🚨 Fatal = Emit this verdict block immediately after the diagnostic that reveals the violation:
FATAL: [violation name] [One sentence: what was found in the data.] This analysis should not proceed without addressing this issue. Results produced under this violation are not trustworthy. If you cannot yet confirm the violation (because the user hasn't run diagnostic code), use CONDITIONAL FATAL: [violation name] with the same format but replace the consequence line with: "If [specific diagnostic condition], this analysis should not proceed. Run the diagnostic above and report the result before continuing." If the user chooses to continue despite a Fatal verdict, repeat the verdict verbatim in Stage 5 interpretation.
⚠️ Serious = Emit this block:
SERIOUS: [limitation name] [One sentence: what was found.] Proceeding is possible, but the interpretation must prominently acknowledge this limitation and its consequences.
Use only FATAL and SERIOUS severity labels. Do not invent additional tiers.
| Shortcut | Reality |
|---|---|
| "The causal forest found the heterogeneity, so it must be real" | Causal forests discover patterns in data. Without validation (BLP, GATES), you don't know if the pattern is real or noise. |
| "Variable importance tells us what drives the treatment effect" | Variable importance measures splitting value, not causal moderation. A variable can be important for splitting without being a true effect modifier. |
| "We can skip LinearDML — the forest is more flexible" | LinearDML is a diagnostic, not a competitor. It screens for signal quickly and provides interpretable coefficients. Always run it first. |
| "No heterogeneity detected means effects are homogeneous" | It means you lack power to detect heterogeneity. The ATE applies broadly — which is a valid and useful finding. |
| "The policy tree tells us who to treat" | It's an exploratory rule, not a deployment-ready policy. Validate on held-out data and run a confirmatory experiment. |
Three layers, presented in order:
"Based on the HTE analysis:
Important caveat: Variable importance measures splitting value, not causal moderation. A variable that is important for the forest's predictions is not necessarily a causal modifier — it could correlate with a true modifier."
Ask: "What is the cost of treatment per unit? (If free or unknown, I'll use 0.)"
Present three benchmarks:
Only offer if:
Default: depth = 2 (shallow, interpretable). Cost-adjusted rewards.
Deployment disclaimer (ALWAYS shown with any policy output):
This is an exploratory targeting rule, not a deployment-ready policy. Before operationalizing: (1) validate on held-out data, (2) run a confirmatory experiment, (3) review for fairness and equity, (4) get domain expert review.
Fairness check (ALWAYS run with policy output): Check if the policy correlates with protected attributes (gender, race, age group) even if they were not used in the tree.
Finding no heterogeneity is a valid result: "Your ATE estimate from [upstream method] appears to apply broadly. This is useful — it means you don't need to segment or target."
Save alongside the plan (or create a new directory if standalone):
docs/causal-plans/YYYY-MM-DD-<project>/
├── plan.md # From planner (or created here if standalone)
├── implementation.md # This skill's stage-by-stage summary
└── analysis.[R|py] # Generated code
Use the Write tool. Tell the user where files are saved.
"Your HTE analysis is complete. Recommended next steps:
/causal-auditor to stress-test for threats to validity./causal-exercises to try HTE on simulated data with known ground truth.Before this skill:
/causal-planner -- Identifies method and saves analysis plan (recommended)/causal-matching, /causal-experiments, /causal-did, /causal-iv -- Any ATE skill can hand off hereAfter this skill:
/causal-auditor -- Stress-test results for threats to validity (recommended)/causal-exercises -- Practice HTE on simulated data (optional)If assumptions fail:
/causal-matching -- If overlap is the main issue (re-examine propensity model)/causal-experiments -- If you can run an RCT (strongest identification for HTE)If the user corrects you, append to references/lessons.md:
### HTE: [Short description]
**Trigger**: [When this tends to happen]
**Mistake**: [What went wrong]
**Rule**: [What to do instead]
**Source**: User correction, [date]
npx claudepluginhub robsontigre/everyday-causal-skills --plugin everyday-causal-skillsGuides through complete difference-in-differences analysis: setup, parallel trends testing, staggered rollout handling, robustness checks, and plain-language interpretation.
Designs, runs, and critiques causal inference workflows in Stata for identification strategies, treatment effects, DiD, IV, event studies, RD, and assumption-sensitive empirical claims.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.