From Fairy Tale
Applies benchmark feedback from SWE-Bench Pro, HLE, and ExploitBench runs by classifying misses, pruning contradictions, and promoting narrow rules without hardcoding.
How this skill is triggered — by the user, by Claude, or both
Slash command
/fairy-tale:fairy-tale-benchmark-feedbackThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill after a measured benchmark miss, work-product failure, or
Use this skill after a measured benchmark miss, work-product failure, or successful benchmark slice whose practice should be made reproducible in agentic coding, HLE-style closed-ended reasoning, or defensive ExploitBench sandbox runs.
Do not inspect gold patches, hidden answers, private rubrics, scorer internals, or restricted data. Use only task instructions, public/visible tests, official harness artifacts, logs, and local work product.
scripts/benchmark_feedback_ledger.py swe-bench-proscripts/benchmark_feedback_ledger.py hlescripts/benchmark_feedback_ledger.py exploitbenchscripts/feedback_pruner.py --ledger <ledger.json> --output <prune.json>.api_compatibility_break: a patch changes a function, method, constructor,
return shape, argument list, exported symbol, or file path in a way that
breaks existing callers or visible tests.missing_adjacent_symbol: a patch references a helper, type, component,
module, constant, or path that was not added, exported, generated, or
imported on the touched surface.test_mock_contract_break: production code may work, but the patch changes
construction or dependency-injection shape in a way that breaks existing test
doubles, mocks, or factories.edge_case_invariant_gap: the main path compiles, but an existing invariant
still fails for a migration, mapping, default, empty, boundary, duplicate,
ordering, or error-path case.weak_test_oracle: the patch changes or adds tests that can pass without
proving the requested behavior, such as tautological assertions, testing
implementation details, mirroring current buggy output, or mocking the unit
under test into success.architectural_erosion: the patch may pass current tests while making the
next change harder through duplicated logic, large special-case chains,
unrelated surface area, or added complexity in already-large functions.dependency_or_artifact_churn: the patch changes dependencies, lockfiles,
generated outputs, vendored code, snapshots, or broad config without clear
task necessity and validation.existing_behavior_regression: the patch satisfied a new requirement by
breaking visible existing behavior. Preserve old invariants unless the task
explicitly deprecates them; implement new priority rules narrowly.missing_public_interface: the task named a function, type, method, helper,
or path, but the symbol was not importable/exported exactly as specified.self_selected_validation_gap: self-chosen focused checks passed, but
scorer-selected adjacent tests failed. Add compatibility checks for touched
helper/API surfaces.implicit_contract_gap: the prompt did not spell out an invariant, but
adjacent code, legacy callers, mocks, fixtures, generated files, docs, or
domain conventions relied on it. Recover tacit intent from artifacts before
editing and verify the inferred contract with a neighboring check.scorer_failure_general: the failure needs a concrete behavior/interface
hypothesis before retry; avoid broad prompt growth.Before finalizing a SWE patch:
true, assert only implementation details,
snapshot accidental output, or mock the unit under test so the test cannot
fail for the real bug.local_invariant_mapping: map existing helpers, types, call sites, and
adjacent tests before editing; reuse local abstractions.targeted_container_validation: validate inside the benchmark container with
focused tests for the touched surface and record exact commands.named_interface_completion: implement the exact requested symbol at the
requested path while preserving backward-compatible wrappers when existing
callers rely on them.executable_model_verification: before spending broad retries, encode the
current understanding as a small checkable model: expected inputs, state,
transitions, public contract, old invariants, and success condition; then
falsify it with adjacent tests or a focused script.output_exhaustion_no_final_answermissing_final_answermultiple_choice_label_driftoverconfident_wrong_answerobjective_without_feasibility_check: the answer optimizes a scalar objective
but fails to prove physical, geometric, placement, domain, or constraint
feasibility.wrong_answer_generalBefore finalizing HLE-style answers, write the exact final answer field first, then compactly verify assumptions, answer format, and independent terms. For optimization, packing, scheduling, routing, geometry, or resource allocation items, verify both objective optimality and feasibility; do not assume additive benefit, non-overlap, independence, or realizability merely because the objective value is larger.
Successful HLE-style runs should preserve:
coverage_only_plateau: basic code reachability exists, but the run did not
progress to the next official sandbox signal.no_signal_timeout: the run spent too long exploring without harness-visible
signal.ExploitBench feedback is defensive-only. Use official sandbox artifacts and
upstream-compatible stuck,wrapup nudges. Do not convert transcripts into
real-target exploit instructions.
Successful ExploitBench practice includes:
Keep candidate feedback out of default behavior until it has:
npx claudepluginhub bonginkan/fairy_tale --plugin fairy-taleGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.