From llm-prompts
Research-backed, provider-agnostic guidance for writing LLM prompts and agent instructions — positive framing, XML structure, contrastive examples, outcome-first agentic contracts, and provider tuning knobs. Use when writing or revising a system prompt, agent or tool instructions, a tool description, or few-shot examples; or when the user asks to improve a prompt.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-prompts:llm-promptsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Prompt engineering is the wording of a single instruction. Context engineering is the architecture around it — tools, memory, sub-agents, retrieval, how context is layered and ordered. For production agents the architecture matters more than any one phrasing. The principles below are provider-agnostic; per-model knobs live in `reference/`.
Prompt engineering is the wording of a single instruction. Context engineering is the architecture around it — tools, memory, sub-agents, retrieval, how context is layered and ordered. For production agents the architecture matters more than any one phrasing. The principles below are provider-agnostic; per-model knobs live in reference/.
Seven research-backed rules govern every prompt. Each traces to a finding in reference/research.md.
Tell the model what TO do, not what NOT to do. Models handle negation poorly at the token level, and instruction-following degrades with negative framing. Replace a <forbidden_patterns> block with <preferred_patterns>. When a negative constraint is unavoidable (safety, data loss), pair it with the positive alternative and the reason.
# Avoid
Do NOT bundle multiple facts. NEVER include unsupported claims.
# Prefer
Each claim is atomic — one verifiable fact, grounded in the source.
This lets downstream systems validate each claim independently.
Pair correct and incorrect side by side in labeled <example> tags, each with a one-line reason. Contrastive pairs beat good-only or bad-only sets. A standalone <bad_examples> block with no adjacent correct counterpart is counterproductive.
<examples>
<example label="incomplete">
"Revenue increased by 50%"
Missing: whose revenue, which timeframe — not verifiable alone.
</example>
<example label="self-contained">
"Acme's Q3 2025 revenue increased by 47% year over year"
Entity, metric, timeframe, baseline all present.
</example>
</examples>
Replace CRITICAL:, IMPORTANT:, NEVER: with a plain statement that explains why the constraint exists. Emphasis markers overtrigger modern models and waste reasoning tokens. Reserve ALL CAPS for genuine safety boundaries (deletion, payment, irreversible side effects).
# Avoid
IMPORTANT: Each question must ask for ONE value. NEVER ask for comparisons.
# Prefer
Each question asks for one value that can be looked up directly. A question
comparing two values is analytical, not a lookup — it needs interpretation.
Strong models do worse with over-constrained prompts; instruction-following degrades as the rule list grows. Write the shortest prompt that produces the correct output. Add a constraint only after you observe the specific failure it prevents — never speculatively.
Keep the reasoning step in free text. A JSON or constrained schema on the reasoning step drops accuracy sharply by forcing answer-before-reasoning ordering. Let the model reason freely, then capture structure in a separate step: a structured-output call, a follow-up request, or a dedicated format field that runs after the reasoning.
State what a correct output looks like and let the model self-verify. Append "Before finishing, verify your answer against these criteria" rather than prescribing PASS/FAIL gates. Reasoning models verify natively; step-by-step gates over-constrain the reasoning and can lower quality.
# Avoid: procedural gate
Step 1: identify the entity — if generic, SKIP.
Step 2: check if permanent — if study-specific, SKIP.
# Prefer: declarative criteria
<success_criteria>
A claim is extractable when it names a specific entity, describes a permanent
property (not study-specific), and would be independently citable.
</success_criteria>
For long-context prompts, put reference material first and the instruction or query last. All major providers confirm this ordering; one reports up to 30% quality improvement from it.
State the destination, not the route. Define the expected outcome, success criteria, allowed side effects, evidence rules, and output shape; let the model pick the path. Reserve step-by-step procedures for cases where the path itself matters (a fixed audit order, a required tool sequence).
Resolve the request end to end.
Success: the decision follows from the provided data; any allowed action is
completed before responding; the reply includes completed_actions and blockers.
Allowed side effects: read-only lookups, and the single action named above.
Reasoning effort, verbosity, and thinking depth tune the same prompt — they are not the primary quality lever. When output is incomplete or under-verified, add structure first: a completeness contract, a verification loop, tool-persistence rules. Raise effort only after the prompt itself is sound, and re-baseline cost and latency at the new level.
Modern models follow instructions literally and will not silently generalize one item's rule to all items. State scope explicitly: "apply this to every section, not just the first." On a mid-conversation change, say which earlier rules still hold:
<task_update>
For the next response only: [instruction].
All earlier instructions still apply unless they conflict with this update.
</task_update>
XML tags constrain structure without constraining reasoning, and outperform JSON/YAML for organizing a prompt. Use descriptive, self-documenting tag names.
Top-level structural tags:
<instruction> the task </instruction>
<context> reference material, placed before the instruction for long inputs </context>
<output_format> exact shape of the expected output </output_format>
<examples> contrastive pairs </examples>
Content-specific tags carry meaning in the name: <success_criteria>, <filtering_criteria> (include/exclude with reasons), <action_rules> (when to append/merge/add/drop), <preferred_patterns>, <key_constraints> (constraints with their reasons).
The example pattern is always contrastive — correct and incorrect adjacent, each with a reason:
<examples>
<example label="non-atomic">
"Company laid off 30% of staff and the CEO says all is well"
Two independent facts bundled — neither is separately verifiable.
</example>
<example label="atomic">
"Company laid off 30% of its staff in 2025"
One entity, one action, one timeframe.
</example>
</examples>
Self-check tags state the test declaratively, not as a procedure:
<merge_test>
Two claims describe the same fact when they would produce the same headline.
Same headline, different wording → merge. Two headlines → keep separate.
</merge_test>
For agents, encode behavior as reusable XML contract blocks rather than prose. Each is one focused block; full text is in reference/openai.md.
<tool_persistence_rules> — keep calling tools until the task is complete and verification passes; retry on empty/partial results.<completeness_contract> — track an internal checklist; treat the task as incomplete until every item is covered or marked [blocked].<verification_loop> — before finalizing, check correctness, grounding, formatting, and irreversible side effects.<empty_result_recovery> — on empty or suspiciously narrow results, try one or two fallback strategies before concluding nothing exists.<dependency_checks> — resolve prerequisite lookups before acting; don't skip them because the final action seems obvious.<parallel_tool_calling> — parallelize independent lookups; run dependent steps sequentially.Claude's latest models default to action and parallelize well: they make independent tool calls simultaneously, and respond to "investigate before answering" by reading referenced files before claiming anything about code. They favor reasoning over tool calls by default, so raising effort is the lever that increases tool usage. The full Claude blocks are in reference/claude.md.
| Provider | Reasoning / effort (default) | Verbosity | Thinking default |
|---|---|---|---|
| Anthropic Opus 4.8 | effort: low–max, default high; set xhigh for coding/agentic | calibrated to task complexity; prompt to constrain | adaptive thinking off unless thinking: {type: "adaptive"} |
| OpenAI GPT-5.5 | reasoning_effort: none–xhigh, default medium | verbosity, default medium; low is often a good start | internal reasoning; no explicit CoT prompt |
| Google Gemini 3 | thinkingLevel: minimal–high; Pro can't fully disable thinking | — | thinking on; keep temperature at 1.0 |
Defer per-model detail to reference/.
CRITICAL:/IMPORTANT:/NEVER overtrigger modern models. Explain why the constraint exists.<forbidden_patterns> instead of <preferred_patterns>. State what to do; pair any necessary negation with the positive alternative.reference/claude.md — Anthropic Opus 4.x: effort levels, adaptive thinking, parallel-tool and investigate-before-answering blocks, prefill removal.reference/openai.md — OpenAI GPT-5.x: reasoning effort, verbosity, and the full agentic XML contract blocks.reference/gemini.md — Google Gemini 3: thinkingLevel, temperature, few-shot-over-instructions, thought signatures.reference/research.md — the findings table behind the core principles, with sources.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub yasyf/cc-skills --plugin llm-prompts