From armory
Analyzes failure modes, generates prompt variants (direct, few-shot, CoT), designs rubrics, and produces test suites for LLM prompt engineering.
How this skill is triggered — by the user, by Claude, or both
Slash command
/armory:prompt-labThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Replaces trial-and-error prompt engineering with structured methodology: objective
Replaces trial-and-error prompt engineering with structured methodology: objective definition, current prompt analysis, variant generation (instruction clarity, example strategies, output format specification), evaluation rubric design, test case creation, and failure mode identification.
| File | Contents | Load When |
|---|---|---|
references/prompt-patterns.md | Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always |
references/evaluation-metrics.md | Quality metrics (accuracy, format compliance, completeness), rubric design | Evaluation needed |
references/failure-modes.md | Common prompt failure taxonomy, detection strategies, mitigations | Failure analysis requested |
references/output-constraints.md | Techniques for constraining LLM output format, JSON mode, schema enforcement | Format control needed |
If an existing prompt is provided:
references/failure-modes.md)
apply to this prompt?Create 2-4 prompt variants, each testing a different hypothesis:
| Variant Type | Hypothesis | When to Use |
|---|---|---|
| Direct instruction | Clear instruction is sufficient | Simple tasks, capable models |
| Few-shot | Examples improve output consistency | Pattern-following tasks |
| Chain-of-thought | Reasoning improves accuracy | Multi-step logic, math, analysis |
| Persona/role | Role framing improves tone/expertise | Domain-specific tasks |
| Structured output | Format specification prevents errors | JSON, CSV, specific templates |
For each variant:
Rubric — Define weighted criteria:
| Criterion | What It Measures | Typical Weight |
|---|---|---|
| Correctness | Output matches expected answer | 30-50% |
| Format compliance | Follows specified structure | 15-25% |
| Completeness | All required elements present | 15-25% |
| Conciseness | No unnecessary content | 5-15% |
| Tone/style | Matches requested voice | 5-10% |
Test cases — Minimum 5 cases covering:
Present variants, rubric, and test cases in a structured format ready for execution.
## Prompt Lab: {Task Name}
### Objective
{What the prompt should achieve — specific and measurable}
### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}
### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}
### Variants
#### Variant A: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant B: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant C: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
### Evaluation Rubric
| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |
### Test Cases
| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |
### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}
### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements
| Problem | Resolution |
|---|---|
| No clear objective | Ask the user to define what "good output" looks like with 2-3 examples. |
| Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing. |
| Too many variables to test | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. |
| No existing prompt to analyze | Start with the simplest possible prompt. The first variant IS the baseline. |
| Output format requirements are strict | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints. |
Push back if:
npx claudepluginhub mathews-tom/armory --plugin armoryDesigns, optimizes, and evaluates LLM prompts — generating templates, structured output schemas, evaluation rubrics, and test suites. Use for prompt refactoring, chain-of-thought, or system prompt design.
Optimizes prompts for production AI features with analysis, 6-step framework, failure detection, and research-backed techniques. Use for prompt review, system prompts, or improvement suggestions.
Provides workflows to write, debug, and optimize LLM prompts using few-shot examples, chain-of-thought structuring, system prompts, and templates. Activates for prompt improvement requests.