From msec
Review: Evaluate test quality using Dave Farley's 8 Properties of Good Tests
How this command is triggered — by the user, by Claude, or both
Slash command
/msec:farley-scoreThe summary Claude sees in its command listing — used to decide when to auto-load this command
## ENTRY POINT: Determine User Intent **CRITICAL: BEFORE doing anything else, determine how the user invoked this command.** ### If invoked WITH a target path If the user's message includes a file path, directory path, or specifies what to analyse (e.g., `/msec:farley-score tests/`, `/msec:farley-score src/test/java/`, `/msec:farley-score evaluate test quality for the whole project`), skip the welcome menu and go directly to the **4-Phase Workflow** section below. ### If invoked WITHOUT a target path If the user simply typed `/msec:farley-score` with no additional arguments or context,...
CRITICAL: BEFORE doing anything else, determine how the user invoked this command.
If the user's message includes a file path, directory path, or specifies what to analyse (e.g., /msec:farley-score tests/, /msec:farley-score src/test/java/, /msec:farley-score evaluate test quality for the whole project), skip the welcome menu and go directly to the 4-Phase Workflow section below.
If the user simply typed /msec:farley-score with no additional arguments or context, present the welcome menu.
Use the AskUserQuestion tool to present the menu. Do NOT output the options as plain text. This ensures the user gets a navigable interactive menu (arrow keys + Enter) rather than having to type a number.
Question: "Welcome to Farley Score! What would you like to do?" Header: "Farley Score"
Options (use these exact labels and descriptions):
Present this overview to the user:
Farley Score measures test quality, not just test coverage.
It evaluates your tests against Dave Farley's 8 Properties of Good Tests -- Understandable, Maintainable, Repeatable, Atomic, Necessary, Granular, Fast, and First (TDD) -- and produces a single number: the Farley Index (0-10).
The tool reads your test code (it never modifies anything), detects quality signals and anti-patterns, scores each property on a 0-10 scale, and produces a detailed report with evidence, worst offenders, and prioritised recommendations.
Not all properties are equal. Understandable and Maintainable have the highest weight (1.5x) because tests that read like specifications and survive refactoring deliver the most long-term value. Fast has the lowest weight (0.75x) because a slow but well-designed test beats a fast but poorly designed one.
A typical analysis takes 3-5 minutes depending on the size of your test suite.
Rating scale: Exemplary (9-10), Excellent (7.5-8.9), Good (6-7.4), Fair (4.5-5.9), Poor (3-4.4), Critical (0-2.9)
After presenting the overview, ask the user what they'd like to do next (show the menu options again).
Display the pre-packaged Farley Score report from the bundled sample project.
To locate the report:
PLUGIN_DIR=$(find ~/.claude/plugins -name "cli_calculator.py" 2>/dev/null | head -1 | sed 's|/lib/cli_calculator.py||') -- this finds the plugin's root directory.$PLUGIN_DIR/examples/sample-project/farley-score-report.mdPresent the full report to the user, then walk them through the key sections:
Farley Index: "This sample suite scored 5.7 out of 10, rated 'Fair'. The sample project contains 21 test methods -- 7 well-designed and 14 deliberately flawed -- to demonstrate the kinds of issues the tool detects."
Property Breakdown: "The two highest-weighted properties -- Understandable (1.5x) and Maintainable (1.5x) -- scored 4.9 and 4.5 respectively. These are the biggest opportunities for improvement. Meanwhile, Fast scored 8.3 because almost all tests are pure computation."
Tautology Theatre: "The tool detected 6 tests that are 'tautology theatre' -- tests whose outcome is predetermined regardless of what the production code does. For example, assertTrue(True) will always pass even if you delete all your code. These tests provide zero value while inflating coverage numbers."
Worst Offenders: "The 5 lowest-scoring tests are listed with specific reasons. The worst is a mock-only test where every object is a MagicMock and no real class is ever instantiated."
Recommendations: "The tool prioritises recommendations by impact -- fixing the highest-weighted, lowest-scoring properties first."
After the walkthrough, ask: "This report was generated from bundled sample tests. Want me to analyse YOUR tests now? Just tell me where they are (e.g., tests/ or src/test/)."
Ask the user: "Where are your test files? For example: tests/, src/test/java/, or just say 'the whole project' and I'll find them."
Wait for their response, then proceed to the 4-Phase Workflow section with their specified target.
Say: "Switching to coaching mode! The Farley Score Coach teaches test quality through guided practice and Socratic questioning."
Then invoke /msec:farley-score-coach for the user.
If the user types something that doesn't match the options above, respond helpfully based on what they asked. If they want analysis, proceed to the 4-Phase Workflow. If they want to learn, redirect to coaching mode.
Autonomous analysis: produces a Farley Index (0-10) with per-property breakdown, signal evidence, worst offenders, and prioritized recommendations.
This command ANALYZES test code. It NEVER modifies code.
/msec:tdd for implementationWhen you run /msec:farley-score, Claude will:
This is autonomous review mode - Claude does the analysis and produces the report.
Want to LEARN test quality instead? Use /msec:farley-score-coach for interactive teaching.
/msec:farley-score tests/
/msec:farley-score src/test/java/
/msec:farley-score evaluate test quality for the whole project
/msec:farley-score review test design in tests/unit/
| Code | Property | Weight | What It Measures |
|---|---|---|---|
| U | Understandable | 1.50x | Tests read like specifications; behavior-driven naming; clear organization |
| M | Maintainable | 1.50x | Tests verify behavior not implementation; resilient to refactoring |
| R | Repeatable | 1.25x | Deterministic results; no external dependencies (time, filesystem, network) |
| A | Atomic | 1.00x | Isolated; no shared mutable state; parallelizable |
| N | Necessary | 1.00x | Each test adds unique value; no redundancy or trivial assertions |
| G | Granular | 1.00x | Single outcome per test; failures pinpoint exact issues |
| F | Fast | 0.75x | Pure computation; no I/O or delays |
| T | First (TDD) | 1.00x | Evidence of test-first approach; tests drive design |
Methodology: Andrea LaForgia's test-design-reviewer -- scoring system, signal detection patterns, tautology theatre analysis. Used with permission.
Framework: Dave Farley's Properties of Good Tests
These files live in the plugin directory, not the user's project. To locate them:
PLUGIN_DIR=$(find ~/.claude/plugins -name "cli_calculator.py" 2>/dev/null | head -1 | sed 's|/lib/cli_calculator.py||')$PLUGIN_DIR/knowledge/farley/.Load these documents during the phases indicated:
$PLUGIN_DIR/knowledge/farley/signal-detection-patterns.md for language-specific detection heuristics$PLUGIN_DIR/knowledge/farley/farley-properties-and-scoring.md for rubrics and formulaNote: Only read these files when you reach the relevant phase. Don't load upfront.
These 7 principles define the methodology:
test/, tests/, *_test.*, *Test.*, *.spec.*, *.test.*)$PLUGIN_DIR/knowledge/farley/signal-detection-patterns.md for language-specific patterns (resolve PLUGIN_DIR as described in Knowledge Base section)assertTrue(true), assertEquals(1, 1), assertNotNull(new Object()) (affects N)file:line) for evidence$PLUGIN_DIR/knowledge/farley/farley-properties-and-scoring.md for rubrics and formulaFor each property, compute a 0-10 score based on signal densities using the rubric from the knowledge file.
For each property, assess the test code holistically against the rubric, providing a 0-10 score with brief justification. Focus on semantic aspects that static analysis misses: naming quality, assertion appropriateness, design influence, tautology theatre.
CRITICAL: Use the Python CLI calculator for ALL math. Do NOT compute scores manually.
Locating the calculator: The CLI calculator lives in the plugin's lib/ directory, NOT in the user's project.
# Resolve the plugin directory (if not already set)
PLUGIN_DIR=$(find ~/.claude/plugins -name "cli_calculator.py" 2>/dev/null | head -1 | sed 's|/lib/cli_calculator.py||')
CLI="$PLUGIN_DIR/lib/cli_calculator.py"
# Normalize a single property
python3 "$CLI" normalize-property '{"prop":"U","neg_count":2,"pos_count":8,"total_methods":20}'
# Blend static and LLM scores
python3 "$CLI" blend-scores '{"static_score":7.5,"llm_score":8.0}'
# Full end-to-end pipeline (recommended)
python3 "$CLI" full-pipeline '{"properties":{"U":{"neg_count":2,"pos_count":15,"total_methods":20},"M":{"neg_count":3,"pos_count":10,"total_methods":20},"R":{"neg_count":1,"pos_count":18,"total_methods":20},"A":{"neg_count":0,"pos_count":12,"total_methods":20},"N":{"neg_count":4,"pos_count":8,"total_methods":20},"G":{"neg_count":1,"pos_count":16,"total_methods":20},"F":{"neg_count":0,"pos_count":18,"total_methods":20},"T":{"neg_count":2,"pos_count":10,"total_methods":20}},"llm_scores":{"U":8.0,"M":7.5,"R":9.0,"A":8.5,"N":6.5,"G":8.0,"F":9.0,"T":7.0}}'
# Get rating from index
python3 "$CLI" get-rating '{"farley_index":7.8}'
Blend formula: final_property_score = 0.60 * static_score + 0.40 * llm_score
Farley Index formula: (U*1.5 + M*1.5 + R*1.25 + A*1.0 + N*1.0 + G*1.0 + F*0.75 + T*1.0) / 9.0
# Test Design Review
## Farley Index: X.X / 10.0 (Rating)
### Property Breakdown
| Property | Static | LLM | Blended | Weight | Weighted | Key Evidence |
|---|---|---|---|---|---|---|
| Understandable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Maintainable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Repeatable | X.X | X.X | X.X | 1.25x | X.XX | ... |
| Atomic | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Necessary | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Granular | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Fast | X.X | X.X | X.X | 0.75x | X.XX | ... |
| First (TDD) | X.X | X.X | X.X | 1.00x | X.XX | ... |
### Signal Summary
| Signal | Count | Affects | Severity |
|---|---|---|---|
| {signal_name} | {count} | {properties} | {High/Medium/Low} |
### Tautology Theatre Analysis
Tests whose outcome is predetermined by their own setup, independent of production code. The defining test: "Would this test still pass if all production code were deleted?" If yes, it is tautology theatre.
#### Mock Tautologies
Configures a mock return value, then asserts that the mock returns it, with no production code in between.
| # | Test Method | Line | Mock Setup | Assertion |
|---|---|---|---|---|
| 1 | {method_name} | {line} | {mock setup} | {assertion} |
> If none detected: "None detected."
#### Mock-Only Tests
Every object in the test is a mock; no real class is instantiated or invoked.
| # | Test Method | Line | Evidence |
|---|---|---|---|
| 1 | {method_name} | {line} | {evidence} |
> If none detected: "None detected."
#### Trivial Tautologies
Assertions that are always true regardless of any code: `assertTrue(true)`, `assertEquals(1, 1)`.
| # | Test Method | Line | Assertion |
|---|---|---|---|
| 1 | {method_name} | {line} | {assertion} |
> If none detected: "None detected."
#### Framework Tests
Tests that verify language or framework behavior, not application code.
| # | Test Method | Line | Assertion | What It Actually Tests |
|---|---|---|---|---|
| 1 | {method_name} | {line} | {assertion} | {explanation} |
> If none detected: "None detected."
#### Tautology Theatre Summary
**{total}** tautology theatre instances across **{affected_methods}** of **{total_test_methods}** test methods: {count} mock tautologies, {count} mock-only tests, {count} trivial tautologies, {count} framework tests.
### Top 5 Worst Offenders
1. {file}:{method} -- Farley {score}/10 -- {key issues}
2. ...
### Recommendations
1. {highest-impact improvement targeting weakest high-weight property}
2. ...
3. ...
### Methodology Notes
- Static/LLM blend: 60/40
- LLM model: {model_id}
- Files analyzed: {count} ({sampling_note})
- Test methods analyzed: {count}
- Language: {language}
- Framework: {framework}
### Dimensions Not Measured
Predictive, Inspiring, Composable, Writable (from Beck's Test Desiderata -- require runtime or team context)
### Reference
Based on Dave Farley's Properties of Good Tests:
https://www.linkedin.com/pulse/tdd-properties-good-tests-dave-farley-iexge/
Scoring methodology by Andrea LaForgia:
https://github.com/andlaf-ak/claude-code-agents/tree/main/test-design-reviewer
| Farley Index | Rating | Interpretation |
|---|---|---|
| 9.0 - 10.0 | Exemplary | Model for the industry; tests serve as living documentation |
| 7.5 - 8.9 | Excellent | High quality with minor improvement opportunities |
| 6.0 - 7.4 | Good | Solid foundation with clear areas for improvement |
| 4.5 - 5.9 | Fair | Functional but needs significant attention to test design |
| 3.0 - 4.4 | Poor | Tests provide limited value; major refactoring needed |
| 0.0 - 2.9 | Critical | Tests may be harmful; consider rewriting from scratch |
| Language | Test Framework | Mocking Framework |
|---|---|---|
| Java | JUnit 5, JUnit 4 | Mockito |
| Python | pytest, unittest | unittest.mock, pytest-mock |
| JavaScript/TypeScript | Jest, Vitest | Jest mocks, Sinon |
| Go | testing | testify/mock, gomock |
| C# | NUnit, xUnit | Moq, NSubstitute |
file:line references for every signal detected. Evidence without location is unverifiable.python3 $PLUGIN_DIR/lib/cli_calculator.py. See "CLI Calculator Integration" in Phase 3 for path resolution steps.(U*1.5 + M*1.5 + R*1.25 + A*1.0 + N*1.0 + G*1.0 + F*0.75 + T*1.0) / 9.0. The divisor is 9.0 (sum of weights), not 8 (number of properties)."inline" -- Report displayed in conversation"file" -- Report written to farley-score-report.md in the project rootNumber of test files before activating SHA-256 deterministic sampling (30% selection).
| Situation | Command | Why |
|---|---|---|
| Automated quality review | /msec:farley-score | Full autonomous analysis with Farley Index |
| Want to learn test quality | /msec:farley-score-coach | Interactive Socratic coaching |
| Writing new code with TDD | /msec:tdd | TDD productivity mode |
| Learning TDD methodology | /msec:tdd-coach | TDD coaching mode |
npx claudepluginhub mse-online/farley_score_plugin --plugin msec