By testland
LLM and prompt evaluation: 6 skills (promptfoo-evaluation, openai-evals, deepeval-evaluation, ragas-evaluation, giskard-llm, langfuse-tracing) and 1 agent (prompt-eval-reviewer).
Action-taking orchestrator that plans and scaffolds a multi-class LLM adversarial probe campaign beyond canned scanners - enumerates an attack taxonomy (jailbreaks, indirect prompt injection chains, data exfiltration, harmful-content bypass, OWASP LLM Top 10 classes), maps each class to a Giskard detector or a promptfoo red-team plugin, sequences the campaign into phases, and writes the resulting scan scripts and promptfoo redteam YAML configs. Distinct from `prompt-eval-reviewer` (read-only anti-pattern reviewer) and `giskard-llm` / `promptfoo-evaluation` skills (single-tool wrappers). Use when a senior AI-safety or security engineer needs a bespoke red-team campaign plan that goes beyond running default scanner presets.
Adversarial reviewer for an LLM eval suite (Promptfoo, OpenAI Evals, DeepEval, Ragas, Giskard, Langfuse-driven, or custom). Flags 8 anti-patterns: too-few test cases (<10), single-provider lock-in, missing model-graded for creative output, missing semantic-similarity for paraphrase-tolerant output, no baseline diff in CI, no cost/latency cap, hard-coded model versions absent, no adversarial coverage (Giskard or equivalent). Returns Critical / Warning / Info findings table. Use proactively after any LLM eval suite is added or modified.
Authors and runs DeepEval - pytest-native LLM eval framework with `LLMTestCase` (input + actual_output + expected_output + retrieval_context) and ~11 built-in metrics (G-Eval, Answer-Relevancy, Faithfulness, Contextual-Recall / Precision / Relevancy, Hallucination, Bias, Toxicity, Summarization, JSON-Correctness); runs via `deepeval test run <file.py>` with `assert_test()` per test or `evaluate()` for batch; integrates Confident-AI dashboard. Use when the user prefers pytest workflow, works with RAG and needs faithfulness/contextual metrics out-of-the-box, or wants a managed dashboard.
Authors and runs Giskard LLM scans - adversarial test-case generation for LLM applications via `giskard.scan(model)` covering 7 vulnerability categories (hallucination, harmful_content, prompt_injection, sensitive_information_disclosure, stereotypes, robustness, basic_sycophancy); wraps any callable model behind `giskard.Model(model_predict, model_type="text_generation", ...)`; emits HTML report. Use when the user needs adversarial / red-team coverage on top of functional eval suites.
Wires Langfuse tracing into LLM apps for production observability and offline eval - instruments via `@observe` (Python) / `startActiveObservation` (TS) decorators that auto-capture inputs / outputs / timings / errors per generation; exposes `langfuse.update_current_span()` for metadata + cost / latency annotation; supports trace-bound scoring for eval datasets and prompt-as-code management. Use when the user needs production LLM observability beyond pre-deploy eval, or wants to ship traces from production to an eval dataset for offline regression testing.
Builds a versioned golden-dataset LLM regression suite for tracking quality across model upgrades: structures a versioned JSONL/CSV golden dataset, configures deterministic eval runs (temperature 0, seed), wires assertion layers (exact, semantic similarity, LLM-as-judge, rubric), enforces a pass-rate threshold with diff reporting vs the baseline model, and gates CI on regression. Use when upgrading an LLM provider model and needing a repeatable before/after quality gate, or when a prompt regression suite must track output quality across model versions over time.
Authors and runs OpenAI Evals - Python framework + registry for evaluating LLMs and LLM-backed systems with `oaieval <model> <eval-name>` CLI; supports template-based evals (Match / Includes / FuzzyMatch / ModelBasedClassify) defined in `evals/registry/evals/*.yaml` against JSONL data files in `evals/registry/data/`, plus custom Python eval classes implementing the Eval interface. Use when the user works with the openai/evals repo, needs the OpenAI-curated eval registry, or contributes new evals via PR to the registry.
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A rigorously curated quality-engineering plugin marketplace for Claude Code. 77 plugins, 695 components, every one rating-gated before merge.
d6 floordocs/REVIEWER_TRAINING.mdSee Quality bar and docs/REVIEWER_CHECKLIST.md.
The marketplace ships three kinds of building block:
qa-api-testing, qa-load-testing). You install only the plugins your
stack needs.great-expectations,
oauth-flow-test-author). Claude loads a skill when your request matches
its trigger; you can also ask for it by name.schema-diff-reviewer reviews a migration diff and returns a findings
table). An agent may preload one or more skills to do its work.Installed components stay dormant until a matching task comes up, so adding a plugin doesn't add noise — it adds capability that activates on demand.
/plugin marketplace add testland/qa
/plugin install <plugin-name>@testland-qa
For example:
/plugin install qa-data-quality@testland-qa
/plugin marketplace add https://github.com/testland/qa
git clone https://github.com/testland/qa ~/.claude/marketplaces/testland-qa
Before you install: plugins run inside your Claude Code session and ship agent instructions and tool wrappers. Anthropic doesn't vet marketplace contents — review a plugin's components before installing it into a sensitive project. Every component here is rating-gated (see Quality bar), but you remain in control of what runs.
New to the marketplace? Install one or two plugins for your role rather than everything — components activate on demand, so a focused set keeps things sharp.
| If you're a… | Try first |
|---|---|
| Manual / exploratory tester | qa-manual-testing · qa-bdd · qa-bug-repro |
| Test automation engineer | qa-web-e2e · qa-api-testing · qa-unit-tests-js |
| Performance engineer | qa-load-testing · qa-chaos-resilience |
| Security tester | qa-sast · qa-secrets · qa-dast |
| Lead / manager / head of quality | qa-roles · qa-test-management · qa-process |
The full catalog is below; for versions and component counts see
CATALOG.md.
Once a plugin is installed, its skills and agents are available to Claude
Code — invoke them by describing the task in plain language. Example with
qa-data-quality:
/plugin install qa-data-quality@testland-qa
great-expectations skill scaffolds an ExpectationSuite + Checkpoint and
wires the results into a CI gate.schema-diff-reviewer agent returns a Critical / Warning / Info findings
table covering breaking-vs-additive changes and downstream impact.Each plugin's README.md lists its skills and agents and what each one does.
Visual regression testing: 7 skills (percy-visual-regression-testing, chromatic-visual-regression-testing, playwright-snapshots, storybook-visual-regression-testing, responsive-breakpoint-runner, visual-baseline-conventions, visual-baseline-gate) and 2 agents (visual-diff-classifier, visual-baseline-curator).
Contract testing for microservices: 5 skills (pact-contract-testing, openapi-contract-diff, graphql-schema-regression, protobuf-compat-checking, contract-compatibility-gate) and 2 agents (contract-drift-investigator, contract-test-scaffolder).
Flake triage: 2 skills (flaky-test-quarantine, flake-pattern-reference) and 5 agents (e2e-flake-bisector, parallel-isolation-checker, regression-bisector, ai-flake-detector, e2e-test-trend-reporter).
Bug reproduction workflow: 1 skill (bug-report-template) and 8 agents (bug-report-from-recording, bug-repro-builder, crash-stack-trace-analyzer, defect-clusterer, defect-trend-narrator, escape-defect-analyzer, failure-classifier, test-failure-debugger).
Data quality testing for analytical pipelines: 5 skills (dbt-testing, great-expectations, soda-checks, data-quality-gate, data-quality-conventions) and 2 agents (schema-diff-reviewer, data-anomaly-triager).
npx claudepluginhub testland/qa --plugin qa-llm-evaluationComprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.
Comprehensive PR review agents specializing in comments, tests, error handling, type design, code quality, and code simplification
Comprehensive feature development workflow with specialized agents for codebase exploration, architecture design, and quality review
A growing collection of Claude-compatible academic workflow bundles. Covers scientific figures, manuscript writing and polishing, reviewer assessment, citation retrieval, data availability, paper reading, literature search, response letters, paper-to-PPTX conversion, and evidence-grounded Chinese invention patent drafting. Rules are organized as reusable skill folders with explicit workflows and quality checks.
Unity Development Toolkit - Expert agents for scripting/refactoring/optimization, script templates, and Agent Skills for Unity C# development