By markac007
AI agent evaluation framework based on Anthropic best practices. Create use cases, LLM judges, A/B prompt tests, and model comparisons.
- Existing use case with test cases and a prompt
**Implements the Science Protocol for prompt experimentation.**
---
Save all generated config files to `~/Downloads/evals/<name>/` before moving to the project.
- Use case config.yaml exists
Private monorepo marketplace of Claude Code plugins for ComplianceGenie workflows.
cg-claude-workspaces-plugins/
├── .claude-plugin/
│ └── marketplace.json ← Marketplace catalog
├── plugins/
│ └── marketing/
│ └── art/ ← Visual content plugin
│ ├── commands/ ← 22 slash commands (/art:*)
│ ├── skills/ ← Ambient skills
│ ├── tools/ ← TypeScript generation tools
│ └── lib/ ← Discord/Midjourney libraries
└── README.md
claude plugin marketplace add ~/GitHub/cg-claude-workspaces-plugins
claude plugin install art@cg-plugins
claude plugin marketplace add MarkAC007/cg-claude-workspaces-plugins
claude plugin install art@cg-plugins
Requires gh auth login to be configured — GitHub CLI handles authentication transparently for private repos.
claude --plugin-dir ~/GitHub/cg-claude-workspaces-plugins/plugins/marketing/art
Loads the plugin directly without install or restart.
| Plugin | Category | Commands | Description |
|---|---|---|---|
art | marketing | 22 /art:* | Visual content — illustrations, diagrams, thumbnails, infographics |
plugins/<category>/<name>/ with plugin structure.claude-plugin/marketplace.jsonclaude plugin updateOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub markac007/cg-claude-workspaces-plugins --plugin evalsAdversarial analysis using 32 parallel expert agents. Stress-tests arguments, produces steelman + counter-argument, and validates decisions through competition.
Multi-agent debate system. Specialized agents discuss topics in rounds, challenge each other, and surface insights through intellectual friction.
Transform product documentation into compelling sales narratives with talking points, elevator pitches, and visual concepts.
First principles reasoning framework. Deconstruct problems to fundamentals, challenge assumptions, and reconstruct optimal solutions.
Data Protection Security Impact Assessment for vendor security evaluation against CIA triad, certification requirements, and GDPR compliance.
Measure AI output quality, user satisfaction, task success, and design effectiveness.
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Build evals, A/B test prompts, audit skills, and benchmark LLM outputs at production quality
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
SDK Usability Benchmark — generate, execute, judge, and analyze AI agent benchmark suites