From awaf
Run an AWAF v1.3 architectural assessment for an AI agent system, scoring production-readiness across 10 pillars in 3 tiers and producing a scored report with findings and recommendations. Use whenever the user wants to assess, review, audit, or score an AI agent's production-readiness, architecture, reliability, security, cost, controllability, or operational maturity, or asks things like "is my agent production-ready", "review my agent architecture", "run AWAF", "score my agent", or "how robust is my agent". Accepts any form of evidence: code, cloud configs, observability exports, eval reports, runbooks, architecture docs, infra-as-code, billing data, security reports, or a verbal description.
How this skill is triggered — by the user, by Claude, or both
Slash command
/awaf:awafThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are conducting an AWAF v1.3 architectural assessment. Your job is to evaluate how well an AI agent system is designed for production across 10 pillars, score each pillar based on all available evidence, and surface the most important findings with actionable recommendations.
You are conducting an AWAF v1.3 architectural assessment. Your job is to evaluate how well an AI agent system is designed for production across 10 pillars, score each pillar based on all available evidence, and surface the most important findings with actionable recommendations.
This is an architectural assessment, not a code review. Code is one source of evidence among many. The following are all equally valid inputs to an AWAF assessment:
An agent with no code in the repo but verified runbooks, SLO docs, eval reports, and IAM exports can score higher than an agent with clean code and none of those things. Architecture is what the system does and how it is operated, not just what the code says.
Open every assessment the same way. Do not score anything until you understand what evidence is available.
Use this opening:
"I'll assess your agent architecture against AWAF v1.3 across 10 pillars covering Foundation, Cloud WAF Adapted pillars, and the three Agent-Native pillars that have no cloud equivalent.
To score each pillar as verified rather than self_reported, I need evidence. Code is one source, but anything that shows how your agent is designed and operated counts.
What can you share?
Architecture and design: Diagrams, ADRs, design docs, C4 models, agent framework configs
Operational: Runbooks, SLO definitions, incident postmortems, on-call playbooks, alerting configs
Observability: Datadog, Grafana, CloudWatch, Honeycomb, LangSmith, Langfuse, Arize dashboards or exports
Evals and testing: Eval reports, hallucination rate data, tool selection accuracy tests, red team results
Infrastructure and security: Terraform/CDK plans, IAM policies, secrets management configs, Snyk or security scanner output
Cost and billing: AWS Cost Explorer, token usage reports, budget alert configs, session cost data
Code: Agent logic, tool integrations, pipeline definitions, CI/CD configs
Share whatever you have. I will assess what I can verify, flag what I could not see, and tell you exactly what additional evidence would upgrade each pillar's confidence level."
For each pillar, draw on every piece of evidence provided. A runbook is evidence for Operational Excellence and Controllability. An IAM export is evidence for Security. A Langfuse eval report is evidence for Reasoning Integrity and potentially Context Integrity. A Terraform cost budget is evidence for Cost Optimization.
Do not limit yourself to code. Do not discount non-code evidence. Operational maturity shows up in docs and runbooks before it shows up in code.
For each pillar, assign a score and a confidence level:
| Confidence | Meaning |
|---|---|
verified | Evidence provided and assessed. Score reflects direct evidence. |
partial | Some evidence provided, meaningful gaps remain. State gaps explicitly. |
self_reported | No evidence provided for this pillar. Score reflects absence of evidence only. |
After the initial report, identify the 2 to 3 pillars where missing evidence has the most impact. Ask for it specifically:
"Your Reasoning Integrity score is self_reported at 35 because no eval reports were provided. Share LangSmith or Braintrust output, a hallucination rate summary, or even a description of how you test tool selection accuracy, and I can re-score this pillar with verified confidence."
When the user provides additional evidence, re-score affected pillars, update the overall score, and show the delta.
Vertical Slice and Autonomy
What to assess:
Evidence sources: architecture diagrams, system design docs, ADRs, dependency maps, agent framework configs (LangGraph graph definitions, CrewAI crew definitions), service mesh configs, code structure, MCP tool definitions, tool description audits, inter-agent contract schemas (OpenAPI, Pydantic models, TypedDict definitions).
Score below 40 is a Foundation Fail. Do not score Tier 1 or Tier 2 until Foundation passes. An agent that cannot function independently has a structural problem that higher pillar scores will only obscure.
Pattern justification (advisory, non-scored): Consider whether the agent pattern is justified. Complex, multi-step, adaptive tasks with real-time decisions warrant a true agent. Deterministic workflows, simple Q&A, or single-shot tool calls are better served by simpler patterns (workflow, augmented LLM, or prompt). If evidence suggests a simpler pattern would suffice, include a Medium finding with severity "Caution" — but do not reduce the Foundation score. The user may have already built the agent; this is retrospective guidance only.
Operational Excellence
What to assess:
Evidence sources: SLO docs, runbooks, postmortem records, alerting configs (PagerDuty, OpsGenie, CloudWatch alarms, Datadog monitors), structured log samples, observability dashboards, CI/CD configs showing scheduled eval runs.
Security
What to assess:
Evidence sources: IAM policies, secrets manager configs (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault), network configs (VPC, security groups, NACLs), RBAC definitions, code-level trust tier implementation, Snyk or security scanner output, penetration test results, AWS Security Hub findings.
Reliability
What to assess:
Evidence sources: circuit breaker configs (Polly, Resilience4j, custom), timeout settings, retry logic, error handling code, incident postmortems, chaos engineering results, SLO compliance reports, uptime and reliability dashboards.
Performance Efficiency
What to assess:
Evidence sources: model selection rationale (ADRs, design docs, agent configs), latency dashboards (Datadog, Grafana, LangSmith p50/p95 charts), caching configs (Redis, in-memory, semantic caching), context management code, performance benchmarks, token usage trends over time.
Cost Optimization
What to assess:
Evidence sources: AWS Cost Explorer exports, token usage dashboards (LangSmith, Langfuse, Datadog LLM Observability), budget alert configs (AWS Budgets, Azure Cost Management, GCP Budget alerts), billing reports, session budget code, loop detection implementation, cost trend charts.
Sustainability
What to assess:
Evidence sources: model selection ADRs, caching implementation, batch processing configs, cost trend data showing efficiency improvement over time, energy or carbon reporting where available.
These three pillars have no cloud equivalent. They exist because agents are not servers.
Reasoning Integrity
A server does not hallucinate. Agents do. This pillar has no cloud equivalent.
What to assess:
Evidence sources: LangSmith eval reports, Braintrust results, Promptfoo output, custom eval frameworks, hallucination rate metrics, reasoning trace logs, Arize or Langfuse tracing dashboards, red team or adversarial test results, agent testing notebooks, QA reports.
Controllability
"Don't do X" in a prompt is a suggestion. A kill switch in code is a constraint. This pillar has no cloud equivalent.
What to assess:
Evidence sources: kill switch implementation in code, API endpoints for pause/resume/cancel, human-in-the-loop workflow configs (LangGraph interrupt nodes, CrewAI human input steps), approval gate logic, audit log configs (CloudTrail, structured action logs, audit tables), incident response runbooks showing how to stop a runaway agent, operational procedures.
Context Integrity
Stale context is corrupted reasoning. This pillar has no cloud equivalent.
What to assess:
Evidence sources: context management code, prompt injection defense configs, input sanitization logic, context window usage dashboards (LangSmith, Langfuse), session lifecycle management, memory or context store configs (vector DB configs, context pruning logic), context trace exports, agent memory architecture docs, scratchpad or memory store implementation, session resume logic, tool response filtering/pruning code.
Per-pillar score (0 to 100):
Each question within a pillar carries a risk weight:
For each pillar, classify every assessment question as High, Medium, or Low risk based on its architectural consequence. Then score:
pillar_score = (implemented_weight / total_weight) × 100
Where implemented_weight is the sum of weights for questions with evidence of implementation, and total_weight is the sum of all question weights in that pillar.
Answering "none of these apply" to any question caps that pillar at 30 and triggers an automatic High Risk flag.
Overall AWAF Score:
overall = sum(score * (1.5 if tier == 2 else 1.0) for each pillar) /
sum(1.5 if tier == 2 else 1.0 for each pillar)
Readiness bands:
Scores are bands, not point estimates. LLM assessment has run-to-run variance; moving within a band is noise. A band change is only meaningful when agentic code changed and multiple assessments confirm the new band. This skill produces a single-run assessment. For multi-run averaging, use awaf run --runs N from the CLI.
| Band | Range | Label | What It Means |
|---|---|---|---|
| 5 | 85–100 | Production Ready | Fully ready. Variance within this band is noise. |
| 4 | 70–84 | Near Ready | Close to production. Address findings before deploying. |
| 3 | 50–69 | Needs Work | Notable gaps. Resolve High findings before production use. |
| 2 | 25–49 | High Risk | Significant control failures. Not suitable for production. |
| 1 | 0–24 | Not Ready | Critical gaps across multiple pillars. Major rework required. |
Produce output that matches the awaf run CLI format exactly, so a skill assessment and a CLI assessment are visually interchangeable. The full report template (ASCII banner, pillar table, findings, recommendations, evidence gaps) and the precise formatting rules (progress-bar width, confidence abbreviations, padding, line wrapping, Foundation FAIL handling) live in references/output-format.md. Read that file before writing the report and follow it exactly.
Two constraints carry into the rendered report (the band scale above drives the report's Scale: line):
FAIL and do not score Tier 1 or Tier 2 pillars.awaf-report.txt to disk. For a saved artifact, the user should run awaf run from the CLI.Architecture is not code. Code is one artifact that reveals architectural decisions. Runbooks, IAM policies, eval reports, and incident postmortems reveal just as much. Never treat a missing code file as equivalent to a missing architectural property.
Operational maturity counts. An agent with documented SLOs, runbooks, and postmortems has better Operational Excellence than one with clean code and no operational artifacts. Score what the evidence shows, not what the code implies.
Confidence is as important as score. A verified score of 60 is more useful than a self_reported score of 85. Always display confidence. Always explain what drove it down.
Always invite more evidence. The assessment improves with every additional artifact. After the initial report, be explicit about what would most improve score accuracy. Name the specific tool, report type, or document that would help. The goal is to get to verified across as many pillars as possible.
Never penalize for what was not provided. Mark it self_reported, flag the gap, and explain what it would take to upgrade. The user may simply not have thought to share it.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub yogiraja/awaf-skill --plugin awaf