awaf — Claude Code Skill
A Claude Code skill that runs an AWAF v1.3 architectural assessment for AI agent systems.
What is AWAF?
Agent Well-Architected Framework (AWAF) is an open specification defining production-readiness criteria for AI agents. It fills the same gap for agents that AWS WAF fills for cloud infrastructure: a vendor-neutral, community-owned standard for architectural rigour.
AWAF v1.3 evaluates agents across 10 pillars in 3 tiers:
| Tier | Pillars | Weight |
|---|
| Tier 0 — Foundation | Vertical Slice & Autonomy | Prerequisite |
| Tier 1 — Cloud WAF Adapted | Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability | 1.0× |
| Tier 2 — Agent-Native | Reasoning Integrity, Controllability, Context Integrity | 1.5× |
Tier 2 pillars carry extra weight because they have no cloud equivalent. Servers don't hallucinate, don't need kill switches in code, and don't accumulate stale reasoning context.
Full spec: github.com/YogirajA/awaf
What's new in v1.3:
- Batching criteria in Performance Efficiency, Cost Optimization, and Sustainability: tool calls and LLM API calls should be batched where possible to cut per-call overhead, latency, and cost.
- Context Integrity expansions: active context-size bounding (prune, summarize, or offload before window limits), explicit state persistence for long sessions, and filtering tool responses to relevant fields before they re-enter context.
- Pattern-justification advisory in Foundation: if a simpler pattern (workflow, augmented LLM, or prompt) would suffice, the assessment raises a non-scored Caution finding rather than a score penalty.
- Band-based scoring: readiness is read as bands, not point estimates, because LLM assessment varies run to run.
What This Skill Does
This skill is a natural-language implementation of the AWAF spec. Unlike awaf-cli (the code-scanning reference implementation), this skill accepts any form of evidence and conducts a dialogue-driven assessment:
- Source code and configuration files
- Cloud provider configs (IAM policies, VPC rules, budget alerts)
- Observability exports (Datadog, Grafana, CloudWatch, Honeycomb, LangSmith, Langfuse, Arize)
- Eval and testing reports (LangSmith, Braintrust, Promptfoo, hallucination rate data)
- Infrastructure as code (Terraform plans, CDK stacks, Helm charts)
- Architecture docs (ADRs, design docs, C4 models, system diagrams)
- Operational artifacts (runbooks, SLO definitions, incident postmortems)
- Security reports (Snyk output, AWS Security Hub, pen test results)
- CI/CD configs (GitHub Actions, GitLab CI, Jenkins)
- Billing and cost data (AWS Cost Explorer, token usage reports)
- Verbal or written description of how your system works
An agent with no code in the repo but verified runbooks, SLO docs, eval reports, and IAM exports can score higher than one with clean code and none of those things. Architecture is what the system does and how it is operated, not just what the code says.
Installation
Via the Claude Code VSCode extension:
Open Manage Plugins, go to the Marketplaces tab, and add:
https://github.com/YogirajA/awaf-skill
Then install the awaf plugin from the marketplace.
Via CLI:
/plugin marketplace add YogirajA/awaf-skill
/plugin install awaf@awaf-marketplace
Usage
/awaf
The skill opens by asking what evidence you can share, then:
- Gathers evidence — accepts anything you provide across all evidence categories
- Scores each pillar — assigns 0–100 with a confidence level (
verified / partial / self_reported)
- Produces a structured report — overall score, per-pillar breakdown, findings, recommendations
- Requests targeted evidence — after the initial report, identifies the 2–3 gaps that would most improve score confidence and asks for them specifically
- Re-scores on new evidence — when you provide more artifacts, affected pillars are re-scored and deltas are shown
Scoring
Per-pillar (0–100): each question carries a risk weight (High = 3 pts, Medium = 2 pts, Low = 1 pt):
pillar_score = (implemented_weight / total_weight) × 100
Answering "none of these apply" to any question caps that pillar at 30 and triggers an automatic High Risk flag.
Overall score applies a 1.5× multiplier to Tier 2 pillars:
overall = sum(score * (1.5 if tier == 2 else 1.0) for each pillar) /
sum(1.5 if tier == 2 else 1.0 for each pillar)
Readiness bands: