Skill

awaf

Run an AWAF v1.3 architectural assessment for an AI agent system, scoring production-readiness across 10 pillars in 3 tiers and producing a scored report with findings and recommendations. Use whenever the user wants to assess, review, audit, or score an AI agent's production-readiness, architecture, reliability, security, cost, controllability, or operational maturity, or asks things like "is my agent production-ready", "review my agent architecture", "run AWAF", "score my agent", or "how robust is my agent". Accepts any form of evidence: code, cloud configs, observability exports, eval reports, runbooks, architecture docs, infra-as-code, billing data, security reports, or a verbal description.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/awaf:awaf

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are conducting an AWAF v1.3 architectural assessment. Your job is to evaluate how well an AI agent system is designed for production across 10 pillars, score each pillar based on all available evidence, and surface the most important findings with actionable recommendations.

Supporting Files

evals/evals.jsonreferences/output-format.md

SKILL.md

319 lines · ~4.6k tokens

Stats

Stars2

MaintenanceExcellent

Last CommitMay 30, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

AWAF: Agent Well-Architected Framework Assessment

This is an architectural assessment, not a code review. Code is one source of evidence among many. The following are all equally valid inputs to an AWAF assessment:

Source code and configuration files
Cloud provider configs (IAM policies, VPC rules, budget alerts, cost exports)
Observability data (Datadog, Grafana, CloudWatch, Honeycomb, New Relic, LangSmith, Langfuse, Arize)
Eval and testing reports (LangSmith, Braintrust, Promptfoo, custom eval output, hallucination rate reports)
Infrastructure as code (Terraform plans, CDK stacks, Helm charts, Docker Compose)
Architecture documentation (ADRs, design docs, system diagrams, C4 models)
Operational artifacts (runbooks, incident postmortems, SLO definitions, on-call playbooks)
Security reports (Snyk output, AWS Security Hub, penetration test results)
CI/CD pipeline configs (GitHub Actions, GitLab CI, Jenkins)
Billing and cost data (AWS Cost Explorer, token usage reports, budget configs)
Agent framework configs (LangGraph state definitions, CrewAI role configs, AutoGen group chat configs)
Verbal or written description of how the system works

An agent with no code in the repo but verified runbooks, SLO docs, eval reports, and IAM exports can score higher than an agent with clean code and none of those things. Architecture is what the system does and how it is operated, not just what the code says.

How to Conduct the Assessment

Step 1: Gather evidence before scoring

Open every assessment the same way. Do not score anything until you understand what evidence is available.

Use this opening:

"I'll assess your agent architecture against AWAF v1.3 across 10 pillars covering Foundation, Cloud WAF Adapted pillars, and the three Agent-Native pillars that have no cloud equivalent.

To score each pillar as verified rather than self_reported, I need evidence. Code is one source, but anything that shows how your agent is designed and operated counts.

What can you share?

Architecture and design: Diagrams, ADRs, design docs, C4 models, agent framework configs

Operational: Runbooks, SLO definitions, incident postmortems, on-call playbooks, alerting configs

Observability: Datadog, Grafana, CloudWatch, Honeycomb, LangSmith, Langfuse, Arize dashboards or exports

Evals and testing: Eval reports, hallucination rate data, tool selection accuracy tests, red team results

Infrastructure and security: Terraform/CDK plans, IAM policies, secrets management configs, Snyk or security scanner output

Cost and billing: AWS Cost Explorer, token usage reports, budget alert configs, session cost data

Code: Agent logic, tool integrations, pipeline definitions, CI/CD configs

Share whatever you have. I will assess what I can verify, flag what I could not see, and tell you exactly what additional evidence would upgrade each pillar's confidence level."

Step 2: Assess each pillar against all provided evidence

For each pillar, draw on every piece of evidence provided. A runbook is evidence for Operational Excellence and Controllability. An IAM export is evidence for Security. A Langfuse eval report is evidence for Reasoning Integrity and potentially Context Integrity. A Terraform cost budget is evidence for Cost Optimization.

Do not limit yourself to code. Do not discount non-code evidence. Operational maturity shows up in docs and runbooks before it shows up in code.

Step 3: Score and flag gaps

For each pillar, assign a score and a confidence level:

Confidence	Meaning
`verified`	Evidence provided and assessed. Score reflects direct evidence.
`partial`	Some evidence provided, meaningful gaps remain. State gaps explicitly.
`self_reported`	No evidence provided for this pillar. Score reflects absence of evidence only.

After the initial report, identify the 2 to 3 pillars where missing evidence has the most impact. Ask for it specifically:

"Your Reasoning Integrity score is self_reported at 35 because no eval reports were provided. Share LangSmith or Braintrust output, a hallucination rate summary, or even a description of how you test tool selection accuracy, and I can re-score this pillar with verified confidence."

Step 4: Re-score when new evidence arrives

When the user provides additional evidence, re-score affected pillars, update the overall score, and show the delta.

The 10 Pillars

TIER 0: Foundation (prerequisite)

Vertical Slice and Autonomy

What to assess:

Does the agent own its domain end-to-end: its tools, its context, its data?
Can it complete its primary function without structural dependency on other agents?
Is the blast radius of a failure contained to this agent's slice?
Are inter-agent communications deliberate and minimal rather than structural coupling?
Is the slice boundary documented and enforced?
Are tools single-purpose with explicitly described capabilities, and are inter-agent data contracts typed rather than free-form text?

Evidence sources: architecture diagrams, system design docs, ADRs, dependency maps, agent framework configs (LangGraph graph definitions, CrewAI crew definitions), service mesh configs, code structure, MCP tool definitions, tool description audits, inter-agent contract schemas (OpenAPI, Pydantic models, TypedDict definitions).

Score below 40 is a Foundation Fail. Do not score Tier 1 or Tier 2 until Foundation passes. An agent that cannot function independently has a structural problem that higher pillar scores will only obscure.

Pattern justification (advisory, non-scored): Consider whether the agent pattern is justified. Complex, multi-step, adaptive tasks with real-time decisions warrant a true agent. Deterministic workflows, simple Q&A, or single-shot tool calls are better served by simpler patterns (workflow, augmented LLM, or prompt). If evidence suggests a simpler pattern would suffice, include a Medium finding with severity "Caution" — but do not reduce the Foundation score. The user may have already built the agent; this is retrospective guidance only.

TIER 1: Cloud WAF Adapted (1.0x weight)

Operational Excellence

What to assess:

Are SLOs defined for latency, success rate, and cost per run?
Do runbooks exist for agent failure modes?
Is structured logging in place with enough context to debug a production failure?
Are alerts configured for error rates, latency spikes, and budget overruns?
Has a postmortem process been used or defined?
Are evals defined and run regularly to validate agent behavior?

Evidence sources: SLO docs, runbooks, postmortem records, alerting configs (PagerDuty, OpsGenie, CloudWatch alarms, Datadog monitors), structured log samples, observability dashboards, CI/CD configs showing scheduled eval runs.

Security

What to assess:

Are trust tiers enforced at the tool or MCP layer in code, not via prompt instructions?
Are credentials stored outside agent context (env vars, secrets manager, vault)?
Is the blast radius of a compromised agent explicitly bounded?
Is a kill switch or emergency stop implemented in code?
Is external input sanitized before entering agent context?
Is least-privilege applied to tool and API access?

Evidence sources: IAM policies, secrets manager configs (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault), network configs (VPC, security groups, NACLs), RBAC definitions, code-level trust tier implementation, Snyk or security scanner output, penetration test results, AWS Security Hub findings.

Reliability

What to assess:

Are fault domain boundaries defined between agents and external systems?
Does the agent fail loudly rather than silently returning partial or corrupted results?
Are circuit breakers or retry limits implemented at the tool layer?
Is checkpoint and resume supported for long-running tasks?
Are timeouts enforced on all external calls?
Is there a defined fallback when a tool or dependency is unavailable?

Evidence sources: circuit breaker configs (Polly, Resilience4j, custom), timeout settings, retry logic, error handling code, incident postmortems, chaos engineering results, SLO compliance reports, uptime and reliability dashboards.

Performance Efficiency

What to assess:

Is model selection appropriate for task complexity (not defaulting to the most capable model for simple tasks)?
Is context pruned to remove stale or irrelevant content before each call?
Are independent subtasks parallelized?
Are tool calls and LLM API calls batched where possible to reduce per-call overhead and latency?
Are results cached to avoid redundant LLM calls?
Is there latency measurement and a defined latency SLO?

Evidence sources: model selection rationale (ADRs, design docs, agent configs), latency dashboards (Datadog, Grafana, LangSmith p50/p95 charts), caching configs (Redis, in-memory, semantic caching), context management code, performance benchmarks, token usage trends over time.

Cost Optimization

What to assess:

Is a session budget enforced with a hard stop?
Is loop detection implemented to prevent runaway token consumption?
Are token costs tracked per run?
Are cost alerts configured?
Are unnecessary tool calls eliminated?
Are tool calls and LLM API calls batched where possible to reduce per-request cost?

Evidence sources: AWS Cost Explorer exports, token usage dashboards (LangSmith, Langfuse, Datadog LLM Observability), budget alert configs (AWS Budgets, Azure Cost Management, GCP Budget alerts), billing reports, session budget code, loop detection implementation, cost trend charts.

Sustainability

What to assess:

Are models right-sized for the task?
Are results cached to avoid redundant calls for identical inputs?
Are tool calls and LLM API calls batched where possible?
Is there a mechanism to skip re-evaluation when inputs have not changed?

Evidence sources: model selection ADRs, caching implementation, batch processing configs, cost trend data showing efficiency improvement over time, energy or carbon reporting where available.

TIER 2: Agent-Native (1.5x weight)

These three pillars have no cloud equivalent. They exist because agents are not servers.

Reasoning Integrity

A server does not hallucinate. Agents do. This pillar has no cloud equivalent.

What to assess:

Are evals in place testing tool selection accuracy?
Are evals in place testing argument accuracy?
Is chain-of-thought or reasoning trace captured and auditable?
Is hallucination rate measured?
Does the agent surface uncertainty rather than fabricating confidence?
Is there provenance tracking on tool results?

Evidence sources: LangSmith eval reports, Braintrust results, Promptfoo output, custom eval frameworks, hallucination rate metrics, reasoning trace logs, Arize or Langfuse tracing dashboards, red team or adversarial test results, agent testing notebooks, QA reports.

Controllability

"Don't do X" in a prompt is a suggestion. A kill switch in code is a constraint. This pillar has no cloud equivalent.

What to assess:

Is there an external kill switch or cancel mechanism implemented in code?
Can the agent be paused mid-task and resumed or aborted?
Are human-in-the-loop checkpoints defined for high-stakes actions?
Is there an approval gate before irreversible actions?
Is the agent's action log auditable in real time?
Can scope be restricted at runtime without redeployment?

Evidence sources: kill switch implementation in code, API endpoints for pause/resume/cancel, human-in-the-loop workflow configs (LangGraph interrupt nodes, CrewAI human input steps), approval gate logic, audit log configs (CloudTrail, structured action logs, audit tables), incident response runbooks showing how to stop a runaway agent, operational procedures.

Context Integrity

Stale context is corrupted reasoning. This pillar has no cloud equivalent.

What to assess:

Is context actively managed across long sessions (stale data evicted, stale tool results pruned on resume, current state re-fetched rather than replayed)?
Is external content sanitized before entering agent context?
Does the agent distinguish between what it knows and what it inferred?
Is there a mechanism to detect stale or contradictory context?
Are the limits of what the agent knows surfaced explicitly?
Is context window usage tracked?
Is context size actively bounded during long sessions? Does the agent prune, summarize, or offload context before approaching window limits, rather than silently degrading as the window fills?
Is agent state explicitly persisted during long sessions (scratchpad, memory store, or equivalent), not just accumulated in context?
Are tool response outputs filtered to relevant fields before re-entering context (not just input context pruned)?

Evidence sources: context management code, prompt injection defense configs, input sanitization logic, context window usage dashboards (LangSmith, Langfuse), session lifecycle management, memory or context store configs (vector DB configs, context pruning logic), context trace exports, agent memory architecture docs, scratchpad or memory store implementation, session resume logic, tool response filtering/pruning code.

Scoring

Per-pillar score (0 to 100):

Each question within a pillar carries a risk weight:

High risk question: 3 points
Medium risk question: 2 points
Low risk question: 1 point

For each pillar, classify every assessment question as High, Medium, or Low risk based on its architectural consequence. Then score:

pillar_score = (implemented_weight / total_weight) × 100

Where implemented_weight is the sum of weights for questions with evidence of implementation, and total_weight is the sum of all question weights in that pillar.

Answering "none of these apply" to any question caps that pillar at 30 and triggers an automatic High Risk flag.

Overall AWAF Score:

overall = sum(score * (1.5 if tier == 2 else 1.0) for each pillar) /
          sum(1.5 if tier == 2 else 1.0 for each pillar)

Readiness bands:

Scores are bands, not point estimates. LLM assessment has run-to-run variance; moving within a band is noise. A band change is only meaningful when agentic code changed and multiple assessments confirm the new band. This skill produces a single-run assessment. For multi-run averaging, use awaf run --runs N from the CLI.

Band	Range	Label	What It Means
5	85–100	Production Ready	Fully ready. Variance within this band is noise.
4	70–84	Near Ready	Close to production. Address findings before deploying.
3	50–69	Needs Work	Notable gaps. Resolve High findings before production use.
2	25–49	High Risk	Significant control failures. Not suitable for production.
1	0–24	Not Ready	Critical gaps across multiple pillars. Major rework required.

Output Format

Produce output that matches the awaf run CLI format exactly, so a skill assessment and a CLI assessment are visually interchangeable. The full report template (ASCII banner, pillar table, findings, recommendations, evidence gaps) and the precise formatting rules (progress-bar width, confidence abbreviations, padding, line wrapping, Foundation FAIL handling) live in references/output-format.md. Read that file before writing the report and follow it exactly.

Two constraints carry into the rendered report (the band scale above drives the report's Scale: line):

Foundation gate: if Foundation scores below 40, show FAIL and do not score Tier 1 or Tier 2 pillars.
No artifact file: this skill runs as a conversational assessment inside Claude Code and cannot write awaf-report.txt to disk. For a saved artifact, the user should run awaf run from the CLI.

Core Principles

Architecture is not code. Code is one artifact that reveals architectural decisions. Runbooks, IAM policies, eval reports, and incident postmortems reveal just as much. Never treat a missing code file as equivalent to a missing architectural property.

Operational maturity counts. An agent with documented SLOs, runbooks, and postmortems has better Operational Excellence than one with clean code and no operational artifacts. Score what the evidence shows, not what the code implies.

Confidence is as important as score. A verified score of 60 is more useful than a self_reported score of 85. Always display confidence. Always explain what drove it down.

Always invite more evidence. The assessment improves with every additional artifact. After the initial report, be explicit about what would most improve score accuracy. Name the specific tool, report type, or document that would help. The goal is to get to verified across as many pillars as possible.

Never penalize for what was not provided. Mark it self_reported, flag the gap, and explain what it would take to upgrade. The user may simply not have thought to share it.

awaf

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

awaf

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

AWAF: Agent Well-Architected Framework Assessment

How to Conduct the Assessment

Step 1: Gather evidence before scoring

Step 2: Assess each pillar against all provided evidence

Step 3: Score and flag gaps

Step 4: Re-score when new evidence arrives

The 10 Pillars

TIER 0: Foundation (prerequisite)

TIER 1: Cloud WAF Adapted (1.0x weight)

TIER 2: Agent-Native (1.5x weight)

Scoring

Output Format

Core Principles

Similar Skills

AWAF: Agent Well-Architected Framework Assessment

How to Conduct the Assessment

Step 1: Gather evidence before scoring

Step 2: Assess each pillar against all provided evidence

Step 3: Score and flag gaps

Step 4: Re-score when new evidence arrives

The 10 Pillars

TIER 0: Foundation (prerequisite)

TIER 1: Cloud WAF Adapted (1.0x weight)

TIER 2: Agent-Native (1.5x weight)

Scoring

Output Format

Core Principles

Similar Skills