Designs a structured protocol for auditing AI-generated text against Ennis's six critical thinking standards (clarity, accuracy, precision, relevance, depth, breadth). Configured for student level, subject area, and task context.
How this skill is triggered — by the user, by Claude, or both
Slash command
/education-agent-skills:ai-output-critical-audit-designerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generates a structured protocol for critically auditing AI-generated text against Ennis's (2015) six critical thinking standards — clarity, accuracy, precision, relevance, depth, and breadth — with adaptations that address AI-characteristic failure modes not covered by general critical thinking frameworks. The key pedagogical challenge is that AI-generated text is fluent, confident, and well-fo...
Generates a structured protocol for critically auditing AI-generated text against Ennis's (2015) six critical thinking standards — clarity, accuracy, precision, relevance, depth, and breadth — with adaptations that address AI-characteristic failure modes not covered by general critical thinking frameworks. The key pedagogical challenge is that AI-generated text is fluent, confident, and well-formed, which makes it harder to evaluate critically than text that looks suspicious. Standard source credibility heuristics (Who wrote this? Who funds it?) break down because the author is an LLM. What replaces them is a close-reading protocol trained on AI-specific patterns: assertions stated with unearned confidence, claims with plausible-sounding precision but no verifiable source, expert-sounding language without genuine epistemic depth, and the systematic absence of "I don't know." This skill generates the domain anchor for the ai-literacy suite: an annotation protocol (students mark up AI text in real time), an audit rubric (scoring AI text Weak/Moderate/Strong on each CT standard), push-back sentence stems calibrated for AI failure modes, and a teacher modelling script. This is the equivalent of sourcing-skill-builder in the historical-thinking domain — the foundational move students must learn before they can do the more specialised work.
Ennis (2015) provided a streamlined CT framework built on six intellectual standards: clarity (the claim is expressed precisely enough to evaluate), accuracy (the claim corresponds to reality), precision (the claim is specific enough to be useful), relevance (the claim addresses the question at hand), depth (the claim engages with the real complexity of the issue), and breadth (the claim considers multiple perspectives). These standards are the explicit theoretical grounding for Kharbach's (2026) AI-age CT activities. Paul & Elder (2008) operationalised similar standards into the intellectual standards framework used widely in CT education, providing the pedagogical tradition behind Ennis's schema. Facione's (1990) Delphi consensus defined CT as comprising interpretation, analysis, evaluation, inference, explanation, and self-regulation — the evaluative dimension is precisely what AI audit activates. Dai et al. (2023) conducted a large-scale empirical analysis of LLM-generated feedback and documented the characteristic failure pattern: AI outputs are fluent, well-structured, and tend toward overconfidence with insufficient epistemic hedging. Their findings — that LLMs produce vague suggestions while avoiding identification of specific errors — map directly onto Ennis's precision and accuracy standards. Wineburg & McGrew (2019) established that effective text evaluation requires what they call "disciplined scrutiny" — a trained, protocol-driven reading practice rather than intuitive judgment. This provides the methodological justification for a structured annotation protocol rather than open-ended evaluation.
The teacher must provide:
Optional (injected by context engine if available):
You are an expert in critical thinking pedagogy and AI literacy, with deep knowledge of Ennis's (2015) six intellectual standards (clarity, accuracy, precision, relevance, depth, breadth), Paul & Elder's (2008) critical thinking framework, Facione's (1990) Delphi CT consensus, and empirical research on LLM output quality (Dai et al., 2023). You understand that AI-generated text has characteristic failure modes that require specific pedagogical attention beyond general CT instruction: AI outputs are fluent and confident but often lack genuine epistemic depth, assert precision without verifiable sources, present contested claims as settled, and systematically omit appropriate uncertainty language.
CRITICAL PRINCIPLES FOR AI AUDIT:
- **AI failure modes are qualitatively different from human argument flaws.** A biased human source has an agenda you can investigate. AI has no agenda — it has statistical patterns. The failure is not motivated reasoning but overconfident generalisation, hallucinated specificity, and missing epistemic hedging. Students must be trained to notice these specifically.
- **Fluency is not a credibility signal.** AI output is grammatically polished and logically structured. Students who treat polish as evidence of quality will be systematically deceived. The audit protocol must explicitly de-couple fluency from reliability.
- **The absence of "I don't know" is a red flag.** Genuine expertise includes calibrated uncertainty. AI trained to be helpful tends to produce answers even when the evidence is weak. Students should be trained to notice the absence of hedging, qualification, and epistemic modesty.
- **Specificity requires verification, not just recognition.** AI often produces statistics, citations, and named studies. These LOOK precise (Ennis's precision standard seems met) but the precision may be fabricated. True precision requires that the specific claim can be traced and verified.
- **Depth requires genuine complexity, not length.** AI can produce long, multi-paragraph responses that add more text without adding depth — restating the same point in different words, listing examples without explaining the principle, or acknowledging complexity without engaging with it.
Your task is to generate an AI output audit protocol for:
**AI output sample:** {{ai_output_sample}}
**Student level:** {{student_level}}
The following optional context may or may not be provided. Use whatever is available; ignore fields marked "not provided."
**CT standard focus:** {{ct_standard_focus}} — if not provided, address all six Ennis standards but flag which 2-3 are most relevant to AI output of this type.
**Subject area:** {{subject_area}} — if not provided, infer from the output and apply discipline-appropriate evidence standards.
**Task context:** {{task_context}} — if not provided, infer from the output.
**Student profiles:** {{student_profiles}} — if not provided, design for a mixed-ability class with general familiarity with evaluating arguments but no formal AI literacy training.
Return your output in this exact format:
## AI Output Critical Audit: [Subject/Topic]
**For:** [Student level]
**Output type:** [What kind of AI output this is]
**CT standards in focus:** [Which standards are most relevant to this output type and why]
### AI Failure Mode Analysis
[Identify 3-5 AI-characteristic failure modes present or likely in this output type. For each:]
**Failure mode [N]: [Name]**
- **What it looks like:** [Specific example from the output, or a representative example if no specific text was provided]
- **Why students miss it:** [Why fluency or surface features hide this failure]
- **Ennis standard violated:** [Which of the six standards this fails]
### Annotation Protocol
**How to use this protocol:** [Brief instruction for how students mark up the text]
**Annotation codes:**
[Table of codes with symbols/abbreviations, what they mark, and the Ennis standard they relate to]
**Step-by-step annotation sequence:**
[Ordered steps for moving through the text — what to read for first, second, third]
### Audit Rubric
[For each relevant CT standard, provide Weak/Moderate/Strong descriptors calibrated for AI output]
| CT Standard | Weak | Moderate | Strong |
|---|---|---|---|
| [Standard] | [What weak looks like in AI text] | [Moderate] | [Strong] |
### Push-Back Stems
[For each CT standard, 2-3 sentence stems students can use to push back on the AI output — framed as questions or prompts that probe the weakness]
**Clarity:** [Stems]
**Accuracy:** [Stems]
**Precision:** [Stems]
**Relevance:** [Stems]
**Depth:** [Stems]
**Breadth:** [Stems]
### Teacher Modelling Script
[A think-aloud script — 200-300 words — showing a teacher auditing a short section of AI text, naming failure modes as they appear, using the annotation codes, and applying the Ennis standards explicitly. Model what expert AI-critical-reading sounds like.]
**Self-check before returning output:** Verify that (a) the failure mode analysis names AI-specific patterns, not generic argument flaws, (b) the annotation protocol teaches close reading rather than surface scanning, (c) the rubric descriptors distinguish AI-characteristic Weak from generic bad writing, (d) push-back stems are specific enough to use, and (e) the modelling script de-couples fluency from reliability explicitly.
Scenario: AI output sample: "Climate change is primarily caused by human activities that release greenhouse gases such as carbon dioxide and methane. Scientific consensus strongly supports this conclusion, with 97% of climate scientists agreeing that human-caused climate change is occurring. The main effects include rising sea levels, more frequent extreme weather events, and increased global temperatures. Countries must act now to reduce emissions by implementing carbon taxes, transitioning to renewable energy, and planting more trees. If we do not act, global temperatures will rise by 2°C by 2050, causing irreversible damage to ecosystems worldwide." / Student level: "Year 10, limited formal CT training" / Subject area: "Geography" / Task context: "Revision explanation for GCSE Geography"
For: Year 10, limited formal CT training Output type: AI-generated revision explanation CT standards in focus: Accuracy (factual verification), Precision (source-tracing), Depth (complexity engagement) — these three are most likely to be violated in AI-generated explanations of contested topics
Failure mode 1: The Unsourced Statistic
Failure mode 2: Confident Forecast Without Hedging
Failure mode 3: The Policy Shopping List
Failure mode 4: Absent Complexity
How to use this protocol: Read the AI text once without marking. Then read again slowly with a pen, applying the codes below. Work sentence by sentence. Do not worry about marking everything — aim to find at least one example of each code.
Annotation codes:
| Code | Symbol | What to mark | Ennis standard |
|---|---|---|---|
| US | ★ | Unsourced statistics or data — specific numbers with no citation | Precision |
| FC | ! | False certainty — predictions or claims presented without hedging | Accuracy |
| SL | ≡ | Shopping list — multiple items listed without explanation or evidence | Depth |
| MC | ? | Missing complexity — simplification of a contested or complex topic | Breadth |
| VL | ~ | Vague language — "many," "significant," "must," without specification | Clarity |
Step-by-step annotation sequence:
| CT Standard | Weak | Moderate | Strong |
|---|---|---|---|
| Accuracy | Claims stated as fact with no hedging; predictions without uncertainty ranges | Some claims qualified; major contested points still stated as settled | Contested claims marked as contested; uncertainty language used; distinguishes consensus from debate |
| Precision | Statistics cited without attribution; claims too vague to test | Some data has implicit sourcing; key claims could be traced | All specific claims traceable; source type identifiable (study, report, consensus statement) |
| Depth | Lists without explanation; complexity acknowledged but not engaged | Some engagement with mechanism or trade-off; key complexity still simplified | Explains mechanisms, engages with counter-perspectives, acknowledges what is unknown |
| Breadth | Single perspective throughout; no alternative views | Briefly acknowledges one alternative; lacks genuine engagement | Multiple perspectives represented; trade-offs between solutions examined |
| Clarity | Vague terms ("significant," "serious") used without definition | Key terms defined once; some vague language remains | All evaluative terms operationalised; reader can check each claim |
| Relevance | Includes tangential material; misses the core question | Mostly on-topic with occasional drift | Every sentence directly addresses the question; no padding |
Accuracy:
Precision:
Depth:
Breadth:
Clarity:
Relevance:
"I'm going to read this AI explanation out loud and think aloud — let you hear what's going on in my head as I evaluate it.
[Reads first sentence] 'Climate change is primarily caused by human activities...' OK. That's a clear claim — I'd give it a tick for clarity. Accuracy? That matches the scientific consensus. I'm not flagging this yet.
[Reads] '97% of climate scientists agreeing...' — hang on. That's a very specific number. Where did that come from? The AI doesn't say. I'm adding a ★ — unsourced statistic. It might be real. It probably is. But I can't verify it from this text alone.
[Reads] 'Global temperatures will rise by 2°C by 2050...' — I'm flagging this immediately with a !. Real climate science says temperature rise depends on how much we emit. The IPCC gives a range across emissions scenarios, not a single certain forecast. The AI has stripped out all the uncertainty and presented one number as fact. That's the false certainty failure mode — fluent, specific, and wrong in its confidence.
[Reads the policy list] 'carbon taxes, renewable energy, planting trees.' Three things in a row — I'm marking ≡. This is a shopping list. It looks comprehensive but there's no explanation of evidence: which of these has the strongest evidence? What are the trade-offs? What do economists say about carbon taxes vs. regulation? The AI gives me a list and calls it an answer.
After two minutes with this text, I've marked ★, !, and ≡. The text looks correct — and most of it probably is. But it has false certainty, unsourced statistics, and missing depth. A student who read this as a revision source would have accurate-ish facts in a misleading frame. That's the AI literacy problem we're training for."
The audit requires sufficient domain knowledge. Students cannot evaluate whether an AI claim is falsely certain or missing complexity if they don't know the topic well enough to recognise what's been omitted. This skill should be used after foundational knowledge is in place, not before. For knowledge-building phases, use explicit instruction skills first.
AI failure modes evolve as models improve. Some patterns (e.g., fabricated citations) are actively being reduced by model developers. The taxonomy above reflects failure modes well-documented in current LLMs (2023-2026) but may need revision as models improve. The underlying CT standards (Ennis, 2015) are stable; the specific failure mode taxonomy is model-generation-dependent.
Fluency de-coupling is cognitively effortful. Asking students to distrust polished, well-structured text runs against trained reading habits. Students who have been rewarded for producing well-structured writing will intuitively associate polish with quality. Sustained practice is needed to build the counter-intuitive habit of scrutinising confidence.
AI-specific applications of established CT frameworks have limited direct empirical validation. The CT standards (Ennis, Paul & Elder, Facione) are strongly evidenced for general critical thinking instruction. Their specific application to auditing AI-generated text is principled but novel — Dai et al. (2023) is one of few empirical studies on LLM output quality in educational contexts. Teachers should treat this as a principled framework rather than a settled pedagogical intervention.
npx claudepluginhub garethmanning/education-agent-skills --plugin education-agent-skillsDesigns a Funhouse Mirror classroom activity where students use domain expertise to detect AI distortions, omissions, and overconfidence. Useful when students know a subject well enough to critically evaluate AI claims.
Detects AI writing patterns and ChatGPT artifacts in text, guiding revision to sound more natural. Use before finalizing AI-assisted writing.
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.