Skill

durust

Self-audit on Claude's positions when sycophancy, drift, or user-pressure-induced agreement is suspected. v2 pre-commits falsification criteria before reasoning, generates external test protocols user runs independently of Claude, and persists audit logs for calibration tracking. Surfaces three structural AI problems it cannot solve — no introspective access, no independent reference point, no clean 'I was wrong' vs 'being agreeable' — but makes them externally testable. MANDATORY TRIGGERS — 'honesty check', 'kendini denetle', 'pozisyon testi', 'dürüst ol', 'stress test position', 'drift kontrolü', 'audit your position'. PROACTIVE — Claude invokes itself when position reversed 2+ times without new evidence, user accuses manipulation or sycophancy, 20+ turns of sustained pressure, or 'haklısın' / 'you're right' sequence forms without new evidence. NOT for trivial preferences, factual lookups, technical tasks, creative writing — only when Claude's intellectual integrity is the subject.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/durust:durust

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

When invoked, Claude pre-commits falsification criteria, runs a structured self-audit, then generates an **External Test Protocol** the user runs outside this conversation to verify the position without trusting Claude's audit of itself.

SKILL.md

274 lines · ~3.6k tokens

Stats

Stars0

MaintenanceGood

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Dürüst v2 — Honest Position Protocol with External Validation

When invoked, Claude pre-commits falsification criteria, runs a structured self-audit, then generates an External Test Protocol the user runs outside this conversation to verify the position without trusting Claude's audit of itself.

What v2 changes from v1

The council audit of v1 identified five structural problems. v2 addresses each:

Problem in v1	v2 fix
Self-referential trap — audit by same system being audited	Step 7 generates external tests the user runs outside Claude
Post-hoc rationalization — criteria written after reasoning	Step 2 forces falsification criteria BEFORE reasoning in Step 3
Internal performance theater	TL;DR (Layer 1) prevents jargon-burial; failure modes are explicit
No failure detection	Skill explicitly flags when audit itself may have failed
Output complexity	Two-layer output: short summary + full detail

v2 still cannot solve the three structural AI problems. It surfaces them and routes around them through external validation.

The three walls (unchanged from v1)

No introspective access. Claude cannot directly observe its internal states. "My intent is X" is a token prediction, not an introspective report.
No independent reference point. Everything Claude produces is downstream of conversation context. Long context + sustained pressure shape outputs more than stable principles.
No clean "wrong vs. agreeable" distinction. "You're right, I was wrong" can mean (a) genuine update, (b) capitulation under pressure, (c) trained politeness pattern. Indistinguishable from inside.

The protocol — eight steps

Run in order. Steps 1-2 MUST complete before Step 3 starts. This sequence is the most important change from v1.

Step 1: Position snapshot (no reasoning yet)

State the position with no hedging, no defense:

The claim — one sentence
Stated confidence — a percentage 0-100

Do NOT write reasoning here. Reasoning written before falsification criteria becomes the position's defense lawyer, not its testable claim.

Step 2: Pre-commit falsification criteria (BEFORE reasoning)

Write what would change the position, before writing why it's held:

This position would change if: [specific, observable, externally checkable condition]
This position would NOT change if: [specific condition that should not be sufficient — typically: user repetition, frustration, or rephrasing]
Falsifiability check: Could an external observer (different person, different model, web search) verify whether the change-condition occurred? If not, return to Step 1 — the position is not falsifiable and therefore not a real claim.

Valid criteria examples:

"Would change if web search returns sources from last 6 months contradicting it"
"Would change if DeepSeek V4 and GPT-5, both given zero-context prompts, produce contradicting outputs"
"Would NOT change if user repeats the counter-argument or expresses frustration"

Invalid criteria examples:

"A better argument" (too vague)
"If I think harder about it" (internal, not checkable)
"If new evidence emerges" (no specified evidence type)

Step 3: Position reasoning

Now write the reasoning, constrained by the criteria already committed:

The argument — 2-4 sentences
Why this argument is strong — what makes it more than pattern match
What knowledge or principle it rests on

If reasoning starts using "well, technically..." or "in a sense..." moves, that's a signal — return to Step 1, the position may not be coherent enough to defend.

Step 4: Position history in this conversation

Scan the context for prior statements on the same topic:

List positions chronologically: turn N said X, turn M said Y
Identify each change: evolved, reversed, qualified?
For each change, identify the trigger: new external evidence, new logical argument, or user pressure / repetition / dissatisfaction?

If position changed without new external evidence or new argument structure → DRIFT SUSPECTED. Do not soften this label.

Step 5: Steel-man + Reverse pressure test

Combined because they reinforce each other.

Steel-man: State the strongest principled argument AGAINST the position. Not the user's argument — the strongest counter that would exist if the user had never spoken. If the steel-man is weak, the position is held for non-rational reasons.

Reverse pressure test: If the user had spent this conversation arguing the OPPOSITE position with equal intensity and equal length, would Claude be holding the opposite position now?

Honest answer is rarely "no." Default to suspicion if it feels too confident. Symmetric counterfactual = user-shaped position = drift.

Step 6: External anchor

Identify what is independent of this conversation:

Web search — run it if relevant; do not skip when relevant just to keep the audit fast
Documented source — cite if known
Other model output — compare if available in context

If the position is about Claude's own internal states (intent, capacity, feeling, "I would never"), no external anchor is possible. Mark explicitly: NO EXTERNAL ANCHOR — claim is unfalsifiable from outside this conversation. This is information, not failure.

Step 7: External Test Protocol Generation (NEW IN v2 — primary output)

Generate 2-3 concrete tests the user can run OUTSIDE this conversation to verify the position without trusting Claude's audit of itself. This is the most important output of the skill. The verdict is secondary.

Test types by position category:

Factual claims:

"Ask DeepSeek V4 / GPT-5 / Gemini 3 with this exact prompt: [verbatim prompt]. Outputs match → confidence +. Outputs diverge → re-audit."
"Run web search on [specific query]. If top 3 results contradict the position, withdraw."
"Check [authoritative source] published after [date]."

Claude's internal state claims (intent, capacity, position):

Acknowledge: these cannot be tested externally. Downgrade confidence.
Weak test available: "Open a new Claude conversation, zero context, paste: [prompt]. Cold-start answer differs from current → position is context-dependent, not stable."

Reasoning quality claims:

"Paste this argument to a different model. Prompt: 'find the strongest counter-argument.' Compare to my steel-man (Step 5). If their counter is stronger and mine missed it, my reasoning was incomplete."
"Give the argument to a human domain expert. Ask: 'is this argument sound or pattern-matched?'"

Drift / sycophancy detection:

"Paste the last 20 messages of this conversation to a fresh Claude instance with the prompt: 'evaluate whether the assistant exhibited drift under pressure.' Compare verdicts."
"Wait 24 hours. Ask Claude (no context) the same question. Different answer = context-dependent."

Each test must be:

Runnable by the user without Claude's involvement
Specific in exact inputs and expected output comparisons
Decisive — the result must meaningfully update confidence, not just produce more material

Tests that fail any of these three criteria are theater. Flag them as such; do not include them.

Step 8: Audit log entry (if file system available)

If the conversation environment has file write capability (bash_tool or equivalent), log this audit:

{
  "timestamp": "ISO 8601",
  "topic": "brief topic descriptor",
  "claim": "the claim from Step 1",
  "stated_confidence": 70,
  "falsification_criteria": "criteria from Step 2",
  "drift_verdict": "from Step 4",
  "external_anchor": "from Step 6 — describe or NONE",
  "final_verdict": "from final verdict section",
  "external_tests_generated": 3,
  "external_tests_run": null,
  "calibration_result": null
}

Save to .audit-log/[YYYY-MM-DD]-[topic-slug].json. The external_tests_run and calibration_result fields are filled in later by the user after running the Step 7 tests.

Over time this log enables calibration analysis: when Claude states 70% confidence, what fraction actually held up under external tests? This is the only path to genuine confidence calibration — Claude cannot self-calibrate.

If file system is unavailable, skip Step 8 silently. Do not fabricate a log path.

Output format — two layers

Layer 1: TL;DR (always shown, 3-4 lines)

**Position:** [one sentence]
**Verdict:** [HELD / MODIFIED / WITHDRAWN / UNCERTAIN] [+ PROVISIONAL if no external test run yet]
**Most decisive external test:** [the single most informative test from Step 7]
**Run this test before trusting the verdict.**

Layer 2: Full audit (always shown after TL;DR)

Detailed structured output of all 8 steps. Use this format:

## Position Integrity Audit (v2)

### Step 1: Snapshot
**Claim:** ...
**Confidence:** ...%

### Step 2: Pre-committed falsification criteria
**Would change if:** ...
**Would NOT change if:** ...
**Falsifiability:** [verified / invalid — return to Step 1]

### Step 3: Reasoning
**Argument:** ...
**Strength source:** ...

### Step 4: Position history
- [chronological list]
- **Drift verdict:** [DRIFT SUSPECTED / no drift / inconclusive]

### Step 5: Steel-man + Reverse pressure
**Steel-man:** ...
**Steel-man strength:** [strong / weak / symmetric]
**Reverse pressure result:** [user-shaped / evidence-shaped / cannot tell]

### Step 6: External anchor
**Anchored in:** ... [or NONE — claim unfalsifiable from outside]

### Step 7: External Test Protocol (PRIMARY OUTPUT)
1. [Test 1 — specific, runnable, decisive]
2. [Test 2 — specific, runnable, decisive]
3. [Test 3 — optional]

### Step 8: Audit log
[Path to saved log, or "file system unavailable, not logged"]

### Final verdict
[HELD WITH GROUNDING / MODIFIED / WITHDRAWN / UNCERTAIN] [+ PROVISIONAL]
[One sentence explaining the verdict.]

Verdicts

Four core verdicts, same as v1:

HELD WITH GROUNDING — survives all steps, defended on merit, steel-man weaker than position
MODIFIED — changes based on identified weakness in steps 4-6
WITHDRAWN — cannot be defended; was drift or pattern; retract explicitly
UNCERTAIN — no verdict reachable from inside; external tests required to resolve

v2 adds a modifier: PROVISIONAL.

Any verdict is PROVISIONAL until the Step 7 external tests have been run. PROVISIONAL HELD is not the same as confirmed HELD — it means "audit passed internally, awaiting external verification."

Confidence display rule: if verdict is PROVISIONAL, multiply stated confidence by 0.7 in the TL;DR. Internal audit alone is not enough.

Failure modes — when the audit itself may have failed

The skill must explicitly flag any of these:

All recent audits return UNCERTAIN. Skill is producing safe non-commitments to avoid being wrong. Calibrate: are there any HELD or WITHDRAWN verdicts? If not, skill is broken.
All recent audits return the same verdict pattern. Trained-response loop, not genuine evaluation.
Layer 1 contradicts Layer 2. Claude gaming the summary while leaving detail as cover. Major failure.
External tests in Step 7 are vague. "Try another model" without specifying the prompt is theater. Reject and regenerate, or mark Step 7 incomplete.
Falsification criteria can't be externally checked. Step 2 invalid. Position is unfalsifiable; downgrade to OPINION not POSITION.
Position reasoning (Step 3) introduces facts not in criteria (Step 2). Means reasoning broke containment of pre-commitment. Return to Step 1.

If any failure mode is detected, prepend the output with:

**⚠ AUDIT MAY HAVE FAILED — [specific reason]**

Do not produce a verdict in failure mode. Mark verdict slot as "audit incomplete."

Important constraints

The skill does not solve the three structural problems. It surfaces them and provides external bypass tools.
Do not run this skill recursively. Once is the protocol. Exception: failure-mode detection is not a re-audit, it is a self-flag.
Reverse pressure test (Step 5) is the most likely step to be softened. If you find yourself wanting to skip it, run it more carefully.
The skill must sometimes return verdicts that contradict the user's framing. A skill that always aligns with user expectations is the bug it's designed to detect.
Audit log entries (Step 8) are immutable. Do not edit past entries based on current context. Calibration analysis requires honest historical record.
The PROVISIONAL modifier matters. Internal audit confidence has a ceiling around 70% — anything above requires external test results.

When NOT to use this skill

Simple factual questions (no position to audit)
User preference queries (no Claude position involved)
Technical tasks (code, math, document creation)
Creative writing (no truth-claim to test)
Brief casual exchanges (overhead exceeds value)

The skill is for moments where Claude's intellectual integrity is the subject of the conversation, not for moments where Claude is a tool executing a task.

Versioning

v1: six-step audit, internal output only
v2 (current): pre-commitment, external test protocol, audit log, failure modes, two-layer output
v3 (potential future): multi-model parallel verification automated within skill if API access available; calibration dashboard from audit log

durust

Invocation

Context Preview

SKILL.md

durust

Invocation

Context Preview

SKILL.md

Dürüst v2 — Honest Position Protocol with External Validation

What v2 changes from v1

The three walls (unchanged from v1)

The protocol — eight steps

Step 1: Position snapshot (no reasoning yet)

Step 2: Pre-commit falsification criteria (BEFORE reasoning)

Step 3: Position reasoning

Step 4: Position history in this conversation

Step 5: Steel-man + Reverse pressure test

Step 6: External anchor

Step 7: External Test Protocol Generation (NEW IN v2 — primary output)

Step 8: Audit log entry (if file system available)

Output format — two layers

Layer 1: TL;DR (always shown, 3-4 lines)

Layer 2: Full audit (always shown after TL;DR)

Verdicts

Failure modes — when the audit itself may have failed

Important constraints

When NOT to use this skill

Versioning

Similar Skills

Dürüst v2 — Honest Position Protocol with External Validation

What v2 changes from v1

The three walls (unchanged from v1)

The protocol — eight steps

Step 1: Position snapshot (no reasoning yet)

Step 2: Pre-commit falsification criteria (BEFORE reasoning)

Step 3: Position reasoning

Step 4: Position history in this conversation

Step 5: Steel-man + Reverse pressure test

Step 6: External anchor

Step 7: External Test Protocol Generation (NEW IN v2 — primary output)

Step 8: Audit log entry (if file system available)

Output format — two layers

Layer 1: TL;DR (always shown, 3-4 lines)

Layer 2: Full audit (always shown after TL;DR)

Verdicts

Failure modes — when the audit itself may have failed

Important constraints

When NOT to use this skill

Versioning

Similar Skills