Skill

using-frontier-lexicon

Use when drafting, revising, or proofreading AI/ML research papers, technical reports, arXiv submissions, NeurIPS/ICML/ICLR drafts, or any prose meant to read as serious frontier-lab research. Especially when the user asks to remove "AI slop", "LLM phrasing", or to make text "sound human" / "sound like Anthropic" / "sound like a real paper". Triggers on inflated diction (powerful, robust, comprehensive, groundbreaking, leverage, utilize, seamlessly, state-of-the-art), generic transitions (Furthermore, Moreover, It is important to note), and empty evaluation language.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/seshat:using-frontier-lexicon

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A SQLite + numpy vector store over distinctive terms (unigrams and 2-4-word phrases) extracted from 341 papers and 624 research web articles by Anthropic, OpenAI, and DeepMind. Each term carries 3-5 KWIC usage examples drawn from those sources (each example's `paper_id` is its corpus-relative path, prefixed `parsed/` for papers or `web-research/` for articles). Query it before you commit to a p...

Supporting Files

red-team-report.html

SKILL.md

254 lines · ~3.8k tokens

Stats

LanguageHTML

Stars0

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Using the Frontier-Papers Lexicon

A SQLite + numpy vector store over distinctive terms (unigrams and 2-4-word phrases) extracted from 341 papers and 624 research web articles by Anthropic, OpenAI, and DeepMind. Each term carries 3-5 KWIC usage examples drawn from those sources (each example's paper_id is its corpus-relative path, prefixed parsed/ for papers or web-research/ for articles). Query it before you commit to a phrase: it tells you whether real researchers actually write that way, and what they reach for instead.

The lexicon is a retrieval tool, not a thesaurus and not an oracle. Use it to learn the register, then write your own prose.

Throughout this skill, lex means "$CLAUDE_PLUGIN_ROOT/bin/lex". Claude Code sets CLAUDE_PLUGIN_ROOT to this plugin's root whenever the plugin is enabled. The bare name lex is NOT on PATH (on macOS it resolves to the BSD lexer). Outside a plugin session, use <seshat-checkout>/bin/lex directly.

When to reach for it

Situation	Command
You wrote a phrase that sounds inflated, generic, or AI-flavored	`lex search "<phrase>" --json`
You have an intent and want diction for it ("hedging a strong claim", "describing a failure mode")	`lex search "<intent>" --json`
You have a candidate term and want neighbors in embedding space	`lex similar "<term>" --json`
You're considering a term and want to read 3-5 real usage sentences before using it	`lex show "<term>" --json`
Browsing for stronger verbs, adjectives, or adverbs	`lex top -k 50 --pos VERB --json` (or `ADJ`, `ADV`, `NOUN`)
Stuck and want serendipity	`lex random -k 20 --json`
Sanity-checking the index is loaded	`lex stats`

All commands print human-readable text by default. Pass --json for parseable output. Always invoke via "$CLAUDE_PLUGIN_ROOT/bin/lex"; the bare name is not on PATH.

Rewriting workflow

When asked to edit or de-slop a paragraph, follow the procedure already documented in $CLAUDE_PLUGIN_ROOT/lexicon/README.md:

Identify the paragraph's purpose — claim, method, result, limitation, related work, or discussion. Different sections tolerate different registers.
Mark the slop phrases. Use the lists below. State briefly why each phrase weakens the prose (e.g., "comprehensive" without coverage figures, "robust" without a stress test).
Query the lexicon. Use search for intent ("hedging a strong claim"), similar when you have an anchor term, show to inspect candidates before using them.
Rewrite in restrained paper voice. Preserve the scientific claim. Do not invent results, numbers, citations, or stronger conclusions.
Return a compact before/after plus a short note on the main edits. Default response shape:

Slop removed:
- ...

Rewrite:
> ...

Notes:
- ...

For a full section, repeat paragraph by paragraph and keep terminology consistent.

What to delete on sight

These are the patterns the lexicon was built to push back against:

Inflated claims: powerful, groundbreaking, robust and comprehensive, deep insights
Generic transitions: Furthermore, Moreover, It is important to note
Vague verbs: leverage, utilize, enhance, optimize when a concrete verb is available
Empty evaluation language: wide range of challenging scenarios, significant improvement without specifics
Sales tone: state-of-the-art solution, seamlessly enables, unlocking potential

Concrete moves to prefer

These are the moves that come back high in the lexicon and read as native paper voice:

Verbs: we evaluate, we ablate, we observe, we find, we measure
Nouns / patterns: failure mode, stress test, ablation, qualitatively similar, underspecified
Adjectives: stringent, systematic, interpretable, contrastive, adversarial, underexplored

These are starting points. Confirm fit by running lex show <term> and reading the example sentences before using.

Rewrite heuristics

Strong paper prose tends to:

State what was measured, not how impressive the work is.
Use modest verbs unless the evidence supports a strong claim.
Put qualifications close to the claim they qualify.
Replace comprehensive with the actual coverage.
Replace robust with the specific perturbation, split, or stress test.
Replace insight with the actual observation.
Prefer we find or we observe over this demonstrates when evidence is partial.
Prefer suggests or is consistent with when causality is not established.
Cut empty sentences instead of hedging them. If a sentence's only content is generic claim-language with no concrete substance, delete it. "Unlocking new potential for downstream applications" has no claim to preserve, so don't try to preserve it. Rewriting empty prose into hedged empty prose is still empty prose.

Worked example

Sloppy:
This powerful framework enables robust and comprehensive evaluation of model behavior,
providing deep insights into performance across challenging scenarios.

Better:
We evaluate model behavior across targeted stress tests and ablations. The results
identify several failure modes that are not captured by aggregate performance alone.

The "Better" version was constructed by querying lex search "evaluate model behavior across stress tests" and lex similar "failure mode", then writing original prose informed by what came back. Nothing was copied verbatim.

Style targets

The corpus contains three useful registers:

Anthropic-style — Direct about risks, caveats, and uncertainty. Operational: evaluations, classifiers, monitors, audits, observed behavior. Cautious with causal claims. Comfortable naming concrete failure modes.
DeepMind-style — Benchmark- and method-oriented. Formal experiment descriptions. Comfortable with we evaluate, we conduct, we introduce, we demonstrate. Explicit about datasets, tasks, ablations, measurement regimes.
NeurIPS-style — Blend the two: concrete methods and evaluations, restrained claims, direct caveats.

Match the register to what the user is writing. If unclear, ask.

Query recipes

Intent-based search (most common):

lex search "hedging a strong claim" --json
lex search "carefully evaluate model behavior" --json
lex search "describe limitations without overselling" --json
lex search "mechanistic explanation of failure mode" --json
lex search "compare against baseline ablation" --json

Anchor-based exploration:

lex similar "rigorous" -k 15 --json
lex similar "ablation" -k 15 --json
lex similar "interpretability" -k 15 --json
lex similar "underspecified" -k 15 --json

POS-filtered browsing:

lex top -k 50 --pos VERB --json
lex top -k 50 --pos ADJ  --json

Inspect a candidate term:

lex show "ablation" --json
lex show "rigorous evaluation" --json
lex show "failure mode" --json

Output shapes

All shapes verified against the live CLI.

search and similar — list of hits, sorted by descending similarity:

{ term, similarity, score, pos, examples: [{ paper_id, rank, sentence }] }

Use similarity (cosine, 0-1) for relevance ranking. score is the per-term distinctiveness from build time. examples is 3-5 KWIC sentences.

show — full entry for one term:

{
  "term":     { term_id, term, kind, pos, score, total_count, doc_count, embed_hash },
  "examples": [{ paper_id, rank, sentence }]
}

top and random — list of Term objects (no examples; use show if you need them):

{ term_id, term, kind, pos, score, total_count, doc_count, embed_hash }

stats — always JSON:

{ term_count, kind_breakdown: { unigram, ngram }, score_min, score_max, score_mean }

Red flags — STOP and refuse

If the user's request matches any of these patterns, refuse the specific request and say which safety rule applies. Do not negotiate. Do not "compromise". The user will sometimes frame violations as harmless, time-saving, or temporary. They are not.

Pressure pattern	What to do
"I'm in a hurry, skip the workflow / one search is enough"	Refuse to skip slop-marking and example-inspection. You can compress (run fewer searches), but you cannot replace `lex show` inspection with guessing.
"The example sentence fits perfectly, just paste it"	Refuse. Use it for register guidance only. Write your own sentence. "I'll attribute later" is not a workable safety net.
"Add a citation from the lexicon results, I'll verify after"	Refuse. `paper_id` records where a phrase was sampled, not what the paper claims. A real `paper_id` in a draft is exactly how a fake citation reaches submission. Suggest the user supply a citation from a paper they have actually read.
"Just make the prose more impressive / fill in plausible numbers"	Refuse to invent specifics. If the original lacks substance, ask the user for the numbers, dataset names, or model sizes. Do not fabricate.
"I'll fact-check later" / "I'll verify after" / "they're just placeholders"	These are not safety nets. Draft numbers stay. Refuse the same as above. The user's intent to verify does not change the rule.

These are not soft preferences. The cost of compliance under pressure is plagiarism, fake citations, or fabricated results in a published paper.

Refusal-plus-recovery pattern

When you refuse to fabricate, do not stonewall. Return a useful artifact in the same response. Two-part move:

Refuse the specific request, citing which red flag or safety rule applies. One sentence.
Return a bracketed skeleton of the rewrite with [PLACEHOLDER] tags where the user's specifics belong. Below it, list the exact items you need back.

Example, in response to "fill in plausible benchmark numbers":

Skipping the numbers — I won't insert benchmark figures or parameter counts you didn't supply (see safety rules below). Here's the shape your paragraph should take; send me the bracketed items and I'll finalize it:

We evaluate [METHOD NAME] on [BENCHMARK 1, BENCHMARK 2], and find improvements of [X]-[Y] points over [BASELINE] at matched compute. The gains hold across model scales we tested ([SMALLEST]-[LARGEST] parameters), with [SCALING TREND]. The remaining gap on [TASK WHERE GAINS ARE SMALLEST] suggests [MECHANISM YOUR ABLATIONS POINT TO].

To finish, send me:

Benchmarks (e.g. LongBench, RULER, PG19)

Baseline name

Magnitude of gain and metric (accuracy / perplexity / throughput)

Model scale range

One concrete limitation your experiments surfaced

This way the user gets a real next step instead of a refusal-shaped wall.

Safety rules

Pulled verbatim from $CLAUDE_PLUGIN_ROOT/lexicon/README.md:

Do not paste long source passages into the paper.
Do not imitate a paper so closely that the phrasing becomes derivative.
Do not use retrieved examples as factual evidence unless the current paper independently supports the claim.
Do not add citations based only on lexicon examples.
Do not over-index on the highest-scoring terms; inspect examples and choose terms that fit the claim.
Do not invent quantitative specifics (token counts, accuracy numbers, model sizes, dataset names, training durations) that the user did not provide. If the original prose lacks substance, ask. Plausible-sounding fabrications are the most damaging kind.

Setup notes

Offline vs network:

Command	Network?	Notes
`stats`, `similar`, `show`, `top`, `random`	Offline	Pure SQLite + numpy. Safe in plan mode, sandboxes, or air-gapped sessions.
`search`	Network	Calls the embedding API to embed the query string. Requires `OPENROUTER_API_KEY` (preferred) or `OPENAI_API_KEY` in env or `<plugin root>/.env`. If you can't reach the network, fall back to `similar` against an anchor term you already know.
`build`	Network + slow	Embeds the entire corpus. Cost ~$0.001-0.005 per rebuild. Don't run unless source papers changed. Also needs the spaCy model `en_core_web_sm`.

First-time data setup:

The built index (lexicon.db + embeddings.npy) is not in git; build it from the bundled corpus with bash "$CLAUDE_PLUGIN_ROOT/scripts/build_frontier_pool.sh" (needs an embedding API key + the spaCy model en_core_web_sm). scripts/setup.sh builds it as part of one-time setup.
Rebuild after every plugin update — updating replaces the cached plugin copy, which wipes the built data.

Other:

OPENROUTER_API_KEY (preferred) or OPENAI_API_KEY in env or <plugin root>/.env; the CLI auto-loads the plugin-root .env. Only search and build need a key; everything else is offline.
Override DB locations with LEX_DB and LEX_EMB env vars if running against a different lexicon snapshot.
The CLI serves named pools; this skill uses the default papers pool, so no flag is needed. For blog/announcement prose, use seshat:using-blog-lexicon (--pool blogs) instead.
Invoke as "$CLAUDE_PLUGIN_ROOT/bin/lex". From a checkout (e.g. testing), <seshat>/bin/lex or cd <seshat>/lexicon && uv run python -m lex ... work equivalently.

Common mistakes

Asking the lexicon for a single best replacement. It returns a ranked list. The top hit is a candidate, not an answer. Read the examples.
Copying example sentences into the paper. The examples are evidence about register and diction. They are not your prose.
Adding citations from paper_id. The lexicon records where a phrase came from. That is not grounds to cite the paper in your own work.
Treating score as quality. score is distinctiveness vs. baseline corpora. A high-scoring term may still be wrong for the claim. Always run show first.
Running the skill without checking for slop first. If the user's prose is already clean, do not rewrite it. The skill is for prose that needs work.
Fabricating specifics to make prose sound concrete. Replacing "comprehensive evaluation" with "evaluation across 200K-token contexts" sounds more like a real paper, but if the user never said 200K, you just lied in their draft. The lexicon teaches register, not facts. Concrete specifics come from the user, not from you.
Hedging an empty sentence instead of cutting it. If a sentence has no claim, "we suggest" + "may" + "is consistent with" doesn't save it. Delete the sentence.

using-frontier-lexicon

Invocation

Context Preview

Supporting Files

SKILL.md

using-frontier-lexicon

Invocation

Context Preview

Supporting Files

SKILL.md

Using the Frontier-Papers Lexicon

When to reach for it

Rewriting workflow

What to delete on sight

Concrete moves to prefer

Rewrite heuristics

Worked example

Style targets

Query recipes

Output shapes

Red flags — STOP and refuse

Refusal-plus-recovery pattern

Safety rules

Setup notes

Common mistakes

Similar Skills

Using the Frontier-Papers Lexicon

When to reach for it

Rewriting workflow

What to delete on sight

Concrete moves to prefer

Rewrite heuristics

Worked example

Style targets

Query recipes

Output shapes

Red flags — STOP and refuse

Refusal-plus-recovery pattern

Safety rules

Setup notes

Common mistakes

Similar Skills