From isagentready
Fixes AI content discovery issues — creates and optimizes robots.txt, AI crawler directives, XML sitemaps, llms.txt, meta robots tags, and content freshness signals so AI systems can find, crawl, and understand website content. Use when asked to "fix robots.txt", "add llms.txt", "create a sitemap", "allow AI crawlers", "fix AI discoverability", "improve AI content discovery score", "make site crawlable by AI", "add dateModified", "fix content freshness", or any robots.txt, sitemap, or llms.txt task.
How this skill is triggered — by the user, by Claude, or both
Slash command
/isagentready:ai-content-discoveryThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Fixes Category 1 (AI Content Discovery, 30% weight) issues from [IsAgentReady.com](https://isagentready.com). This category checks whether AI systems can find, crawl, and understand your website's content. It evaluates 7 checkpoints worth 100 points total.
Fixes Category 1 (AI Content Discovery, 30% weight) issues from IsAgentReady.com. This category checks whether AI systems can find, crawl, and understand your website's content. It evaluates 7 checkpoints worth 100 points total.
structured-data skill)content-semantics skill)agent-protocols skill)security-trust skill)| ID | Checkpoint | Max Points | What It Tests |
|---|---|---|---|
| 1.8 | HTTP bot accessibility | 15 | Page returns HTTP 200-299 (not 401/403 from WAF) |
| 1.1 | robots.txt present | 15 | /robots.txt returns 200 with text/plain Content-Type |
| 1.2 | AI crawler directives | 15 | Allow/Disallow rules for 13 AI user-agents in robots.txt |
| 1.3 | XML Sitemap | 15 | Valid XML sitemap with <urlset> or <sitemapindex> |
| 1.4 | llms.txt | 15 | /llms.txt with markdown heading + URLs; bonus for /llms-full.txt |
| 1.5 | Meta robots / X-Robots-Tag | 15 | No restrictive directives (noindex, noai, noimageai) |
| 1.6 | Content freshness signals | 10 | dateModified in JSON-LD, article:modified_time, or Last-Modified |
What passes: HTTP status 200-299. What fails: HTTP 401 or 403 (WAF/CDN blocking bots).
Diagnose — test with an AI crawler user-agent:
curl -sI -A "Mozilla/5.0 (compatible; GPTBot/1.0)" https://example.com/
curl -sI -A "Mozilla/5.0 (compatible; ClaudeBot/1.0)" https://example.com/
If blocked by Cloudflare — create a WAF exception:
# Dashboard -> Security -> WAF -> Custom Rules -> Create rule:
# Field: User Agent | Operator: contains | Value: GPTBot
# Action: Skip remaining rules
#
# Repeat for ClaudeBot, Amazonbot, ChatGPT-User, etc.
If blocked by Nginx rate limiting — allow AI user-agents:
map $http_user_agent $is_ai_bot {
default 0;
"~*GPTBot" 1;
"~*ClaudeBot" 1;
"~*Amazonbot" 1;
"~*ChatGPT" 1;
}
# Skip rate limiting for AI bots
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
server {
location / {
if ($is_ai_bot) {
# Allow AI bots through without rate limits
}
limit_req zone=general burst=20;
}
}
If blocked by Apache — allow in .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Amazonbot) [NC]
RewriteRule ^ - [L]
Verify — re-test with curl to confirm 200 response.
What passes: /robots.txt returns HTTP 200 with Content-Type: text/plain.
What fails: Missing file (404), HTML error page served, or wrong Content-Type.
Check current state:
curl -sI https://example.com/robots.txt | head -20
Create /robots.txt at your web root:
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://example.com/sitemap.xml
Ensure correct Content-Type — must return text/plain. Nginx: default_type text/plain; in the location block. Apache: ForceType text/plain in a <Files> directive.
Verify:
curl -sI https://example.com/robots.txt | grep -i content-type
# Expected: Content-Type: text/plain
See references/robots-txt-guide.md for complete robots.txt syntax and rules.
What passes: All 13 AI crawlers explicitly allowed (15 pts), or some allowed with none blocked (15 pts), or wildcard Allow: / with none blocked (15 pts).
Partial credit: No AI crawlers mentioned but default allow applies (10 pts), or mixed policies with some blocked (7 pts).
What fails: All AI crawlers explicitly disallowed (0 pts).
| User-Agent | Owner | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data crawling |
| ChatGPT-User | OpenAI | Real-time browsing in ChatGPT |
| OAI-SearchBot | OpenAI | SearchGPT results |
| ClaudeBot | Anthropic | Training data crawling |
| Claude-User | Anthropic | Real-time browsing in Claude |
| Claude-SearchBot | Anthropic | Claude search results |
| Google-Extended | Gemini AI training | |
| Amazonbot | Amazon | Alexa/AI training |
| Bytespider | ByteDance | TikTok/AI training |
| CCBot | Common Crawl | Open dataset crawling |
| PerplexityBot | Perplexity | AI search results |
| Applebot-Extended | Apple | Apple Intelligence training |
| meta-externalagent | Meta | Meta AI training |
Check current directives:
curl -s https://example.com/robots.txt
Add explicit Allow directives for each AI crawler to your robots.txt:
# AI Crawlers — explicitly allow (one block per agent)
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: meta-externalagent
Allow: /
If you want to allow all crawlers — a simple wildcard also works:
User-agent: *
Allow: /
If you want selective control — allow some, block others:
# Allow search-oriented AI crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
# Block training-oriented crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
See references/robots-txt-guide.md for full syntax and AI user-agent details.
What passes: Valid XML sitemap found at a discoverable URL with <urlset> or <sitemapindex>.
What fails: No sitemap found, or sitemap is not valid XML.
The scanner checks these locations in order:
Sitemap: directives in robots.txt/sitemap.xml/sitemap_index.xmlCheck if a sitemap exists:
curl -sI https://example.com/sitemap.xml | head -5
curl -s https://example.com/robots.txt | grep -i sitemap
Create /sitemap.xml:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2025-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2025-01-10</lastmod>
</url>
</urlset>
For large sites, use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2025-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2025-01-14</lastmod>
</sitemap>
</sitemapindex>
Add the Sitemap directive to robots.txt:
Sitemap: https://example.com/sitemap.xml
Verify the sitemap is valid XML:
curl -s https://example.com/sitemap.xml | head -5
# Should start with <?xml and contain <urlset or <sitemapindex
Most frameworks have sitemap plugins — prefer automated generation over manual files:
/wp-sitemap.xml)next-sitemap package or App Router sitemap.tssitemap_generator gemdjango.contrib.sitemapsspatie/laravel-sitemapWhat passes: /llms.txt returns HTTP 200 with text/plain or text/markdown, starts with a # heading, and contains at least one URL. Bonus: /llms-full.txt companion found.
What fails: Missing file, wrong content type, no heading, or no URLs.
Create /llms.txt — a markdown-formatted overview of your site for LLMs:
# Your Company Name
> Brief description of what your company does.
## Docs
- [Getting Started](https://example.com/docs/getting-started)
- [API Reference](https://example.com/docs/api)
- [Tutorials](https://example.com/docs/tutorials)
## Products
- [Product Overview](https://example.com/products)
- [Pricing](https://example.com/pricing)
## Optional
- [Blog](https://example.com/blog)
- [Changelog](https://example.com/changelog)
- [Status Page](https://status.example.com)
Optionally create /llms-full.txt — expanded version with more detail:
# Your Company Name
> Detailed description of your company, products, and services.
## Getting Started
Full getting started content here, not just a link.
Include setup instructions, prerequisites, etc.
## API Reference
Inline API documentation or detailed summaries of endpoints.
Ensure correct Content-Type — must be text/plain or text/markdown. Same server config as robots.txt (see checkpoint 1.1).
Verify:
curl -sI https://example.com/llms.txt | grep -i content-type
curl -s https://example.com/llms.txt | head -5
# First line must start with #
# Must contain at least one https:// URL
See references/llms-txt-guide.md for the full specification and examples for different site types.
What passes: No restrictive directives found (15 pts).
Partial credit: Restrictive directives other than noindex found (8 pts) — e.g., nofollow, nosnippet, noai, noimageai.
What fails: noindex directive found (0 pts).
The scanner checks both:
<meta name="robots" content="..."> in HTMLX-Robots-Tag HTTP response header| Directive | Effect |
|---|---|
noindex | Prevents indexing entirely (worst for AI) |
nofollow | Prevents following links on the page |
nosnippet | Prevents showing snippets in search results |
noai | Signals no AI usage (some crawlers respect) |
noimageai | Signals no AI usage of images |
Check current meta robots:
curl -s https://example.com/ | grep -i 'name="robots"'
curl -sI https://example.com/ | grep -i x-robots-tag
Remove or replace restrictive tags in your HTML <head>:
<!-- WRONG: blocks AI indexing -->
<meta name="robots" content="noindex, nofollow">
<!-- CORRECT: allows AI indexing -->
<meta name="robots" content="index, follow">
<!-- ALSO CORRECT: omit entirely (default is index, follow) -->
Remove restrictive X-Robots-Tag headers:
Nginx:
# Remove if present:
# add_header X-Robots-Tag "noindex";
# Replace with (or remove entirely):
add_header X-Robots-Tag "index, follow";
For specific AI directives — if you have noai or noimageai and want to allow AI:
<!-- Remove noai/noimageai to allow AI systems -->
<meta name="robots" content="index, follow">
Verify:
curl -s https://example.com/ | grep -i 'name="robots"'
# Should show: content="index, follow" or no meta robots tag at all
What passes: dateModified in JSON-LD or article:modified_time meta tag (10 pts).
Partial: Only datePublished, article:published_time, or <time datetime> (7 pts). Only Last-Modified header (5 pts).
What fails: No freshness signals detected (0 pts).
Why it matters: ChatGPT shows 3.2x preference for content with fresh date signals. AI systems use dates to prioritize recent, authoritative content.
Check current signals:
curl -sI https://example.com/ | grep -i last-modified
curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time'
Add dateModified and datePublished to JSON-LD (best — 10 pts):
{
"@context": "https://schema.org",
"@type": "Article",
"datePublished": "2024-01-15T09:00:00Z",
"dateModified": "2024-03-01T14:30:00Z"
}
Add Open Graph meta tags (also 10 pts):
<meta property="article:published_time" content="2024-01-15T09:00:00Z">
<meta property="article:modified_time" content="2024-03-01T14:30:00Z">
Last-Modified HTTP header scores only 5 pts — use JSON-LD or meta tags for full credit.
<time datetime> element scores 7 pts as a fallback:
<time datetime="2024-03-01T14:30:00Z">March 1, 2024</time>
Verify:
curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time'
text/html instead of text/plainUser-agent: * / Disallow: / blocks all AI crawlers/sitemap.xmlnoai costs 7 points, noindex costs all 15dateModified in JSON-LD for full 10 pointsSee references/gotchas.md for detailed correct vs incorrect examples of each.
If $ARGUMENTS is provided, interpret it as the URL to fix or the specific checkpoint to address.
npx claudepluginhub bartwaardenburg/isagentready-skills --plugin isagentreadyProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.