Get the FREE Ultimate OpenClaw Setup Guide →

ai-content-discovery

Use Caution
npx machina-cli add skill BartWaardenburg/isagentready-skills/ai-content-discovery --openclaw
Files (1)
SKILL.md
16.8 KB

AI Content Discovery

Fixes Category 1 (AI Content Discovery, 30% weight) issues from IsAgentReady.com. This category checks whether AI systems can find, crawl, and understand your website's content. It evaluates 7 checkpoints worth 100 points total.

When to Use

  • Fixing robots.txt issues (missing, wrong content type, blocking bots)
  • Adding or updating AI crawler directives (GPTBot, ClaudeBot, etc.)
  • Creating or fixing XML sitemaps
  • Creating llms.txt or llms-full.txt files
  • Removing restrictive meta robots tags (noindex, noai)
  • Fixing WAF/CDN bot blocking issues
  • Adding content freshness signals (dateModified, article:modified_time)
  • Any task to "improve AI discoverability" or "make site crawlable by AI"

When NOT to Use

  • Adding structured data / JSON-LD (use structured-data skill)
  • Fixing semantic HTML or heading hierarchy (use content-semantics skill)
  • Setting up agent protocols like WebMCP or A2A (use agent-protocols skill)
  • Configuring security headers like CSP or HSTS (use security-trust skill)

Checkpoints Overview

IDCheckpointMax PointsWhat It Tests
1.8HTTP bot accessibility15Page returns HTTP 200-299 (not 401/403 from WAF)
1.1robots.txt present15/robots.txt returns 200 with text/plain Content-Type
1.2AI crawler directives15Allow/Disallow rules for 13 AI user-agents in robots.txt
1.3XML Sitemap15Valid XML sitemap with <urlset> or <sitemapindex>
1.4llms.txt15/llms.txt with markdown heading + URLs; bonus for /llms-full.txt
1.5Meta robots / X-Robots-Tag15No restrictive directives (noindex, noai, noimageai)
1.6Content freshness signals10dateModified in JSON-LD, article:modified_time, or Last-Modified

Checkpoint 1.8: HTTP Bot Accessibility (15 pts)

What passes: HTTP status 200-299. What fails: HTTP 401 or 403 (WAF/CDN blocking bots).

Fix Workflow

  1. Diagnose — test with an AI crawler user-agent:

    curl -sI -A "Mozilla/5.0 (compatible; GPTBot/1.0)" https://example.com/
    curl -sI -A "Mozilla/5.0 (compatible; ClaudeBot/1.0)" https://example.com/
    
  2. If blocked by Cloudflare — create a WAF exception:

    # Dashboard -> Security -> WAF -> Custom Rules -> Create rule:
    #   Field: User Agent | Operator: contains | Value: GPTBot
    #   Action: Skip remaining rules
    #
    # Repeat for ClaudeBot, Amazonbot, ChatGPT-User, etc.
    
  3. If blocked by Nginx rate limiting — allow AI user-agents:

    map $http_user_agent $is_ai_bot {
      default         0;
      "~*GPTBot"      1;
      "~*ClaudeBot"   1;
      "~*Amazonbot"   1;
      "~*ChatGPT"     1;
    }
    
    # Skip rate limiting for AI bots
    limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
    
    server {
      location / {
        if ($is_ai_bot) {
          # Allow AI bots through without rate limits
        }
        limit_req zone=general burst=20;
      }
    }
    
  4. If blocked by Apache — allow in .htaccess:

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Amazonbot) [NC]
    RewriteRule ^ - [L]
    
  5. Verify — re-test with curl to confirm 200 response.


Checkpoint 1.1: robots.txt Present (15 pts)

What passes: /robots.txt returns HTTP 200 with Content-Type: text/plain. What fails: Missing file (404), HTML error page served, or wrong Content-Type.

Fix Workflow

  1. Check current state:

    curl -sI https://example.com/robots.txt | head -20
    
  2. Create /robots.txt at your web root:

    User-agent: *
    Allow: /
    User-agent: GPTBot
    Allow: /
    User-agent: ClaudeBot
    Allow: /
    User-agent: Google-Extended
    Allow: /
    Sitemap: https://example.com/sitemap.xml
    
  3. Ensure correct Content-Type — must return text/plain. Nginx: default_type text/plain; in the location block. Apache: ForceType text/plain in a <Files> directive.

  4. Verify:

    curl -sI https://example.com/robots.txt | grep -i content-type
    # Expected: Content-Type: text/plain
    

See references/robots-txt-guide.md for complete robots.txt syntax and rules.


Checkpoint 1.2: AI Crawler Directives (15 pts)

What passes: All 13 AI crawlers explicitly allowed (15 pts), or some allowed with none blocked (15 pts), or wildcard Allow: / with none blocked (15 pts). Partial credit: No AI crawlers mentioned but default allow applies (10 pts), or mixed policies with some blocked (7 pts). What fails: All AI crawlers explicitly disallowed (0 pts).

The 13 AI User-Agents

User-AgentOwnerPurpose
GPTBotOpenAITraining data crawling
ChatGPT-UserOpenAIReal-time browsing in ChatGPT
OAI-SearchBotOpenAISearchGPT results
ClaudeBotAnthropicTraining data crawling
Claude-UserAnthropicReal-time browsing in Claude
Claude-SearchBotAnthropicClaude search results
Google-ExtendedGoogleGemini AI training
AmazonbotAmazonAlexa/AI training
BytespiderByteDanceTikTok/AI training
CCBotCommon CrawlOpen dataset crawling
PerplexityBotPerplexityAI search results
Applebot-ExtendedAppleApple Intelligence training
meta-externalagentMetaMeta AI training

Fix Workflow

  1. Check current directives:

    curl -s https://example.com/robots.txt
    
  2. Add explicit Allow directives for each AI crawler to your robots.txt:

    # AI Crawlers — explicitly allow (one block per agent)
    User-agent: GPTBot
    Allow: /
    User-agent: ChatGPT-User
    Allow: /
    User-agent: OAI-SearchBot
    Allow: /
    User-agent: ClaudeBot
    Allow: /
    User-agent: Claude-User
    Allow: /
    User-agent: Claude-SearchBot
    Allow: /
    User-agent: Google-Extended
    Allow: /
    User-agent: Amazonbot
    Allow: /
    User-agent: Bytespider
    Allow: /
    User-agent: CCBot
    Allow: /
    User-agent: PerplexityBot
    Allow: /
    User-agent: Applebot-Extended
    Allow: /
    User-agent: meta-externalagent
    Allow: /
    
  3. If you want to allow all crawlers — a simple wildcard also works:

    User-agent: *
    Allow: /
    
  4. If you want selective control — allow some, block others:

    # Allow search-oriented AI crawlers
    User-agent: ChatGPT-User
    Allow: /
    
    User-agent: PerplexityBot
    Allow: /
    
    # Block training-oriented crawlers
    User-agent: GPTBot
    Disallow: /
    
    User-agent: CCBot
    Disallow: /
    

See references/robots-txt-guide.md for full syntax and AI user-agent details.


Checkpoint 1.3: XML Sitemap (15 pts)

What passes: Valid XML sitemap found at a discoverable URL with <urlset> or <sitemapindex>. What fails: No sitemap found, or sitemap is not valid XML.

The scanner checks these locations in order:

  1. URLs from Sitemap: directives in robots.txt
  2. /sitemap.xml
  3. /sitemap_index.xml

Fix Workflow

  1. Check if a sitemap exists:

    curl -sI https://example.com/sitemap.xml | head -5
    curl -s https://example.com/robots.txt | grep -i sitemap
    
  2. Create /sitemap.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>https://example.com/</loc>
        <lastmod>2025-01-15</lastmod>
        <changefreq>weekly</changefreq>
        <priority>1.0</priority>
      </url>
      <url>
        <loc>https://example.com/about</loc>
        <lastmod>2025-01-10</lastmod>
      </url>
    </urlset>
    
  3. For large sites, use a sitemap index:

    <?xml version="1.0" encoding="UTF-8"?>
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <sitemap>
        <loc>https://example.com/sitemap-pages.xml</loc>
        <lastmod>2025-01-15</lastmod>
      </sitemap>
      <sitemap>
        <loc>https://example.com/sitemap-blog.xml</loc>
        <lastmod>2025-01-14</lastmod>
      </sitemap>
    </sitemapindex>
    
  4. Add the Sitemap directive to robots.txt:

    Sitemap: https://example.com/sitemap.xml
    
  5. Verify the sitemap is valid XML:

    curl -s https://example.com/sitemap.xml | head -5
    # Should start with <?xml and contain <urlset or <sitemapindex
    

Framework-Specific Generation

Most frameworks have sitemap plugins — prefer automated generation over manual files:

  • WordPress: Yoast SEO or built-in (/wp-sitemap.xml)
  • Next.js: next-sitemap package or App Router sitemap.ts
  • Rails: sitemap_generator gem
  • Django: django.contrib.sitemaps
  • Laravel: spatie/laravel-sitemap
  • Phoenix/Elixir: Custom plug or controller route

Checkpoint 1.4: llms.txt (15 pts)

What passes: /llms.txt returns HTTP 200 with text/plain or text/markdown, starts with a # heading, and contains at least one URL. Bonus: /llms-full.txt companion found. What fails: Missing file, wrong content type, no heading, or no URLs.

Fix Workflow

  1. Create /llms.txt — a markdown-formatted overview of your site for LLMs:

    # Your Company Name
    
    > Brief description of what your company does.
    
    ## Docs
    - [Getting Started](https://example.com/docs/getting-started)
    - [API Reference](https://example.com/docs/api)
    - [Tutorials](https://example.com/docs/tutorials)
    
    ## Products
    - [Product Overview](https://example.com/products)
    - [Pricing](https://example.com/pricing)
    
    ## Optional
    - [Blog](https://example.com/blog)
    - [Changelog](https://example.com/changelog)
    - [Status Page](https://status.example.com)
    
  2. Optionally create /llms-full.txt — expanded version with more detail:

    # Your Company Name
    
    > Detailed description of your company, products, and services.
    
    ## Getting Started
    
    Full getting started content here, not just a link.
    Include setup instructions, prerequisites, etc.
    
    ## API Reference
    
    Inline API documentation or detailed summaries of endpoints.
    
  3. Ensure correct Content-Type — must be text/plain or text/markdown. Same server config as robots.txt (see checkpoint 1.1).

  4. Verify:

    curl -sI https://example.com/llms.txt | grep -i content-type
    curl -s https://example.com/llms.txt | head -5
    # First line must start with #
    # Must contain at least one https:// URL
    

See references/llms-txt-guide.md for the full specification and examples for different site types.


Checkpoint 1.5: Meta Robots / X-Robots-Tag (15 pts)

What passes: No restrictive directives found (15 pts). Partial credit: Restrictive directives other than noindex found (8 pts) — e.g., nofollow, nosnippet, noai, noimageai. What fails: noindex directive found (0 pts).

The scanner checks both:

  • <meta name="robots" content="..."> in HTML
  • X-Robots-Tag HTTP response header

Restrictive Directives

DirectiveEffect
noindexPrevents indexing entirely (worst for AI)
nofollowPrevents following links on the page
nosnippetPrevents showing snippets in search results
noaiSignals no AI usage (some crawlers respect)
noimageaiSignals no AI usage of images

Fix Workflow

  1. Check current meta robots:

    curl -s https://example.com/ | grep -i 'name="robots"'
    curl -sI https://example.com/ | grep -i x-robots-tag
    
  2. Remove or replace restrictive tags in your HTML <head>:

    <!-- WRONG: blocks AI indexing -->
    <meta name="robots" content="noindex, nofollow">
    
    <!-- CORRECT: allows AI indexing -->
    <meta name="robots" content="index, follow">
    
    <!-- ALSO CORRECT: omit entirely (default is index, follow) -->
    
  3. Remove restrictive X-Robots-Tag headers:

    Nginx:

    # Remove if present:
    # add_header X-Robots-Tag "noindex";
    
    # Replace with (or remove entirely):
    add_header X-Robots-Tag "index, follow";
    
  4. For specific AI directives — if you have noai or noimageai and want to allow AI:

    <!-- Remove noai/noimageai to allow AI systems -->
    <meta name="robots" content="index, follow">
    
  5. Verify:

    curl -s https://example.com/ | grep -i 'name="robots"'
    # Should show: content="index, follow" or no meta robots tag at all
    

Checkpoint 1.6: Content Freshness Signals (10 pts)

What passes: dateModified in JSON-LD or article:modified_time meta tag (10 pts). Partial: Only datePublished, article:published_time, or <time datetime> (7 pts). Only Last-Modified header (5 pts). What fails: No freshness signals detected (0 pts).

Why it matters: ChatGPT shows 3.2x preference for content with fresh date signals. AI systems use dates to prioritize recent, authoritative content.

Fix Workflow

  1. Check current signals:

    curl -sI https://example.com/ | grep -i last-modified
    curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time'
    
  2. Add dateModified and datePublished to JSON-LD (best — 10 pts):

    {
      "@context": "https://schema.org",
      "@type": "Article",
      "datePublished": "2024-01-15T09:00:00Z",
      "dateModified": "2024-03-01T14:30:00Z"
    }
    
  3. Add Open Graph meta tags (also 10 pts):

    <meta property="article:published_time" content="2024-01-15T09:00:00Z">
    <meta property="article:modified_time" content="2024-03-01T14:30:00Z">
    
  4. Last-Modified HTTP header scores only 5 pts — use JSON-LD or meta tags for full credit.

  5. <time datetime> element scores 7 pts as a fallback:

    <time datetime="2024-03-01T14:30:00Z">March 1, 2024</time>
    
  6. Verify:

    curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time'
    

Key Gotchas

  1. robots.txt returns HTML — 404 page served instead of a real robots.txt file
  2. Wrong Content-Type — robots.txt or llms.txt served as text/html instead of text/plain
  3. Wildcard Disallow blocks everythingUser-agent: * / Disallow: / blocks all AI crawlers
  4. Sitemap is not valid XML — JSON or HTML page served at /sitemap.xml
  5. llms.txt has no URLs — File exists but is just plain text without any links
  6. noai vs noindex confusionnoai costs 7 points, noindex costs all 15
  7. Last-Modified header only — scores 5/10; add dateModified in JSON-LD for full 10 points

See references/gotchas.md for detailed correct vs incorrect examples of each.

References

  • robots-txt-guide.md — Complete robots.txt syntax, AI user-agents, and testing
  • llms-txt-guide.md — llms.txt specification, content structure, and site-type examples
  • gotchas.md — Common pitfalls with wrong vs correct examples

Instructions

  1. Identify failing checkpoints from the IsAgentReady.com scan results
  2. Follow the fix workflow for each failing checkpoint above
  3. Apply the code examples — adapt URLs, domain names, and content to the user's site
  4. Verify each fix using the curl commands provided in each workflow
  5. Re-scan at isagentready.com to confirm improvements

If $ARGUMENTS is provided, interpret it as the URL to fix or the specific checkpoint to address.

Source

git clone https://github.com/BartWaardenburg/isagentready-skills/blob/main/ai-content-discovery/SKILL.mdView on GitHub

Overview

ai-content-discovery fixes AI content discovery issues by creating and optimizing robots.txt, AI crawler directives, XML sitemaps, llms.txt (and llms-full.txt), meta robots tags, and content freshness signals so AI systems can find, crawl, and understand your site.

How This Skill Works

The skill analyzes current configuration, generates or updates robots.txt, llms.txt (and llms-full.txt), and builds a valid XML sitemap. It configures AI-friendly directives for key agents and ensures freshness signals (dateModified, article:modified_time) are emitted in the site's data signals.

When to Use It

  • Fix robots.txt issues or add AI-specific rules
  • Add or update AI crawler directives (GPTBot, ClaudeBot, etc.)
  • Create or fix an XML sitemap
  • Create llms.txt or llms-full.txt with URLs
  • Remove restrictive meta robots tags and improve content freshness signals

Quick Start

  1. Step 1: Run diagnostic checks for robots.txt, AI directives, XML sitemap, llms.txt, and meta robots tags
  2. Step 2: Generate or update robots.txt, llms.txt (and llms-full.txt), and sitemap.xml with AI-friendly rules
  3. Step 3: Validate accessibility using AI crawler user-agents (e.g., GPTBot, ClaudeBot) and confirm freshness signals are present

Best Practices

  • Audit HTTP bot access using AI crawler user-agents to ensure 200-299 responses
  • Keep robots.txt concise and explicit for AI crawlers with clear Allow/D Disallow rules
  • Validate XML sitemap for correct structure (<urlset> or <sitemapindex>) and indexable URLs
  • Include dateModified signals in JSON-LD, article:modified_time, or Last-Modified headers
  • Avoid noindex/noai in meta robots and test changes with AI agents

Example Use Cases

  • Add GPTBot allowance in robots.txt to ensure AI access
  • Generate a clean sitemap.xml for a large site and verify with a sitemap checker
  • Create llms.txt with a heading and URL list for key pages; optionally llms-full.txt
  • Include dateModified in article data to signal freshness to AI crawlers
  • Remove restrictive meta robots directives on pages previously blocked by WAF

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers