robots.txt for AI Bots

Configure your robots.txt correctly so GPTBot, ClaudeBot, PerplexityBot and other AI crawlers can access your content — and you get cited in AI-generated answers.

What is robots.txt?

robots.txt is a plain-text file placed at the root of your domain (e.g. https://example.com/robots.txt). It instructs web crawlers which pages or sections of your site they may or may not access. The file follows the Robots Exclusion Protocol — a standard supported by all major search engines and AI crawlers.

Each entry consists of a User-agent line identifying the bot, followed by one or more Allow or Disallow directives.

Why AI Bots Need Explicit Access

Most AI crawlers — including GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot — respect robots.txt. If your file blocks them, either with a blanket Disallow: / under User-agent: * or by not explicitly allowing them, your content will not be indexed and will remain invisible to AI-generated answers.

Many websites have robots.txt files originally written to control search engine crawlers. These files often contain broad restrictions that unintentionally block AI bots. Auditing and updating your robots.txt is one of the highest-impact steps you can take for AI visibility.

Two Types of AI Crawlers — and Why the Difference Matters

Not all AI bots work the same way. There are two fundamentally different mechanisms, and your robots.txt settings produce very different effects for each.

1. Live Crawlers (Retrieval)

These bots fetch your page in real time when a user asks the AI a question. Your content appears directly in the answer, often with a citation link. Examples:

ChatGPT-User — OpenAI ChatGPT Browsing
OAI-SearchBot — OpenAI SearchGPT
PerplexityBot, Perplexity-User — Perplexity AI
Claude-Web, Claude-SearchBot — Anthropic Claude with web access

Effect of robots.txt changes: immediate. Allow them today, get cited today. Block them today, disappear from AI answers today.

2. Training Crawlers

These bots collect data to train the next version of an AI model. Today's ChatGPT answers come from its training dataset — which is typically 6–12 months old. Blocking a training bot today does not affect what the AI already knows; it affects what the next model version will know. Examples:

GPTBot — OpenAI training
ClaudeBot, anthropic-ai — Anthropic training
Google-Extended — Google Gemini training
CCBot — Common Crawl (used by many AI companies)
Bytespider, Amazonbot, Applebot-Extended, cohere-ai

Effect of robots.txt changes: delayed. Your changes only show up when the AI company releases its next trained model — which can take months.

Practical advice: Allow retrieval bots in any case — they bring real-time visibility with citations. Decide about training bots based on your content strategy: allow them if you want your work included in future AI knowledge; block them if you want to retain control over how your content is used.

Complete robots.txt Example

Copy this template and adapt it for your domain. The Sitemap line at the bottom helps crawlers discover your pages efficiently.

User-agent: *
Allow: /

# ── Live Crawlers (Retrieval — immediate effect) ──

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-SearchBot
Allow: /

# ── Training Crawlers (delayed effect until next model) ──

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: anthropic-ai
Allow: /

Sitemap: https://example.com/sitemap.xml

Replace https://example.com/sitemap.xml with your actual sitemap URL. If you have multiple sitemaps, add one Sitemap: line per file.

How to Verify Your robots.txt

Google Search Console — Use the built-in robots.txt Tester to check which rules apply to any given URL.
curl — Run curl -s https://yourdomain.com/robots.txt from a terminal to confirm the file is served correctly and with a 200 status code.
AI Visibility Scanner — Our free scanner checks your robots.txt as part of a full 19-point AI visibility audit.

Common Mistakes

Blanket block: User-agent: * followed by Disallow: / blocks all bots — including every AI crawler. This is often left over from development or staging environments.
Missing Sitemap line: Without a Sitemap declaration, crawlers have to discover pages through links alone. Always include your sitemap URL.
Wrong file location: robots.txt must be at the root of your domain (/robots.txt), not in a subdirectory. A file at /blog/robots.txt is ignored by crawlers.
Case sensitivity: The User-agent bot names are case-sensitive. GPTBot is not the same as gptbot.
No-index via meta tag: Note that robots.txt controls crawl access; it does not control indexing. Use <meta name="robots" content="noindex"> tags or X-Robots-Tag HTTP headers to prevent specific pages from being indexed even if they are crawled.

Official Sources

Test your robots.txt now — free AI Visibility Score 26 checks in under 30 seconds

Related Guides

📄

llms.txt — The AI Instruction FileTell AI systems what your site is about with a structured plain-text file at /llms.txt.

📝

Content Structure for AIQuestion headings, answer-first pattern, and TL;DR summaries that AI systems prefer.