robots.txt for AI Bots
Configure your robots.txt correctly so GPTBot, ClaudeBot, PerplexityBot and other AI crawlers can access your content — and you get cited in AI-generated answers.
What is robots.txt?
robots.txt is a plain-text file placed at the root of your domain (e.g. https://example.com/robots.txt). It instructs web crawlers which pages or sections of your site they may or may not access. The file follows the Robots Exclusion Protocol — a standard supported by all major search engines and AI crawlers.
Each entry consists of a User-agent line identifying the bot, followed by one or more Allow or Disallow directives.
Why AI Bots Need Explicit Access
Most AI crawlers — including GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot — respect robots.txt. If your file blocks them, either with a blanket Disallow: / under User-agent: * or by not explicitly allowing them, your content will not be indexed and will remain invisible to AI-generated answers.
Many websites have robots.txt files originally written to control search engine crawlers. These files often contain broad restrictions that unintentionally block AI bots. Auditing and updating your robots.txt is one of the highest-impact steps you can take for AI visibility.
Two Types of AI Crawlers — and Why the Difference Matters
Not all AI bots work the same way. There are two fundamentally different mechanisms, and your robots.txt settings produce very different effects for each.
1. Live Crawlers (Retrieval)
These bots fetch your page in real time when a user asks the AI a question. Your content appears directly in the answer, often with a citation link. Examples:
ChatGPT-User— OpenAI ChatGPT BrowsingOAI-SearchBot— OpenAI SearchGPTPerplexityBot,Perplexity-User— Perplexity AIClaude-Web,Claude-SearchBot— Anthropic Claude with web access
Effect of robots.txt changes: immediate. Allow them today, get cited today. Block them today, disappear from AI answers today.
2. Training Crawlers
These bots collect data to train the next version of an AI model. Today's ChatGPT answers come from its training dataset — which is typically 6–12 months old. Blocking a training bot today does not affect what the AI already knows; it affects what the next model version will know. Examples:
GPTBot— OpenAI trainingClaudeBot,anthropic-ai— Anthropic trainingGoogle-Extended— Google Gemini trainingCCBot— Common Crawl (used by many AI companies)Bytespider,Amazonbot,Applebot-Extended,cohere-ai
Effect of robots.txt changes: delayed. Your changes only show up when the AI company releases its next trained model — which can take months.
Complete robots.txt Example
Copy this template and adapt it for your domain. The Sitemap line at the bottom helps crawlers discover your pages efficiently.
User-agent: *
Allow: /
# ── Live Crawlers (Retrieval — immediate effect) ──
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Claude-SearchBot
Allow: /
# ── Training Crawlers (delayed effect until next model) ──
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: anthropic-ai
Allow: /
Sitemap: https://example.com/sitemap.xml
Replace https://example.com/sitemap.xml with your actual sitemap URL. If you have multiple sitemaps, add one Sitemap: line per file.
How to Verify Your robots.txt
- Google Search Console — Use the built-in robots.txt Tester to check which rules apply to any given URL.
- curl — Run
curl -s https://yourdomain.com/robots.txtfrom a terminal to confirm the file is served correctly and with a200status code. - AI Visibility Scanner — Our free scanner checks your robots.txt as part of a full 19-point AI visibility audit.
Common Mistakes
- Blanket block:
User-agent: *followed byDisallow: /blocks all bots — including every AI crawler. This is often left over from development or staging environments. - Missing Sitemap line: Without a Sitemap declaration, crawlers have to discover pages through links alone. Always include your sitemap URL.
- Wrong file location: robots.txt must be at the root of your domain (
/robots.txt), not in a subdirectory. A file at/blog/robots.txtis ignored by crawlers. - Case sensitivity: The
User-agentbot names are case-sensitive.GPTBotis not the same asgptbot. - No-index via meta tag: Note that robots.txt controls crawl access; it does not control indexing. Use
<meta name="robots" content="noindex">tags orX-Robots-TagHTTP headers to prevent specific pages from being indexed even if they are crawled.
Official Sources
- Google — robots.txt specification and syntax
- OpenAI — GPTBot documentation
- Anthropic — ClaudeBot and anthropic-ai crawler docs
- Perplexity — How to get indexed by PerplexityBot