Home / Learn / How to allow AI crawlers in robots.txt

AI Search Intelligence

How to allow AI crawlers in robots.txt

By Abiot Y. Derbie - Founder, RankEcho - Updated 2026-07-22

The short answer

To allow AI crawlers, name each one explicitly in robots.txt: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, and Google-Extended each read their own User-agent block. A blanket User-agent: * with Allow: / opens the site to every compliant bot, but two things break AI access more often than robots.txt itself. Most providers run two crawlers, one for training and one for live retrieval, and blocking the retrieval crawler removes the site from AI answers even while the training crawler is welcome. A CDN or firewall can also return 403 to bots the file allows, before robots.txt is ever consulted. Verify the result with a request carrying the bot user-agent: a 200 means the door is open, a 403 or a challenge page means it is not.

How do I allow all crawlers in robots.txt?

The universal permission is two lines: a User-agent line with an asterisk, then an Allow line pointing at the site root. Every crawler that honors robots.txt reads that as full access to public pages.

Allowing all is the right default for a public marketing site, but it is not the same as being readable by AI engines. Three things override or bypass it. A named User-agent block beats the wildcard for that bot, so one legacy Disallow rule can quietly exclude a single engine. A CDN or firewall can refuse the request before the file is consulted. And some crawlers ignore robots.txt entirely.

The sections below cover the named blocks worth adding, the two crawler families behind them, and how to confirm the outcome instead of assuming it.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

Which AI crawlers should I allow?

Each major provider publishes named user agents. These are the ones that decide whether a brand can appear in AI answers at all.

Crawler	Operator	What it does	Typical setting
GPTBot	OpenAI	Crawls pages that may inform model training	Allow
OAI-SearchBot	OpenAI	Indexes pages for ChatGPT search results	Allow
ChatGPT-User	OpenAI	Fetches a page live when a user asks ChatGPT to read it	Allow
ClaudeBot	Anthropic	Crawls pages that may inform model training	Allow
Claude-SearchBot	Anthropic	Indexes pages for Claude search results	Allow
Claude-User	Anthropic	Fetches a page live during a Claude conversation	Allow
PerplexityBot	Perplexity	Indexes pages for Perplexity answers and citations	Allow
Google-Extended	Google	Governs Gemini and AI training, not Search indexing	Allow
Googlebot	Google	Powers Search and Google AI Overviews	Allow
CCBot	Common Crawl	Builds a public archive many models train on	Policy choice
Bytespider	ByteDance	Crawls aggressively and is reported to ignore robots.txt	Block at the server

What is the difference between a training crawler and a retrieval crawler?

Most providers run two crawlers that answer two different questions. A training crawler collects pages that may inform what a future model knows about a brand. A retrieval crawler fetches pages at answer time, which is what produces a citation inside a live response.

The distinction matters because the two are configured separately and the consequences are not symmetrical. Blocking a training crawler slows how models come to associate a brand with its category, over months. Blocking a retrieval crawler removes the site from live answers immediately, even when every training crawler is allowed.

A site that wants citations should allow the retrieval crawlers without exception. Whether to allow training crawlers is a separate policy decision that can reasonably go either way.

Training crawlers: GPTBot, ClaudeBot, Google-Extended, CCBot
Retrieval crawlers: OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot
Block a retrieval crawler and citations from that engine stop
Block a training crawler and future models learn less about the brand

What should the robots.txt look like?

Two configurations cover almost every case. The first allows every citation-relevant crawler explicitly, so no legacy wildcard rule can exclude one by accident. Add a User-agent line for each name below, each followed by Allow: / on the next line, above any existing rules.

The second keeps citations while opting out of training. Allow the retrieval crawlers exactly as above, and set Disallow: / under GPTBot, ClaudeBot, Google-Extended, and CCBot instead. Expect fewer unprompted brand mentions over time in exchange.

GPTBot - Allow: /
OAI-SearchBot - Allow: /
ChatGPT-User - Allow: /
ClaudeBot - Allow: /
Claude-SearchBot - Allow: /
Claude-User - Allow: /
PerplexityBot - Allow: /
Google-Extended - Allow: /
Googlebot - Allow: /

Is Cloudflare blocking AI crawlers even though robots.txt allows them?

This is the most common cause of a site that looks configured and is still invisible. Bot management at the CDN runs before the origin serves anything, so a firewall rule can return 403 or a challenge page to a crawler the robots.txt file welcomes.

Cloudflare ships a one-click control for blocking AI crawlers, and on some plans it is on by default. Check Security, then Bots, and review whether AI scraper blocking is enabled. Check WAF rules for user-agent matches. If Cloudflare is also managing robots.txt, confirm which file is actually being served.

The same applies to any bot-protection layer: Akamai, Fastly, AWS WAF, or a plugin. Server logs settle it. Search for the crawler user agents and look at the status codes they receive.

How do I verify a crawler can reach the page?

Request the page with the bot user-agent and read the status code. A 200 means the door is open. A 403, a 429, or an HTML challenge page means something between the crawler and the content is refusing.

Two more checks catch the rest. A page that carries a noindex directive is excluded from AI citation even when crawlers can fetch it, so any page meant to be cited must be indexable. And a page whose content only appears after JavaScript runs may return an empty shell to crawlers that do not execute scripts, which reads as a page with nothing to cite.

Run the check against the pages that matter most: pricing, comparisons, and the answers to the questions buyers ask.

curl -A GPTBot/1.0 -I https://yourdomain.com
curl -A PerplexityBot -I https://yourdomain.com/pricing
200 means reachable, 403 or a challenge means blocked
View source and confirm the answer text is in the HTML, not only in the rendered page

What about crawlers that ignore robots.txt?

Robots.txt is a request, not an enforcement mechanism. The major commercial crawlers publish their user agents and state that they honor it. Others are reported to ignore it, which means a Disallow line alone will not stop them.

Where the goal is genuinely to stop a crawler rather than to signal a preference, the block belongs at the server or CDN layer, matched on user agent and verified in logs. Where the goal is AI visibility, the opposite applies: the file should welcome the named crawlers and the CDN should be checked for rules that contradict it.

What if crawlers can reach the page and still do not cite it?

Access is necessary and not sufficient. In our analysis of 5,542 business websites, 33.7% carried no structured data of any kind and 10.9% returned no server-rendered homepage content, which means an engine could fetch the page and still find little it could confidently attribute.

Once access is confirmed, the next two checks are extraction and identity. Does the raw HTML response contain the answer, the headings, and the description, or does that text only appear after JavaScript runs? And does the page state what the business is as structured data, rather than leaving an engine to infer it from prose?

Both are cheaper to fix than anything downstream of them, and both gate whether crawler access converts into a citation.

Read the full analysis Check a page for readiness

How often should this be reviewed?

Quarterly, and after any CDN or platform change. New engines launch with new user agents, providers split single crawlers into training and retrieval pairs, and bot-protection defaults change without notice on the customer side.

A crawler policy that was correct six months ago can silently exclude an engine that did not exist then. The check takes minutes; the cost of missing it is absence from an entire engine.

Frequently asked questions

How do I allow all bots in robots.txt?

Add a User-agent line with an asterisk followed by Allow: / at the top of the file. That grants access to every crawler that honors robots.txt. Named User-agent blocks still take precedence over the wildcard for those specific bots, so check the file for legacy Disallow rules that name individual crawlers.

What is the difference between GPTBot and ChatGPT-User?

GPTBot crawls pages that may inform model training. ChatGPT-User fetches a page live when someone asks ChatGPT to read it. They are configured separately, and blocking ChatGPT-User removes the site from live answers even when GPTBot is allowed.

Does blocking GPTBot remove me from ChatGPT answers?

Not by itself. GPTBot governs training. ChatGPT search results are governed by OAI-SearchBot, and live page fetches by ChatGPT-User. Blocking all three removes the site from ChatGPT entirely; blocking only GPTBot limits what future models learn about the brand.

Why are AI crawlers still blocked when my robots.txt allows them?

Almost always a CDN or firewall rule that runs before the origin. Cloudflare and similar providers ship one-click AI bot blocking, enabled by default on some plans, and it overrides what robots.txt says. Request the page with the bot user-agent and check for a 403 or a challenge page.

Do AI crawlers obey robots.txt?

The major commercial crawlers publish their user agents and state that they honor robots.txt. Some crawlers are reported to ignore it. If the goal is to stop a specific bot rather than signal a preference, enforce it at the server or CDN layer and verify in logs.

See where AI ignores your brand — run a free audit →

Last updated 2026-07-22 · RankEcho · Operated by Nexus Decision Systems LLC