Home / Learn / How to allow AI crawlers (without blocking yourself out of AI answers)
AI Search Intelligence

How to allow AI crawlers (without blocking yourself out of AI answers)

The short answer

AI engines can only cite pages they are allowed to fetch, so an over-aggressive robots.txt or bot rule is one of the most common reasons a site is invisible in AI answers. The key distinction is between crawlers that fetch pages for live answers and retrieval — which you almost always want to allow — and crawlers or tokens that gather data for model training, which is a separate content-licensing decision. Blocking the training bots does not remove you from live AI answers, and accidentally blocking the retrieval bots can make you disappear from them.

Two kinds of AI crawler, two different decisions

Reputable AI companies run more than one bot, and they do different jobs. Retrieval and answer bots fetch a page at answer time or to power an AI search index — these are the ones that get you cited, so you generally want them allowed. Training crawlers gather large amounts of content to train or ground future models; allowing them can help a model learn about your brand over time, but it is a licensing and strategy choice rather than a visibility necessity. Conflating the two is what leads sites to block their own retrieval by accident.

The AI crawler user-agents to know

The major user-agents, grouped by who runs them:

  • OpenAI: GPTBot (bulk crawling, largely for training), OAI-SearchBot (powers ChatGPT search indexing), and ChatGPT-User (fetches a page when a user or action requests it live).
  • Anthropic: ClaudeBot (the primary crawler), plus older agents like anthropic-ai and Claude-Web seen in some logs.
  • Perplexity: PerplexityBot (indexing for the answer engine) and Perplexity-User (user-initiated fetches).
  • Google: Googlebot (powers Search, which also feeds AI Overviews), and Google-Extended — a robots.txt token that controls whether your content is used for Gemini and Vertex AI, not a separate crawler.
  • Microsoft: Bingbot (the Bing index, which feeds Copilot).
  • Others: Applebot and Applebot-Extended (Apple), CCBot (Common Crawl, used by many model trainers), Amazonbot, Meta-ExternalAgent, and Bytespider (ByteDance).

What to allow if you want to be cited

For visibility in AI answers, the bots that matter most are the retrieval and search ones: ChatGPT-User and OAI-SearchBot, PerplexityBot and Perplexity-User, Googlebot, Bingbot, and ClaudeBot. If any of these are blocked, the corresponding engine cannot fetch your pages to cite them. The training-oriented agents — GPTBot, Google-Extended, Applebot-Extended, CCBot — are the ones where reasonable sites differ: allow them if you want your content to inform future models, restrict them if you have content-licensing concerns. Critically, blocking GPTBot does not block ChatGPT from fetching your page live via ChatGPT-User, so blocking training does not have to mean blocking citations.

A sane robots.txt baseline

A simple, visibility-friendly approach is to explicitly allow the retrieval and search bots and decide deliberately about the training bots. In robots.txt that means, for each agent you want to permit, a block giving that user-agent an Allow rule for your site, while keeping any Disallow rules narrow and intentional. Avoid a blanket disallow that catches everything, and avoid leaving stale rules from an old setup in place. After publishing, confirm the file is reachable at your domain root and that it does not accidentally disallow the paths you most want cited.

robots.txt is necessary but not sufficient

Allowing a crawler only gives it permission; the page still has to be fetchable and readable. Two things commonly undo a correct robots.txt: a bot-management or firewall layer that challenges or blocks automated agents before they reach the page, and content that only appears after client-side JavaScript, which many crawlers will not execute. Allowing the bot in robots.txt while a security layer silently blocks it is one of the most common hidden causes of AI invisibility.

Verify, do not assume

robots.txt is a voluntary standard: reputable bots honor it, but the only way to be sure a given engine can actually reach you is to test. Check that the engines can fetch and cite representative pages, watch your server logs for the AI user-agents above, and re-test after any change to robots.txt, your CDN, or your bot-management settings. Permission, reachability, and readability all have to line up before a page can be cited.

Frequently asked questions

Will blocking GPTBot remove me from ChatGPT?

Not from live answers. GPTBot is largely a training crawler; ChatGPT fetches pages live through ChatGPT-User and indexes via OAI-SearchBot, so blocking GPTBot is a training-data choice, not a visibility one.

What is Google-Extended?

It is a robots.txt token that controls whether your content is used for Gemini and Vertex AI, not a separate crawler. Blocking it does not remove you from Google Search or AI Overviews, which are governed by Googlebot.

Do all AI crawlers obey robots.txt?

Reputable ones do, but robots.txt is voluntary, so some agents may ignore it. Use it to guide well-behaved bots, and rely on server-side controls if you need to enforce access.

I allowed the bots but I am still not cited — why?

Permission is only step one. A bot-management layer may be blocking the agent before it reaches the page, or your content may render only after JavaScript. Both prevent fetching even with a correct robots.txt.

See where AI ignores your brand — run a free audit →
Last updated 2026-06-08 · RankEcho · Operated by Nexus Decision Systems LLC