Should you block or allow AI crawlers? The GPTBot and ClaudeBot decision
Blocking AI crawlers can protect your content or quietly delete you from AI answers, depending entirely on which bots you block. Here is the one distinction that decides it, a per-bot map for 2026, and a posture to pick.
Last updated June 17, 2026Whether to block AI crawlers comes down to one split: training crawlers (which only feed models) versus answer crawlers (which let you be cited in ChatGPT, Claude, and Perplexity). Block training crawlers if your content is a moat. But block the answer crawlers and you vanish from AI search, which for most launching products is the bigger loss. Check your settings, because tools like Cloudflare may have blocked them for you by default.
- The decision is per-bot, not all-or-nothing. Each AI company runs separate, separately-controllable crawlers for training, search, and live fetches.
- Blocking OAI-SearchBot removes you from ChatGPT search answers; blocking GPTBot only opts you out of training. They are different bots doing different jobs.
- Never block Googlebot. Blocking Google-Extended opts you out of Gemini training without touching your Google Search ranking.
- Cloudflare now blocks AI crawlers by default on new domains, so you may already be invisible to AI answers without choosing to be. Check your AI Crawl Control settings.
Mostly: allow the crawlers that answer questions, and block the ones that only train, if you block anything at all. The move to avoid is blocking everything and quietly deleting yourself from AI answers, which is exactly what some setups, including Cloudflare’s newer defaults, now do on your behalf. The whole decision turns on one distinction most guides skip, so start there.
01 · The frameworkThe distinction that decides everything
An automated bot run by an AI company that fetches web pages, either to train a model on their content, to build a search index the model answers from, or to retrieve a page in real time when a user asks about it. Each job is usually a separate, separately-controllable bot, which is why “block AI crawlers” is rarely the right instruction.
AI crawlers do not all do the same job, and that is the entire decision. They split three ways:
- Training crawlers fetch your pages to train a model. You get nothing back. Blocking these keeps your content out of a model you do not control.
- Answer crawlers fetch your pages to build the index an assistant answers from. Allowing these is how you get cited in ChatGPT, Claude, or Perplexity, with a link back to you.
- User-initiated fetches happen when a person pastes your URL into an assistant and asks about it. Blocking these mostly just frustrates someone who already wanted your page.
So “should I block AI crawlers” is the wrong question. The right one is “which of those three jobs do I want to allow?” For most people the answer is the same: block training if you like, keep the answer crawlers open, because the answer crawlers are your visibility.
02 · The bot mapWho is crawling you in 2026
| Crawler | Run by | Job | Block it? |
|---|---|---|---|
| GPTBot | OpenAI | Training | Optional (opts out of training) |
| OAI-SearchBot | OpenAI | ChatGPT search index | No, if you want ChatGPT citations |
| ChatGPT-User | OpenAI | User-initiated fetch | No |
| ClaudeBot | Anthropic | Training | Optional |
| Claude-SearchBot | Anthropic | Claude search index | No, if you want Claude citations |
| Claude-User | Anthropic | User-initiated fetch | No |
| Google-Extended | Gemini training | Optional (safe for SEO) | |
| Googlebot | Google Search | Never | |
| PerplexityBot | Perplexity | Perplexity index | No, if you want Perplexity citations |
| CCBot | Common Crawl | Training (feeds many models) | Optional |
| Bytespider | ByteDance | Training | Optional |
This map is current as of mid-2026, but providers rename and split their bots often. Anthropic’s old anthropic-ai and Claude-Web strings, for example, are deprecated, so any robots.txt still blocking those is not blocking Anthropic at all. Before you rely on a directive, check the bot against the provider’s own documentation, and verify a suspicious visitor by reverse DNS rather than trusting the user-agent string, which anyone can fake.
03 · The decisionPick a posture
There are really only three coherent positions. Match yourself to one.
| Posture | What you do | Who it fits |
|---|---|---|
| Open | Allow every AI crawler | Products that live or die on being found: most indie tools, SaaS, anyone pre-audience |
| Selective | Allow answer crawlers, block training crawlers | Content businesses that want citations without feeding training: blogs, docs, publishers |
| Closed | Block all AI crawlers | Proprietary, paywalled, or premium content where being scraped costs more than being cited earns |
The selective posture is the popular middle ground, and in robots.txt it looks like this:
# Block training-only crawlers
User-agent: GPTBot
User-agent: CCBot
User-agent: Google-Extended
User-agent: ClaudeBot
Disallow: /
# Allow the crawlers that cite you
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /
# Never block this one
User-agent: Googlebot
Allow: /A note on that file: in robots.txt, a missing directive already means allow, so those answer crawlers are permitted by default. The Allow lines are there to document intent, which matters when someone inherits the file a year later. Whatever posture you pick, pair it with an llms.txt so the crawlers you do allow can find your best pages in one fetch instead of guessing.
04 · The default trapCloudflare may have decided for you
If your site is behind Cloudflare, you may have already made this decision without realising it.
On July 1, 2025, Cloudflare became the first major infrastructure provider to block AI crawlers by default, and “block on all pages” is now the default for newly created domains. New sites are asked up front whether to allow AI crawlers, and if you clicked through setup without thinking about it, the answer may be no. The result is blunt: a product that wants to be found in AI answers can be invisible to them straight out of the box, while competitors who opted in become the sources those assistants cite instead.
Cloudflare’s AI Crawl Control dashboard is where you change this. It shows which crawlers are hitting you, whether they actually respect your robots.txt, and lets you allow or block each one individually. It also offers Pay Per Crawl, a third option that charges crawlers through a 402 response, aimed at publishers with content worth licensing rather than at most indie products.
The action item is simple: do not accept whatever the default left you on. Open the dashboard, see who is allowed, and make it a deliberate choice.
Are you accidentally invisible to AI answers?
A single inherited line, or a platform default, can quietly remove you from ChatGPT and Perplexity. Nilkick checks your crawler state and tells you whether you are blocking the bots that would otherwise cite you.
05 · Enforcement realityrobots.txt is a request, not a wall
Even a perfect robots.txt only works on crawlers that choose to obey it. The major labs say theirs do, but a large share of AI traffic ignores the file entirely, and robots.txt has no technical means to stop them. If you genuinely need to block a bot, that lives at the server or firewall layer, where a rule is evaluated before robots.txt is ever read, so a firewall block overrides any Allow.
It is worth being honest about why blocking tempts people. The exchange that made crawling fair, where search indexes you and sends visitors in return, has broken for AI.
was Anthropic’s crawl-to-referral ratio in June 2025, by Cloudflare’s own measurement, against OpenAI’s 1,700:1 and Google’s 14:1. AI crawlers read far more than they send back, which breaks the old crawl-for-traffic bargain.
That imbalance is the real argument for blocking training crawlers: they take much more than they return. But it cuts the other way for anyone without an audience yet. If nobody is visiting, there is no referral relationship to protect, and the content a cold product guards is rarely a moat someone is racing to copy. Invisibility costs a new product more than scraping does.
One more reality check: if your pages are a client-side JavaScript app, this whole debate may be moot. Most AI crawlers do not run JavaScript, so they receive an empty shell no matter what your robots.txt says. Fix that first, because it is part of being parseable at all.
06 · The recommendationSo what should you actually do?
Stripped down:
- If you are launching something and want to be found, allow the answer crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) and stop worrying about the rest. Optionally block the training-only ones. Above all, do not let a default block delete you from AI search.
- If you run a content business with a real audience and real IP, take the selective posture: be cited, do not feed training.
- If your content is genuinely proprietary or premium, block it all and accept you will not be cited. That is a valid choice, as long as it is a choice.
Whatever you pick, the failure mode is the same: not choosing, and letting an inherited robots.txt or a platform default decide for you. Remember which half of the launch picture this sits in too. Being absent from AI answers is a Footprint problem, not a content one. The worst posture is an accidental one.
Common questions
Get your free launch-readiness score
See what else is between your product and its first real users — Nilkick scores your readiness and hands you the map. Free, no login.
https:// optional · no account · we don't email you
What is agent readiness?
Agent readiness is how well your site can be discovered, parsed, and acted on by AI agents. What it means in 2026, what actually matters, and what is hype.
llms.txt in 2026: do AI crawlers actually read it?
Independent server-log studies show AI search crawlers almost never fetch llms.txt, yet Google PageSpeed Insights now audits for it. Here is what that means in 2026.
What is launch readiness?
Launch readiness is how prepared a just-shipped product is to win its first users, not how finished it is. The definition, the two bars it measures, and why footprint counts more.