nilkick
Agent readiness 9 min read

Should you block or allow AI crawlers? The GPTBot and ClaudeBot decision

Blocking AI crawlers can protect your content or quietly delete you from AI answers, depending entirely on which bots you block. Here is the one distinction that decides it, a per-bot map for 2026, and a posture to pick.

Last updated June 17, 2026
Key takeaway

Whether to block AI crawlers comes down to one split: training crawlers (which only feed models) versus answer crawlers (which let you be cited in ChatGPT, Claude, and Perplexity). Block training crawlers if your content is a moat. But block the answer crawlers and you vanish from AI search, which for most launching products is the bigger loss. Check your settings, because tools like Cloudflare may have blocked them for you by default.

  • The decision is per-bot, not all-or-nothing. Each AI company runs separate, separately-controllable crawlers for training, search, and live fetches.
  • Blocking OAI-SearchBot removes you from ChatGPT search answers; blocking GPTBot only opts you out of training. They are different bots doing different jobs.
  • Never block Googlebot. Blocking Google-Extended opts you out of Gemini training without touching your Google Search ranking.
  • Cloudflare now blocks AI crawlers by default on new domains, so you may already be invisible to AI answers without choosing to be. Check your AI Crawl Control settings.

Mostly: allow the crawlers that answer questions, and block the ones that only train, if you block anything at all. The move to avoid is blocking everything and quietly deleting yourself from AI answers, which is exactly what some setups, including Cloudflare’s newer defaults, now do on your behalf. The whole decision turns on one distinction most guides skip, so start there.

01 · The frameworkThe distinction that decides everything

AI crawlernoun

An automated bot run by an AI company that fetches web pages, either to train a model on their content, to build a search index the model answers from, or to retrieve a page in real time when a user asks about it. Each job is usually a separate, separately-controllable bot, which is why “block AI crawlers” is rarely the right instruction.

AI crawlers do not all do the same job, and that is the entire decision. They split three ways:

  • Training crawlers fetch your pages to train a model. You get nothing back. Blocking these keeps your content out of a model you do not control.
  • Answer crawlers fetch your pages to build the index an assistant answers from. Allowing these is how you get cited in ChatGPT, Claude, or Perplexity, with a link back to you.
  • User-initiated fetches happen when a person pastes your URL into an assistant and asks about it. Blocking these mostly just frustrates someone who already wanted your page.

So “should I block AI crawlers” is the wrong question. The right one is “which of those three jobs do I want to allow?” For most people the answer is the same: block training if you like, keep the answer crawlers open, because the answer crawlers are your visibility.

02 · The bot mapWho is crawling you in 2026

Crawler Run by Job Block it?
GPTBot OpenAI Training Optional (opts out of training)
OAI-SearchBot OpenAI ChatGPT search index No, if you want ChatGPT citations
ChatGPT-User OpenAI User-initiated fetch No
ClaudeBot Anthropic Training Optional
Claude-SearchBot Anthropic Claude search index No, if you want Claude citations
Claude-User Anthropic User-initiated fetch No
Google-Extended Google Gemini training Optional (safe for SEO)
Googlebot Google Google Search Never
PerplexityBot Perplexity Perplexity index No, if you want Perplexity citations
CCBot Common Crawl Training (feeds many models) Optional
Bytespider ByteDance Training Optional
Verify the strings, because old templates go stale

This map is current as of mid-2026, but providers rename and split their bots often. Anthropic’s old anthropic-ai and Claude-Web strings, for example, are deprecated, so any robots.txt still blocking those is not blocking Anthropic at all. Before you rely on a directive, check the bot against the provider’s own documentation, and verify a suspicious visitor by reverse DNS rather than trusting the user-agent string, which anyone can fake.

03 · The decisionPick a posture

There are really only three coherent positions. Match yourself to one.

Posture What you do Who it fits
Open Allow every AI crawler Products that live or die on being found: most indie tools, SaaS, anyone pre-audience
Selective Allow answer crawlers, block training crawlers Content businesses that want citations without feeding training: blogs, docs, publishers
Closed Block all AI crawlers Proprietary, paywalled, or premium content where being scraped costs more than being cited earns

The selective posture is the popular middle ground, and in robots.txt it looks like this:

robots.txt (selective posture)
# Block training-only crawlers
User-agent: GPTBot
User-agent: CCBot
User-agent: Google-Extended
User-agent: ClaudeBot
Disallow: /

# Allow the crawlers that cite you
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

# Never block this one
User-agent: Googlebot
Allow: /

A note on that file: in robots.txt, a missing directive already means allow, so those answer crawlers are permitted by default. The Allow lines are there to document intent, which matters when someone inherits the file a year later. Whatever posture you pick, pair it with an llms.txt so the crawlers you do allow can find your best pages in one fetch instead of guessing.

04 · The default trapCloudflare may have decided for you

If your site is behind Cloudflare, you may have already made this decision without realising it.

On July 1, 2025, Cloudflare became the first major infrastructure provider to block AI crawlers by default, and “block on all pages” is now the default for newly created domains. New sites are asked up front whether to allow AI crawlers, and if you clicked through setup without thinking about it, the answer may be no. The result is blunt: a product that wants to be found in AI answers can be invisible to them straight out of the box, while competitors who opted in become the sources those assistants cite instead.

Cloudflare’s AI Crawl Control dashboard is where you change this. It shows which crawlers are hitting you, whether they actually respect your robots.txt, and lets you allow or block each one individually. It also offers Pay Per Crawl, a third option that charges crawlers through a 402 response, aimed at publishers with content worth licensing rather than at most indie products.

The action item is simple: do not accept whatever the default left you on. Open the dashboard, see who is allowed, and make it a deliberate choice.

Free · 30 seconds

Are you accidentally invisible to AI answers?

A single inherited line, or a platform default, can quietly remove you from ChatGPT and Perplexity. Nilkick checks your crawler state and tells you whether you are blocking the bots that would otherwise cite you.

Get your free scoreNo account · no email wall

05 · Enforcement realityrobots.txt is a request, not a wall

Even a perfect robots.txt only works on crawlers that choose to obey it. The major labs say theirs do, but a large share of AI traffic ignores the file entirely, and robots.txt has no technical means to stop them. If you genuinely need to block a bot, that lives at the server or firewall layer, where a rule is evaluated before robots.txt is ever read, so a firewall block overrides any Allow.

It is worth being honest about why blocking tempts people. The exchange that made crawling fair, where search indexes you and sends visitors in return, has broken for AI.

73k:1

was Anthropic’s crawl-to-referral ratio in June 2025, by Cloudflare’s own measurement, against OpenAI’s 1,700:1 and Google’s 14:1. AI crawlers read far more than they send back, which breaks the old crawl-for-traffic bargain.

That imbalance is the real argument for blocking training crawlers: they take much more than they return. But it cuts the other way for anyone without an audience yet. If nobody is visiting, there is no referral relationship to protect, and the content a cold product guards is rarely a moat someone is racing to copy. Invisibility costs a new product more than scraping does.

One more reality check: if your pages are a client-side JavaScript app, this whole debate may be moot. Most AI crawlers do not run JavaScript, so they receive an empty shell no matter what your robots.txt says. Fix that first, because it is part of being parseable at all.

06 · The recommendationSo what should you actually do?

Stripped down:

  • If you are launching something and want to be found, allow the answer crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) and stop worrying about the rest. Optionally block the training-only ones. Above all, do not let a default block delete you from AI search.
  • If you run a content business with a real audience and real IP, take the selective posture: be cited, do not feed training.
  • If your content is genuinely proprietary or premium, block it all and accept you will not be cited. That is a valid choice, as long as it is a choice.

Whatever you pick, the failure mode is the same: not choosing, and letting an inherited robots.txt or a platform default decide for you. Remember which half of the launch picture this sits in too. Being absent from AI answers is a Footprint problem, not a content one. The worst posture is an accidental one.


FAQ

Common questions

Not from ChatGPT search. GPTBot is OpenAI’s training crawler, so blocking it only signals that your content should not be used to train models. The bot that affects ChatGPT search answers is OAI-SearchBot. Block that one and your site will not appear in ChatGPT search results, though it can still be fetched when a user pastes your link, via ChatGPT-User.
The nudge off zero

Get your free launch-readiness score

See what else is between your product and its first real users — Nilkick scores your readiness and hands you the map. Free, no login.

https:// optional · no account · we don't email you

Keep going · Agent readiness cluster All guides →