Why AI Crawlers Ignore Robots.txt And What It Means For Websites

You updated your robots.txt. You blocked GPTBot, ClaudeBot, and every crawler you could name. You followed every guide and did everything right.

Then you checked your server logs.

The bots were still there.

This isn’t a hypothetical scenario. It’s exactly what happened to Ziff Davis, the parent company behind PCMag, Mashable, and Everyday Health. They followed OpenAI’s own instructions to modify their robots.txt file to block its crawler. According to their lawsuit filed in 2025, the result wasn’t compliance. It was an increase in scraping activity.

So here’s the real question, not “how do I block AI crawlers,” but something far more uncomfortable: if a crawler ignores your robots.txt, who is legally responsible?

The answer, right now, is: nobody. Not clearly. Not reliably.

That’s the accountability vacuum. And it’s bigger than most people realize.

📌 Key Takeaways

Robots.txt is a voluntary standard with no built-in enforcement mechanism
Courts have rejected DMCA anti-circumvention claims for robots.txt alone (Ziff Davis v. OpenAI, 2025)
Contract law via Terms of Service is currently the strongest legal path, but it’s imperfect
AI crawler non-compliance jumped from 3.3% in Q4 2024 to 13.26% in Q2 2025
The EU AI Act Article 53 (in force August 2025) creates binding obligations, but only in Europe
Cloudflare’s AIndependence feature now blocks AI scrapers by default for new domains
The IETF’s AIPREF working group is building the next-generation standard, but it’s still voluntary
Publishers need a layered defense: robots.txt + ToS + Cloudflare blocking + logging + technical barriers

What Robots.txt Actually Is (And What It Isn’t)

Before we get into the legal tangle, let’s make sure we’re clear on the fundamentals.

Robots.txt is a plain text file sitting at the root of your domain. Its job is to tell automated bots which parts of your site they’re allowed to crawl. That’s it.

It doesn’t lock anything. It doesn’t encrypt anything. There’s no password. There’s no gate.

Think of it like a “Please Do Not Disturb” sign on a hotel room door. A polite request. Nothing more.

As legal technology experts at ByteTunnels explain, the file doesn’t create a legal obligation. It doesn’t form a contract, there’s no offer, no acceptance, no consideration. It relies entirely on the good faith of the bot reading it.

For search engines like Google and Bing, that good faith has generally held. They built their businesses on the trust of the web, and violating robots.txt would be catastrophic for their reputation.

AI training crawlers operate under a different logic entirely.

They aren’t trying to send you traffic. They don’t need your ongoing cooperation. They need your content once, for training, and then they’re done. The incentive structure that keeps search engines honest simply doesn’t apply.

📚 Recommended Insight

Robots.txt Guide 2026: AI Crawlers, RFC 9309 & Crawl Budget Mastery

Master RFC 9309 robots.txt in 2026: block AI training bots, allow search agents, fix 12 mistakes that kill rankings silently. Real templates + working examples.

Read the Full Article →

The Scale of the Problem

Let’s look at some numbers, because the scale here is genuinely striking.

According to data from Tollbit’s Q2 2025 report, covered by The Register, non-compliance with robots.txt directives across all AI bots jumped from 3.3% of requests in Q4 2024 to 13.26% in Q2 2025. That’s a fourfold increase in less than a year.

At the same time, the number of sites blocking OpenAI’s GPTBot grew by nearly 70% in just a few months, reaching approximately 5.6 million websites. ClaudeBot and AppleBot face similar resistance.

But here’s the part that gets overlooked: blocking isn’t working.

A May 2025 large-scale empirical study published on arXiv found that many bots never even check the robots.txt file at all. And compliance actually dropped when restrictions became stricter, meaning bots were less likely to follow a rule that said “disallow everything” than one that just set rate limits. More restriction, less compliance.

Research published on arXiv tracking reputable news sites from September 2023 to May 2025 found that AI-blocking by legitimate publishers increased from 23% to nearly 60%. Meanwhile, misinformation sites, which have almost no incentive to block training crawlers, saw blocking rise from just 4.6% to 9.2% over the same period.

That asymmetry has serious long-term implications for AI quality. More on that shortly.

📊 Crawler Compliance at a Glance (2025–2026)

Metric	Data Point	Source
Robots.txt non-compliance (Q4 2024)	3.3% of AI bot requests	Tollbit
Robots.txt non-compliance (Q2 2025)	13.26% of AI bot requests	Tollbit
Sites blocking GPTBot (mid-2025)	~5.6 million websites	The Register
Sites blocking ClaudeBot	~5.8 million websites	The Register
ClaudeBot crawl-to-referral ratio	20,583 pages crawled per 1 referral sent	Cloudflare / TechnologyChecker
Reputable sites blocking AI crawlers (May 2025)	~60% (up from 23% in Sept 2023)	arXiv
AI bots that accessed top 1M sites (June 2024)	39% — but only 2.98% were challenged	Cloudflare
Cloudflare AI blocker adoptions (2024–2025)	Over 1 million customers enabled it	Cloudflare Blog

Two Legal Theories. Two Problems.

When publishers realized blocking wasn’t working, they turned to the courts. Two main legal theories have emerged, and both have significant weaknesses.

Theory 1: DMCA Anti-Circumvention

The Digital Millennium Copyright Act (DMCA) includes a provision under 17 U.S.C. § 1201 that prohibits circumventing “technological measures that effectively control access” to copyrighted works.

The argument: robots.txt is a technological measure. Ignoring it circumvents access controls. Therefore, violating it should trigger DMCA liability.

It’s a clean theory. Courts haven’t bought it, at least not yet.

In Ziff Davis v. OpenAI (No. 1:25-cv-04315, S.D.N.Y. 2025), Ziff Davis made exactly this argument. The court dismissed it. As summarized by technology law firm FKKS, the ruling found that robots.txt is “more akin to a sign than a barrier.” It doesn’t technically prevent access, it just asks for compliance. Any HTTP client can fetch any URL regardless of what robots.txt says. Therefore, it fails the “effectively controls access” test that DMCA requires.

That said, this ruling isn’t final. Reddit has since filed a lawsuit against Perplexity AI using a DMCA Section 1201 theory combined with other technical measures, including rate limits, captcha protection, and anomaly-detection tools. The argument there is that bypassing the combination of measures constitutes circumvention. That case, Reddit v. SerpApi, is still pending.

The lesson from DMCA cases so far: robots.txt alone won’t support a DMCA claim under current case law, but it remains evidence of lack of authorization when combined with other technical measures.

Theory 2: Breach of Contract via Terms of Service

The second theory is simpler, but also messier.

If a website’s Terms of Service explicitly prohibit scraping, and a bot operator is deemed to have agreed to those terms, then scraping in violation of robots.txt could constitute a breach of contract.

Here’s where understanding the type of agreement matters critically.

Browsewrap agreements are terms buried in a footer or on a separate page that users never explicitly accept, the assumption is that merely visiting a site binds you to them. Courts have been deeply skeptical of enforcing these, especially against commercial actors who never clicked anything to agree.

Clickwrap agreements require an active, affirmative action, clicking “I Agree” during signup or login. These create a much more defensible contract. If a company created an account on a platform, clicked through a ToS that explicitly prohibits scraping, and then scraped anyway, that’s a clear breach of contract regardless of what robots.txt says.

Reddit’s lawsuit against Anthropic (Reddit v. Anthropic, No. 3:25-cv-05643, N.D. Cal. 2025) navigates this tension directly. Reddit argues that Anthropic, as a sophisticated commercial actor with actual knowledge of Reddit’s terms, cannot plausibly claim ignorance of the browsewrap restrictions. That “sophisticated commercial actor” framing may shift how courts evaluate browsewrap enforceability, but it’s still untested at trial.

There’s a critical subplot here too. Reddit already has licensing agreements with both Google and OpenAI. Perplexity allegedly obtained that same data through third-party scraping services, essentially circumventing a licensing ecosystem that already existed. That framing shifts the complaint from “you scraped us” to “you undercut our business model.” Courts tend to take business injury arguments more seriously.

📚 Recommended Insight

Crawl Budget in 2026: When It Actually Matters (And the Formula Everyone Gets Wrong)

Crawl budget is no longer just about Googlebot. Learn the formula, exact thresholds, AI crawler risks, and a step-by-step action plan to fix indexing delays.

Read the Full Article →

Case Studies: What’s Actually Happening in the Courts

Case Study 1: Ziff Davis v. OpenAI – The Company That Did Everything Right

Ziff Davis is one of the most instructive cases precisely because they followed the rules.

When OpenAI published guidance on how publishers could block GPTBot via robots.txt, Ziff Davis complied immediately. They updated their robots.txt files across their portfolio of health and technology publications.

The result, according to their complaint: scraping activity increased. OpenAI’s bots didn’t slow down, they accelerated.

Ziff Davis sued for copyright infringement, DMCA violations, unjust enrichment, and trademark dilution. The DMCA claim was dismissed on the grounds described above. The copyright and unjust enrichment claims are still active as of mid-2026.

The key lesson here isn’t just legal, it’s strategic. Updating robots.txt doesn’t necessarily signal protection to a determined crawler. It may signal something closer to the opposite: that you’ve identified content worth protecting.

Case Study 2: Reddit v. Anthropic + Perplexity – The Multi-Front War

Reddit’s battle against AI companies is unfolding on multiple fronts simultaneously, and it illustrates just how complex the accountability picture has become.

In June 2025, Reddit sued Anthropic in California state court on six theories: breach of contract, unjust enrichment, trespass to chattels, tortious interference, unfair competition, and more. According to Troutman Pepper Locke’s analysis, Reddit’s complaint reads less like a standard copyright case and more like a description of cyber-intrusion, with allegations of identity masking, proxy usage, and industrial-scale scraping designed to evade detection.

Then, in October 2025, Reddit sued Perplexity in federal court in New York, using the DMCA Section 1201 theory and alleging that Perplexity used SerpApi and proxy services to circumvent Reddit’s technical defenses. Not just robots.txt, but rate limits, IP detection, and bot-blocking systems.

The difference in venue matters. Federal court opens up different legal remedies and different precedent-setting potential.

Lesson: Reddit’s multi-theory, multi-court approach is deliberate. No single theory is strong enough on its own. The goal is to establish precedent across as many fronts as possible.

Case Study 3: The OpenAI DDoS Incident – Infrastructure as Victim

This one doesn’t involve a lawsuit, but it should be mentioned.

In January 2025, a seven-person company had its website effectively taken offline when OpenAI’s bot hit it with what TechCrunch described as DDoS-level traffic. The bot crawled so aggressively that the server couldn’t handle the load and collapsed under the weight of what was, technically, legitimate crawl requests.

The company had robots.txt in place. It didn’t help.

For small publishers on shared hosting or usage-based infrastructure, this isn’t a theoretical risk. It’s an operational one. According to research from ALM Corp, AI crawlers often request complete page content including images, scripts, and assets, generating real infrastructure costs for publishers that receive no corresponding traffic or revenue in return.

📚 Recommended Insight

From Crawl Control to AI Governance: The Unintended Evolution of Robots.txt

Discover how robots.txt evolved from a 1994 fix for runaway crawlers into the de facto governance layer for AI training data. The full story and data inside.

Read the Full Article →

The Crawl-to-Referral Problem Nobody Talks About

**AI crawlers extract content no Reciprocity**

Here’s a number worth sitting with.

Cloudflare’s analysis of robots.txt data across its network found that Anthropic’s ClaudeBot crawls 20,583 pages for every single referral it sends back to publishers. OpenAI’s ratio is 1,255:1. Meta’s crawlers send zero referrals.

Compare that to Googlebot, which has built its entire business model on the idea that crawling creates value for both parties. Google takes your content and sends you traffic. The value exchange is imperfect, but it exists.

AI training crawlers often take content and offer nothing in return. No traffic. No citations. No licensing fees. No acknowledgment.

That’s not a neutral transaction. It’s extraction.

And yet there’s currently no legal framework that treats it as such. Copyright law asks whether copying occurred and whether it was transformative. It doesn’t ask whether the economic relationship between publisher and crawler was fair.

This is arguably the deepest hole in the accountability framework: the absence of any concept of reciprocal value in how we evaluate AI crawler behavior.

⚠️ Common Mistakes Publishers Make

Relying on robots.txt alone: It’s a request, not a barrier. Without layered technical measures, it provides minimal legal protection on its own.
Using browsewrap ToS without clickwrap backup: Browsewrap agreements (terms buried in a footer) are hard to enforce. If your platform allows account creation, ensure your signup flow includes explicit acceptance of anti-scraping terms.
Blocking all AI crawlers indiscriminately: Training bots and retrieval bots are different. Blocking retrieval bots removes you from AI search results — traffic you might actually want.
Not logging bot activity: Courts need evidence. Server logs showing which bots accessed what, when, and how frequently are essential if you ever plan to pursue legal action.
Assuming compliance without verification: Actually check whether the bots you’ve blocked are still appearing in your logs. Some are. Regularly.
Ignoring infrastructure-level tools: Robots.txt is a text file. Cloudflare’s bot management is an active filter. The two aren’t equivalent — use both.

The Contrarian View: Blocking May Make Things Worse

Here’s an insight that doesn’t get discussed enough.

The more reputable publishers block AI crawlers, the more AI models end up training on content from sources that don’t block, which includes misinformation sites, low-quality aggregators, and sources with no interest in accuracy.

The arXiv research tracking publisher robots.txt behavior puts this starkly: reputable news sites are blocking AI crawlers at nearly 60%. Misinformation sites are blocking at under 10%.

If that trend continues, the quality of AI training data will systematically skew toward lower-credibility sources. Not because AI companies are choosing bad data, but because the good data has opted out.

This creates an uncomfortable paradox for publishers. Every additional site that blocks AI training crawlers is, in aggregate, making future AI models slightly less reliable. The short-term protection of your content may contribute to long-term degradation of the information ecosystem you depend on.

There’s no clean answer here. But it’s a tradeoff that deserves to be made consciously, not accidentally.

The Infrastructure Response: Cloudflare Changes the Default

While publishers and courts were battling over legal theory, Cloudflare took a different approach: make protection the default.

In July 2024, Cloudflare launched its one-click “AI Scrapers and Crawlers” toggle, available to all customers including those on the free tier. Navigate to Security > Bots in your dashboard, flip the switch, and Cloudflare begins blocking verified AI crawlers immediately. The feature auto-updates as new bot fingerprints are identified. Over one million customers enabled it within the first year.

Then in July 2025, Cloudflare went further. Every new domain registering with Cloudflare is now asked upfront whether it wants AI crawlers to access its content, shifting from opt-out to opt-in. The company also launched a “Pay Per Crawl” beta, allowing publishers to charge AI companies for access rather than simply allowing or blocking them.

This is significant for a few reasons.

First, it provides a technical enforcement layer that robots.txt cannot. Cloudflare operates at the infrastructure level, it can block requests before they reach your server, regardless of whether the bot reads your robots.txt.

Second, and legally important, Cloudflare now allows AI companies to declare their crawler’s purpose, whether for training, inference, or search. That metadata creates a paper trail. If a company declares its crawler is for “search” but uses the data for training, that misrepresentation potentially strengthens any subsequent legal claim.

Third, it represents an emerging market model. Publishers including Condé Nast, Dotdash Meredith, The Associated Press, TIME, and Reddit have already endorsed the permission-based framework. When enough publishers adopt it, AI companies that ignore it face reputational and business consequences beyond any single lawsuit.

Cloudflare services roughly 20% of global internet traffic. That’s not a niche tool. That’s infrastructure.

What the Courts Still Haven’t Decided

The legal landscape is in genuine flux. Several questions that will shape publisher rights for years remain unanswered:

Does robots.txt establish implied license?

Earlier case law (like Field v. Google) found that the absence of disallow rules implied consent to indexing. Does the presence of disallow rules imply all access is unauthorized? Courts haven’t ruled definitively.

Can browsewrap ToS bind sophisticated commercial actors?

Reddit’s theory that a billion-dollar AI company can’t plead ignorance of publicly posted terms is novel and potentially powerful. It’s also untested at trial.

Does Section 1201 DMCA apply to technical measures beyond DRM?

The Second Circuit’s pending review of YOUT v. RIAA could reshape how courts interpret “technological measures that effectively control access”, directly affecting whether robots.txt-plus-additional-barriers triggers DMCA protection.

Where does fair use end for AI training data?

The Thomson Reuters v. Ross Intelligence ruling in early 2025 found fair use didn’t protect AI-assisted reproduction of headnotes. How that reasoning applies to web-scale training data is still being worked out.

Every one of these questions represents a gap where the current legal system has no clear answer. That’s the accountability vacuum in practice.

The IETF Is Building the Next Layer – But It’s Still Voluntary

The Internet Engineering Task Force recognized that robots.txt was never designed for this problem. In January 2026, they chartered the AI Preferences (AIPREF) Working Group, a body tasked with creating a new technical standard that separates content access from content use.

The core idea is elegant. Right now, robots.txt only answers one question: can you crawl this? The new standard would allow site owners to say something more nuanced: you can crawl this for search, but not for training. You can index this, but not use it for RAG. You can access this page, but not build a model on it.

The proposed syntax would look something like this:

Content Usage Policy — Opt Out of AI Training

User-Agent: *
Allow: /

Content-Usage: train-ai=n

That single line would tell any AIPREF-compliant crawler: you can access this content, but you may not use it to train AI models.

IETF meeting minutes from April 2026 show the working group wrestling with genuinely hard questions, like what to do when users paste URLs directly into AI models (bypassing robots.txt entirely), and how to handle RAG pipelines that feed back into training loops.

But co-chair Mark Nottingham has acknowledged that current systems are failing, and that the new standard will also be voluntary. Just like robots.txt. Just like everything before it.

The IETF doesn’t make law. It makes standards. Compliance remains a choice.

🔄 Robots.txt vs. AIPREF: What Changes

Feature	Robots.txt (Current)	AIPREF (Proposed)
Controls access?	✅ Yes (by crawler type)	✅ Yes (inherits from robots.txt)
Controls use of content?	❌ No	✅ Yes (train-ai, search, RAG)
Granularity	Domain/path level	Domain/path + use-case level
Legal enforcement	Voluntary / court-dependent	Voluntary / still being defined
HTTP header support	❌ No	✅ Yes (Content-Usage header)
Industry adoption	Universal (30+ years)	In progress (2026 milestone)

What Publishers Should Actually Do Right Now

The legal framework will take years to settle. The AIPREF standard may take equally long. So what do you do today?

Build a layered defense

Robots.txt alone is insufficient. The combination most defensible in court includes robots.txt directives, explicit ToS prohibition on scraping, Cloudflare-level bot blocking, server-side rate limiting, and logging of all bot activity. As legal analysis from FKKS explains, each layer on its own is weak; together they start to build the case that access was unauthorized, unwanted, and willful.

Enable Cloudflare’s AI blocker if you’re on their network

It’s free, takes under a minute to enable, and operates at the infrastructure level, meaning bots can’t bypass it the way they bypass robots.txt. Navigate to Security > Bots > AI Scrapers and Crawlers. Done. For ad-supported publishers, the newer option to block AI bots only on ad-serving pages adds another layer of protection precisely where revenue matters most.

Use clickwrap, not just browsewrap, for your ToS

If your platform allows account creation, ensure your signup flow includes an explicit, affirmative acceptance of terms that prohibit AI scraping. Browsewrap terms (buried in a footer) are hard to enforce. Clickwrap agreements, where users actively click “I Agree”, create an enforceable contract.

Be strategic about what you block

Not all AI crawlers are the same. Cloudflare’s robots.txt data shows PerplexityBot and ChatGPT-User, bots that power search results and send referral traffic, are actually welcomed by more sites than they’re blocked. Blocking training crawlers while allowing retrieval bots is more nuanced and often more commercially sensible than blocking everything.

Document everything

If you ever pursue legal action, you need evidence of specific bot behavior over time. Keep server logs. Record when you updated robots.txt. Note when you saw crawl activity from named bots after issuing disallow directives. This documentation is what turns a complaint into a lawsuit.

Monitor the AIPREF process

The standard is still being written. If you run a publisher, a media company, or any content-heavy site, following the AIPREF working group’s progress while standards are being shaped is a real opportunity to influence how this technology evolves.

Before vs. After: The Shift in Publisher Strategy

Before 2023, robots.txt was essentially a search engine tool. Publishers used it to guide crawlers toward the content they wanted indexed and away from the content they didn’t. Compliance was near-universal. The question was never “will the bot obey?”, it was “what should we tell it to do?”

After 2025, the same file has become part of a contested legal strategy, a data point in multi-million-dollar lawsuits, and evidence of notice in copyright disputes. Publishers who used to think about robots.txt once and forget it now need to think about it as a living document that intersects with legal strategy, business model, and infrastructure management simultaneously.

The technical file hasn’t changed. The world around it has.

The Deeper Problem: Law Built for a Different Era

Every legal framework being applied to this problem was built before generative AI existed.

The DMCA was written in 1998 to stop DVD piracy. The CFAA was written to prosecute hackers. Copyright law evolved to protect against human reproduction of creative works. None of these frameworks anticipated a situation where a company could read every page on the internet simultaneously, extract patterns without storing verbatim text, and use the resulting model to produce outputs that compete directly with the original sources.

In the US, no federal statute specifically addresses AI training data rights. Publishers are left stitching together DMCA, CFAA, copyright, and contract theories, each with significant gaps.

In the EU, the situation is meaningfully different. Article 53 of the EU AI Act, which came into force on August 2, 2025, requires providers of general-purpose AI models to put in place a policy to comply with EU copyright law. specifically including the obligation to identify and respect rights reservations expressed by content owners under Article 4(3) of the DSM Directive. That obligation applies regardless of where the AI company is based, as long as the model is placed on the EU market.

The EU’s GPAI Code of Practice, published in July 2025 alongside Article 53’s entry into force, goes further: it commits signatories to using crawlers that read and respect the robots.txt protocol per RFC 9309, and to respecting other machine-readable opt-outs. It’s voluntary as a code, but it serves as the compliance pathway under a binding legal obligation.

A German court ruling in Kneschke v. LAION further affirmed that rightsholders outside scientific research contexts have the right to opt out of text and data mining, and that a natural language opt-out in terms of use doesn’t qualify. The opt-out must be machine-readable. Robots.txt counts. A paragraph in your footer does not.

In the US, nothing equivalent exists yet. That gap is the accountability vacuum.

If you are ready to build your technical defense line now, we have prepared for you a [comprehensive and practical guide on setting up a Robots.txt file] that includes ready-to-copy and paste codes, as well as server-level protection steps.

The Bottom Line

The accountability vacuum is real, and it’s not going to close itself.

Robots.txt was a gentleman’s agreement designed for a web that operated on different incentives. AI training crawlers don’t share those incentives. And the legal system built for DMCA-era problems doesn’t neatly map onto AI-era realities.

What we’re watching unfold, across courtrooms in New York and California, in IETF working groups in Brussels and Bangkok, in Cloudflare’s infrastructure decisions affecting 20% of the web, and in the EU’s binding obligations under Article 53, is the early, messy process of a new accountability framework being built in real time.

Publishers who understand the landscape have a better chance of protecting themselves. That means layered technical defenses, aligned legal documentation, strategic blocking decisions, and close attention to the case law and standards being written right now.

The sign on your door matters. It just needs better locks behind it.

For a broader look at how robots.txt became AI’s accidental governance layer, see our companion piece: From Crawl Control to AI Governance: The Unintended Evolution of Robots.txt.

⚖️ Legal Disclaimer

This article is intended for informational purposes only and does not constitute legal advice. The legal landscape around AI crawlers, robots.txt, and web scraping is actively evolving, and case outcomes referenced here may change. Laws and obligations vary significantly by jurisdiction. If you are facing a specific legal situation involving AI data scraping, copyright infringement, or Terms of Service enforcement, consult a qualified attorney in your jurisdiction before taking action.

Frequently Asked Questions (FAQS)

Is robots.txt legally binding?

No. Robots.txt is a voluntary technical protocol with no built-in legal enforcement. Courts have found it functions more like a “no trespassing” sign than a legal barrier. Ignoring it doesn’t automatically create liability — but it can be used as evidence of bad faith or willful disregard in broader legal disputes involving DMCA, contract, or copyright claims.

What happened in Ziff Davis v. OpenAI?

Ziff Davis sued OpenAI after following OpenAI’s own instructions to block its crawler via robots.txt — only to see scraping activity increase. The court dismissed their DMCA anti-circumvention claim, ruling that robots.txt doesn’t “effectively control access” as required by DMCA Section 1201. Copyright and unjust enrichment claims remain active as of mid-2026.

What is the difference between browsewrap and clickwrap agreements?

Browsewrap agreements are terms that users are assumed to accept simply by using a website — no explicit agreement required. Courts are skeptical of enforcing these. Clickwrap agreements require an active affirmative action (clicking “I Agree”) during account creation or login. These are far more legally defensible. For publishers building a legal case against scrapers, having clickwrap ToS that explicitly prohibit AI training scraping is significantly stronger than browsewrap terms buried in a footer.

What is the IETF AIPREF working group?

AIPREF is a working group chartered in January 2026 to develop a new internet standard that extends beyond robots.txt. It aims to let content owners express preferences not just about whether content can be crawled, but how it can be used — separating access rights from use rights. The proposed Content-Usage mechanism would allow distinctions like “index for search, but don’t use for AI training.” The standard is still in development with an August 2026 milestone for key specifications.

Should I block all AI crawlers from my site?

Not necessarily. There’s an important distinction between training crawlers (GPTBot, ClaudeBot, Meta-ExternalAgent) and retrieval/search crawlers (ChatGPT-User, OAI-SearchBot, PerplexityBot). Training crawlers take your content for model development and typically send no traffic back. Retrieval bots power AI search results and may generate referral visits. A nuanced strategy blocks training crawlers while explicitly allowing retrieval bots.

What does the EU AI Act say about web scraping?

Article 53 of the EU AI Act, in force since August 2, 2025, requires providers of general-purpose AI models to implement policies complying with EU copyright law, specifically respecting rights reservations under Article 4(3) of the DSM Directive. This applies to any AI company placing a model on the EU market, regardless of where they’re based. The GPAI Code of Practice published alongside it commits signatories to respecting robots.txt per RFC 9309 and other machine-readable opt-outs.

Will AIPREF actually solve the problem?

Technically, it will be a meaningful improvement — allowing much more precise expression of content preferences. But practically, compliance will still be voluntary. IETF co-chair Mark Nottingham has acknowledged this directly. The standard will only matter if AI companies choose to honor it, or if regulators require them to. Without enforcement, it becomes another robots.txt: a polite request that determined actors can ignore.

Was this article helpful?

One comment

Nano Banana AI

June 23, 2026 / 8:46 am Reply

The point about the legal gray area around robots.txt is especially interesting because it highlights how much of the web still relies on voluntary norms rather than enforceable rules. I also appreciated the discussion of the crawl-to-referral problem—publishers are dealing with more than just bandwidth costs when AI systems consume content without sending users back to the source. It seems like layered defenses are becoming necessary while the courts catch up.

The Accountability Vacuum: Who Is Legally Responsible When AI Crawlers Ignore Robots.txt?

What Robots.txt Actually Is (And What It Isn’t)

The Scale of the Problem