The Definitive Evolution Of Robots.txt: AI Governance

In February 1994, a Dutch software engineer named Martijn Koster sat down at his keyboard and wrote a short message to the www-talk mailing list at CERN.

His server was being hammered.

A poorly written crawler, built by science fiction author Charles Stross, of all people, had crawled Koster’s site so aggressively it had effectively taken it offline. No malice. No intent. Just a bot with no manners and no mechanism to know better.

Robots.txt Evolution

1 / 10

Introduction

The Accidental Constitution

In 1994, a simple 30-line text file was created to stop a runaway crawler from crashing a server. Thirty years later, that same file is at the center of global AI governance debates.

⚡ The Core Concept

What started as a “Do Not Disturb” sign for early web crawlers has unintentionally evolved into the de facto governance layer for AI training data.

Era 1: 1993–1994

The Origin Story

Martijn Koster’s server was being hammered by an aggressive bot. His solution was elegant: a plain text file at the root of domains telling bots where they were welcome.

🔍 The Design Flaw

Built on “trust, not enforcement.” It assumed good faith and had no technical teeth. This design choice from 1994 is still causing headaches in the AI era.

Era 2: 1994–2019

The Search Engine Era

For 25 years, robots.txt worked remarkably well. Search engines needed publishers, and publishers needed search traffic. The relationship was mutually beneficial.

🛠 Mutual Dependence

Violating robots.txt meant business suicide for search engines. Compliance was driven by market dynamics, not legal enforcement. It became invisible infrastructure.

Era 3: 2019–2023

The Standardization Gap

It took 28 years for robots.txt to become an official internet standard. In September 2022, the IETF published RFC 9309, formalizing a protocol already universally adopted.

⏳ Why The Delay?

Formal standardization is slow. The informal status quo worked perfectly for search engines, removing any urgent incentive to codify it until new, extractive crawlers emerged.

Era 4: Aug 2023

The GPTBot Pivot

OpenAI launched GPTBot. Unlike search engines, AI crawlers take content and send back nothing. It was purely extractive, fundamentally breaking the mutual dependence model.

📊 Rapid Adoption

Within 12 months, 35.7% of the top 1,000 websites blocked GPTBot. The fastest adoption shift in robots.txt history. A technical file became a political statement.

The Data Asymmetry

A Looming AI Quality Crisis

As reputable news sites block AI crawlers (reaching ~60%), misinformation sites barely block them (under 10%). Good data is opting out, leaving AI to train on the bad.

📉 The Impact

The publishers who care most about accuracy are excluding themselves. Every block increases the relative proportion of low-quality, misleading content in AI training datasets.

The Irony

Blocking Doesn’t Stop Citations

Research shows that blocking AI crawlers does not reliably reduce citation rates. AI models rely on historical data they’ve already ingested, rendering new blocks ineffective.

🛡 Traffic Loss

Publishers who block AI crawlers see a 23.1% decline in monthly visits, yet AI systems still cite them at similar rates. The ultimate irony of the governance gap.

Era 5: 2025–Present

Becoming A Policy Instrument

The EU AI Act (Article 53) made history by attaching binding legal obligations to robots.txt compliance. It transformed from a social norm into a legal requirement.

⚖️ Fragmented Governance

In Europe, ignoring robots.txt carries legal consequences. In the US, it provides minimal protection under current case law. The governance landscape is split.

The Future

Enter AIPREF

The IETF chartered the AIPREF Working Group to build robots.txt’s successor. Why? Because current protocol controls *access*, but AI demands control over *use*.

💡 The Solution

AIPREF proposes a Content-Usage mechanism: allow crawling for search, but prohibit AI training. It separates “can you visit” from “what can you do with it.”

The Bottom Line

A Governance Gap

Robots.txt looks like it’s doing its job, but in many cases, it’s not. The architecture of mutual trust is gone, replaced by AI crawlers that don’t need ongoing goodwill.

📖 Key Takeaway

The sign on the door hasn’t changed. The world it was designed for has. The purely voluntary model is no longer sufficient for the problems it’s being asked to solve.

Koster’s solution was elegant in its simplicity: a plain text file at the root of every domain, telling crawlers which parts of the site they were welcome to visit. He called it, with characteristic pragmatism, a “proposed standard for robot exclusion.”

Nobody voted on it. Nobody signed it into law. It wasn’t a regulation, a treaty, or even a formal specification. It was a suggestion, a mailing list proposal from one developer to others, built on nothing more substantial than mutual goodwill.

Thirty years later, that same text file is being cited in federal lawsuits, discussed in IETF working groups alongside international policy experts, and treated as the front line of a global debate over who controls the data that trains artificial intelligence.

Koster’s “Do Not Disturb” sign became the web’s accidental constitution.

This is the story of how that happened.

📌 What You’ll Learn in This Article

How robots.txt was born from a single overloaded server in 1994, and why its design reflects that origin
The five distinct eras of robots.txt, from de facto convention to global AI policy instrument
Why it took 28 years for robots.txt to become an official internet standard (RFC 9309, 2022)
The pivotal moment in August 2023 when GPTBot’s launch transformed robots.txt into an AI governance tool
Why 60% of reputable publishers now block AI crawlers, and why that may be making AI worse
What Lawrence Lessig’s “code is law” framework tells us about where robots.txt goes next

H2: Era 1: The Accidental Standard (1993–1994)

The origin story of robots.txt matters more than most people realize, because it explains every limitation the file still has today.

The early web was a chaotic, wonderful experiment. By 1993, the number of websites had grown from a handful to thousands. Alongside them came a new kind of software: the web crawler. These automated programs roamed the young internet, following links, indexing content, building the primitive databases that would eventually become search engines.

The problem was that nobody had agreed on any rules.

Some crawlers were responsible. Others were not. And a site owner in 1993 had no mechanism to communicate preferences to a bot, or even to know one had visited.

The immediate trigger for robots.txt, as documented by Koster himself on his personal site, was a crawler called Websnarf, written by Charles Stross. It hit Koster’s server repeatedly and relentlessly, with no rate limiting, no pause, no consideration for the resources it was consuming. For a server in 1993, this was catastrophic.

Koster’s first public mention of a proposed solution came on February 25, 1994, in a message to the www-talk mailing list, the main communication channel for WWW-related work at the time. He proposed a simple standard: a text file at the root of any domain, readable by any bot, expressing the site owner’s preferences.

By June 1994, the file had been renamed to robots.txt and was rapidly adopted as a de facto standard by the web community. Early search engines, WebCrawler, Lycos, AltaVista, all began checking it before crawling.

No one voted. No standards body ratified it. It spread because it solved a real problem, and because the web in 1994 was small enough that community norms actually worked.

This is the design philosophy embedded in every robots.txt file that exists today: trust, not enforcement. The file assumes good faith. It has no mechanism to verify compliance, no way to prevent access, no technical teeth at all. It is a request, dressed up to look like a rule.

That design choice was entirely appropriate for 1994. Its consequences are still playing out in 2026.

📅 The Complete Robots.txt Timeline

Year	Event	Significance
Feb 1994	Koster proposes Robot Exclusion Protocol on www-talk mailing list	The original proposal — motivated by server overload, not governance
Jun 1994	robots.txt becomes de facto standard; WebCrawler, Lycos adopt it	First era of voluntary compliance begins
1997–2006	Wildcard extensions, Allow directive emerge informally	Protocol grows beyond its original spec through convention
2007	Google, Bing, Yahoo begin harmonizing interpretations	De facto standardization by market dominance
Jul 2019	Google submits REP to IETF for formal standardization; open-sources its parser	First attempt at official status after 25 years
Sep 2022	RFC 9309 published — robots.txt becomes official internet standard	28 years after creation; co-authored by Koster and Google engineers
Aug 2023	OpenAI launches GPTBot — the first major AI training crawler with a named user-agent	The pivot point: robots.txt becomes an AI governance tool overnight
Aug 2024	35.7% of top 1,000 websites block GPTBot — up from 5% at launch	Seven-fold increase in 12 months; fastest adoption shift in robots.txt history
Aug 2025	EU AI Act Article 53 enters force, mandates respect for machine-readable opt-outs	First binding legal obligation tied to robots.txt compliance
Jan 2026	IETF charters AIPREF Working Group to build robots.txt successor	Official acknowledgment that the 1994 standard is no longer sufficient

Era 2: The Search Engine Era – Trust as Infrastructure (1994–2019)

For the first twenty-five years of its life, robots.txt worked remarkably well.

Search engines needed publishers. Publishers needed search engines. The relationship was symbiotic: crawl our content, send us traffic. In that environment, violating robots.txt would have been business suicide for any search engine. The entire model depended on the goodwill of the web.

This era saw robots.txt quietly become one of the most universally adopted technical standards in internet history, not through mandate, but through market dynamics. As the IETF’s own RFC 9309 documentation acknowledges, the protocol was “co-authored by Koster and Google engineers” after 25+ years of real-world use, a standard that was formalized only after it had already become universal.

The informality of that process had real consequences. Without a governing body, the standard evolved through convention rather than coordination. Google introduced the Allow directive to override Disallow rules. Microsoft added wildcard support. Yahoo experimented with crawl-delay parameters. Each engine interpreted edge cases differently, and nobody had final authority to say who was right.

As Search Engine World documented, Google’s 2019 decision to push REP into the formal IETF process was partly driven by this inconsistency problem, the goal was to align interpretations across crawlers and finally create a single documented rule set.

RFC 9309, published in September 2022, was the result: the first official specification of a standard that had been in widespread use for 28 years. But by the time it was published, the world it was designed for was already changing.

Era 3: The Standardization Gap – Why It Took 28 Years

Here’s a question that’s easy to skip over but genuinely revealing: why did it take until 2022 for robots.txt to become an official standard?

The answer says something important about how internet governance actually works.

Formal standardization through the IETF is slow and deliberate. It requires rough consensus, public review, and careful documentation of edge cases and failure modes. For a protocol that already worked, that search engines already honored, that site owners already understood, there was no urgent incentive to go through that process.

The status quo was functional. Informal norms held. The cost of standardization seemed to outweigh the benefit.

That calculation changed in two ways. First, the number of bots had multiplied far beyond what any informal community could track. The web in 2019 was not the web in 1994, it had billions of pages, thousands of crawlers, and edge cases the original proposal never anticipated.

Second, and more importantly, a new category of crawler was emerging that operated under entirely different incentives. AI training bots didn’t need to maintain good relationships with publishers. They didn’t send traffic back. The mutual dependence that had kept search engine compliance near-universal simply didn’t apply.

As one legal scholar’s analysis on ResearchGate notes, Lawrence Lessig’s concept of “code is law” applies directly here: the architecture of robots.txt, its voluntary, trust-based design, embedded a specific set of assumptions about who would be reading it and why. Those assumptions were correct for 1994. They became increasingly fragile as the ecosystem diversified.

RFC 9309 arrived just in time to be tested by a problem it wasn’t designed to solve.

📚 Recommended Insight

Robots.txt Guide 2026: AI Crawlers, RFC 9309 & Crawl Budget Mastery

Master RFC 9309 robots.txt in 2026: block AI training bots, allow search agents, fix 12 mistakes that kill rankings silently. Real templates + working examples.

Read the Full Article →

Era 4: The GPTBot Pivot – August 2023

If there is a single moment when robots.txt’s identity changed permanently, it is August 2023.

**Robots.txt identity changed August 2023**

On August 7, 2023, OpenAI announced GPTBot, the company’s first publicly named web crawler, dedicated to collecting training data for its language models. With the announcement came a simple instruction: if you want to block it, add these lines to your robots.txt.

The response was unlike anything in the history of the protocol.

According to data tracked by multiple sources including PPC.land, GPTBot blocking started at around 5% of the top 1,000 websites at launch. Within twelve months, it had reached 35.7%. A seven-fold increase in one year.

For context: the adoption rate of robots.txt itself in 1994, when it was brand new and solving an urgent problem, was considered impressive. The rate at which publishers responded to GPTBot was faster.

Why? Because this was the first time that a crawler’s purpose was unambiguously extractive. Search engine crawlers took content and sent back traffic. GPTBot took content and sent back nothing. As Cloudflare’s robots.txt analysis across its network shows, Anthropic’s ClaudeBot crawls 20,583 pages for every single referral it sends back to publishers. OpenAI’s ratio is 1,255 to 1. Meta’s AI crawlers send zero referrals.

That’s not a value exchange. It’s pure extraction.

And publishers responded accordingly, not through legal action (that would come later), but through the only tool available to them: the same plain text file that Martijn Koster had proposed to solve a much simpler problem thirty years earlier.

💡 The Insight Most Articles Miss

The GPTBot launch didn’t just create a new use case for robots.txt. It revealed that the file was already functioning as a governance layer — it just hadn’t been recognized as one. For thirty years, robots.txt was a technical tool. In August 2023, it became a political statement. The same lines of text that once said “please don’t crawl my admin directory” now said “I do not consent to having my content used to train AI.” Same syntax. Entirely different meaning.

Era 5: The Governance Instrument (2025–Present)

By 2025, robots.txt had crossed a threshold it had never crossed before: it began appearing in binding legal and regulatory frameworks.

The most significant development was Article 53 of the EU AI Act, which came into force on August 2, 2025. For the first time in history, a binding legal obligation was attached to robots.txt compliance: providers of general-purpose AI models operating in the EU market must implement policies to respect machine-readable rights reservations, of which robots.txt, per RFC 9309, is the primary example.

A German court ruling in Kneschke v. LAION added further clarity: a natural language opt-out buried in terms of service doesn’t qualify. The opt-out must be machine-readable. Robots.txt counts. A paragraph in your footer does not.

These developments mark a qualitative shift in the file’s status. For thirty years, robots.txt was a social norm, respected because ignoring it was socially costly. Now, in certain jurisdictions, ignoring it carries legal consequences.

But the EU AI Act’s reach is limited by geography. In the United States, no equivalent obligation exists. Publishers relying on robots.txt as a legal instrument in US courts have found it provides limited protection, as the Ziff Davis v. OpenAI ruling demonstrated when a court found that robots.txt fails the “effectively controls access” test required for DMCA protection.

The result is a fragmented governance landscape: robots.txt is legally meaningful in Europe, technically meaningful everywhere, and legally vulnerable in the US.

📚 Recommended Insight

Crawl Budget in 2026: When It Actually Matters (And the Formula Everyone Gets Wrong)

Crawl budget is no longer just about Googlebot. Learn the formula, exact thresholds, AI crawler risks, and a step-by-step action plan to fix indexing delays.

Read the Full Article →

The Adoption Asymmetry No One Is Talking About

As publishers began blocking AI crawlers in large numbers, a troubling pattern emerged in the data.

Research published on arXiv and tracking publisher robots.txt behavior from September 2023 to May 2025 found that reputable news sites increased AI-blocking from 23% to nearly 60% over the period. Major publishers including The New York Times, The Guardian, CNN, Reuters, and Bloomberg all implemented blocks.

But misinformation sites? Their blocking rate moved from 4.6% to just 9.2% over the same period.

The implication is profound. As credible publishers opt out of AI training data, the relative proportion of low-quality, inaccurate, and misleading content in AI training datasets increases, not because AI companies are choosing bad data, but because the good data has chosen to leave.

A study from arXiv examining 5,818 publishers found that reputable sites forbid an average of 15.5 AI user-agents in their robots.txt files. Misinformation sites prohibit fewer than one.

This is the adoption asymmetry: the publishers who care most about accuracy are opting out at the highest rates. The publishers who care least are staying in.

It’s worth sitting with what this means for the long-term quality of AI systems. Every percentage point increase in reputable publisher blocking is, in aggregate, a shift in the composition of AI training data toward less reliable sources. The short-term protection of intellectual property may be contributing to a long-term degradation of AI quality.

There is no easy resolution to this tension. But it should be part of any serious conversation about how robots.txt is being used as a governance tool.

📊 The Adoption Asymmetry: Reputable vs. Misinformation Sites

Metric	Reputable News Sites	Misinformation Sites
AI blocking rate (Sep 2023)	23%	4.6%
AI blocking rate (May 2025)	~60%	9.2%
Avg. AI user-agents blocked	15.5 agents	Less than 1
Growth in blocking (2 years)	+37 percentage points	+4.6 percentage points
Implication for AI training data	Credible content progressively excluded; lower-quality sources remain accessible

Source: arXiv research (2510.10315), September 2023 – May 2025

Three Case Studies That Define the Evolution

The numbers tell one story. These three cases tell it in human terms, showing exactly how robots.txt moved from infrastructure tool to policy instrument, and what that transition cost each party involved.

**Case Studies Define Robots.txt Evolution**

Case Study 1: The New York Times – From robots.txt to Federal Court in Four Months

The New York Times is the most documented example of a publisher moving through every stage of the robots.txt evolution in rapid succession.

The challenge: When OpenAI launched GPTBot in August 2023, the Times faced the same decision every major publisher faced: block, allow, or wait and see. Unlike many peers, the Times moved immediately on two fronts simultaneously.

The action: According to Editor & Publisher’s contemporaneous reporting, the Times blocked GPTBot in its robots.txt file as early as August 17, 2023, just ten days after GPTBot’s launch. At the same time, they updated their Terms of Service to explicitly prohibit the use of Times content for AI training.

The Times didn’t stop there. By end of 2023, they blocked Common Crawl’s CCBot, recognizing that a robots.txt block on GPTBot meant nothing if OpenAI could obtain the same content via Common Crawl’s historical archive. In January 2025, the Times began hard-blocking the Internet Archive’s crawlers, citing the Wayback Machine as another vector for AI companies to access archived content without authorization.

The outcome: In December 2023, the Times filed suit against OpenAI and Microsoft, alleging millions of articles had been used to train AI systems that now compete directly with the publication. The lawsuit, still active as of mid-2026, cites both copyright infringement and the ToS update published months earlier.

The lesson: The Times’ approach shows why robots.txt alone was never sufficient. The file was one layer of a multi-front strategy: ToS updates, Common Crawl blocking, Internet Archive restrictions, and legal action. Each layer closed a gap the previous one left open. The four-month arc from robots.txt update to federal lawsuit shows how fast the governance gap can escalate when a determined publisher decides to close it.

Case Study 2: Common Crawl — The Nonprofit That Became AI’s Backdoor

Common Crawl is the most important case study for understanding why robots.txt compliance alone doesn’t solve the data governance problem.

**Common Crawl data governance problems**

The challenge: Founded in 2007 by Gil Elbaz in San Francisco, Common Crawl began publishing web crawl data publicly in 2011. For most of its early life, it was a modest research project. Then the AI industry discovered it. According to a February 2024 Mozilla Foundation study, 64% of the large language models reviewed used at least one filtered version of Common Crawl for pre-training. GPT-3’s training data was approximately 60–80% derived from it.

The funding picture shifted accordingly. By 2024, over 60% of Common Crawl’s donated funds came from entities affiliated with generative AI companies, including Anthropic, OpenAI, and the Schmidt Foundation. A nonprofit web archive had effectively become the primary data pipeline for a multi-trillion-dollar industry.

The action: Publishers began to realize that blocking GPTBot in robots.txt accomplished almost nothing if Common Crawl had already archived years of their content. The Danish Rights Alliance filed a formal removal request in July 2024. Common Crawl’s attorney responded in December 2024, five months later, confirming approximately 50% of requested content had been removed. A November 2025 investigation by The Atlantic found Times and Danish publisher content still accessible in historical archives, with file-system logs showing no content modifications since 2016.

On April 29, 2026, the News/Media Alliance sent a formal demand letter to Common Crawl requiring four actions: remove publisher content on request, publish a statement that it doesn’t authorize AI training use of scraped content, revise its Terms of Use to prohibit AI training use, and establish a clear opt-out registry.

The outcome: Still unresolved as of mid-2026. Common Crawl’s existing indemnity clauses already cover use of crawled content for “developing, training, or deploying AI Systems”, language the NMA letter describes as tacit acknowledgment of the use case, not protection from it.

The lesson: Robots.txt cannot retroactively protect content that was scraped before the file was updated. Common Crawl’s historical archive represents a fundamental gap in the opt-out model. The governance tool that robots.txt provides is prospective only, and the past is still very much in play.

Case Study 3: The Internet Archive — When Good Intentions Create Bad Loopholes

The Internet Archive case is the most philosophically complex, a story about how a tool built for preservation became, unintentionally, an infrastructure for extraction.

The challenge: The Wayback Machine has archived over 916 billion web pages since 1996. For that preservation mission, strict robots.txt compliance would be counterproductive, important content would vanish simply because a site owner added a Disallow rule. The Internet Archive has therefore maintained a nuanced position: it honors robots.txt for ongoing crawls but applies different standards to historical preservation.

This made sense when the Wayback Machine served researchers and historians. It made much less sense when analysis of Google’s C4 training dataset by the Washington Post in 2023 showed that web.archive.org ranked as the 187th most present domain in the training data used to build Google’s T5 model and Meta’s Llama models.

The action: Reddit was among the first to connect the dots. After striking a $60 million data licensing deal with Google in February 2024, Reddit became aware that AI companies were scraping archived Reddit content from the Wayback Machine, effectively circumventing the licensing ecosystem Reddit had built. It began severely restricting Internet Archive access, limiting crawls to the homepage only.

The New York Times followed in late 2025, announcing it was hard-blocking Internet Archive crawlers, citing the Wayback Machine as providing “unfettered access to Times content, including by AI companies, without authorization.” According to Nieman Lab’s iJanuary 2026 investigation, 93% of publishers in their dataset now disallow at least two Internet Archive crawler bots. In September 2025 alone, Gannett reported blocking 75 million AI bots across its platforms, approximately 70 million from OpenAI.

The outcome: The Internet Archive finds itself in an impossible position: its preservation mission conflicts directly with publisher rights in the AI era. Its executive director has warned that removing archived materials threatens the open web, but publishers are no longer willing to treat the archive as a neutral actor when its contents actively feed AI training pipelines.

The lesson: The Internet Archive case reveals a governance gap that no robots.txt configuration can close. Third parties can archive your content and distribute it to AI companies regardless of what your robots.txt says. The opt-out model assumes the party receiving your directive is the same party using your content. Increasingly, they are not.

A New Insight from the Data: Blocking Doesn’t Stop Citations

Here is something that emerged from 2026 research and fundamentally complicates the entire blocking debate.

A BuzzStream study published in March 2026, drawing on 4 million citations across 3,600 prompts from ChatGPT, Gemini, Google AI Overviews, and AI Mode, found that blocking AI crawlers via robots.txt does not reliably reduce citation rates in AI-generated responses.

Publishers who blocked AI crawlers were still being cited by AI systems, at roughly similar rates to publishers who didn’t block.

The mechanism isn’t hard to understand. AI models are trained on historical data. By the time a publisher adds a block, the model has often already ingested years of their content. Future training may exclude them, but currently deployed models cite from what they already know.

Research from Rutgers Business School and The Wharton School, published December 2025, added the traffic dimension: publishers who blocked AI crawlers experienced a 23.1% decline in monthly visits and a 13.9% decline in human-only browsing, with no corresponding reduction in AI citation rates.

This is the deepest irony in the entire robots.txt story: the tool meant to give publishers control may be reducing their traffic without reducing AI’s use of their content.

It doesn’t mean blocking is wrong, there are still legal and ethical reasons to do it. But it means the governance picture is even more complex than the robots.txt file alone suggests.

“Code Is Law” – And What Happened When the Code Stopped Working

In 1999, Harvard Law professor Lawrence Lessig published a book, and then a widely cited Harvard Magazine article, arguing that in cyberspace, the architecture of technical systems functions as a form of regulation. Code doesn’t just implement rules; it is rules. The design of a system determines what’s possible, what’s permitted, and what’s prevented.

Robots.txt is a near-perfect illustration of Lessig’s thesis, and of its limits.

For twenty-five years, the code worked as intended. The architecture of the web, the mutual dependence between publishers and search engines, the reputational stakes of non-compliance, the small size of the bot community, meant that the technical protocol and the social norm reinforced each other. Robots.txt didn’t need enforcement because the incentives aligned.

When AI training crawlers entered the picture, those incentives broke. The architecture hadn’t changed. The file was the same. But the actors reading it had fundamentally different motivations, and the code that had functioned as law for a generation suddenly had no teeth.

This is the governance gap that lawyers, regulators, and standards bodies are now racing to fill. Lessig’s framework suggests the resolution will come through one of three mechanisms: new code (technical standards like AIPREF), new law (regulatory frameworks like the EU AI Act), or new norms (industry self-regulation through codes of practice).

History suggests it will require all three, and that the transition will be messy.

The RFC 9309 Moment: What Formal Standardization Actually Changed

When RFC 9309 was published in September 2022, it codified what had been true in practice for nearly three decades. But formalization changed something important beyond documentation.

Before RFC 9309, a crawler could plausibly claim that “robots.txt” was merely a convention, advisory, informal, and not binding even as a technical matter. After RFC 9309, that argument became significantly harder to make. The IETF standard specifies that crawlers “should” honor the protocol, using the formal RFC language that carries specific meaning in standards contexts.

As Search Engine World’s analysis of RFC 9309 notes, the standardization also resolved long-standing ambiguities around the “longest match rule” (more specific rules override general ones), the handling of Allow/Disallow conflicts, and how crawlers should behave when robots.txt is unreachable.

More practically, RFC 9309 gave regulators and courts a document to point to. The EU AI Act’s reference to “machine-readable rights reservations” gains much of its practical meaning from RFC 9309 providing a defined, documented standard for what such a reservation looks like.

The file didn’t change. Its legal context did.

What AIPREF Is Trying to Build – And Why It’s Not Enough Alone

The IETF’s acknowledgment that robots.txt is insufficient came officially in January 2026 with the chartering of the AI Preferences (AIPREF) Working Group.

The core insight driving AIPREF is that robots.txt conflates two distinct concepts: access and use. The file was designed to control whether a bot could crawl a page. It was never designed to control what a bot could do with the content once crawled.

In the search engine era, this distinction didn’t matter. Googlebot crawls your page and indexes it for search, crawl and use are effectively the same action, and the value exchange is understood.

In the AI era, the distinction is everything. A publisher might be perfectly comfortable with Perplexity crawling its content for search results but deeply opposed to the same content being used to train a large language model. Under robots.txt, there is no way to express that distinction.

AIPREF’s proposed solution is a Content-Usage mechanism, either as an extension to robots.txt syntax or as an HTTP header, allowing site owners to express preferences about use, not just access:

Content Usage Policy — Allow Search & RAG, Block Training

User-Agent: *
Allow: /

Content-Usage: train-ai=n, search=y, rag=y

That single additional line does something robots.txt has never been able to do: separate the question of “can you visit this page” from “what can you do with what you find.”

The AIPREF working group is targeting August 2026 for key specification milestones. But as co-chair Mark Nottingham has acknowledged, the new standard will also be voluntary. The IETF makes standards; it doesn’t make law.

The challenge of enforcement, the same challenge that has haunted robots.txt since 1994, will remain.

🔄 Five Eras of Robots.txt: A Summary

Era	Period	Primary Function	Enforcement Mechanism
1. Accidental Standard	1993–1994	Server protection from overload	Community goodwill
2. Search Engine Era	1994–2019	Crawl management and indexing control	Mutual business dependence
3. Formalization	2019–2023	Technical standardization	RFC 9309 (advisory)
4. AI Governance Tool	2023–2025	AI training data consent	Legal action + platform pressure
5. Policy Instrument	2025–present	Regulatory compliance signal	EU AI Act (binding in EU); voluntary elsewhere

The Three Tensions That Define Robots.txt Today

Understanding where robots.txt goes next requires understanding the three unresolved tensions at its core.

Tension 1: Access vs. Use

Robots.txt was built to control access. The AI era demands control over use. These are fundamentally different problems requiring different technical solutions. AIPREF is attempting to bridge this gap, but the architecture of the web, where crawling and using were always the same action, makes clean separation genuinely difficult.

Tension 2: Voluntary vs. Enforceable

Every standard in the robots.txt ecosystem, from the original 1994 proposal to RFC 9309 to the forthcoming AIPREF specification, is voluntary. Enforcement depends either on reputational incentives (which only work when actors care about reputation) or on external legal frameworks (which are still being built). The EU AI Act represents the first attempt to make compliance binding, but its geographic scope is limited and its enforcement mechanisms are still being tested.

Tension 3: Individual Rights vs. Collective Goods

Each publisher who blocks AI crawlers is making a rational individual decision. In aggregate, those decisions may be producing a collectively harmful outcome, AI systems trained on progressively lower-quality data. This is a classic commons problem, and it has no easy resolution within the current robots.txt framework, which is built around individual site preferences rather than collective information ecosystem health.

These tensions don’t have obvious technical solutions. They require policy, law, and negotiation between parties with fundamentally misaligned interests. That’s territory robots.txt was never designed to navigate.

📚 Recommended Insight

The Accountability Vacuum: Who Is Legally Responsible When AI Crawlers Ignore Robots.txt?

When AI crawlers ignore your robots.txt, who is legally responsible? Explore DMCA gaps, real lawsuits, and what publishers can do right now to protect their content.

Read the Full Article →

Before vs. After: What Changed in August 2023

Before August 2023, robots.txt was a background infrastructure tool. Site owners set it once, occasionally updated it when they launched new sections, and largely forgot about it. The population of crawlers was dominated by search engines that had strong incentives to comply. Non-compliance was rare enough to be anecdotal.

After August 2023, robots.txt became a front-line decision. Every publisher with original content needed a position on AI crawlers. The file that used to be updated quarterly was now being updated weekly as new AI bots appeared and new data about non-compliance emerged. Legal teams got involved. The question “what does your robots.txt say?” became as strategically significant as “what does your Terms of Service say?”

The technical file didn’t change. The stakes around it did.

⚠️ The Governance Illusion: What Most Publishers Still Get Wrong

Treating robots.txt as a firewall: It’s a signal, not a barrier. Non-compliant bots ignore it entirely. Technical enforcement requires additional measures.
Not distinguishing training from retrieval bots: Blocking ChatGPT-User or PerplexityBot (search/retrieval bots) removes you from AI search results. Blocking GPTBot and ClaudeBot (training bots) does not. These are different decisions with different consequences.
Assuming the file is current: New AI crawlers launch regularly. A robots.txt written in 2023 may not name bots that launched in 2025. Regular audits against updated lists (Dark Visitors, Cloudflare Radar) are necessary.
Conflating robots.txt with legal protection: Without accompanying Terms of Service language, server-level blocking, and documentation of violations, robots.txt provides minimal legal protection under current US case law.
Ignoring the collective dimension: Individual blocking decisions aggregate into systemic effects on AI training data quality. This isn’t a reason not to block — but it should be a conscious tradeoff, not an accidental one.

What Comes Next: Three Possible Futures

**Three Possible AI Futures For Robots.txt**

Future 1: Fragmented Governance

The most likely near-term scenario. The EU has binding obligations via Article 53. The US relies on a patchwork of copyright, DMCA, and contract law. Other jurisdictions develop their own approaches. AI companies comply where legally required and ignore voluntary standards elsewhere. Robots.txt remains meaningful in some contexts and irrelevant in others.

Future 2: Industry Self-Regulation

AI companies, facing reputational pressure and the threat of regulatory action, adopt the AIPREF standard voluntarily and honor it consistently. Publishers gain meaningful control over how their content is used. The market develops licensing mechanisms (like Cloudflare’s Pay Per Crawl) that create economic incentives for compliance. This outcome requires sustained pressure from publishers, regulators, and users, and a degree of cooperation from AI companies that hasn’t been consistently demonstrated.

Future 3: Mandatory Compliance

Regulators in the US, following the EU’s lead, establish binding requirements for AI companies to honor machine-readable content preferences. The AIPREF standard (or something like it) becomes legally enforceable. Robots.txt and its successors transition from social norms to genuine legal instruments. This is the most transformative outcome, and the one that most closely matches the actual scale of the problem.

Most experts expect some combination of all three, varying by jurisdiction and context. The common thread is that the purely voluntary model that served robots.txt for thirty years is no longer sufficient for the problems it’s being asked to solve.

The Bottom Line

Martijn Koster didn’t set out to create a governance system. He set out to stop a runaway crawler from crashing his server.

The file he proposed in 1994 was a technical convenience, thirty lines of plain text built on the assumption that the actors reading it would act in good faith. For twenty-five years, that assumption held. The web’s architecture enforced what the protocol couldn’t: a mutual dependency between publishers and crawlers that made compliance rational.

That architecture is gone now. AI training crawlers don’t need ongoing goodwill from publishers. The content they need, they’ve already taken, or are taking, faster than any single protocol can track.

What we’re left with is a governance gap wearing the costume of a solved problem. Robots.txt looks like it’s doing its job. In many cases, it’s not.

The IETF is building the successor. The EU has passed the first binding law. Courts are working out the case law. None of it is complete, and none of it is fast enough for the pace at which the problem is moving.

The sign on the door hasn’t changed. The world it was designed for has.

This article is part of a series. For a detailed look at the legal accountability questions raised by AI crawler non-compliance, including current lawsuits and what publishers can do, read: [The Accountability Vacuum: Who Is Legally Responsible When AI Crawlers Ignore Robots.txt?]

⚖️ Legal Disclaimer

This article is intended for informational purposes only and does not constitute legal advice. The legal and regulatory landscape around robots.txt, AI crawlers, and web governance is actively evolving. Laws and obligations vary by jurisdiction. If you are facing a specific legal situation involving AI data collection or content rights, consult a qualified attorney in your jurisdiction before taking action.

Frequently Asked Questions (FAQs)

Who invented robots.txt and why?

Robots.txt was proposed by Martijn Koster, a Dutch software engineer working at Nexor, in February 1994. The immediate trigger was a poorly written web crawler built by Charles Stross that repeatedly hit Koster’s server so aggressively it caused effective downtime. Koster proposed a plain text file at the root of any domain that could communicate crawl preferences to automated bots. By June 1994 it had become a de facto standard adopted by early search engines including WebCrawler, Lycos, and AltaVista.

When did robots.txt become an official standard?

Despite being in widespread use since 1994, robots.txt wasn’t formally standardized until September 2022, when the IETF published RFC 9309 — the Robots Exclusion Protocol. The process was initiated by Google in July 2019, which submitted a draft to the IETF and open-sourced its own robots.txt parser. RFC 9309 was co-authored by Koster and Google engineers, resolving longstanding ambiguities around wildcard syntax, conflict resolution, and crawl behavior when the file is unreachable.

Why did GPTBot’s launch in 2023 change robots.txt?

GPTBot was the first major AI training crawler with a publicly named user-agent and clear opt-out instructions. Its launch was the first time a crawler’s purpose was unambiguously extractive — taking content for model training while offering no traffic or referrals in return. Publishers responded at unprecedented speed: GPTBot blocking went from 5% to 35.7% of the top 1,000 websites within 12 months. The launch transformed robots.txt from a technical infrastructure tool into an active expression of AI data consent.

What is the difference between robots.txt and AIPREF?

Robots.txt controls access — whether a bot can crawl a page. AIPREF, the standard being developed by the IETF’s AI Preferences Working Group, aims to control use — what a bot can do with content after crawling it. The proposed Content-Usage mechanism would let publishers say “you can crawl this for search, but not use it for AI training” in a single machine-readable statement. AIPREF is targeting August 2026 for key specifications but will also be a voluntary standard.

Does the EU AI Act make robots.txt compliance legally required?

In the EU, Article 53 of the EU AI Act (in force August 2, 2025) requires providers of general-purpose AI models to respect machine-readable rights reservations under Article 4(3) of the DSM Directive. Robots.txt, per RFC 9309, is the primary recognized mechanism for such reservations. This creates a binding legal obligation for AI companies operating in the EU market — regardless of where they’re headquartered. No equivalent obligation exists in US law as of mid-2026.

Why are reputable news sites blocking AI crawlers more than misinformation sites?

Reputable publishers have more to protect — established brand value, paywalled content, advertising-dependent traffic — and more organizational capacity to implement robots.txt changes. Misinformation sites often lack the same resources or motivations, and some may actively benefit from AI systems training on their content. Research tracking both site types from September 2023 to May 2025 found reputable sites reaching nearly 60% AI blocking, while misinformation sites stayed below 10% — a gap that has widened every year.

What does “code is law” mean in the context of robots.txt?

Lawrence Lessig’s “code is law” concept, from his 1999 book and 2000 Harvard Magazine article, argues that technical architecture functions as regulation in digital environments. Applied to robots.txt: the file’s design (voluntary, trust-based, no enforcement mechanism) embedded assumptions about who would be reading it. For search engines, those assumptions held. For AI training crawlers, they didn’t — creating a governance gap that technical architecture alone can’t close. The lesson is that code-as-governance only works when the incentive structures of the actors reading the code align with its intent.

Was this article helpful?

2 Comments

Why AI Crawlers Ignore Robots.txt And What It Means For Websites

June 24, 2026 / 5:01 am Reply

[…] For a broader look at how robots.txt became AI’s accidental governance layer, see our companion piece: From Crawl Control to AI Governance: The Unintended Evolution of Robots.txt. […]
Crawl Budget In 2026: When It Matters And How To Fix It

June 24, 2026 / 5:05 am Reply

[…] Read the Full Article → […]

From Crawl Control to AI Governance: The Unintended Evolution of Robots.txt

Robots.txt Evolution

H2: Era 1: The Accidental Standard (1993–1994)

Era 2: The Search Engine Era – Trust as Infrastructure (1994–2019)

Era 3: The Standardization Gap – Why It Took 28 Years

Era 4: The GPTBot Pivot – August 2023

Era 5: The Governance Instrument (2025–Present)

The Adoption Asymmetry No One Is Talking About

Three Case Studies That Define the Evolution

Case Study 1: The New York Times – From robots.txt to Federal Court in Four Months

Case Study 2: Common Crawl — The Nonprofit That Became AI’s Backdoor

Case Study 3: The Internet Archive — When Good Intentions Create Bad Loopholes

A New Insight from the Data: Blocking Doesn’t Stop Citations

“Code Is Law” – And What Happened When the Code Stopped Working

The RFC 9309 Moment: What Formal Standardization Actually Changed

What AIPREF Is Trying to Build – And Why It’s Not Enough Alone

The Three Tensions That Define Robots.txt Today

Tension 1: Access vs. Use

Tension 2: Voluntary vs. Enforceable

Tension 3: Individual Rights vs. Collective Goods

Before vs. After: What Changed in August 2023

What Comes Next: Three Possible Futures

Future 1: Fragmented Governance

Future 2: Industry Self-Regulation

Future 3: Mandatory Compliance

The Bottom Line

Frequently Asked Questions (FAQs)

Was this article helpful?

Dsn Daily

2 Comments

Leave a ReplyCancel Reply

From Crawl Control to AI Governance: The Unintended Evolution of Robots.txt

The Accountability Vacuum: Who Is Legally Responsible When AI Crawlers Ignore Robots.txt?

Apache Iceberg Explained: Why It’s the New Operating Model for Modern Data

📚 Reading List

Robots.txt Evolution

H2: Era 1: The Accidental Standard (1993–1994)

Era 2: The Search Engine Era – Trust as Infrastructure (1994–2019)

Era 3: The Standardization Gap – Why It Took 28 Years

Era 4: The GPTBot Pivot – August 2023

Era 5: The Governance Instrument (2025–Present)

The Adoption Asymmetry No One Is Talking About

Three Case Studies That Define the Evolution

Case Study 1: The New York Times – From robots.txt to Federal Court in Four Months

Case Study 2: Common Crawl — The Nonprofit That Became AI’s Backdoor

Case Study 3: The Internet Archive — When Good Intentions Create Bad Loopholes

A New Insight from the Data: Blocking Doesn’t Stop Citations

“Code Is Law” – And What Happened When the Code Stopped Working

The RFC 9309 Moment: What Formal Standardization Actually Changed

What AIPREF Is Trying to Build – And Why It’s Not Enough Alone

The Three Tensions That Define Robots.txt Today

Tension 1: Access vs. Use

Tension 2: Voluntary vs. Enforceable

Tension 3: Individual Rights vs. Collective Goods

Before vs. After: What Changed in August 2023

What Comes Next: Three Possible Futures

Future 1: Fragmented Governance

Future 2: Industry Self-Regulation

Future 3: Mandatory Compliance

The Bottom Line

Frequently Asked Questions (FAQs)

Was this article helpful?

Dsn Daily

Related Posts

2 Comments

Leave a ReplyCancel Reply

Trending now

📚 Reading List