Robots.txt is a plain text file placed at the root of a website. It tells search engine bots which parts of the site they can crawl and which parts to skip. This file works under the Robots Exclusion Protocol, a rule system introduced in 1994 to guide web crawlers.

It helps manage crawler traffic by blocking bots from overloading a server or indexing unwanted pages. For example, site owners can hide admin folders, login pages, or test areas using robots.txt. This control helps protect private or non-indexable URLs from appearing in search results.

Following the rules in robots.txt is not forced. Search engines like Google and Bing follow them, but harmful bots might ignore them completely. By the 2020s, websites also started using robots.txt to block bots that collect data for AI model training. This includes stopping access to content used by large language models or generative search engines.

Today, robots.txt is not just about managing crawler load. It is also a tool for data protection, indexing control, and reducing content scraping across the web.

Origin and Development of Robots.txt

Robots.txt began in 1994 as a simple fix to a big problem: badly behaved bots crashing websites. It grew from an informal practice into an official web standard with clear rules for modern web crawlers.

Early Proposal and Origin

In February 1994, developer Martijn Koster proposed a solution on the WWW-talk mailing list after a rogue crawler caused a denial-of-service issue on his server. He suggested a simple file—initially called RobotsNotWanted.txt—that would list paths bots should avoid.

At the time, the web was small. It was practical to maintain lists of bots, and the main concern was avoiding crawler overload, not protecting content.

Adoption by Early Search Engines

By June 1994, the file was renamed to robots.txt. It quickly gained popularity. Early search engines like WebCrawler, Lycos, and AltaVista started checking this file before crawling sites. The practice became a de facto standard even though it had no official status.

Lack of Formal Standardization

For years, robots.txt was guided by convention. But without formal rules, issues arose around edge cases such as:

  • Handling of unknown directives
  • File encoding and character sets
  • Behavior during server errors or redirects
  • Caching policies

Different bots interpreted the rules differently, leading to inconsistent results across platforms.

Move Toward Official Standard

In July 2019, Google joined forces with Koster and others to formalize the Robots Exclusion Protocol. They submitted a draft to the Internet Engineering Task Force (IETF) that outlined current usage and clarified ambiguous cases.

RFC 9309: Formal Approval

After public feedback and revisions, the IETF published RFC 9309 in September 2022. The document, co-authored by Koster and Google engineers, made robots.txt an official Proposed Standard. It codified long-standing practices and gave precise instructions on:

  • Crawler behavior
  • Error handling
  • Encoding formats
  • Caching rules
  • Interpretation of special characters

RFC 9309 brought structure and clarity to a rule set that had guided the web for decades.

How Robots.txt Guides Web Crawlers

The robots.txt file helps website owners manage how search engine crawlers interact with their sites. It acts as a simple set of instructions that tells compliant bots which pages to visit and which to avoid.

Location and Scope

  • The file must be named robots.txt
  • It must be placed at the root of a domain (e.g. https://example.com/robots.txt)
  • Each subdomain or domain requires its own separate robots.txt file
  • Crawlers will not look for the file in other directories or with alternate names

If a site does not include a robots.txt file, crawlers generally assume there are no restrictions and proceed to index all accessible content. If the file exists, bots like Googlebot or Bingbot will check the rules before crawling.

Primary Uses

Website owners use robots.txt for two main reasons:

  • Control crawler access to unimportant, duplicate, or private pages
  • Reduce server load by limiting the number of unnecessary requests

Examples of pages commonly disallowed:

  • Login screens or admin panels
  • Archive folders or staging environments
  • Dynamically generated URLs
  • Temporary files or incomplete content

This ensures that the crawl budget is spent on valuable content, improving index quality and search efficiency.

Robots.txt and Crawler Behavior

The file is not a security tool. It simply requests bots to avoid certain areas. Well-behaved crawlers respect this, treating the file like a “do not enter” sign. However, malicious bots or scrapers may choose to ignore it entirely.

Because it depends on voluntary compliance, robots.txt is often described as a handshake agreement or a social contract between webmasters and bot developers.

Mutual Benefits

  • Search engines benefit by skipping irrelevant or sensitive content, which improves crawl efficiency
  • Site owners benefit by protecting server performance and guiding search bots to focus on important pages

Despite its simplicity, robots.txt remains a widely respected method of directing automated access across the web.

How Is a Robots.txt File Structured and Written

Robots.txt uses a simple, line-based syntax to tell crawlers which URLs they may access. Each instruction follows a plain text format with specific fields and values. The file must be saved in UTF-8 and placed in the root directory of the site.

Structure of the File

A robots.txt file is divided into one or more rule groups. Each group starts with a user agent declaration and is followed by one or more rules. The format is one field/value pair per line. Blank lines separate different groups.

Key Fields

User-agent
Specifies the bot the rules apply to. For example:
User-agent: Googlebot applies rules to Google’s crawler.
User-agent: * is a wildcard and applies to all bots.

Disallow
Blocks bots from crawling specific paths.
Example: Disallow: /private/ prevents access to anything under the /private/ directory.

A blank disallow line (e.g. Disallow:) means no restrictions
Disallow: / blocks the entire site

Allow
Grants access to specific paths, even if a broader disallow rule exists.
Example:
Disallow: /archive/
Allow: /archive/public/

Comments
Lines starting with # are ignored by bots. These help human readers understand the rules.
Example: Disallow: /temp/ # staging area

Examples of Robots.txt Rules

Block one directory for all bots

User-agent: *
Disallow: /private/

Block multiple directories and one specific page

User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /junk/page.html

Allow everything

User-agent: *
Disallow:
Or
User-agent: *
Allow: /

Block entire site

User-agent: *
Disallow: /

Disallow one bot only

User-agent: BadBot
Disallow: /
User-agent: *
Disallow:

Apply same rules to multiple specific bots

User-agent: Googlebot
User-agent: Bingbot
Disallow: /private/

Rule Matching Behavior

  • Crawlers follow the most specific rule that matches their name
    • If multiple user-agent groups are listed, bots choose the group that directly matches their name
  • If no exact match is found, bots may fall back to the User-agent: * group

This rule structure ensures clear, separate instructions for each crawler and lets webmasters fine-tune what gets crawled and what stays hidden.

How Do Wildcards Work in Robots.txt Directives

Robots.txt originally used only simple path matching, but modern standards now allow wildcards for greater flexibility. These wildcards help website owners block or allow specific types of URLs using patterns rather than exact paths.

Basic Prefix Matching

The original 1994 approach matched URL paths by prefix only.
For example:

Disallow: /folder

This would match any URL that begins with /folder, such as:

  • /folder/page.html
  • /folder123/image.jpg

No advanced pattern recognition was included in the early rules.

Supported Wildcards in Modern Use

RFC 9309 introduced support for two special characters in pattern matching:

  • Asterisk (*)
    Acts as a wildcard for any sequence of characters (including nothing).

    • Disallow: /*.pdf blocks any URL containing .pdf
    • Disallow: /files/*.jpg blocks all .jpg images in /files/, regardless of file name
  • Dollar sign ($)
    Anchors the match to the end of a URL

    • Disallow: *.php$ blocks only URLs that end with .php
    • Without $, the same rule might also match .php?id=123 or longer strings

These characters give precise control over file types, directories, or URL patterns that should or should not be crawled.

Examples of Pattern-Based Rules

Disallow: /downloads/*.zip # blocks all .zip files in the /downloads/ folder
Disallow: /temp/*?debug=true # blocks any temp URL with a debug query
Disallow: /*.log$ # blocks all URLs that end in .log

Encoding Special Characters

If a literal * or $ needs to appear in a URL (not as a wildcard), it must be percent-encoded:

* becomes %2A
$ becomes %24

This ensures that crawlers do not misread the symbol as a pattern operator.

Implementation Caveats

Not all crawlers follow wildcard rules perfectly. However, major bots like Googlebot and Bingbot follow the wildcard behavior defined in RFC 9309, making these rules reliable for mainstream search engines.

Using pattern matching in robots.txt allows targeted control over file types, folders, or parameters—saving crawl resources and focusing search indexing on valuable content.

How Do Non-Standard Directives Work in Robots.txt

Robots.txt supports a few common extensions beyond the core User-agent, Disallow, and Allow rules. These additions are not part of the original 1994 standard but have emerged through widespread adoption or crawler-specific support. The 2022 REP standard allows for such optional directives to be included, provided they do not interfere with known rules.

Crawl-delay

Crawl-delay sets a pause between successive crawler requests. It is not officially standardized and is handled inconsistently:

  • Yandex: Treats Crawl-delay: 10 as “wait 10 seconds between each request”
  • Bing: Interprets it as a crawl rate limit over time
  • Googlebot: Ignores this directive completely; crawl behavior must be managed via Search Console

Due to these differences, Crawl-delay should be used cautiously. It may help reduce server strain for some bots but is ineffective for others.

Sitemap

The Sitemap: directive lets site owners point crawlers directly to one or more XML sitemaps:

Sitemap: https://www.example.com/sitemap.xml

Multiple Sitemap: lines are allowed. This extension is widely supported and helps crawlers discover important URLs without submitting the sitemap separately.

Host

Used mainly by Yandex, the Host: directive indicates the preferred domain name in case a site has multiple accessible hostnames:

Host: example.com

This directive is not recognized by Google or Bing and is not part of the official REP.

Nofollow

Some webmasters mistakenly placed a Nofollow line in robots.txt to prevent bots from following links. This directive is not supported in robots.txt and has no effect. Proper use of nofollow should be within HTML <a> tags or <meta> elements in the page head.

Noindex

Noindex was never part of the official specification, but some believed it could block indexing of disallowed pages. For a time, Googlebot may have heuristically honored this, but as of September 1, 2019, Google no longer supports Noindex in robots.txt files.

Site owners should instead use:

  • noindex meta tags
  • HTTP response headers
  • Password protection or proper server configurations

Handling Unknown Directives

Robots.txt is extensible by design. Crawlers are expected to ignore unrecognized directives. For example:

Scanner: on

If this line is not understood by a bot, it is simply skipped without error, and the rest of the file is parsed normally. This flexibility allows new directives to evolve without disrupting established parsing behavior.

In summary, non-standard directives offer useful crawler-specific controls but should be used with an understanding that support varies. For broad compatibility, it’s best to rely on standardized fields or confirm which bots support any extended rules.

How Is Robots.txt Adopted Across the Web

Robots.txt is widely respected by major web crawlers, but compliance remains voluntary. It functions on an honor system where responsible bots follow its rules, while others may ignore them.

Mainstream Search Engine Support

Most major search engines fully support robots.txt. This includes: Googlebot (Google), Bingbot (Microsoft), Slurp (Yahoo), Yandex Bot, Baidu Spider, DuckDuckGo’s crawler

These bots check the robots.txt file before crawling and will skip disallowed paths, such as /search, if instructed. This behavior is built into their algorithms and reflects established crawler etiquette.

Other Respectful Bots

Beyond search engines, many other bots also follow robots.txt rules:

  • Content aggregators
  • Feed readers
  • SEO tools
  • Research crawlers with ethical guidelines
  • Some web libraries and frameworks also respect robots.txt if configured to behave as bots

Non-Compliant or Malicious Crawlers

Because robots.txt is not enforceable:

  • Malicious bots, spammers, or scrapers may completely ignore it
  • Poorly coded bots might not even check for the file
  • Some malicious crawlers may read robots.txt to locate sensitive paths, treating it as a map to hidden content

Listing private folders (e.g. /admin/ or /confidential/) in robots.txt does not secure them. Security experts advise using authentication or server-level restrictions for protection.

Use in Web Archiving

Historically, robots.txt influenced how archiving tools handled websites:

  • The Wayback Machine at Internet Archive used to honor robots.txt
  • In 2017, it stopped honoring the file, citing misuse by site owners trying to erase history
  • Now, it archives pages regardless of robots.txt, treating the file as part of the live web but not a rule for historical content

Other groups like Archive Team may still read robots.txt for discovery purposes, but often ignore it to preserve public content.

Use of Robots.txt in Generative AI Data Collection

In the 2020s, robots.txt became a key tool for website owners trying to block AI companies from harvesting their content for training datasets. As large language models began collecting vast amounts of web content, concerns over unregulated scraping intensified.

Rise of AI Crawlers

Unlike search engines, which drive traffic back to content sources, AI crawlers often take content without giving anything in return. This triggered backlash among publishers, many of whom updated their robots.txt files to block known AI bots.

A key moment came in 2023, when OpenAI introduced GPTBot, a crawler used to collect data for training models like ChatGPT. In response:

  • 306 out of the top 1,000 websites blocked GPTBot using robots.txt (Originality.AI, 2023)
  • Only 85 sites blocked Google-Extended, Google’s opt-out for generative AI crawling
  • Many publishers targeted GPTBot specifically, allowing other bots but disallowing AI data scrapers

Major Publishers and Platforms

Prominent websites that blocked GPTBot include: The New York Times, BBC, Medium.com (blocked all AI-related bots across the platform)

Medium stated that AI firms were taking content without permission and diluting the web with AI-generated text, prompting a blanket disallow for all machine learning crawlers.

Compliance and Criticism

OpenAI confirmed that GPTBot respects robots.txt and offered clear instructions for opting out. However, critics noted:

  • The crawler began respecting rules after much of the open web had already been scraped
  • As The Verge’s David Pierce remarked, GPTBot started asking permission after it had already taken a full meal

Evasion and Evolving Threats

Not all AI scraping is done by stable, named bots. Some companies:

  • Rotate user-agent strings
  • Use new or distributed crawlers
  • Bypass robots.txt entirely

This makes it difficult for site owners to block AI collection reliably. The honor system at the heart of robots.txt is easily circumvented by determined actors.

Current Role and Future Outlook

Robots.txt now acts as a first line of defense against unauthorized AI data collection. However, its effectiveness relies entirely on the crawler’s willingness to comply. As scraping tactics evolve, many in the web community are calling for stronger, enforceable protections—possibly including legal or technical frameworks—to defend content against unwanted use in AI datasets.

What Are the Main Limitations of Robots.txt

Robots.txt is a helpful tool for guiding bots, but it is not a secure barrier. It works on trust, not enforcement. Understanding its limits is critical for proper use.

Voluntary Compliance

Robots.txt works on an honor system. Bots choose whether to follow the rules. Major search engines do; malicious bots usually do not.

  • No built-in way to block bad actors
  • Bots like Googlebot, Bingbot follow it
  • Spam scrapers often ignore it
  • No penalties for disobedient bots

Even when properly configured, robots.txt cannot prevent access—it can only request it.

No Protection for Sensitive Data

Disallowing a page does not make it private. The robots.txt file itself is public, and listed URLs can be discovered by anyone.

  • Disallowed paths are visible to everyone
  • Attackers can use it as a map to hidden areas
  • Sensitive data must be protected using login systems, firewalls, or IP blocks
  • Never rely on robots.txt to keep anything secret

Security through obscurity is not true security.

Crawling vs Indexing

Robots.txt blocks crawling, not indexing. Pages can still show up in search results even if they are never crawled.

  • Search engines may index URLs from external links
  • Disallowed pages might appear with no snippet
  • Use noindex meta tags or HTTP headers to prevent indexing
  • Do not disallow if you want bots to see and apply noindex

For effective removal, let bots crawl and instruct them with proper indexing tags.

Disallowed URLs in Search Results

If other websites link to your disallowed content, search engines might index the link anyway—despite never visiting the page.

  • Google may list the page with “No information available”
  • Title might come from anchor text
  • Snippet will be blank or missing
  • Deletion requires allowing crawl or returning 404/410

Blocking a page does not guarantee its invisibility in search.

File Size and Fetch Limits

Search engine bots have limits on how much of a robots.txt file they read. Large files may result in skipped rules.

  • Maximum guaranteed read size: 500 KiB
  • Google follows this limit strictly
  • Rules beyond this size may be ignored
  • Keep the file short and clear

Structure your file efficiently to ensure all rules are respected.

Delays in Rule Updates

Bots cache the robots.txt file and may not recheck it immediately. Changes may take time to take effect.

  • Most bots refresh once every 24 hours
  • Updates might not apply instantly
  • Cache behavior can vary
  • Server errors may trigger bots to stop crawling entirely

After editing, allow time for changes to be seen and followed.

Inconsistent Bot Behavior

Not all crawlers support the same directives. Some understand wildcards or Allow; others ignore them completely.

  • Googlebot ignores Crawl-delay
  • Bingbot supports it
  • Unknown bots may break syntax rules
  • Poorly coded bots may bypass restrictions

Always assume that unsupported bots will behave unpredictably.

Public Visibility of robots.txt

The robots.txt file is public and easy to access. It is not meant to be hidden.

  • Accessible at yoursite.com/robots.txt
  • Reveals all disallowed paths
  • Competitors and bad actors can view it
  • Don’t list anything you truly want to hide

Use access controls for anything that must remain private.

Proper Use and Alternatives

Robots.txt is one tool among many. For full control, use other methods alongside it.

  • For privacy: use authentication, firewalls, or IP restrictions
  • For indexing control: use noindex tags or headers
  • For page removal: use 404/410 or Search Console tools

Avoid combining Disallow with noindex—bots cannot see the tag if blocked from crawling the page.

What Other Tools Can Control Crawling Besides Robots.txt

Robots.txt is only one part of a broader system used to manage how bots access and index content. Several other tools and standards complement or extend its role, depending on what the site owner needs to control.

Meta Robots Tags

Meta robots tags are HTML elements placed in the <head> section of a webpage. They give crawlers page-specific instructions such as whether to index the content or follow links.

This method is widely supported and ideal when you want a bot to visit the page but not index it or follow its links.

Example:

<meta name=”robots” content=”noindex, nofollow”>

Key points:

  • Works on individual pages only
  • Needs the crawler to access the page to read the tag
  • Does not work if the page is blocked in robots.txt

X-Robots-Tag HTTP Header

The X-Robots-Tag is a server-side instruction delivered via HTTP headers. It performs a similar function to the meta tag but is used for non-HTML files like PDFs, images, or documents.

This allows you to control indexing behavior for files that do not have HTML markup.

Example:

X-Robots-Tag: noindex

Key points:

  • Useful for non-HTML files
  • Crawler must still be able to fetch the file
  • Commonly used for PDFs or dynamic file responses

Password Protection and Access Controls

The most secure way to prevent indexing is to restrict access to the content altogether. Search engines do not log in or solve CAPTCHAs, so placing content behind authentication makes it invisible to crawlers.

This approach is best for sensitive or private information that should never be accessible to the public or bots.

Key points:

  • Prevents both crawling and indexing
  • Applies to full pages, folders, or files
  • Should be used for anything confidential

URL Removal Tools

Search engines like Google provide tools to temporarily remove URLs from their search index. These are useful for urgent removals but should be combined with noindex or blocking rules to prevent reappearance.

This method is quick but not permanent unless paired with other techniques.

Key points:

  • Fastest method for emergency deindexing
  • Available in Google Search Console
  • Works best when followed by long-term controls like noindex or access restrictions

Other *.txt Files Inspired by Robots.txt

The idea of using plain text files for structured bot or service instructions has expanded into other areas. These formats live at the root of a site, just like robots.txt, but serve different purposes.

Examples:

ads.txt: Lists approved ad vendors
security.txt: Shares contact information for reporting vulnerabilities
humans.txt: A fun, unofficial file with credits for the site’s creators

While these do not affect crawling or indexing, they follow the same basic pattern: a public text file with machine-readable info.