llms.txt is a Markdown file placed at the root of a website (/llms.txt) that provides LLMs a structured summary of the site content. It includes an H1 title, a blockquote description, and organized links to key pages. Created by Jeremy Howard (Answer.AI) in 2024, it is widely adopted by companies like Anthropic, Stripe, and Vercel but is not an IETF or W3C standard.

What is the difference between AEO and GEO?

AEO (Answer Engine Optimization) focuses on structuring content so AI-powered answer engines like ChatGPT, Perplexity, and Google AI Overviews cite your site when generating responses. GEO (Generative Engine Optimization) extends this to focus on appearing in AI-generated summaries across all platforms. AEO targets question-answer extraction; GEO targets topical authority and citation rate across the entire AI ecosystem.

What AI crawlers should I allow in robots.txt?

As of 2026, the major AI crawlers are: GPTBot, OAI-SearchBot, ChatGPT-User (OpenAI), ClaudeBot, Claude-SearchBot, Claude-User (Anthropic), Google-Extended (Gemini), PerplexityBot, Perplexity-User, meta-externalagent, Applebot-Extended, Amazonbot, CCBot, Bytespider, and cohere-ai. Search/retrieval bots cite your content in AI answers, while training bots absorb content into model weights.

brand.txt is a plain-text file placed at the root of a website that provides AI systems with explicit instructions on how to represent a brand. It defines the canonical brand name with exact capitalization, preferred and prohibited terminology, product names, tone guidance, and competitor disambiguation. It reduces AI hallucinations about your brand identity.

ai.txt is a plain-text file that declares permissions for AI use of website content. While robots.txt controls crawl access, ai.txt specifies what AI systems may do with the content: training, indexing, citation, or summarization. It includes owner contact information and links to other discovery files like llms.txt and sitemap.xml.

AI Discovery Standards, Open-Source Reference for AI Web Discoverability

The Visibility Problem

Search changed. In 2025, nearly 60% of queries ended without a click. The user got their answer directly from an AI summary. Google AI Overviews, ChatGPT Search, Perplexity, and Copilot now synthesize answers from multiple sources and present them as a single response. The "ten blue links" page is fading. If your content isn't structured for extraction and citation by these systems, you're invisible to a growing share of your audience.

This page documents every file, protocol, and technique that determines whether AI systems can find, understand, and cite your website. It's the result of building and testing these standards across production sites, not theory.

What the Data Shows

Numbers from 2025-2026 industry research on AI crawler behavior, blocking rates, and adoption.

~28%

of websites now block at least one major AI crawler via robots.txt, CDN, or WAF rules.

79%

of top news publishers block AI training bots. GPTBot is the most blocked crawler (17-62% depending on dataset).

~10%

of domains have adopted llms.txt. Among the top 1,000 sites, it drops to 0.3%. No major AI provider officially uses it as a ranking signal.

The blocking numbers reveal a market that hasn't settled on a strategy. Most publishers are reacting to AI crawlers the same way they reacted to early search engines in the 2000s: with blanket blocks. The problem: blocking search bots (OAI-SearchBot, Claude-SearchBot) removes you from AI-generated answers entirely. Blocking training bots (GPTBot, ClaudeBot) stops your content from being absorbed into model weights without attribution. These are different decisions with different consequences, and most sites are treating them as the same thing.

The llms.txt adoption curve is interesting for what it reveals about the standard's actual utility. No major AI provider (Google, OpenAI, Anthropic, Meta) has committed to using it as a retrieval signal. Its real value has shifted toward B2A (Business-to-Agent) communication: giving coding assistants, IDE agents, and documentation crawlers a structured entry point into your site. That's a narrower use case than the original pitch, but it's a real one.

The Training vs. Retrieval Split

The most important distinction in AI discoverability is between training crawlers and retrieval crawlers. They look similar in your access logs, but they do fundamentally different things with your content.

Training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) ingest your content into model weights. Once absorbed, your words become part of the model's knowledge but are never attributed back to you. There is no referral traffic. No citation. Your content improves someone else's product.

Retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) fetch your content at query time to include in AI-generated answers. These crawlers do cite you. They do send traffic. AI-referred visitors convert at roughly 14% versus 3% for traditional organic, five times higher, because they arrive with specific intent already formed by the AI summary.

The strategic play: allow all retrieval bots, selectively manage training bots. The robots.txt strategy section below shows exactly how to do this.

Measuring AI Visibility

Traditional SEO metrics (clicks, impressions, keyword rankings) don't capture AI performance. These are the metrics that matter now.

Citation Frequency

Primary

How often AI systems cite your domain when answering questions in your topic area.

Share of AI Voice

Primary

Your brand's presence as a percentage of all citations in AI-generated answers for core queries.

AI Referral Traffic

Measurable

Visits from ChatGPT, Perplexity, Claude, and Copilot. Track via UTM params or referrer headers.

AI Conversion Rate

Measurable

Conversion rate of AI-referred visitors vs. organic. Industry benchmarks: ~14% vs. ~3%.

Brand Mention Accuracy

Qualitative

Whether AI systems correctly describe your brand, products, and positioning. Reduced by brand.txt.

Answer Selection Rate

AEO-specific

How often your content is chosen as the direct answer (not just cited) in AI Overviews and snippets.

File Registry

Standard = ratified RFC or W3C spec. Adopted = widely used convention. Emerging = growing adoption, no formal spec. Proposed = draft or specification proposal.

Access Control

robots.txt

/robots.txt

Crawler access directives for 25+ AI bots. Separates training bots (GPTBot, ClaudeBot) from search bots (OAI-SearchBot, PerplexityBot).

StandardRFC 9309

ai.txt

/ai.txt

AI-specific usage permissions: training, citation, indexing, summarization. Granular control beyond robots.txt.

EmergingCommunity

tdmrep.json

/.well-known/tdmrep.json

Text and Data Mining reservation. EU CDSM Directive Article 4 compliance. Machine-readable opt-out for AI training.

StandardW3C TDMRep

Content Discovery

llms.txt

/llms.txt

Structured Markdown summary for LLMs. Title, description, and organized links to key pages. Created by Jeremy Howard (Answer.AI), 2024.

Adoptedllmstxt.org

llms-full.txt

/llms-full.txt

Full-text content export for deep AI ingestion. Extended version of llms.txt with complete page content.

Adoptedllmstxt.org

sitemap.xml

/sitemap.xml

URL index with lastmod, changefreq, and priority metadata. Used by Google, Bing, and AI crawlers.

Standardsitemaps.org

feed.xml

/feed.xml

RSS/Atom feed for syndication. Chronological content updates consumed by readers and aggregators.

StandardRSS 2.0 / Atom

feed.json

/feed.json

JSON Feed (jsonfeed.org). Machine-readable alternative to RSS/Atom. Easier for AI agents to parse.

AdoptedJSON Feed 1.1

Agent Discovery

ai-plugin.json

/ai-plugin.json

ChatGPT plugin manifest. Declares site capabilities, API endpoints, and authentication for OpenAI agents.

AdoptedOpenAI Plugin

agents.json

/agents.json

Agent-to-Agent (A2A) capability advertisement. Declares skills, I/O modes, and authentication for autonomous agents.

EmergingA2A Protocol

MCP Server Card

/.well-known/mcp/server-card.json

MCP Server Card (SEP-1649). Exposes transport config, capabilities, and auth requirements for MCP clients.

ProposedMCP / AAIF

openapi.json

/api/openapi.json

OpenAPI 3.x specification. Machine-readable API contract. Foundation for agent tool discovery.

StandardOpenAPI 3.1

Structured Data

JSON-LD

Embedded in HTML <head>

Schema.org structured data (Organization, Article, FAQPage, WebSite with SearchAction). Primary signal for AI entity recognition.

StandardSchema.org

Brand & Identity

brand.txt

/brand.txt

Brand governance for AI systems: name capitalization, preferred terminology, prohibited terms, tone, competitor disambiguation.

EmergingCommunity

ai.json

/ai.json

Structured content map for AI agents. JSON metadata declaring site sections, topics, and content types.

EmergingCommunity

Trust & Security

security.txt

/.well-known/security.txt

Vulnerability reporting policy. Contact, encryption key, and disclosure timeline per RFC 9116.

StandardRFC 9116

humans.txt

/humans.txt

Team credits, technologies used, and acknowledgments. Human-readable provenance signal.

Adoptedhumanstxt.org

dnt-policy.txt

/.well-known/dnt-policy.txt

Do Not Track compliance declaration. EFF standard format. Privacy-respecting signal for browsers and extensions.

AdoptedEFF DNT

Sustainability

carbon.txt

/carbon.txt

Sustainability disclosure: hosting provider, energy sources, carbon offsets. Green Web Foundation standard.

Adoptedcarbontxt.org

Platform

manifest.json

/manifest.json

PWA metadata: app name, icons, theme colors, display mode. Required for installable web apps.

StandardW3C Web App Manifest

browserconfig.xml

/browserconfig.xml

Windows tile configuration for pinned sites. Tile images and background colors.

StandardMicrosoft

ads.txt

/ads.txt

Authorized Digital Sellers. Declares which ad networks are authorized to sell inventory on your domain.

StandardIAB Tech Lab

Developer Agent

AGENTS.md

Repository root

Cross-tool project context for coding agents. Build commands, architecture, conventions. Emerging universal standard.

Emergingagents.md

CLAUDE.md

Repository root

Claude Code project context. Architecture, workflows, and coding conventions for Anthropic agents.

AdoptedAnthropic

.cursorrules

Repository root

Cursor IDE agent rules. Modular .mdc files in .cursor/rules/ for context-aware coding assistance.

AdoptedCursor

AI Crawler Registry

All known AI crawler user-agent strings as of Q2 2026. Separate training bots (content absorbed into model weights, no attribution) from search bots (content cited in AI-generated answers).

OpenAI

GPTBotOAI-SearchBotChatGPT-User

Anthropic

ClaudeBotClaude-SearchBotClaude-User

Google

GooglebotGoogle-ExtendedGoogleOther

Perplexity

PerplexityBotPerplexity-User

AEO vs GEO

Answer Engine Optimization targets direct answer selection. Generative Engine Optimization targets citation frequency across AI platforms. You need both.

GoalBe selected as the direct answerBe cited as a source across AI platforms

TargetsPerplexity, ChatGPT Search, Google AI OverviewsClaude, ChatGPT, Gemini recommendations

Content patternH2 headings as literal questions, 2-3 sentence answer belowConsistent terminology, clear authorship, JSON-LD, llms.txt

Key schemaFAQPage, HowToOrganization, Person, WebSite + SearchAction, sameAs

MetricAnswer selection rateShare of AI Voice (citation frequency)

Content That Gets Cited

AI systems don't rank content. They extract it. The difference matters. A page that ranks #1 on Google can be completely ignored by ChatGPT if it's structured poorly. What LLMs actually favor:

Answer-first formatting. Put the answer in the first 2-3 sentences after a heading, then explain. AI systems extract the answer block and move on. If your answer is buried in paragraph four, it won't be found.

Evidence density. Research shows that LLMs are biased toward content that reads as "evidentiary": numbers, citations, specific claims with sources. Pages with original data, named experts, and precise figures get cited at significantly higher rates than opinion pieces or generic overviews.

Entity consistency. AI models work with entities, not keywords. If you call your product "DataSync" on one page and "Data Sync Platform" on another, the model can't confidently attribute information to you. Use the same terminology everywhere. brand.txt exists to enforce this at the AI layer.

Structured data as ground truth. JSON-LD schema (Organization, Person, Article, FAQPage) gives AI systems a machine-readable source of truth about who you are and what your content is about. It's the most underrated signal in AI discoverability. Most sites either skip it or implement it incorrectly.

robots.txt Strategy

Separate training bots from search bots. Allow what you want cited, block what you want protected.

Search bots(OAI-SearchBot, Claude-SearchBot, PerplexityBot)

Crawl your site to include content in AI-generated answers. Blocking them removes you from AI search results entirely.

Training bots(GPTBot, ClaudeBot, Google-Extended)

Ingest content into model weights. Your content becomes part of the model but is not attributed to you.

Recommended strategy

Allow all search bots (you want citations). Selectively allow or block training bots. Always allow Googlebot. For EU sites, add /.well-known/tdmrep.json for CDSM Directive Article 4 compliance.

Setup

One command generates all discovery files. Auto-detects public/ or static/ directories. Existing files are never overwritten.

$ npx ai-discovery-standards

Works with Next.js, React, Vue, Hugo, Gatsby, and any static site. Full documentation on GitHub

Standards evolve. Last updated May 2026. File an issue on GitHub if something is missing or outdated.

The Visibility Problem

What the Data Shows

Numbers from 2025-2026 industry research on AI crawler behavior, blocking rates, and adoption.

~28%

of websites now block at least one major AI crawler via robots.txt, CDN, or WAF rules.

79%

of top news publishers block AI training bots. GPTBot is the most blocked crawler (17-62% depending on dataset).

~10%

of domains have adopted llms.txt. Among the top 1,000 sites, it drops to 0.3%. No major AI provider officially uses it as a ranking signal.

The Training vs. Retrieval Split

The strategic play: allow all retrieval bots, selectively manage training bots. The robots.txt strategy section below shows exactly how to do this.

Measuring AI Visibility

Traditional SEO metrics (clicks, impressions, keyword rankings) don't capture AI performance. These are the metrics that matter now.

Citation Frequency

Primary

How often AI systems cite your domain when answering questions in your topic area.

Share of AI Voice

Primary

Your brand's presence as a percentage of all citations in AI-generated answers for core queries.

AI Referral Traffic

Measurable

Visits from ChatGPT, Perplexity, Claude, and Copilot. Track via UTM params or referrer headers.

AI Conversion Rate

Measurable

Conversion rate of AI-referred visitors vs. organic. Industry benchmarks: ~14% vs. ~3%.

Brand Mention Accuracy

Qualitative

Whether AI systems correctly describe your brand, products, and positioning. Reduced by brand.txt.

Answer Selection Rate

AEO-specific

How often your content is chosen as the direct answer (not just cited) in AI Overviews and snippets.

File Registry

Standard = ratified RFC or W3C spec. Adopted = widely used convention. Emerging = growing adoption, no formal spec. Proposed = draft or specification proposal.

Access Control

robots.txt

/robots.txt

Crawler access directives for 25+ AI bots. Separates training bots (GPTBot, ClaudeBot) from search bots (OAI-SearchBot, PerplexityBot).

StandardRFC 9309

ai.txt

/ai.txt

AI-specific usage permissions: training, citation, indexing, summarization. Granular control beyond robots.txt.

EmergingCommunity

tdmrep.json

/.well-known/tdmrep.json

Text and Data Mining reservation. EU CDSM Directive Article 4 compliance. Machine-readable opt-out for AI training.

StandardW3C TDMRep

Content Discovery

llms.txt

/llms.txt

Structured Markdown summary for LLMs. Title, description, and organized links to key pages. Created by Jeremy Howard (Answer.AI), 2024.

Adoptedllmstxt.org

llms-full.txt

/llms-full.txt

Full-text content export for deep AI ingestion. Extended version of llms.txt with complete page content.

Adoptedllmstxt.org

sitemap.xml

/sitemap.xml

URL index with lastmod, changefreq, and priority metadata. Used by Google, Bing, and AI crawlers.

Standardsitemaps.org

feed.xml

/feed.xml

RSS/Atom feed for syndication. Chronological content updates consumed by readers and aggregators.

StandardRSS 2.0 / Atom

feed.json

/feed.json

JSON Feed (jsonfeed.org). Machine-readable alternative to RSS/Atom. Easier for AI agents to parse.

AdoptedJSON Feed 1.1

Agent Discovery

ai-plugin.json

/ai-plugin.json

ChatGPT plugin manifest. Declares site capabilities, API endpoints, and authentication for OpenAI agents.

AdoptedOpenAI Plugin

agents.json

/agents.json

Agent-to-Agent (A2A) capability advertisement. Declares skills, I/O modes, and authentication for autonomous agents.

EmergingA2A Protocol

MCP Server Card

/.well-known/mcp/server-card.json

MCP Server Card (SEP-1649). Exposes transport config, capabilities, and auth requirements for MCP clients.

ProposedMCP / AAIF

openapi.json

/api/openapi.json

OpenAPI 3.x specification. Machine-readable API contract. Foundation for agent tool discovery.

StandardOpenAPI 3.1

Structured Data

JSON-LD

Embedded in HTML <head>

Schema.org structured data (Organization, Article, FAQPage, WebSite with SearchAction). Primary signal for AI entity recognition.

StandardSchema.org

Brand & Identity

brand.txt

/brand.txt

Brand governance for AI systems: name capitalization, preferred terminology, prohibited terms, tone, competitor disambiguation.

EmergingCommunity

ai.json

/ai.json

Structured content map for AI agents. JSON metadata declaring site sections, topics, and content types.

EmergingCommunity

Trust & Security

security.txt

/.well-known/security.txt

Vulnerability reporting policy. Contact, encryption key, and disclosure timeline per RFC 9116.

StandardRFC 9116

humans.txt

/humans.txt

Team credits, technologies used, and acknowledgments. Human-readable provenance signal.

Adoptedhumanstxt.org

dnt-policy.txt

/.well-known/dnt-policy.txt

Do Not Track compliance declaration. EFF standard format. Privacy-respecting signal for browsers and extensions.

AdoptedEFF DNT

Sustainability

carbon.txt

/carbon.txt

Sustainability disclosure: hosting provider, energy sources, carbon offsets. Green Web Foundation standard.

Adoptedcarbontxt.org

Platform

manifest.json

/manifest.json

PWA metadata: app name, icons, theme colors, display mode. Required for installable web apps.

StandardW3C Web App Manifest

browserconfig.xml

/browserconfig.xml

Windows tile configuration for pinned sites. Tile images and background colors.

StandardMicrosoft

ads.txt

/ads.txt

Authorized Digital Sellers. Declares which ad networks are authorized to sell inventory on your domain.

StandardIAB Tech Lab

Developer Agent

AGENTS.md

Repository root

Cross-tool project context for coding agents. Build commands, architecture, conventions. Emerging universal standard.

Emergingagents.md

CLAUDE.md

Repository root

Claude Code project context. Architecture, workflows, and coding conventions for Anthropic agents.

AdoptedAnthropic

.cursorrules

Repository root

Cursor IDE agent rules. Modular .mdc files in .cursor/rules/ for context-aware coding assistance.

AdoptedCursor

AI Crawler Registry

All known AI crawler user-agent strings as of Q2 2026. Separate training bots (content absorbed into model weights, no attribution) from search bots (content cited in AI-generated answers).

OpenAI

GPTBotOAI-SearchBotChatGPT-User

Anthropic

ClaudeBotClaude-SearchBotClaude-User

Google

GooglebotGoogle-ExtendedGoogleOther

Perplexity

PerplexityBotPerplexity-User

AEO vs GEO

Answer Engine Optimization targets direct answer selection. Generative Engine Optimization targets citation frequency across AI platforms. You need both.

GoalBe selected as the direct answerBe cited as a source across AI platforms

TargetsPerplexity, ChatGPT Search, Google AI OverviewsClaude, ChatGPT, Gemini recommendations

Content patternH2 headings as literal questions, 2-3 sentence answer belowConsistent terminology, clear authorship, JSON-LD, llms.txt

Key schemaFAQPage, HowToOrganization, Person, WebSite + SearchAction, sameAs

MetricAnswer selection rateShare of AI Voice (citation frequency)

Content That Gets Cited

AI systems don't rank content. They extract it. The difference matters. A page that ranks #1 on Google can be completely ignored by ChatGPT if it's structured poorly. What LLMs actually favor:

robots.txt Strategy

Separate training bots from search bots. Allow what you want cited, block what you want protected.

Search bots(OAI-SearchBot, Claude-SearchBot, PerplexityBot)

Crawl your site to include content in AI-generated answers. Blocking them removes you from AI search results entirely.

Training bots(GPTBot, ClaudeBot, Google-Extended)

Ingest content into model weights. Your content becomes part of the model but is not attributed to you.

Recommended strategy

Allow all search bots (you want citations). Selectively allow or block training bots. Always allow Googlebot. For EU sites, add /.well-known/tdmrep.json for CDSM Directive Article 4 compliance.

Setup

One command generates all discovery files. Auto-detects public/ or static/ directories. Existing files are never overwritten.

$ npx ai-discovery-standards

Works with Next.js, React, Vue, Hugo, Gatsby, and any static site. Full documentation on GitHub

Standards evolve. Last updated May 2026. File an issue on GitHub if something is missing or outdated.