The Visibility Problem
Search changed. In 2025, nearly 60% of queries ended without a click. The user got their answer directly from an AI summary. Google AI Overviews, ChatGPT Search, Perplexity, and Copilot now synthesize answers from multiple sources and present them as a single response. The "ten blue links" page is fading. If your content isn't structured for extraction and citation by these systems, you're invisible to a growing share of your audience.
This page documents every file, protocol, and technique that determines whether AI systems can find, understand, and cite your website. It's the result of building and testing these standards across production sites, not theory.
What the Data Shows
Numbers from 2025-2026 industry research on AI crawler behavior, blocking rates, and adoption.
~28%
of websites now block at least one major AI crawler via robots.txt, CDN, or WAF rules.
79%
of top news publishers block AI training bots. GPTBot is the most blocked crawler (17-62% depending on dataset).
~10%
of domains have adopted llms.txt. Among the top 1,000 sites, it drops to 0.3%. No major AI provider officially uses it as a ranking signal.
The blocking numbers reveal a market that hasn't settled on a strategy. Most publishers are reacting to AI crawlers the same way they reacted to early search engines in the 2000s: with blanket blocks. The problem: blocking search bots (OAI-SearchBot, Claude-SearchBot) removes you from AI-generated answers entirely. Blocking training bots (GPTBot, ClaudeBot) stops your content from being absorbed into model weights without attribution. These are different decisions with different consequences, and most sites are treating them as the same thing.
The llms.txt adoption curve is interesting for what it reveals about the standard's actual utility. No major AI provider (Google, OpenAI, Anthropic, Meta) has committed to using it as a retrieval signal. Its real value has shifted toward B2A (Business-to-Agent) communication: giving coding assistants, IDE agents, and documentation crawlers a structured entry point into your site. That's a narrower use case than the original pitch, but it's a real one.
The Training vs. Retrieval Split
The most important distinction in AI discoverability is between training crawlers and retrieval crawlers. They look similar in your access logs, but they do fundamentally different things with your content.
Training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) ingest your content into model weights. Once absorbed, your words become part of the model's knowledge but are never attributed back to you. There is no referral traffic. No citation. Your content improves someone else's product.
Retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) fetch your content at query time to include in AI-generated answers. These crawlers do cite you. They do send traffic. AI-referred visitors convert at roughly 14% versus 3% for traditional organic, five times higher, because they arrive with specific intent already formed by the AI summary.
The strategic play: allow all retrieval bots, selectively manage training bots. The robots.txt strategy section below shows exactly how to do this.
Measuring AI Visibility
Traditional SEO metrics (clicks, impressions, keyword rankings) don't capture AI performance. These are the metrics that matter now.
How often AI systems cite your domain when answering questions in your topic area.
Your brand's presence as a percentage of all citations in AI-generated answers for core queries.
Visits from ChatGPT, Perplexity, Claude, and Copilot. Track via UTM params or referrer headers.
Conversion rate of AI-referred visitors vs. organic. Industry benchmarks: ~14% vs. ~3%.
Whether AI systems correctly describe your brand, products, and positioning. Reduced by brand.txt.
How often your content is chosen as the direct answer (not just cited) in AI Overviews and snippets.
File Registry
Standard = ratified RFC or W3C spec. Adopted = widely used convention. Emerging = growing adoption, no formal spec. Proposed = draft or specification proposal.
Access Control
Crawler access directives for 25+ AI bots. Separates training bots (GPTBot, ClaudeBot) from search bots (OAI-SearchBot, PerplexityBot).
AI-specific usage permissions: training, citation, indexing, summarization. Granular control beyond robots.txt.
Text and Data Mining reservation. EU CDSM Directive Article 4 compliance. Machine-readable opt-out for AI training.
Content Discovery
Structured Markdown summary for LLMs. Title, description, and organized links to key pages. Created by Jeremy Howard (Answer.AI), 2024.
Full-text content export for deep AI ingestion. Extended version of llms.txt with complete page content.
URL index with lastmod, changefreq, and priority metadata. Used by Google, Bing, and AI crawlers.
RSS/Atom feed for syndication. Chronological content updates consumed by readers and aggregators.
JSON Feed (jsonfeed.org). Machine-readable alternative to RSS/Atom. Easier for AI agents to parse.
Agent Discovery
ChatGPT plugin manifest. Declares site capabilities, API endpoints, and authentication for OpenAI agents.
Agent-to-Agent (A2A) capability advertisement. Declares skills, I/O modes, and authentication for autonomous agents.
MCP Server Card (SEP-1649). Exposes transport config, capabilities, and auth requirements for MCP clients.
OpenAPI 3.x specification. Machine-readable API contract. Foundation for agent tool discovery.
Structured Data
Schema.org structured data (Organization, Article, FAQPage, WebSite with SearchAction). Primary signal for AI entity recognition.
Brand & Identity
Brand governance for AI systems: name capitalization, preferred terminology, prohibited terms, tone, competitor disambiguation.
Structured content map for AI agents. JSON metadata declaring site sections, topics, and content types.
Trust & Security
Vulnerability reporting policy. Contact, encryption key, and disclosure timeline per RFC 9116.
Team credits, technologies used, and acknowledgments. Human-readable provenance signal.
Do Not Track compliance declaration. EFF standard format. Privacy-respecting signal for browsers and extensions.
Sustainability
Sustainability disclosure: hosting provider, energy sources, carbon offsets. Green Web Foundation standard.
Platform
PWA metadata: app name, icons, theme colors, display mode. Required for installable web apps.
Windows tile configuration for pinned sites. Tile images and background colors.
Authorized Digital Sellers. Declares which ad networks are authorized to sell inventory on your domain.
Developer Agent
Cross-tool project context for coding agents. Build commands, architecture, conventions. Emerging universal standard.
Claude Code project context. Architecture, workflows, and coding conventions for Anthropic agents.
Cursor IDE agent rules. Modular .mdc files in .cursor/rules/ for context-aware coding assistance.
AI Crawler Registry
All known AI crawler user-agent strings as of Q2 2026. Separate training bots (content absorbed into model weights, no attribution) from search bots (content cited in AI-generated answers).
GPTBotOAI-SearchBotChatGPT-UserClaudeBotClaude-SearchBotClaude-UserGooglebotGoogle-ExtendedGoogleOtherPerplexityBotPerplexity-Usermeta-externalagentmeta-externalfetcherApplebotApplebot-ExtendedAmazonbotBytespiderTikTokSpiderbingbotCopilotBotCCBotcohere-aiYouBotDiffbotSemrushBotAhrefsBotAEO vs GEO
Answer Engine Optimization targets direct answer selection. Generative Engine Optimization targets citation frequency across AI platforms. You need both.
Content That Gets Cited
AI systems don't rank content. They extract it. The difference matters. A page that ranks #1 on Google can be completely ignored by ChatGPT if it's structured poorly. What LLMs actually favor:
Answer-first formatting. Put the answer in the first 2-3 sentences after a heading, then explain. AI systems extract the answer block and move on. If your answer is buried in paragraph four, it won't be found.
Evidence density. Research shows that LLMs are biased toward content that reads as "evidentiary": numbers, citations, specific claims with sources. Pages with original data, named experts, and precise figures get cited at significantly higher rates than opinion pieces or generic overviews.
Entity consistency. AI models work with entities, not keywords. If you call your product "DataSync" on one page and "Data Sync Platform" on another, the model can't confidently attribute information to you. Use the same terminology everywhere. brand.txt exists to enforce this at the AI layer.
Structured data as ground truth. JSON-LD schema (Organization, Person, Article, FAQPage) gives AI systems a machine-readable source of truth about who you are and what your content is about. It's the most underrated signal in AI discoverability. Most sites either skip it or implement it incorrectly.
robots.txt Strategy
Separate training bots from search bots. Allow what you want cited, block what you want protected.
Crawl your site to include content in AI-generated answers. Blocking them removes you from AI search results entirely.
Ingest content into model weights. Your content becomes part of the model but is not attributed to you.
Allow all search bots (you want citations). Selectively allow or block training bots. Always allow Googlebot. For EU sites, add /.well-known/tdmrep.json for CDSM Directive Article 4 compliance.
Setup
One command generates all discovery files. Auto-detects public/ or static/ directories. Existing files are never overwritten.
Works with Next.js, React, Vue, Hugo, Gatsby, and any static site. Full documentation on GitHub
Standards evolve. Last updated May 2026. File an issue on GitHub if something is missing or outdated.