AI Discovery Standards
25 files across 9 categories. Every protocol and metadata standard used to make websites discoverable by AI systems, search engines, and autonomous agents.
By Vedang Vatsa · GitHub → · npx ai-discovery-standards
File Registry
Standard = ratified RFC or W3C spec. Adopted = widely used convention. Emerging = growing adoption, no formal spec. Proposed = draft or specification proposal.
Access Control
Crawler access directives for 25+ AI bots. Separates training bots (GPTBot, ClaudeBot) from search bots (OAI-SearchBot, PerplexityBot).
AI-specific usage permissions: training, citation, indexing, summarization. Granular control beyond robots.txt.
Text and Data Mining reservation. EU CDSM Directive Article 4 compliance. Machine-readable opt-out for AI training.
Content Discovery
Curated Markdown summary for LLMs. Title, description, and organized links to key pages. Created by Jeremy Howard (Answer.AI), 2024.
Full-text content export for deep AI ingestion. Extended version of llms.txt with complete page content.
URL index with lastmod, changefreq, and priority metadata. Used by Google, Bing, and AI crawlers.
RSS/Atom feed for syndication. Chronological content updates consumed by readers and aggregators.
JSON Feed (jsonfeed.org). Machine-readable alternative to RSS/Atom. Easier for AI agents to parse.
Agent Discovery
ChatGPT plugin manifest. Declares site capabilities, API endpoints, and authentication for OpenAI agents.
Agent-to-Agent (A2A) capability advertisement. Declares skills, I/O modes, and authentication for autonomous agents.
MCP Server Card (SEP-1649). Exposes transport config, capabilities, and auth requirements for MCP clients.
OpenAPI 3.x specification. Machine-readable API contract. Foundation for agent tool discovery.
Structured Data
Schema.org structured data (Organization, Article, FAQPage, WebSite with SearchAction). Primary signal for AI entity recognition.
Brand & Identity
Brand governance for AI systems: name capitalization, preferred terminology, prohibited terms, tone, competitor disambiguation.
Structured content map for AI agents. JSON metadata declaring site sections, topics, and content types.
Trust & Security
Vulnerability reporting policy. Contact, encryption key, and disclosure timeline per RFC 9116.
Team credits, technologies used, and acknowledgments. Human-readable provenance signal.
Do Not Track compliance declaration. EFF standard format. Privacy-respecting signal for browsers and extensions.
Sustainability
Sustainability disclosure: hosting provider, energy sources, carbon offsets. Green Web Foundation standard.
Platform
PWA metadata: app name, icons, theme colors, display mode. Required for installable web apps.
Windows tile configuration for pinned sites. Tile images and background colors.
Authorized Digital Sellers. Declares which ad networks are authorized to sell inventory on your domain.
Developer Agent
Cross-tool project context for coding agents. Build commands, architecture, conventions. Emerging universal standard.
Claude Code project context. Architecture, workflows, and coding conventions for Anthropic agents.
Cursor IDE agent rules. Modular .mdc files in .cursor/rules/ for context-aware coding assistance.
AI Crawler Registry
All known AI crawler user-agent strings as of Q2 2026. Separate training bots (content absorbed into model weights, no attribution) from search bots (content cited in AI-generated answers).
GPTBotOAI-SearchBotChatGPT-UserClaudeBotClaude-SearchBotClaude-UserGooglebotGoogle-ExtendedGoogleOtherPerplexityBotPerplexity-Usermeta-externalagentmeta-externalfetcherApplebotApplebot-ExtendedAmazonbotBytespiderTikTokSpiderbingbotCopilotBotCCBotcohere-aiYouBotDiffbotSemrushBotAhrefsBotAEO vs GEO
Answer Engine Optimization targets direct answer selection. Generative Engine Optimization targets citation frequency across AI platforms. You need both.
robots.txt Strategy
Separate training bots from search bots. Allow what you want cited, block what you want protected.
Crawl your site to include content in AI-generated answers. Blocking them removes you from AI search results entirely.
Ingest content into model weights. Your content becomes part of the model but is not attributed to you.
Allow all search bots (you want citations). Selectively allow or block training bots. Always allow Googlebot. For EU sites, add /.well-known/tdmrep.json for CDSM Directive Article 4 compliance.
Setup
One command generates all discovery files. Auto-detects public/ or static/ directories. Existing files are never overwritten.
Works with Next.js, React, Vue, Hugo, Gatsby, and any static site. Full documentation on GitHub
Standards evolve. Last updated May 2026. File an issue on GitHub if something is missing or outdated.