veda.ng

AI Discovery Standards

25 files across 9 categories. Every protocol and metadata standard used to make websites discoverable by AI systems, search engines, and autonomous agents.

By Vedang Vatsa · GitHub → · npx ai-discovery-standards

File Registry

Standard = ratified RFC or W3C spec. Adopted = widely used convention. Emerging = growing adoption, no formal spec. Proposed = draft or specification proposal.

Access Control

robots.txt
/robots.txt

Crawler access directives for 25+ AI bots. Separates training bots (GPTBot, ClaudeBot) from search bots (OAI-SearchBot, PerplexityBot).

StandardRFC 9309
ai.txt
/ai.txt

AI-specific usage permissions: training, citation, indexing, summarization. Granular control beyond robots.txt.

EmergingCommunity
tdmrep.json
/.well-known/tdmrep.json

Text and Data Mining reservation. EU CDSM Directive Article 4 compliance. Machine-readable opt-out for AI training.

StandardW3C TDMRep

Content Discovery

llms.txt
/llms.txt

Curated Markdown summary for LLMs. Title, description, and organized links to key pages. Created by Jeremy Howard (Answer.AI), 2024.

llms-full.txt
/llms-full.txt

Full-text content export for deep AI ingestion. Extended version of llms.txt with complete page content.

sitemap.xml
/sitemap.xml

URL index with lastmod, changefreq, and priority metadata. Used by Google, Bing, and AI crawlers.

feed.xml
/feed.xml

RSS/Atom feed for syndication. Chronological content updates consumed by readers and aggregators.

StandardRSS 2.0 / Atom
feed.json
/feed.json

JSON Feed (jsonfeed.org). Machine-readable alternative to RSS/Atom. Easier for AI agents to parse.

Agent Discovery

ai-plugin.json
/ai-plugin.json

ChatGPT plugin manifest. Declares site capabilities, API endpoints, and authentication for OpenAI agents.

agents.json
/agents.json

Agent-to-Agent (A2A) capability advertisement. Declares skills, I/O modes, and authentication for autonomous agents.

EmergingA2A Protocol
MCP Server Card
/.well-known/mcp/server-card.json

MCP Server Card (SEP-1649). Exposes transport config, capabilities, and auth requirements for MCP clients.

ProposedMCP / AAIF
openapi.json
/api/openapi.json

OpenAPI 3.x specification. Machine-readable API contract. Foundation for agent tool discovery.

StandardOpenAPI 3.1

Structured Data

JSON-LD
Embedded in HTML <head>

Schema.org structured data (Organization, Article, FAQPage, WebSite with SearchAction). Primary signal for AI entity recognition.

StandardSchema.org

Brand & Identity

brand.txt
/brand.txt

Brand governance for AI systems: name capitalization, preferred terminology, prohibited terms, tone, competitor disambiguation.

EmergingCommunity
ai.json
/ai.json

Structured content map for AI agents. JSON metadata declaring site sections, topics, and content types.

EmergingCommunity

Trust & Security

security.txt
/.well-known/security.txt

Vulnerability reporting policy. Contact, encryption key, and disclosure timeline per RFC 9116.

StandardRFC 9116
humans.txt
/humans.txt

Team credits, technologies used, and acknowledgments. Human-readable provenance signal.

dnt-policy.txt
/.well-known/dnt-policy.txt

Do Not Track compliance declaration. EFF standard format. Privacy-respecting signal for browsers and extensions.

AdoptedEFF DNT

Sustainability

carbon.txt
/carbon.txt

Sustainability disclosure: hosting provider, energy sources, carbon offsets. Green Web Foundation standard.

Platform

manifest.json
/manifest.json

PWA metadata: app name, icons, theme colors, display mode. Required for installable web apps.

browserconfig.xml
/browserconfig.xml

Windows tile configuration for pinned sites. Tile images and background colors.

StandardMicrosoft
ads.txt
/ads.txt

Authorized Digital Sellers. Declares which ad networks are authorized to sell inventory on your domain.

Developer Agent

AGENTS.md
Repository root

Cross-tool project context for coding agents. Build commands, architecture, conventions. Emerging universal standard.

Emergingagents.md
CLAUDE.md
Repository root

Claude Code project context. Architecture, workflows, and coding conventions for Anthropic agents.

AdoptedAnthropic
.cursorrules
Repository root

Cursor IDE agent rules. Modular .mdc files in .cursor/rules/ for context-aware coding assistance.

AdoptedCursor

AI Crawler Registry

All known AI crawler user-agent strings as of Q2 2026. Separate training bots (content absorbed into model weights, no attribution) from search bots (content cited in AI-generated answers).

OpenAI
GPTBotOAI-SearchBotChatGPT-User
Anthropic
ClaudeBotClaude-SearchBotClaude-User
Google
GooglebotGoogle-ExtendedGoogleOther
Perplexity
PerplexityBotPerplexity-User
Meta
meta-externalagentmeta-externalfetcher
Apple
ApplebotApplebot-Extended
Amazon
Amazonbot
ByteDance
BytespiderTikTokSpider
Microsoft
bingbotCopilotBot
Others
CCBotcohere-aiYouBotDiffbotSemrushBotAhrefsBot

AEO vs GEO

Answer Engine Optimization targets direct answer selection. Generative Engine Optimization targets citation frequency across AI platforms. You need both.

GoalBe selected as the direct answerBe cited as a source across AI platforms
TargetsPerplexity, ChatGPT Search, Google AI OverviewsClaude, ChatGPT, Gemini recommendations
Content patternH2 headings as literal questions, 2-3 sentence answer belowConsistent terminology, clear authorship, JSON-LD, llms.txt
Key schemaFAQPage, HowToOrganization, Person, WebSite + SearchAction, sameAs
MetricAnswer selection rateShare of AI Voice (citation frequency)

robots.txt Strategy

Separate training bots from search bots. Allow what you want cited, block what you want protected.

Search bots(OAI-SearchBot, Claude-SearchBot, PerplexityBot)

Crawl your site to include content in AI-generated answers. Blocking them removes you from AI search results entirely.

Training bots(GPTBot, ClaudeBot, Google-Extended)

Ingest content into model weights. Your content becomes part of the model but is not attributed to you.

Recommended strategy

Allow all search bots (you want citations). Selectively allow or block training bots. Always allow Googlebot. For EU sites, add /.well-known/tdmrep.json for CDSM Directive Article 4 compliance.

Setup

One command generates all discovery files. Auto-detects public/ or static/ directories. Existing files are never overwritten.

$ npx ai-discovery-standards

Works with Next.js, React, Vue, Hugo, Gatsby, and any static site. Full documentation on GitHub

Standards evolve. Last updated May 2026. File an issue on GitHub if something is missing or outdated.