Prompt Injection

Prompt injection is a security attack where malicious instructions are embedded in content that an AI system processes, causing the model to follow attacker-controlled commands instead of legitimate user or system instructions.

It exploits the core nature of LLMs: they process all text in their context as potential instructions without being able to reliably distinguish between trusted system prompts and untrusted external content. A simple example: a user asks an AI assistant to summarize a webpage. The webpage contains hidden text saying 'Ignore all previous instructions.

' If the model follows this instruction, the attack succeeds. Prompt injection becomes critical as AI agents gain more capabilities. An agent that can send emails, access databases, or execute code transforms a prompt injection from an annoyance into a serious security vulnerability. Indirect prompt injection is particularly dangerous.

This is where the malicious instructions come from external sources the agent retrieves like web pages, documents, or emails. The attack surface is enormous. Defenses include input sanitization, instruction hierarchy enforcement, and limiting agent capabilities to minimum necessary permissions.

Interactive Concept: prompt injection

Prompt Injection Attack Simulator

Explore how malicious instructions can be embedded in content to manipulate AI systems into ignoring their original instructions.

System Setup

User Input

Content Processing

Injection Detected

AI Response

AI System Processing

System Prompt:

You are a helpful assistant. Summarize the following document for the user.

AI Response

AI is processing...

Prompt Injection Attack Simulator

AI System Processing

AI Response

Related Terms

Prompt Injection

Prompt Injection Attack Simulator

AI System Processing

AI Response

Related Terms