Tokenizer | Glossary | Vedang Vatsa

A tokenizer is a preprocessing system that converts raw text into a sequence of discrete tokens that a language model can process, serving as the bridge between human-readable text and numerical model inputs. Different tokenizers produce different token sequences for the same text, affecting model behavior, efficiency, and capabilities.

The tokenizer defines the vocabulary, the set of all possible tokens, and the rules for splitting text into tokens. Common approaches include byte-pair encoding, WordPiece, and SentencePiece, all of which learn subword units that balance vocabulary size against sequence length. 75 words per token on average, but this varies significantly by language, domain, and specific tokenizer.

Languages with complex morphology or non-Latin scripts often tokenize less efficiently, using more tokens per word and consuming more context window. Code, mathematical notation, and technical content may tokenize poorly if underrepresented in tokenizer training.

The tokenizer determines what the model can see: if important distinctions aren't captured in tokenization, the model cannot learn them. Tokenizer choice affects model speed (more tokens means more computation), context utilization (inefficient tokenization wastes context window), and capabilities (models struggle with text that tokenizes badly).

Training and inference use the same tokenizer to guarantee consistency.

Interactive Concept: tokenizer

Tokenizer Visualization

See how different tokenization strategies convert raw text into discrete tokens that AI models can process.

Input Text

Tokenization Strategy

Tokens

7 tokens • Vocab size: 7

Hello

world

This

tokenization

Total Tokens

Unique Tokens

20.6%

Compression Ratio

4.9

Chars per Token

Related Terms

Token Large Language Model (LLM)