veda.ng

A tokenizer is a preprocessing system that converts raw text into a sequence of discrete tokens that a language model can process, serving as the bridge between human-readable text and numerical model inputs. Different tokenizers produce different token sequences for the same text, affecting model behavior, efficiency, and capabilities. The tokenizer defines the vocabulary, the set of all possible tokens, and the rules for splitting text into tokens. Common approaches include byte-pair encoding, WordPiece, and SentencePiece, all of which learn subword units that balance vocabulary size against sequence length. English text typically tokenizes at 0.75 words per token on average, but this varies significantly by language, domain, and specific tokenizer. Languages with complex morphology or non-Latin scripts often tokenize less efficiently, using more tokens per word and consuming more context window. Code, mathematical notation, and technical content may tokenize poorly if underrepresented in tokenizer training. The tokenizer determines what the model can see: if important distinctions aren't captured in tokenization, the model cannot learn them. Tokenizer choice affects model speed (more tokens means more computation), context utilization (inefficient tokenization wastes context window), and capabilities (models struggle with text that tokenizes badly). Training and inference use the same tokenizer to ensure consistency.