veda.ng
Back to Glossary

Byte-Pair Encoding

Byte-pair encoding is a subword tokenization algorithm that iteratively builds a vocabulary by merging the most frequently occurring pairs of characters or tokens, balancing vocabulary size against sequence compression. Starting from individual characters (or bytes), BPE counts all adjacent pairs in the training corpus and merges the most frequent pair into a single new token. This process repeats until reaching the desired vocabulary size, typically 30,000 to 100,000 tokens. Common words become single tokens through repeated merges. Rare words remain split into frequent subword pieces. The beauty of BPE is that it handles out-of-vocabulary words gracefully: any word can be represented as a sequence of subwords, and the model learns relationships between subword pieces. This is why language models can handle misspellings, neologisms, and technical terms they've never seen exactly. BPE training is corpus-dependent: a tokenizer trained on English performs poorly on code or other languages. The merge rules are deterministic once trained, ensuring consistent tokenization. GPT models use BPE variants; BERT uses WordPiece which is similar but scores merges differently. Understanding BPE explains why some text tokenizes efficiently (common words in the training corpus) and other text tokenizes poorly (rare domains or languages).

Related Terms