Vocabulary

Vocabulary in language models is the complete set of tokens the model recognizes, with each token mapping to a learnable embedding vector and an entry in the output prediction layer.

Vocabulary size represents a core design tradeoff: larger vocabularies mean more parameters (embedding matrices scale linearly with vocab size), shorter sequences (common words become single tokens), but sparser learning signal (rare tokens appear infrequently in training). Smaller vocabularies mean longer sequences (words split into more pieces), less model memory, but more computation per word.

GPT-4 uses roughly 100,000 tokens; earlier models used 50,000. Multilingual models need larger vocabularies to represent multiple languages efficiently without excessive splitting. Specialized domains might benefit from custom vocabularies trained on domain-specific corpora.

Special tokens serve functional roles: [PAD] for sequence padding, [BOS]/[EOS] for beginning/end of sequence, [UNK] for unknown tokens, [SEP] for separating segments, and [MASK] for masked language modeling. The vocabulary is frozen after tokenizer training, adding new tokens requires architectural changes or special handling.

When important concepts aren't well-represented in the vocabulary, models must learn to compose them from subwords, which can limit capability on specialized domains.

Interactive Concept: vocabulary

The set of all tokens a language model knows.

Token: "hello"

ID: 1234

Frequency: common

Related Terms

Tokenizer Embeddings