Tokenizer

A tokenizer for large language models (LLMs) is a tool that converts text into tokens, which are smaller and manageable pieces like words or sub-words. This process is essential for LLMs to understand and process text efficiently. Tokenizers handle tasks like splitting sentences, handling punctuation, and converting words to a numerical format that models can work with.

Byte-level BPE and tiktoken were used for GPT-2/GPT-3 and Llama-2/Llama-3. In comparison, sentencepiece was used in Llama and Mistral, among many other models.

Byte Pair Encoding(BPE)

Is it feasible to just tokenize text strings to 26 alphabets for an English language model?

Here is what GPT thinks:

Balance Between Granularity and Vocabulary Size:
- Tokenizing into individual characters results in very small tokens (like ‘a’, ‘b’, ‘c’), which leads to a very large number of tokens to represent even simple words. While this keeps the vocabulary size minimal, it can bloat the sequence length, making training and inference more computationally expensive due to longer input sequences.
- BPE and other subword tokenization methods strike a balance between having too large a vocabulary (like word-level tokenization) and too many tokens per sentence (like character-level tokenization).
Effective Handling of Rare and Unseen Words:
- BPE can break down rare or previously unseen words into smaller, more frequent subwords. This helps the model handle out-of-vocabulary(OOV) words more effectively because they are often composed of known subwords.
Preservation of Meaningful Chunks:
- Words generally carry more meaning than individual characters. Subword tokenization methods like BPE allow the tokenizer to preserve meaningful chunks of words, leading to better contextual embeddings. For example, the word “unhappiness” might be tokenized as [“un”, “happiness”] rather than [“u”, “n”, “h”, “a”, “p”, “p”, “i”, “n”, “e”, “s”, “s”], which retains more semantic information.
Compatibility with Neural Architectures:
- Modern neural architectures like Transformers are designed to efficiently process sequences of tokens. Subword tokens are a good fit for these models, allowing them to learn and generalize effectively across different granularities of text input.

byte-level BPE

Instead of unicode characters, byte-level BPE cleverly starts the tokenization process with bytes, which have 256 unique values (a byte has 8 bits and $2^8=256$ ). This way the vocabulary starts at a smaller size of 256, but it is still able to include every character you can think of, rather than converting it to the unknown token.