Byte Pair Encoding

GPT Tokenizer

Byte Level byte pair encoding(BPE) and tiktoken were used for GPT2/3 tokenizers. In comparison, sentencepiece was used in Llama and mistral, among many other models.

GPT3 improves its capability in understanding Python code by encoding consecutive blank spaces into single tokens, which compresses the Python code string into shorter sequence fitting into the context length.

WordPiece tokenizer

used in BERT

Reference