Byte Pair Encoding
GPT Tokenizer
Byte Level byte pair encoding(BPE) and
tiktoken
were used for GPT2/3 tokenizers. In comparison,sentencepiece
was used in Llama and mistral, among many other models.
GPT3 improves its capability in understanding Python code by encoding consecutive blank spaces into single tokens, which compresses the Python code string into shorter sequence fitting into the context length.
WordPiece tokenizer
used in BERT