BPE Tokenization Demystified: Implementation and Examples
A taxonomy of tokenization methods
In NLP, one crux of problems is - how to tokenize the text. There are three methods available:
- Char-level
- Word-level
- Subword-level
Let’s talk about the Char-level tokenizer. That is, we tokenize the text into a char stream. For instance, highest -> h, i, g, h, e, s, t
. One advantage of the Char-level tokenizer is that the size of Vocab won’t be that large. The size of Vocab is equal to the size of the alphabet. So you probably won’t meet the infamous Out-of-vocabulary(OOV) problem. However, the downside is that the char itself does not convey too much information, and we will get too many tokens after tokenizing. Try to imagine that a simple word highest will give us 7 tokens😨