About This Tokenizer
- Uses CC-CEDICT Dictionary
- Uses a longest matching prefix algorithm, traversing a trie data structure to compare matches.
- It pulls all matches when there are multiple words with the same characters.
- I plan on improving the algorithm to look ahead several words, to maximize the longest word instead of using a greedy algorithm. (Example: it could match a 2 character word, even if it messes up the next longer 3 character word, lowering the overall average length of the words.)