About This Tokenizer

  • Uses CC-CEDICT Dictionary
  • Uses a longest matching prefix algorithm, traversing a trie data structure to compare matches.
  • It pulls all matches when there are multiple words with the same characters.
  • I plan on improving the algorithm to look ahead several words, to maximize the longest word instead of using a greedy algorithm. (Example: it could match a 2 character word, even if it messes up the next longer 3 character word, lowering the overall average length of the words.)