🏷️ fuzzy text search


Search IconIcon to open search


Last updated Jan 1, 2023

Token is a minimal element in the string (text). Tokens can be letters, words, puctuation, vowels, consonants, etc. Depending on the task we may choose different types of tokens.

The most trivial way to split string in tokens is - letters (characters, symbols). This approach doesn’t need to know language of the string it only needs to know encoding. It was a problem in the past, but now everybody use utf-8

Encoding can be:

Spliting into words is more complex task,

Example of tokenizers:

# String is…

String converted to sequence (list, stream) of tokens, after which it can be converted to subsequences, which can be converted to set, multiset/bag, vector

flowchart TD string --> tokens tokens --> subsequences tokens ---> sequence subsequences --> set subsequences --> bag[multiset or bag] bag --> vector vector --> local[local frequency] vector --> global[global frequency] local --> TF global --> TF-IDF


# String measures

Depending on which representation we selected we can use different measures. For example (this is not full list):

flowchart LR sequence --> ed[edit distance] ed --> Hamming ed --> LCS ed --> Levenstein ed --> Damerau-Levenstein sequence --> Jaro sequence --> Jaro-Winkler sequence --> sol[sequence of letters] --> p[phonetic algorithms] set --> Dice set --> Overlap set --> Jaccard bag[multiset or bag] --> JG[Jaccard, generalized] bag[multiset or bag] --> bagd[Bag distance] vector --> cos[cosine similarity]

# Combinations

We can combine different tokenizers and different algorithms.

There are typical combination, for example:

And not so typical combinations, but still valid:

# Characteristics

Sequence of tokens have following characteristics:

E.g. having sequence we can tell which exact items we have, in which order they are, how much each individual token occurs. Any other representation makes a compromise on the given characteristics.

bag of unigramsyesyes
set of unigramsyes
bag of bigramsyesyes*partial
set of bigramsyespartial

It doesn’t make sense to use sequence of bigrams, because sequence contains information about order already

# Transformation or normalization

Not shown in previous diagram, but there can be optional transformation step after tokenization.

flowchart TD tokens --> t[Transformation step] t --> subsequences t --> sequence

For example,