Intro

A fuzzy comparison of strings can be used to do:

approximate search
full-text search
spellcheck
etc.

Overview

One way to organize subjects is by splitting the process into stages. Not all stages are required and in some cases, stages can go in different order.

Use two fingers to pan and zoom

splitting

For example:

string to characters (Crème Brûlée → C r è m e B r û l é e)
string to words aka tokenization (Hello, world! → Hello world)
string to sentences

transformation

For example:

for characters and words
- to lowercase (A → a) or to uppercase (a → A)
- remove diacritics (è → e)
for words
- remove stopwords (the end → end)
- replace with synonyms or by dictionary (gorgeous → beautiful)
- stemming (liked → like)

grouping

For example:

unigrams or no grouping (a b c → a b c)
bigrams (a b c → ab bc)
trigrams (a b c → abc)
bigrams with padding 1 (a b c → #a ab bc c#)
trigrams with padding 1 (a b c → #ab abc bc#)
skip-grams of length-2-skip-1 (a b c → ab, ac, bc)
etc.

Grouping is typically used together with representations that do not preserve order e.g. all except sequence (see next section).

representation

Representations:

sequence (a b b c → a b b c)
set (a b b c → a b c)
multiset
- bag (a b b c → a:1 b:2 c:1)
- vector (a b b c → 1 2 1, assuming that vector space or alphabet is a, b, c)

measure

String similarity measure - depends on the type of the representation, for example:

for sequence, typically called "Edit distance"
for set
for bag
- Bag distance
- Jaccard index, generalized
for vector
- Cosine similarity
- Term Frequency — Inverse Document Frequency