Intro
A fuzzy comparison of strings can be used to do:
- approximate search
- full-text search
- spellcheck
- etc.
Overview
One way to organize subjects is by splitting the process into stages. Not all stages are required and in some cases, stages can go in different order.
splitting
For example:
- string to characters (
Crème Brûlée
→C
r
è
m
e
B
r
û
l
é
e
) - string to words aka tokenization (
Hello, world!
→Hello
world
) - string to sentences
transformation
For example:
- for characters and words
- to lowercase (
A
→a
) or to uppercase (a
→A
) - remove diacritics (
è
→e
)
- to lowercase (
- for words
- remove stopwords (
the
end
→end
) - replace with synonyms or by dictionary (
gorgeous
→beautiful
) - stemming (
liked
→like
)
- remove stopwords (
grouping
For example:
- unigrams or no grouping (
a
b
c
→a
b
c
) - bigrams (
a
b
c
→ab
bc
) - trigrams (
a
b
c
→abc
) - bigrams with padding 1 (
a
b
c
→#a
ab
bc
c#
) - trigrams with padding 1 (
a
b
c
→#ab
abc
bc#
) - skip-grams of length-2-skip-1 (
a
b
c
→ab
,ac
,bc
) - etc.
Grouping is typically used together with representations that do not preserve order e.g. all except sequence (see next section).
representation
- sequence (
a
b
b
c
→a
b
b
c
) - set (
a
b
b
c
→a
b
c
) - multiset
- bag (
a
b
b
c
→a:1
b:2
c:1
) - vector (
a
b
b
c
→1
2
1
, assuming that vector space or alphabet isa
,b
,c
)
- bag (
measure
String similarity measure - depends on the type of the representation, for example:
- for sequence, typically called "Edit distance"
- for set
- for bag
- for vector