Skip to content

Intro

A fuzzy comparison of strings can be used to do:

  • approximate search
  • full-text search
  • spellcheck
  • etc.

Overview

One way to organize subjects is by splitting the process into stages. Not all stages are required and in some cases, stages can go in different order.

splitting
transformation
grouping
representation
measure

splitting

For example:

  • string to characters (Crème BrûléeC r è m e B r û l é e)
  • string to words aka tokenization (Hello, world!Hello world)
  • string to sentences

transformation

For example:

  • for characters and words
    • to lowercase (Aa) or to uppercase (aA)
    • remove diacritics (èe)
  • for words
    • remove stopwords (the endend)
    • replace with synonyms or by dictionary (gorgeousbeautiful)
    • stemming (likedlike)

grouping

For example:

  • unigrams or no grouping (a b ca b c)
  • bigrams (a b cab bc)
  • trigrams (a b cabc)
  • bigrams with padding 1 (a b c#a ab bc c#)
  • trigrams with padding 1 (a b c#ab abc bc#)
  • skip-grams of length-2-skip-1 (a b cab, ac, bc)
  • etc.

Grouping is typically used together with representations that do not preserve order e.g. all except sequence (see next section).

representation

  • sequence (a b b ca b b c)
  • set (a b b ca b c)
  • multiset
    • bag (a b b ca:1 b:2 c:1)
    • vector (a b b c1 2 1, assuming that vector space or alphabet is a, b, c)

measure

String similarity measure - depends on the type of the representation, for example: