Skip to content

Intro

A fuzzy comparison of strings can be used to do:

  • approximate search
  • full-text search
  • spellcheck
  • etc.

Overview

One way to organize subjects is by splitting the process into stages. Not all stages are required and in some cases, stages can go in different order.

splittingtransformationgroupingrepresentationmeasure

splitting

For example:

  • string to characters (Crème BrûléeC r è m e B r û l é e)
  • string to words aka tokenization (Hello, world!Hello world)
  • string to sentences

transformation

For example:

  • for characters and words
    • to lowercase (Aa) or to uppercase (aA)
    • remove diacritics (èe)
  • for words
    • remove stopwords (the endend)
    • replace with synonyms or by dictionary (gorgeousbeautiful)
    • stemming (likedlike)

grouping

For example:

  • unigrams or no grouping (a b ca b c)
  • bigrams (a b cab bc)
  • trigrams (a b cabc)
  • bigrams with padding 1 (a b c#a ab bc c#)
  • trigrams with padding 1 (a b c#ab abc bc#)
  • skip-grams of length-2-skip-1 (a b cab, ac, bc)
  • etc.

Grouping is typically used together with representations that do not preserve order e.g. all except sequence (see next section).

representation

Representations:

  • sequence (a b b ca b b c)
  • set (a b b ca b c)
  • multiset
    • bag (a b b ca:1 b:2 c:1)
    • vector (a b b c1 2 1, assuming that vector space or alphabet is a, b, c)

measure

String similarity measure - depends on the type of the representation, for example: