🏷️ fuzzy text search

Search

Search IconIcon to open search

String similarity measure

Last updated Jan 1, 2023

String similarity measure is a function wich takes two strings and answers the question of how are similar two given strings.

Measures can be separated in following categories:

  1. similarity or dissimilarity
    • similarity - the higher value means more closer strings
    • dissimilarity - the lower value means more closer strings
  2. normalized: yes or no
    • measure is normalised if it’s values are in the range 0..1
    • normalised similarity can be converted to dissimilarity using formula $dis(x,y) = 1 - sim(x,y)$
  3. expected input. See String is…
    • sequence
    • set
    • multiset (bag) or vector
  4. ranking or relevance
    • if measure returns only true and false it can be used as relevance, but not as ranking function
    • similarity can be used as ranking if ordered descending
    • dissimilarity can be used as ranking if ordered ascending
    • normalized similarity can be used as relevance, for example $relevance(x,y) = sim(x,y) \geq k$, where $k$ is some coefficient between 0 and 1
  5. global or local
    • local takes into account the similarity between some part of the target and the query
    • global takes into account the similarity of all target data to the query. I think TF-IDF is an example here
  6. assumed error
    • phonetic (if words sound similar). Good for names, for example Claire, Clare
    • ortographic (if words look similar). Good for detecting typos and errors

# Reading