Skip to content

String similarity measure

String similarity measure is a function wich takes two strings and answers the question on how two given strings are similar.

Measures can be separated in following categories:

  1. similarity or dissimilarity
    • similarity - the higher value means more closer strings
    • dissimilarity - the lower value means more closer strings
  2. normalized: yes or no
    • measure is normalised if it’s values are in the range 0..1
    • normalised similarity can be converted to dissimilarity using formula dis(x,y)=1sim(x,y)dis(x,y) = 1 - sim(x,y)
  3. expected input
    • sequence
    • set
    • multiset (bag) or vector
  4. ranking or relevance
    • if measure returns only true and false it can be used as relevance, but not as ranking function
    • similarity can be used as ranking if ordered descending
    • dissimilarity can be used as ranking if ordered ascending
    • normalized similarity can be used as relevance, for example relevance(x,y)=sim(x,y)krelevance(x,y) = sim(x,y) \geq k, where kk is some coefficient between 0 and 1
  5. global or local
    • local takes into account the similarity between some part of the target and the query
    • global takes into account the similarity of all target data to the query. I think TF-IDF is an example here
  6. assumed error
    • phonetic (if words sound similar). Good for names, for example Claire, Clare
    • ortographic (if words look similar). Good for detecting typos and errors