String similarity measure

String similarity measure is a function wich takes two strings and answers the question on how two given strings are similar.

Measures can be separated in following categories:

similarity or dissimilarity
- similarity - the higher value means more closer strings
- dissimilarity - the lower value means more closer strings
  - Other name is distance. Not all distance are true Metric measure though
normalized: yes or no
- measure is normalised if it’s values are in the range 0..1
- normalised similarity can be converted to dissimilarity using formula $dis(x,y) = 1 - sim(x,y)$
expected input
- sequence
- set
- multiset (bag) or vector
ranking or relevance
- if measure returns only true and false it can be used as relevance, but not as ranking function
- similarity can be used as ranking if ordered descending
- dissimilarity can be used as ranking if ordered ascending
- normalized similarity can be used as relevance, for example $relevance(x,y) = sim(x,y) \geq k$ , where $k$ is some coefficient between 0 and 1
global or local
- local takes into account the similarity between some part of the target and the query
- global takes into account the similarity of all target data to the query. I think TF-IDF is an example here
assumed error
- phonetic (if words sound similar). Good for names, for example Claire, Clare
- ortographic (if words look similar). Good for detecting typos and errors