String similarity measure
String similarity measure is a function wich takes two strings and answers the question on how two given strings are similar.
Measures can be separated in following categories:
- similarity or dissimilarity
- similarity - the higher value means more closer strings
- dissimilarity - the lower value means more closer strings
- Other name is distance. Not all distance are true Metric measure though
- normalized: yes or no
- measure is normalised if it’s values are in the range 0..1
- normalised similarity can be converted to dissimilarity using formula
- expected input
- sequence
- set
- multiset (bag) or vector
- ranking or relevance
- if measure returns only
true
andfalse
it can be used as relevance, but not as ranking function - similarity can be used as ranking if ordered descending
- dissimilarity can be used as ranking if ordered ascending
- normalized similarity can be used as relevance, for example , where is some coefficient between 0 and 1
- if measure returns only
- global or local
- local takes into account the similarity between some part of the target and the query
- global takes into account the similarity of all target data to the query. I think
TF-IDF
is an example here
- assumed error
- phonetic (if words sound similar). Good for names, for example
Claire
,Clare
- ortographic (if words look similar). Good for detecting typos and errors
- phonetic (if words sound similar). Good for names, for example