String similarity measure
String similarity measure is a function wich takes two strings and answers the question of how are similar two given strings.
Measures can be separated in following categories:
- similarity or dissimilarity
- similarity - the higher value means more closer strings
- dissimilarity - the lower value means more closer strings
- distance - is the disimilarity which has mathematical metric properties
- normalized: yes or no
- measure is normalised if it’s values are in the range 0..1
- normalised similarity can be converted to dissimilarity using formula $dis(x,y) = 1 - sim(x,y)$
- expected input. See
String is…
- sequence
- set
- multiset (bag) or vector
- ranking or relevance
- if measure returns only
true
andfalse
it can be used as relevance, but not as ranking function - similarity can be used as ranking if ordered descending
- dissimilarity can be used as ranking if ordered ascending
- normalized similarity can be used as relevance, for example $relevance(x,y) = sim(x,y) \geq k$, where $k$ is some coefficient between 0 and 1
- if measure returns only
- global or local
- local takes into account the similarity between some part of the target and the query
- global takes into account the similarity of all target data to the query. I think
TF-IDF
is an example here
- assumed error
- phonetic (if words sound similar). Good for names, for example
Claire
,Clare
- ortographic (if words look similar). Good for detecting typos and errors
- phonetic (if words sound similar). Good for names, for example
# Reading
- COPADS, I: Distance Coefficients between Two Lists or Sets
- Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions
- An Empirical Study on Similarity Functions: Parameter Estimation for the Information Contrast Model
- pg_similarity
- https://abydos.readthedocs.io/en/latest/abydos.distance.html