# String similarity measure

String similarity measure is a function wich takes two strings and answers the question of how are similar two given strings.

Measures can be separated in following categories:

**similarity**or**dissimilarity**- similarity - the higher value means more closer strings
- dissimilarity - the lower value means more closer strings
**distance**- is the disimilarity which has mathematical metric properties

**normalized**: yes or no- measure is normalised if it’s values are in the range 0..1
**normalised similarity**can be converted to dissimilarity using formula $dis(x,y) = 1 - sim(x,y)$

**expected input**. See String is…**sequence****set****multiset**(bag) or vector

**ranking**or**relevance**- if measure returns only
`true`

and`false`

it can be used as**relevance**, but not as**ranking**function **similarity**can be used as**ranking**if ordered descending**dissimilarity**can be used as**ranking**if ordered ascending**normalized**similarity can be used as**relevance**, for example $relevance(x,y) = sim(x,y) \geq k$, where $k$ is some coefficient between 0 and 1

- if measure returns only
**global**or**local****local**takes into account the similarity between some part of the target and the query**global**takes into account the similarity of all target data to the query. I think`TF-IDF`

is an example here

**assumed error****phonetic**(if words sound similar). Good for names, for example`Claire`

,`Clare`

**ortographic**(if words look similar). Good for detecting typos and errors

## # Reading

- COPADS, I: Distance Coefficients between Two Lists or Sets
- Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions
- An Empirical Study on Similarity Functions: Parameter Estimation for the Information Contrast Model
- pg_similarity
- https://abydos.readthedocs.io/en/latest/abydos.distance.html