Edit distance
Edit distance of two strings is the minimal number of edit operations needed to transform first string () into the second (). Edit distance is a measure.
Diffrent edit distances can take into account different edit operations: delete, insert, substitute, transpose.
Name | delete | insert | substitute | transpose |
---|---|---|---|---|
Hamming distance | + | |||
Harmonic edit distance | + | + | ||
LCS distance | + | + | ||
Levenshtein distance | + | + | + | |
Marzal-Vidal edit distance | + | + | + | |
Damerau-Levenshtein distance | + | + | + | + |
Hamming distance
Hamming distance is a very specific measure. It assumes that strings are of the same length. It is rather makes sense for networks, signal processing, where messages (think of network packets) have constant size.
So I won’t compare it to other measures.
See Hamming distance
LCS distnace
LCS stands for Longest Common Subsequence (not substring). I will denote it as . And length of the string as . Than:
- is
est
- is
4
See LCS distance
Levenshtein distance
is 2
. But is also 2
, which doesn’t make sense. Levenshtein distance improves “LCS edit distance” by adding substitution operation, so is 1
.
Damerau-Levenshtein distance
is 2
, but this is common typo - accidentally type letters in reverse order. Damerau-Levenshtein distance improves Levenshtein distance by adding transposition operation, so is 1
.
See Damerau-Levenshtein distance
More…
We can think of another types of operations or weights, for example:
- deleting or inserting duplicate letter, like in
mesage
andmessage
can weight less - typical error - deleting or inserting letters in the start and end can weight more. To take into account cognitive effect which allows people to recognize scrambled words. See: Typoglycemia ↗
- allow transposition with a distance
Jaro similarity
. But are FARMVIL
, FARMVIE
e.g. two different strings, but the same length. Jaro similarity improves LCS similarity by taking into account such situations. . jaro_subsequence(FAREMVIEL, FARMVILLE)
are FARMVIEL
are FARMVILE
. TODO: add Jaro to Ngrams, Qgrams, skip-grams, etc.
But Jaro similarity is not direct extension of LCS similarity, otherwise in the absence of transposition they would return the same value:
See Jaro similarity
Side note about empty string
Let’s denote empty string as and operation of replacing one letter with another as then:
- replacement operation would be
- delete operation would be
- insert operation would be
- and even transpose operation can be represented as
Which makes replacement operation only one required to calculate edit distance, but this is not a trditional point of view. Otherwise we would not have Hamming distance and LCS distance.
One more side idea: replacement can be viewed as insertion and deletion than we can get triangle inequality: where is a cost of operation.
Other combinations
Following “rule of product” we can come up with more imaginary edit distances (they may be not useful though)
delete | insert | substitute | transpose | |
---|---|---|---|---|
Unknown 1 | + | + | ||
Unknown 2 | + | + | + |