Skip to content

Edit distance

Edit distance of two strings dist(x,y)dist(x, y) is the minimal number of edit operations needed to transform first string (xx) into the second (yy). Edit distance is a measure.

Diffrent edit distances can take into account different edit operations: delete, insert, substitute, transpose.

Name delete insert substitute transpose
Hamming distance +
Harmonic edit distance + +
LCS distance + +
Levenshtein distance + + +
Marzal-Vidal edit distance + + +
Damerau-Levenshtein distance + + + +

Hamming distance

Hamming distance is a very specific measure. It assumes that strings are of the same length. It is rather makes sense for networks, signal processing, where messages (think of network packets) have constant size.

So I won’t compare it to other measures.

See Hamming distance

LCS distnace

LCS stands for Longest Common Subsequence (not substring). I will denote it as lcs(x,y)lcs(x, y). And length of the string as x|x|. Than:

  • lcs(test,east)lcs(test, east) is est
  • test|test| is 4

See LCS distance

Levenshtein distance

distlcs(test,east)dist_{lcs}(test, east) is 2. But distlcs(test,text)dist_{lcs}(test, text) is also 2, which doesn’t make sense. Levenshtein distance improves “LCS edit distance” by adding substitution operation, so distlev(test,text)dist_{lev}(test, text) is 1.

See Levenshtein distance

Damerau-Levenshtein distance

distlev(test,tset)dist_{lev}(test, tset) is 2, but this is common typo - accidentally type letters in reverse order. Damerau-Levenshtein distance improves Levenshtein distance by adding transposition operation, so distdamlev(test,tset)dist_{damlev}(test, tset) is 1.

See Damerau-Levenshtein distance

More…

We can think of another types of operations or weights, for example:

  • deleting or inserting duplicate letter, like in mesage and message can weight less - typical error
  • deleting or inserting letters in the start and end can weight more. To take into account cognitive effect which allows people to recognize scrambled words. See: Typoglycemia
  • allow transposition with a distance

Jaro similarity

simLCS(FAREMVIEL,FARMVILLE)=7/90.77sim_{LCS}(FAREMVIEL, FARMVILLE) = 7 / 9 \approx 0.77. But lcs(FAREMVIEL,FARMVILLE)lcs(FAREMVIEL, FARMVILLE) are FARMVIL, FARMVIE e.g. two different strings, but the same length. Jaro similarity improves LCS similarity by taking into account such situations. simJaro(FAREMVIEL,FARMVILLE)=0.88sim_{Jaro}(FAREMVIEL, FARMVILLE) = 0.88. jaro_subsequence(FAREMVIEL, FARMVILLE) are FARMVIEL are FARMVILE. TODO: add Jaro to Ngrams, Qgrams, skip-grams, etc.

But Jaro similarity is not direct extension of LCS similarity, otherwise in the absence of transposition they would return the same value:

  • simLCS(test,east)=3/4sim_{LCS}(test, east) = 3/4
  • simJaro(test,east)=5/6sim_{Jaro}(test, east) = 5/6
  • simLev(test,east)=1/2sim_{Lev}(test, east) = 1/2

See Jaro similarity

Side note about empty string

Let’s denote empty string as ε\varepsilon and operation of replacing one letter with another as δ(x,y)\delta(x, y) then:

  • replacement operation would be δ(x,y)\delta(x,y)
  • delete operation would be δ(x,ε)\delta(x,\varepsilon)
  • insert operation would be δ(ε,x)\delta(\varepsilon, x)
  • and even transpose operation can be represented as δ(xy,yx)\delta(xy, yx)

Which makes replacement operation only one required to calculate edit distance, but this is not a trditional point of view. Otherwise we would not have Hamming distance and LCS distance.

One more side idea: replacement can be viewed as insertion and deletion than we can get triangle inequality: c(δ(x,y))c(δ(x,ε))+c(δ(ε,y))c(\delta(x,y)) \leq c(\delta(x, \varepsilon)) + c(\delta(\varepsilon, y)) where cc is a cost of operation.

Other combinations

Following “rule of product” we can come up with more imaginary edit distances (they may be not useful though)

deleteinsertsubstitutetranspose
Unknown 1++
Unknown 2+++