# Edit distance

**Edit distance** of two strings $dist(x, y)$ is the minimal number of edit operations needed to transform first string ($x$) into the second ($y$). Edit distance is a measure.

Diffrent edit distances can take into account different **edit operations**: delete, insert, substitute, transpose.

Name | delete | insert | substitute | transpose |
---|---|---|---|---|

Hamming distance | + | |||

Harmonic edit distance | + | + | ||

LCS distance | + | + | ||

Levenshtein distance | + | + | + | |

Marzal-Vidal edit distance | + | + | + | |

Damerau-Levenshtein distance | + | + | + | + |

## Hamming distance

Hamming distance is a very specific measure. It assumes that strings are of the same length. It is rather makes sense for networks, signal processing, where messages (think of network packets) have constant size.

So I won’t compare it to other measures.

See Hamming distance

## LCS distnace

LCS stands for Longest Common Subsequence (not substring). I will denote it as $lcs(x, y)$. And length of the string as $|x|$. Than:

- $lcs(test, east)$ is
`est`

- $|test|$ is
`4`

See LCS distance

## Levenshtein distance

$dist_{lcs}(test, east)$ is `2`

. But $dist_{lcs}(test, text)$ is also `2`

, which doesn’t make sense. Levenshtein distance improves “LCS edit distance” by adding **substitution** operation, so $dist_{lev}(test, text)$ is `1`

.

## Damerau-Levenshtein distance

$dist_{lev}(test, tset)$ is `2`

, but this is common typo - accidentally type letters in reverse order. Damerau-Levenshtein distance improves Levenshtein distance by adding **transposition** operation, so $dist_{damlev}(test, tset)$ is `1`

.

See Damerau-Levenshtein distance

## More…

We can think of another types of operations or weights, for example:

- deleting or inserting duplicate letter, like in
`mesage`

and`message`

can weight less - typical error - deleting or inserting letters in the start and end can weight more. To take into account cognitive effect which allows people to recognize scrambled words. See: Typoglycemia
- allow transposition with a distance

## Jaro similarity

$sim_{LCS}(FAREMVIEL, FARMVILLE) = 7 / 9 \approx 0.77$. But $lcs(FAREMVIEL, FARMVILLE)$ are `FARMVIL`

, `FARMVIE`

e.g. two different strings, but the same length. Jaro similarity improves LCS similarity by taking into account such situations. $sim_{Jaro}(FAREMVIEL, FARMVILLE) = 0.88$. `jaro_subsequence(FAREMVIEL, FARMVILLE)`

are `FARMVIEL`

are `FARMVILE`

. **TODO:** add Jaro to Ngrams, Qgrams, skip-grams, etc.

But Jaro similarity is not direct extension of LCS similarity, otherwise in the absence of transposition they would return the same value:

- $sim_{LCS}(test, east) = 3/4$
- $sim_{Jaro}(test, east) = 5/6$
- $sim_{Lev}(test, east) = 1/2$

See Jaro similarity

## Side note about empty string

Let’s denote empty string as $\varepsilon$ and operation of replacing one letter with another as $\delta(x, y)$ then:

- replacement operation would be $\delta(x,y)$
- delete operation would be $\delta(x,\varepsilon)$
- insert operation would be $\delta(\varepsilon, x)$
- and even transpose operation can be represented as $\delta(xy, yx)$

Which makes replacement operation only one required to calculate edit distance, but this is not a trditional point of view. Otherwise we would not have Hamming distance and LCS distance.

One more side idea: replacement can be viewed as insertion and deletion than we can get triangle inequality: $c(\delta(x,y)) \leq c(\delta(x, \varepsilon)) + c(\delta(\varepsilon, y))$ where $c$ is a cost of operation.

## Other combinations

Following “rule of product” we can come up with more imaginary edit distances (they may be not useful though)

delete | insert | substitute | transpose | |
---|---|---|---|---|

Unknown 1 | + | + | ||

Unknown 2 | + | + | + |