LCS distance

Type of Edit distance, which takes into account only insert and delete operations.

LCS stands for Longest Common Subsequence (not substring). I will denote it as $lcs(x, y)$ . And length of the string as $|x|$ . Than:

$lcs(test, east)$ is est
$|test|$ is 4

Now let’s assume we want to measure edit distance between string, using only delete and insert operations:

number of letters we need to delete from $x$ in order to turn it in $y$ is $|x| - |lcs(x, y)|$
number of letters we need to insert in $x$ in order to turn it in $y$ is $|y| - |lcs(x, y)|$

So edit distance is:

dist_{LCS}(x, y) = |x| + |y| - 2\cdot|lcs(x, y)|

Interesting observation

\begin{align*} &\text{LCS similarity with Higuera-Mico} &\text{Dice coefficient} \\ &sim_{HM-LCS}(x,y) = \frac{2 \cdot |lcs(x, y)|}{|x| + |y|} &sim_{Dice}(x,y) = \frac{2 |x \cap y|}{|x| + |y|}\\ \\ &\text{Marzal-Vidal similarity w/o replace} &\text{Jaccard index}\\ &sim_{\text{MV-wor}}(x,y) = \frac{|lcs(x, y)|}{|scs(x, y)|} &sim_{Jaccard}(x,y) = {{|x \cap y|}\over{|x \cup y|}} \end{align*}

LCS similarity with Higuera-Mico normalization

We can see that maximum of “LCS edit distance” is $|x| + |y|$ , than normalized distance is (Higuera, Mico (normalisation)):

dist_{HM-LCS}(x,y) = 1 - \frac{2 \cdot |lcs(x, y)|}{|x| + |y|}

sim_{HM-LCS}(x,y) = \frac{2 \cdot |lcs(x, y)|}{|x| + |y|}

Which looks similar to Dice coefficient.

Marzal-Vidal similarity without replace

If we would use only insert and delete operations (like in LCS) and apply the same formula as in Marzal-Vidal edit distance, than “edit path” length would be the same as SCS (shortest common supersequence, $|scs(x,y)| = |x| + |y| - |lcs(x, y)|$ ):

dis_{\text{MV-wor}}(x,y) = \frac{|x| + |y| - 2\cdot|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|} = 1 - \frac{|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|}

sim_{\text{MV-wor}}(x,y) = \frac{|lcs(x, y)|}{|scs(x, y)|}

Which looks similar to Jaccard index

LCS similarity with Li-Bo normalization

If we apply Li, Bo (normalisation) to LCS distance:

dis_{LB-LCS}(x, y) = \frac{2 \cdot dis(x,y)}{|x| + |y| + dis(x, y)} = 1 - \frac{|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|}

Which is the same as “Marzal-Vidal similarity without replace” (see above).

Others

We can continue to substitue $lcs$ instead of $\cap$ and $scs$ instead of $\cup$ in formulas for sets:

\begin{align*} &\text{} &\text{Otsuka-Ochiai coefficient} \\ &sim(x,y) = \frac{|lcs(x,y)|}{\sqrt{|x|\cdot|y|}} &sim_{OO}(x,y) = \frac{|x \cap y|}{\sqrt{|x|\cdot|y|}}\\ \\ &\text{} &\text{Overlap coefficient}\\ &sim(x,y) = \frac{|lcs(x,y)|}{\min(|x|,|y|)} &sim_{overlap}(x,y) = \frac{|x \cap y|}{\min(|x|,|y|)} \end{align*}

See Otsuka-Ochiai coefficient, Overlap coefficient.

Algorithms

Algorithm	Time complexity	Space complexity
Dynamic programming solution ¹	$O(nm)$	$O(nm)$
Larsen²	$O(\log(m) \log(n))$	$O(mn^2)$
Hunt–Szymanski³	$O(r \log(n) + m \log(m))$

Vizualization

Dynamic programming solution:
- algorithm-visualizer ↗

Reading

Algorithms for computing variants of the longest common subsequence problem ↗

string2string: A Modern Python Library for String-to-String Algorithms ↗ ↩
“Length of Maximal Common Subsequences”, K.S. Larsen. ↩
The Hunt-Szymanski Algorithm for LCS ↗ ↩

type	dissimilarity
normalized	no
representation	sequence