Skip to content

LCS distance

Type of Edit distance, which takes into account only insert and delete operations.

LCS stands for Longest Common Subsequence (not substring). I will denote it as lcs(x,y)lcs(x, y). And length of the string as x|x|. Than:

  • lcs(test,east)lcs(test, east) is est
  • test|test| is 4

Now let’s assume we want to measure edit distance between string, using only delete and insert operations:

  • number of letters we need to delete from xx in order to turn it in yy is xlcs(x,y)|x| - |lcs(x, y)|
  • number of letters we need to insert in xx in order to turn it in yy is ylcs(x,y)|y| - |lcs(x, y)|

So edit distance is:

distLCS(x,y)=x+y2lcs(x,y)dist_{LCS}(x, y) = |x| + |y| - 2\cdot|lcs(x, y)|

Interesting observation

LCS similarity with Higuera-MicoDice coefficientsimHMLCS(x,y)=2lcs(x,y)x+ysimDice(x,y)=2xyx+yMarzal-Vidal similarity w/o replaceJaccard indexsimMV-wor(x,y)=lcs(x,y)scs(x,y)simJaccard(x,y)=xyxy\begin{align*} &\text{LCS similarity with Higuera-Mico} &\text{Dice coefficient} \\ &sim_{HM-LCS}(x,y) = \frac{2 \cdot |lcs(x, y)|}{|x| + |y|} &sim_{Dice}(x,y) = \frac{2 |x \cap y|}{|x| + |y|}\\ \\ &\text{Marzal-Vidal similarity w/o replace} &\text{Jaccard index}\\ &sim_{\text{MV-wor}}(x,y) = \frac{|lcs(x, y)|}{|scs(x, y)|} &sim_{Jaccard}(x,y) = {{|x \cap y|}\over{|x \cup y|}} \end{align*}

LCS similarity with Higuera-Mico normalization

We can see that maximum of “LCS edit distance” is x+y|x| + |y|, than normalized distance is (Higuera, Mico (normalisation)):

distHMLCS(x,y)=12lcs(x,y)x+ydist_{HM-LCS}(x,y) = 1 - \frac{2 \cdot |lcs(x, y)|}{|x| + |y|} simHMLCS(x,y)=2lcs(x,y)x+ysim_{HM-LCS}(x,y) = \frac{2 \cdot |lcs(x, y)|}{|x| + |y|}

Which looks similar to Dice coefficient.

Marzal-Vidal similarity without replace

If we would use only insert and delete operations (like in LCS) and apply the same formula as in Marzal-Vidal edit distance, than “edit path” length would be the same as SCS (shortest common supersequence, scs(x,y)=x+ylcs(x,y)|scs(x,y)| = |x| + |y| - |lcs(x, y)|):

disMV-wor(x,y)=x+y2lcs(x,y)x+ylcs(x,y)=1lcs(x,y)x+ylcs(x,y)dis_{\text{MV-wor}}(x,y) = \frac{|x| + |y| - 2\cdot|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|} = 1 - \frac{|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|} simMV-wor(x,y)=lcs(x,y)scs(x,y)sim_{\text{MV-wor}}(x,y) = \frac{|lcs(x, y)|}{|scs(x, y)|}

Which looks similar to Jaccard index

LCS similarity with Li-Bo normalization

If we apply Li, Bo (normalisation) to LCS distance:

disLBLCS(x,y)=2dis(x,y)x+y+dis(x,y)=1lcs(x,y)x+ylcs(x,y)dis_{LB-LCS}(x, y) = \frac{2 \cdot dis(x,y)}{|x| + |y| + dis(x, y)} = 1 - \frac{|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|}

Which is the same as “Marzal-Vidal similarity without replace” (see above).

Others

We can continue to substitue lcslcs instead of \cap and scsscs instead of \cup in formulas for sets:

Otsuka-Ochiai coefficientsim(x,y)=lcs(x,y)xysimOO(x,y)=xyxyOverlap coefficientsim(x,y)=lcs(x,y)min(x,y)simoverlap(x,y)=xymin(x,y)\begin{align*} &\text{} &\text{Otsuka-Ochiai coefficient} \\ &sim(x,y) = \frac{|lcs(x,y)|}{\sqrt{|x|\cdot|y|}} &sim_{OO}(x,y) = \frac{|x \cap y|}{\sqrt{|x|\cdot|y|}}\\ \\ &\text{} &\text{Overlap coefficient}\\ &sim(x,y) = \frac{|lcs(x,y)|}{\min(|x|,|y|)} &sim_{overlap}(x,y) = \frac{|x \cap y|}{\min(|x|,|y|)} \end{align*}

See Otsuka-Ochiai coefficient, Overlap coefficient.

Algorithms

AlgorithmTime complexitySpace complexity
Dynamic programming solution 1O(nm)O(nm)O(nm)O(nm)
Larsen2O(log(m)log(n))O(\log(m) \log(n))O(mn2)O(mn^2)
Hunt–Szymanski3O(rlog(n)+mlog(m))O(r \log(n) + m \log(m))

Vizualization

Reading

Footnotes

  1. string2string: A Modern Python Library for String-to-String Algorithms

  2. “Length of Maximal Common Subsequences”, K.S. Larsen.

  3. The Hunt-Szymanski Algorithm for LCS