Type of Edit distance , which takes into account only insert
and delete
operations.
LCS stands for Longest Common Subsequence (not substring). I will denote it as l c s ( x , y ) lcs(x, y) l cs ( x , y ) . And length of the string as ∣ x ∣ |x| ∣ x ∣ . Than:
l c s ( t e s t , e a s t ) lcs(test, east) l cs ( t es t , e a s t ) is est
∣ t e s t ∣ |test| ∣ t es t ∣ is 4
Now let’s assume we want to measure edit distance between string, using only delete and insert operations:
number of letters we need to delete from x x x in order to turn it in y y y is ∣ x ∣ − ∣ l c s ( x , y ) ∣ |x| - |lcs(x, y)| ∣ x ∣ − ∣ l cs ( x , y ) ∣
number of letters we need to insert in x x x in order to turn it in y y y is ∣ y ∣ − ∣ l c s ( x , y ) ∣ |y| - |lcs(x, y)| ∣ y ∣ − ∣ l cs ( x , y ) ∣
So edit distance is:
d i s t L C S ( x , y ) = ∣ x ∣ + ∣ y ∣ − 2 ⋅ ∣ l c s ( x , y ) ∣ dist_{LCS}(x, y) = |x| + |y| - 2\cdot|lcs(x, y)| d i s t L CS ( x , y ) = ∣ x ∣ + ∣ y ∣ − 2 ⋅ ∣ l cs ( x , y ) ∣
Interesting observation
LCS similarity with Higuera-Mico Dice coefficient s i m H M − L C S ( x , y ) = 2 ⋅ ∣ l c s ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ s i m D i c e ( x , y ) = 2 ∣ x ∩ y ∣ ∣ x ∣ + ∣ y ∣ Marzal-Vidal similarity w/o replace Jaccard index s i m MV-wor ( x , y ) = ∣ l c s ( x , y ) ∣ ∣ s c s ( x , y ) ∣ s i m J a c c a r d ( x , y ) = ∣ x ∩ y ∣ ∣ x ∪ y ∣ \begin{align*}
&\text{LCS similarity with Higuera-Mico} &\text{Dice coefficient} \\
&sim_{HM-LCS}(x,y) = \frac{2 \cdot |lcs(x, y)|}{|x| + |y|} &sim_{Dice}(x,y) = \frac{2 |x \cap y|}{|x| + |y|}\\
\\
&\text{Marzal-Vidal similarity w/o replace} &\text{Jaccard index}\\
&sim_{\text{MV-wor}}(x,y) = \frac{|lcs(x, y)|}{|scs(x, y)|} &sim_{Jaccard}(x,y) = {{|x \cap y|}\over{|x \cup y|}}
\end{align*} LCS similarity with Higuera-Mico s i m H M − L CS ( x , y ) = ∣ x ∣ + ∣ y ∣ 2 ⋅ ∣ l cs ( x , y ) ∣ Marzal-Vidal similarity w/o replace s i m MV-wor ( x , y ) = ∣ scs ( x , y ) ∣ ∣ l cs ( x , y ) ∣ Dice coefficient s i m D i ce ( x , y ) = ∣ x ∣ + ∣ y ∣ 2∣ x ∩ y ∣ Jaccard index s i m J a cc a r d ( x , y ) = ∣ x ∪ y ∣ ∣ x ∩ y ∣
LCS similarity with Higuera-Mico normalization
We can see that maximum of “LCS edit distance” is ∣ x ∣ + ∣ y ∣ |x| + |y| ∣ x ∣ + ∣ y ∣ , than normalized distance is (Higuera, Mico (normalisation) ):
d i s t H M − L C S ( x , y ) = 1 − 2 ⋅ ∣ l c s ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ dist_{HM-LCS}(x,y) = 1 - \frac{2 \cdot |lcs(x, y)|}{|x| + |y|} d i s t H M − L CS ( x , y ) = 1 − ∣ x ∣ + ∣ y ∣ 2 ⋅ ∣ l cs ( x , y ) ∣
s i m H M − L C S ( x , y ) = 2 ⋅ ∣ l c s ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ sim_{HM-LCS}(x,y) = \frac{2 \cdot |lcs(x, y)|}{|x| + |y|} s i m H M − L CS ( x , y ) = ∣ x ∣ + ∣ y ∣ 2 ⋅ ∣ l cs ( x , y ) ∣
Which looks similar to Dice coefficient .
Marzal-Vidal similarity without replace
If we would use only insert
and delete
operations (like in LCS) and apply the same formula as in Marzal-Vidal edit distance , than “edit path” length would be the same as SCS (shortest common supersequence, ∣ s c s ( x , y ) ∣ = ∣ x ∣ + ∣ y ∣ − ∣ l c s ( x , y ) ∣ |scs(x,y)| = |x| + |y| - |lcs(x, y)| ∣ scs ( x , y ) ∣ = ∣ x ∣ + ∣ y ∣ − ∣ l cs ( x , y ) ∣ ):
d i s MV-wor ( x , y ) = ∣ x ∣ + ∣ y ∣ − 2 ⋅ ∣ l c s ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ − ∣ l c s ( x , y ) ∣ = 1 − ∣ l c s ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ − ∣ l c s ( x , y ) ∣ dis_{\text{MV-wor}}(x,y) = \frac{|x| + |y| - 2\cdot|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|} = 1 - \frac{|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|} d i s MV-wor ( x , y ) = ∣ x ∣ + ∣ y ∣ − ∣ l cs ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ − 2 ⋅ ∣ l cs ( x , y ) ∣ = 1 − ∣ x ∣ + ∣ y ∣ − ∣ l cs ( x , y ) ∣ ∣ l cs ( x , y ) ∣
s i m MV-wor ( x , y ) = ∣ l c s ( x , y ) ∣ ∣ s c s ( x , y ) ∣ sim_{\text{MV-wor}}(x,y) = \frac{|lcs(x, y)|}{|scs(x, y)|} s i m MV-wor ( x , y ) = ∣ scs ( x , y ) ∣ ∣ l cs ( x , y ) ∣
Which looks similar to Jaccard index
LCS similarity with Li-Bo normalization
If we apply Li, Bo (normalisation) to LCS distance:
d i s L B − L C S ( x , y ) = 2 ⋅ d i s ( x , y ) ∣ x ∣ + ∣ y ∣ + d i s ( x , y ) = 1 − ∣ l c s ( x , y ) ∣ ∣ x ∣ + ∣ y ∣ − ∣ l c s ( x , y ) ∣ dis_{LB-LCS}(x, y) = \frac{2 \cdot dis(x,y)}{|x| + |y| + dis(x, y)} = 1 - \frac{|lcs(x, y)|}{|x| + |y| - |lcs(x, y)|} d i s L B − L CS ( x , y ) = ∣ x ∣ + ∣ y ∣ + d i s ( x , y ) 2 ⋅ d i s ( x , y ) = 1 − ∣ x ∣ + ∣ y ∣ − ∣ l cs ( x , y ) ∣ ∣ l cs ( x , y ) ∣
Which is the same as “Marzal-Vidal similarity without replace” (see above).
Others
We can continue to substitue l c s lcs l cs instead of ∩ \cap ∩ and s c s scs scs instead of ∪ \cup ∪ in formulas for sets:
Otsuka-Ochiai coefficient s i m ( x , y ) = ∣ l c s ( x , y ) ∣ ∣ x ∣ ⋅ ∣ y ∣ s i m O O ( x , y ) = ∣ x ∩ y ∣ ∣ x ∣ ⋅ ∣ y ∣ Overlap coefficient s i m ( x , y ) = ∣ l c s ( x , y ) ∣ min ( ∣ x ∣ , ∣ y ∣ ) s i m o v e r l a p ( x , y ) = ∣ x ∩ y ∣ min ( ∣ x ∣ , ∣ y ∣ ) \begin{align*}
&\text{} &\text{Otsuka-Ochiai coefficient} \\
&sim(x,y) = \frac{|lcs(x,y)|}{\sqrt{|x|\cdot|y|}} &sim_{OO}(x,y) = \frac{|x \cap y|}{\sqrt{|x|\cdot|y|}}\\
\\
&\text{} &\text{Overlap coefficient}\\
&sim(x,y) = \frac{|lcs(x,y)|}{\min(|x|,|y|)} &sim_{overlap}(x,y) = \frac{|x \cap y|}{\min(|x|,|y|)}
\end{align*} s im ( x , y ) = ∣ x ∣ ⋅ ∣ y ∣ ∣ l cs ( x , y ) ∣ s im ( x , y ) = min ( ∣ x ∣ , ∣ y ∣ ) ∣ l cs ( x , y ) ∣ Otsuka-Ochiai coefficient s i m OO ( x , y ) = ∣ x ∣ ⋅ ∣ y ∣ ∣ x ∩ y ∣ Overlap coefficient s i m o v er l a p ( x , y ) = min ( ∣ x ∣ , ∣ y ∣ ) ∣ x ∩ y ∣
See Otsuka-Ochiai coefficient , Overlap coefficient .
Algorithms
Algorithm Time complexity Space complexity Dynamic programming solution 1 O ( n m ) O(nm) O ( nm ) O ( n m ) O(nm) O ( nm ) Larsen2 O ( log ( m ) log ( n ) ) O(\log(m) \log(n)) O ( log ( m ) log ( n )) O ( m n 2 ) O(mn^2) O ( m n 2 ) Hunt–Szymanski3 O ( r log ( n ) + m log ( m ) ) O(r \log(n) + m \log(m)) O ( r log ( n ) + m log ( m ))
Vizualization
Dynamic programming solution:
Reading