Skip to content

Bag distance

Given a string xx over an alphabet Σ\Sigma, let x=ms(x)x = ms(x) denote the multiset (bag) of symbols in xx. For instance, ms(peer)={{e,e,p,r}}ms(peer) = \lbrace\lbrace e, e, p, r \rbrace\rbrace. The following can be easily proved to be a metric on multisets:

disbag(x,y)=max(xy,yx)dis_{bag}(x, y) = \max(|x \setminus y|, |y \setminus x|)

where the difference has a bag semantics (e.g. {{a,a,a,b}}{{a,a,b,c,c}}={{a}}\lbrace\lbrace a, a, a, b\rbrace\rbrace \setminus \lbrace\lbrace a, a, b, c, c\rbrace\rbrace = \lbrace\lbrace a \rbrace\rbrace), and | · | counts the number of elements in a multiset (e.g. {{a,a}}=2|\lbrace\lbrace a, a \rbrace\rbrace | = 2).
In practice, disbagdis_{bag} first “drops” common elements, then takes the maximum
considering the number of “residual” elements. For instance:

disbag({{a,a,a,b}}{{a,a,b,c,c}})=max({{a}},{{c,c}})=2dis_{bag}(\lbrace\lbrace a, a, a, b \rbrace\rbrace \setminus \lbrace\lbrace a, a, b, c, c \rbrace\rbrace ) = \max(|\lbrace\lbrace a \rbrace\rbrace |, |\lbrace\lbrace c, c \rbrace\rbrace |) = 2

It is immediate to observe that disbag(x,y)disLev(x,y),x,yΣdis_{bag}(x, y) \leq dis_{Lev}(x, y), \forall x, y \in \Sigma^{\ast}. See Levenshtein distance.

Further, since computing disbag(x,y)dis_{bag}(x, y) is O(x+y)O(|x| + |y|), disbagdis_{bag} can indeed be effectively used to quickly filter out candidate strings.