Given a string x over an alphabet Σ, let x=ms(x) denote the multiset (bag) of symbols in x. For instance, ms(peer)={{e,e,p,r}}. The following can be easily proved to be a metric on multisets:
disbag(x,y)=max(∣x∖y∣,∣y∖x∣)
where the difference has a bag semantics (e.g. {{a,a,a,b}}∖{{a,a,b,c,c}}={{a}}), and ∣⋅∣ counts the number of elements in a multiset (e.g. ∣{{a,a}}∣=2).
In practice, disbag first “drops” common elements, then takes the maximum
considering the number of “residual” elements. For instance:
disbag({{a,a,a,b}},{{a,a,b,c,c}})=max(∣{{a}}∣,∣{{c,c}}∣)=2
It is immediate to observe that disbag(x,y)≤disLev(x,y),∀x,y∈Σ∗. See Levenshtein distance.
Further, since computing disbag(x,y) is O(∣x∣+∣y∣), disbag can indeed be effectively used to quickly filter out candidate strings.
Reading