Jaccard index, generalized

The Jaccard index can be generalized to multisets or bags, which are basically sets in which repeated elements are allowed.

The multisets $x$ and $y$ sharing the same elements (support) can be simply represented as respective vectors $\vec{x} = [x_1, x_2, \ldots, x_N], \vec{y} = [y_1, y_2, \ldots, y_N]$ , where $N$ is the total number of possible distinct elements in the universe defined by the union of the two multiset elements, and $x_i$ corresponds to the multiplicity of element $i$ in the multiset $x$ . The Jaccard index for multisets then becomes:

sim_{GenJaccard}(x, y) = \frac{ \sum_{i=1}^{N} \min(x_i, y_i)}{ \sum_{i=1}^{N} \max(x_i, y_i)}

with $0 \leq sim_{GenJaccard}(x, y) \leq 1$ .

As an example, let’s consider $x = \lbrace\lbrace a, a, a, b, b \rbrace\rbrace$ and $y = \lbrace\lbrace a, a, b, c, c, d \rbrace\rbrace$ . If we have the set of possible elements organized into the indexing vector $\vec{p} = [a, b, c, d]$ , we will obtain $\vec{x} = [3, 2, 0, 0]$ and $\vec{y} = [2, 1, 2, 1]$ . Observe that the order of elements in $\vec{p}$ is immaterial to our analysis.

Then, we have:

sim_{GenJaccard}(x, y) = \frac{2 + 1 + 0 + 0}{3 + 2 + 2 + 1} = \frac{3}{8}

Reading

Costa, Luciano da Fontoura. “Further Generalizations of the Jaccard Index.” ArXiv abs/2110.09619 (2021): n. pag. ↗

type	similarity
normalized	yes
representation	bag