Skip to content

Jaccard index, generalized

The Jaccard index can be generalized to multisets or bags, which are basically sets in which repeated elements are allowed.

The multisets xx and yy sharing the same elements (support) can be simply represented as respective vectors x=[x1,x2,,xN],y=[y1,y2,,yN]\vec{x} = [x_1, x_2, \ldots, x_N], \vec{y} = [y_1, y_2, \ldots, y_N], where NN is the total number of possible distinct elements in the universe defined by the union of the two multiset elements, and xix_i corresponds to the multiplicity of element ii in the multiset xx. The Jaccard index for multisets then becomes:

simGenJaccard(x,y)=i=1Nmin(xi,yi)i=1Nmax(xi,yi)sim_{GenJaccard}(x, y) = \frac{ \sum_{i=1}^{N} \min(x_i, y_i)}{ \sum_{i=1}^{N} \max(x_i, y_i)}

with 0simGenJaccard(x,y)10 \leq sim_{GenJaccard}(x, y) \leq 1.

As an example, let’s consider x={{a,a,a,b,b}}x = \lbrace\lbrace a, a, a, b, b \rbrace\rbrace and y={{a,a,b,c,c,d}}y = \lbrace\lbrace a, a, b, c, c, d \rbrace\rbrace. If we have the set of possible elements organized into the indexing vector p=[a,b,c,d]\vec{p} = [a, b, c, d], we will obtain x=[3,2,0,0]\vec{x} = [3, 2, 0, 0] and y=[2,1,2,1]\vec{y} = [2, 1, 2, 1]. Observe that the order of elements in p\vec{p} is immaterial to our analysis.

Then, we have:

simGenJaccard(x,y)=2+1+0+03+2+2+1=38sim_{GenJaccard}(x, y) = \frac{2 + 1 + 0 + 0}{3 + 2 + 2 + 1} = \frac{3}{8}