How to compute jaccard similarity from a pandas dataframe

Use pairwise_distances to calculate the distance and subtract that distance from 1 to find the similarity score:

from sklearn.metrics.pairwise import pairwise_distances
1 - pairwise_distances(df.T.to_numpy(), metric="jaccard")

Explanation:

In newer versions of scikit learn, the definition of jaccard_score is similar to the Jaccard similarity coefficient definition in Wikipedia:

where

M₁₁ represents the total number of attributes where A and B both have a value of 1.
M₀₁ represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
M₁₀ represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
M₀₀ represents the total number of attributes where A and B both have a value of 0.

Let’s create a sample dataset to see if the results match:

from pandas import DataFrame, crosstab
from numpy.random import default_rng
rng = default_rng(0)

# Create a dataframe of 40 rows and 5 columns (named A, B, C, D, E)
# Each cell in the DataFrame is either 0 or 1 with 50% probability
df = DataFrame(rng.binomial(1, 0.5, size=(40, 5)), columns=list('ABCDE'))

This yields the following crosstab for columns A and B:

A/B	0	1
0	10	7
1	14	9

Based on the definition, the Jaccard similarity score is:

M00 = (df['A'].eq(0) & df['B'].eq(0)).sum()  # 10
M01 = (df['A'].eq(0) & df['B'].eq(1)).sum()  # 7
M10 = (df['A'].eq(1) & df['B'].eq(0)).sum()  # 14
M11 = (df['A'].eq(1) & df['B'].eq(1)).sum()  # 9


print(M11 / (M01 + M10 + M11))  # 0.3

This is what you would get with jaccard_score:

from sklearn.metrics import jaccard_score
print(jaccard_score(df['A'], df['B']))  # 0.3

The problem with the jaccard_score function is that it is not vectorized. You’ll have to loop over all columns to calculate the similarity score for each corresponding column. In order to avoid that, you can use the vectorized distance version. However, since it is “distance” but not “similarity”, you’ll need to subtract that value from 1:

from sklearn.metrics.pairwise import pairwise_distances
print(1 - pairwise_distances(df.T.to_numpy(), metric="jaccard"))

# [[1.         0.3        0.45714286 0.34285714 0.46666667]
#  [0.3        1.         0.29411765 0.33333333 0.23333333]
#  [0.45714286 0.29411765 1.         0.40540541 0.44117647]
#  [0.34285714 0.33333333 0.40540541 1.         0.36363636]
#  [0.46666667 0.23333333 0.44117647 0.36363636 1.        ]]

Optionally, you can convert it back to a DataFrame:

jac_sim = 1 - pairwise_distances(df.T.to_numpy(), metric="jaccard")
jac_sim_df = DataFrame(
    1 - pairwise_distances(df.T.to_numpy(), metric="jaccard"), 
    index=df.columns, columns=df.columns,
)

#           A         B         C         D         E
#  A  1.000000  0.300000  0.457143  0.342857  0.466667
#  B  0.300000  1.000000  0.294118  0.333333  0.233333
#  C  0.457143  0.294118  1.000000  0.405405  0.441176
#  D  0.342857  0.333333  0.405405  1.000000  0.363636
#  E  0.466667  0.233333  0.441176  0.363636  1.000000

Note: In the previous version of this answer, the calculations used the hamming metric with pairwise_distances because in earlier versions of scikit-learn, jaccard_score was calculated similar to the accuracy score (i.e. (M₀₀ + M₁₁) / (M₀₀ + M₀₁ + M₁₀ + M₁₁)). That is no longer the case so the answer was updated to use the jaccard metric instead of hamming.

More Related Contents:

Leave a Comment Cancel reply