How to compute jaccard similarity from a pandas dataframe

Use pairwise_distances to calculate the distance and subtract that distance from 1 to find the similarity score: from sklearn.metrics.pairwise import pairwise_distances 1 – pairwise_distances(df.T.to_numpy(), metric=”jaccard”) Explanation: In newer versions of scikit learn, the definition of jaccard_score is similar to the Jaccard similarity coefficient definition in Wikipedia: where M11 represents the total number of attributes where … Read more

get cosine similarity between two documents in lucene

As Julia points out Sujit Pal’s example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4. import java.io.IOException; import java.util.*; import org.apache.commons.math3.linear.*; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.SimpleAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.util.*; public class CosineDocumentSimilarity { public static final String CONTENT … Read more

Selecting close matches from one array based on another reference array

Approach #1: With NumPy broadcasting, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A. It seems for the given sample inputs, a threshold of 90 works. Thus, we would have an implementation, like so – thresh = 90 Aout = A[(np.abs(A[:,None] … Read more

Solr Custom Similarity

I figured it out on my own. I have stored my own implementation of DefaultSimilarity under /dist/ folder in solr. Then i add <lib dir=”../../../dist/org/apache/lucene/search/similarities/” regex=”.*\.jar”/> to my solrconfig.xml and everything works fine. package org.apache.lucene.search.similarities; import org.apache.lucene.index.FieldInvertState; import org.apache.lucene.search.similarities.DefaultSimilarity; public class MyNewSimilarityClass extends DefaultSimilarity { @Override public float coord(int overlap, int maxOverlap) { return 1.0f; … Read more

String similarity with Python + Sqlite (Levenshtein distance / edit distance)

Here is a ready-to-use example test.py: import sqlite3 db = sqlite3.connect(‘:memory:’) db.enable_load_extension(True) db.load_extension(‘./spellfix’) # for Linux #db.load_extension(‘./spellfix.dll’) # <– UNCOMMENT HERE FOR WINDOWS db.enable_load_extension(False) c = db.cursor() c.execute(‘CREATE TABLE mytable (id integer, description text)’) c.execute(‘INSERT INTO mytable VALUES (1, “hello world, guys”)’) c.execute(‘INSERT INTO mytable VALUES (2, “hello there everybody”)’) c.execute(‘SELECT * FROM mytable WHERE … Read more

Algorithm to find articles with similar text

Edit distance isn’t a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you’d actually be interested in searching. Something like Lucene is the way to go. You index all your documents, and then when you … Read more