Finding similar strings with PostgreSQL quickly

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that’s already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly. Use SET pg_trgm.similarity_threshold and the % operator … Read more

Comparing strings with tolerance

You could use the Levenshtein Distance algorithm. “The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.” – Wikipedia.com This one is from dotnetperls.com: using System; /// <summary> /// … Read more

how to compute similarity between two strings in MYSQL

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552): CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11) DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; DECLARE cv0, cv1 text; SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF … Read more

What’s the fastest way in Python to calculate cosine similarity given sparse matrix data?

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output: from sklearn.metrics.pairwise import cosine_similarity from scipy import sparse A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) A_sparse = sparse.csr_matrix(A) similarities = cosine_similarity(A_sparse) … Read more

A better similarity ranking algorithm for variable length strings

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles/StrikeAMatch.html Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment … Read more

Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be: import math import re from collections import Counter WORD = re.compile(r”\w+”) def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) … Read more