similarity - w3toppers.com

Finding similar strings with PostgreSQL quickly

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that’s already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly. Use SET pg_trgm.similarity_threshold and the % operator … Read more

Comparing strings with tolerance

You could use the Levenshtein Distance algorithm. “The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.” – Wikipedia.com This one is from dotnetperls.com: using System; /// <summary> /// … Read more

how to compute similarity between two strings in MYSQL

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552): CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11) DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; DECLARE cv0, cv1 text; SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF … Read more

What’s the fastest way in Python to calculate cosine similarity given sparse matrix data?

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output: from sklearn.metrics.pairwise import cosine_similarity from scipy import sparse A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) A_sparse = sparse.csr_matrix(A) similarities = cosine_similarity(A_sparse) … Read more

A better similarity ranking algorithm for variable length strings

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles/StrikeAMatch.html Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment … Read more

Checking images for similarity with OpenCV

This is a huge topic, with answers from 3 lines of code to entire research magazines. I will outline the most common such techniques and their results. Comparing histograms One of the simplest & fastest methods. Proposed decades ago as a means to find picture simmilarities. The idea is that a forest will have a … Read more

How to calculate distance similarity measure of given 2 strings?

I just addressed this exact same issue a few weeks ago. Since someone is asking now, I’ll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x … Read more

How to find similar results and sort by similarity?

I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it … Read more

Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be: import math import re from collections import Counter WORD = re.compile(r”\w+”) def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) … Read more

Find the similarity metric between two strings

There is a built in. from difflib import SequenceMatcher def similar(a, b): return SequenceMatcher(None, a, b).ratio() Using it: >>> similar(“Apple”,”Appel”) 0.8 >>> similar(“Apple”,”Mango”) 0.0