levenshtein-distance
String similarity metrics in Python [duplicate]
I realize it’s not the same thing, but this is close enough: >>> import difflib >>> a=”Hello, All you people” >>> b = ‘hello, all You peopl’ >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095 You can make this as a function def similar(seq1, seq2): return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9 >>> similar(a, b) True >>> similar(‘Hello, world’, … Read more
Most efficient way to calculate Levenshtein distance
The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation — the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of … Read more
Fast Levenshtein distance in R?
And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions (1)
Text clustering with Levenshtein distances
This may be a bit simplistic, but here’s a code example that uses hierarchical clustering based on Levenshtein distance in R. set.seed(1) rstr <- function(n,k){ # vector of n random char(k) strings sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))}) } str<- c(paste0(“aa”,rstr(10,3)),paste0(“bb”,rstr(10,3)),paste0(“cc”,rstr(10,3))) # Levenshtein Distance d <- adist(str) rownames(d) <- str hc <- hclust(as.dist(d)) plot(hc) rect.hclust(hc,k=3) df <- data.frame(str,cutree(hc,k=3)) In this … Read more
Levenshtein Distance Algorithm better than O(n*m)?
Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, … Read more
Finding closest neighbour using optimized Levenshtein Algorithm
I’ve done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until … Read more
High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]
In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles: import codecs, difflib, Levenshtein, distance with codecs.open(“titles.tsv”,”r”,”utf-8″) as f: title_list = f.read().split(“\n”)[:-1] for row in title_list: sr = row.lower().split(“\t”) diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio() lev = Levenshtein.ratio(sr[3], sr[4]) sor = 1 – distance.sorensen(sr[3], … Read more
Sort an array by the “Levenshtein Distance” with best performance in Javascript
I wrote an inline spell checker a few years ago and implemented a Levenshtein algorithm – since it was inline and for IE8 I did quite a lot of performance optimisation. var levDist = function(s, t) { var d = []; //2d matrix // Step 1 var n = s.length; var m = t.length; if … Read more
What algorithm gives suggestions in a spell checker?
There is good essay by Peter Norvig how to implement a spelling corrector. It’s basicly a brute force approach trying candidate strings with a given edit distance. (Here are some tips how you can improve the spelling corrector performance using a Bloom Filter and faster candidate hashing.) The requirements for a spell checker are weaker. … Read more