String similarity metrics in Python [duplicate]

I realize it’s not the same thing, but this is close enough: >>> import difflib >>> a=”Hello, All you people” >>> b = ‘hello, all You peopl’ >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095 You can make this as a function def similar(seq1, seq2): return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9 >>> similar(a, b) True >>> similar(‘Hello, world’, … Read more

Text clustering with Levenshtein distances

This may be a bit simplistic, but here’s a code example that uses hierarchical clustering based on Levenshtein distance in R. set.seed(1) rstr <- function(n,k){ # vector of n random char(k) strings sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))}) } str<- c(paste0(“aa”,rstr(10,3)),paste0(“bb”,rstr(10,3)),paste0(“cc”,rstr(10,3))) # Levenshtein Distance d <- adist(str) rownames(d) <- str hc <- hclust(as.dist(d)) plot(hc) rect.hclust(hc,k=3) df <- data.frame(str,cutree(hc,k=3)) In this … Read more

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles: import codecs, difflib, Levenshtein, distance with codecs.open(“titles.tsv”,”r”,”utf-8″) as f: title_list = f.read().split(“\n”)[:-1] for row in title_list: sr = row.lower().split(“\t”) diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio() lev = Levenshtein.ratio(sr[3], sr[4]) sor = 1 – distance.sorensen(sr[3], … Read more

What algorithm gives suggestions in a spell checker?

There is good essay by Peter Norvig how to implement a spelling corrector. It’s basicly a brute force approach trying candidate strings with a given edit distance. (Here are some tips how you can improve the spelling corrector performance using a Bloom Filter and faster candidate hashing.) The requirements for a spell checker are weaker. … Read more