agrep: only return best match(es)

The agrep package uses Levenshtein Distances to match strings. The package RecordLinkage has a C function to calculate the Levenshtein Distance, which can be used directly to speed up your computation. Here is a reworked ClosestMatch function that is around 10x faster library(RecordLinkage) ClosestMatch2 = function(string, stringVector){ distance = levenshteinSim(string, stringVector); stringVector[distance == max(distance)] }

Regex for existence of some words whose order doesn’t matter

See this regex: /^(?=.*Tim)(?=.*stupid).+/ Regex explanation: ^ Asserts position at start of string. (?=.*Tim) Asserts that “Tim” is present in the string. (?=.*stupid) Asserts that “stupid” is present in the string. .+Now that our phrases are present, this string is valid. Go ahead and use .+ or – .++ to match the entire string. To … Read more

Javascript fuzzy search that makes sense

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime’s search. https://github.com/farzher/fuzzysort The only typo it allows is a transpose. It’s pretty solid (1k stars, 0 issues), very fast, and handles your case easily: fuzzysort.go(‘int’, [‘international’, ‘splint’, ‘tinder’]) // [{highlighted: ‘*int*ernational’, … Read more

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles: import codecs, difflib, Levenshtein, distance with codecs.open(“titles.tsv”,”r”,”utf-8″) as f: title_list = f.read().split(“\n”)[:-1] for row in title_list: sr = row.lower().split(“\t”) diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio() lev = Levenshtein.ratio(sr[3], sr[4]) sor = 1 – distance.sorensen(sr[3], … Read more

javascript regular expression to check for IP addresses

May be late but, someone could try: Example of VALID IP address 115.42.150.37 192.168.0.1 110.234.52.124 Example of INVALID IP address 210.110 – must have 4 octets 255 – must have 4 octets y.y.y.y – only digits are allowed 255.0.0.y – only digits are allowed 666.10.10.20 – octet number must be between [0-255] 4444.11.11.11 – octet … Read more

A better similarity ranking algorithm for variable length strings

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles/StrikeAMatch.html Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment … Read more