how to parallelize many (fuzzy) string comparisons using apply in Pandas?

You can parallelize this with Dask.dataframe. >>> dmaster = dd.from_pandas(master, npartitions=4) >>> dmaster[‘my_value’] = dmaster.original.apply(lambda x: helper(x, slave), name=”my_value”)) >>> dmaster.compute() original my_value 0 this is a nice sentence 2 1 this is another one 3 2 stackoverflow is nice 1 Additionally, you should think about the tradeoffs between using threads vs processes here. Your … Read more

Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best … Read more

Apply fuzzy matching across a dataframe column and save results in a new column

I couldn’t tell what you were doing. This is how I would do it. from fuzzywuzzy import fuzz from fuzzywuzzy import process Create a series of tuples to compare: compare = pd.MultiIndex.from_product([df1[‘Company’], df2[‘FDA Company’]]).to_series() Create a special function to calculate fuzzy metrics and return a series. def metrics(tup): return pd.Series([fuzz.ratio(*tup), fuzz.token_sort_ratio(*tup)], [‘ratio’, ‘token’]) Apply metrics … Read more