Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best ratio among them. Therefore it even has to calculate the Levenshtein distance multiple times. So one goal should be to reduce the search space by excluding some possibilities using a way faster matching algorithm. Another issue is that the strings are preprocessed to remove punctuation and to lowercase the strings. While this is required for the matching (so e.g. a uppercased word becomes equal to a lowercased one) we can do this ahead of time. So we only have to preprocess the 70k strings once. I will use RapidFuzz instead of FuzzyWuzzy here, since it is quite a bit faster (I am the author).

The following version performs more than 10 times as fast as your previous solution in my experiments and applies the following improvements:

it preprocesses the strings ahead of time
it passes a score_cutoff to extractOne so it can skip calculations where it already knows they can not reach this ratio

import pandas as pd, numpy as np
from rapidfuzz import process, utils

org_list = df['org_name']
processed_orgs = [utils.default_process(org) for org in org_list]

for (i, processed_query) in enumerate(processed_orgs):
    # None is skipped by extractOne, so we set the current element to None an
    # revert this change after the comparision
    processed_orgs[i] = None
    match = process.extractOne(processed_query, processed_orgs, processor=None, score_cutoff=93)
    processed_orgs[i] = processed_query
    if match:
        df.loc[i, 'fuzzy_match'] = org_list[match[2]]
        df.loc[i, 'fuzzy_match_score'] = match[1]

Here is a list of the most relevant improvements of RapidFuzz to make it faster than FuzzyWuzzy in this example:

It is implemented fully in C++ while a big part of FuzzyWuzzy is implemented in Python
When calculating the levenshtein distance it takes into account the score_cutoff to choose an optimized implementation based. E.g. when the length difference between the strings is to big it can exit in O(1).
FuzzyWuzzy uses Python-Levenshtein to calculate the similarity between two strings, which uses a weightened Levenshtein distance with a weight of 2 for substitutions. This is implemented using Wagner-Fischer. RapidFuzz on the other hand uses a bitparallel implementation for this based on BitPal, which is faster
fuzz.WRatio is combining the results of multiple other string matching algorithms like fuzz.ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio and takes the maximum result after weighting them. So while fuzz.ratio has a weighting of 1 fuzz.token_sort_ratio and fuzz.token_set_ratio have one of 0.95. When the score_cutoff is bigger than 95 fuzz.token_sort_ratio and fuzz.token_set_ratio are not calculated anymore, since the results are guaranteed to be smaller than the score_cutoff
In process.extractOne RapidFuzz avoids calls through Python whenever possible and preprocesses the query once ahead of time. E.g. the BitPal algorithm requires one of the two strings which are compared to be stored into a bitvector which takes a big part of the algorithms runtime. In process.extractOne the query is stored into this bitvector only once and the bitvector is reused afterwards making the algorithm a lot faster.
since extractOne only searches for the best match it uses the ratio of the current best match as score_cutoff for the next elements. This way it can quickly discard more elements by using the improvements to the levenshtein distance calculation from 2) in many cases. When it finds a element with a similarity of 100 it exits early since there can’t be a better match afterwards.

More Related Contents:

Leave a Comment Cancel reply