如何加快相似的字符串匹配?

时间:2018-07-05 15:14:56

标签: mysql python-3.x levenshtein-distance

我有许多驻留在mySQL中的表,我需要使用引用表中的指定列(例如,表A列1)对多个表中的相似列(例如,表B列1,表C列)执行相似的字符串匹配1,...)。

尝试使用Python Levenshtein(比率函数),并将每个元素(例如,表B第1列中的每个元素)循环到列表(例如表A列1)中。但是,整个循环过程太慢。表B第1栏中有数百万个元素,而列表由60K个唯一元素组成。我要花几天时间才能完成整个过程。有更有效的方法吗?

    ## Perform fuzzy matching using Levenshtein distance
    def get_closest_match(previous_string, sample_string, df, fun):
       # Initialize variables
       best_match = ''
       highest_ratio = 0
       # Compare sample_string with previous_string to identify duplicates
       if sample_string == previous_string:
          # If it is duplicate, skip fuzzy matching for efficiency
          best_match = previous_string
          # If it is not duplicate, perform subsequent matching
       else:
          # Compare sample_string with current_string in reference list
          for current_string in df.values.tolist():
             if sample_string == current_string[0]:
                # If total match, skip fuzzy matching for efficiency 
                highest_ratio =  1
                best_match = current_string[0]

            elif (sample_string.split(' ')[0] == current_string[0].split(' ') 
                 [0]) and (highest_ratio != 1):
                  # If it is not total match and pass first word matching, 
                  # proceed with fuzzy matching 
                  current_score = fun(sample_string, current_string[0])
                  if(current_score > highest_ratio):
                     highest_ratio = current_score
                     best_match = current_string[0]
     return best_match

     def LevRatioMerge(df1, df2, fun):
     temp_string = ''
     for row in df1.itertuples():
        best_match = get_closest_match(temp_string, row[1], df2, fun)
        temp_string = best_match
        matched_dict['Matched'].append(best_match)

     LevRatioMerge(df_tableBCol1, df_tableACol1, ratio)

0 个答案:

没有答案