Question

我有许多驻留在mySQL中的表，我需要使用引用表中的指定列（例如，表A列1）对多个表中的相似列（例如，表B列1，表C列）执行相似的字符串匹配1，...）。

尝试使用Python Levenshtein（比率函数），并将每个元素（例如，表B第1列中的每个元素）循环到列表（例如表A列1）中。但是，整个循环过程太慢。表B第1栏中有数百万个元素，而列表由60K个唯一元素组成。我要花几天时间才能完成整个过程。有更有效的方法吗？

    ## Perform fuzzy matching using Levenshtein distance
    def get_closest_match(previous_string, sample_string, df, fun):
       # Initialize variables
       best_match = ''
       highest_ratio = 0
       # Compare sample_string with previous_string to identify duplicates
       if sample_string == previous_string:
          # If it is duplicate, skip fuzzy matching for efficiency
          best_match = previous_string
          # If it is not duplicate, perform subsequent matching
       else:
          # Compare sample_string with current_string in reference list
          for current_string in df.values.tolist():
             if sample_string == current_string[0]:
                # If total match, skip fuzzy matching for efficiency 
                highest_ratio =  1
                best_match = current_string[0]

            elif (sample_string.split(' ')[0] == current_string[0].split(' ') 
                 [0]) and (highest_ratio != 1):
                  # If it is not total match and pass first word matching, 
                  # proceed with fuzzy matching 
                  current_score = fun(sample_string, current_string[0])
                  if(current_score > highest_ratio):
                     highest_ratio = current_score
                     best_match = current_string[0]
     return best_match

     def LevRatioMerge(df1, df2, fun):
     temp_string = ''
     for row in df1.itertuples():
        best_match = get_closest_match(temp_string, row[1], df2, fun)
        temp_string = best_match
        matched_dict['Matched'].append(best_match)

     LevRatioMerge(df_tableBCol1, df_tableACol1, ratio)

如何加快相似的字符串匹配？

0 个答案: