我有许多驻留在mySQL中的表,我需要使用引用表中的指定列(例如,表A列1)对多个表中的相似列(例如,表B列1,表C列)执行相似的字符串匹配1,...)。
尝试使用Python Levenshtein(比率函数),并将每个元素(例如,表B第1列中的每个元素)循环到列表(例如表A列1)中。但是,整个循环过程太慢。表B第1栏中有数百万个元素,而列表由60K个唯一元素组成。我要花几天时间才能完成整个过程。有更有效的方法吗?
## Perform fuzzy matching using Levenshtein distance
def get_closest_match(previous_string, sample_string, df, fun):
# Initialize variables
best_match = ''
highest_ratio = 0
# Compare sample_string with previous_string to identify duplicates
if sample_string == previous_string:
# If it is duplicate, skip fuzzy matching for efficiency
best_match = previous_string
# If it is not duplicate, perform subsequent matching
else:
# Compare sample_string with current_string in reference list
for current_string in df.values.tolist():
if sample_string == current_string[0]:
# If total match, skip fuzzy matching for efficiency
highest_ratio = 1
best_match = current_string[0]
elif (sample_string.split(' ')[0] == current_string[0].split(' ')
[0]) and (highest_ratio != 1):
# If it is not total match and pass first word matching,
# proceed with fuzzy matching
current_score = fun(sample_string, current_string[0])
if(current_score > highest_ratio):
highest_ratio = current_score
best_match = current_string[0]
return best_match
def LevRatioMerge(df1, df2, fun):
temp_string = ''
for row in df1.itertuples():
best_match = get_closest_match(temp_string, row[1], df2, fun)
temp_string = best_match
matched_dict['Matched'].append(best_match)
LevRatioMerge(df_tableBCol1, df_tableACol1, ratio)