我有两个大文件File1和File2,每个文件都包含公司的名称。我正在尝试从File2中找到公司名称(“ companyname”)的模糊匹配,以匹配到File1。目前,由于超时,我无法完成处理。有没有更有效的方法来提高处理速度?
这是我的代码:
File1=pd.read_csv("directory/File1.csv")
File2=pd.read_csv("directory/File2.csv")
def match_name(name, list_names, min_score=0):
# -1 score in case we don't get any matches
max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iternating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
dict_list = []
for name in File2.companyname:
# Use our method to find best match, we can set a threshold here
match = match_name(File1.companyname, File2.companyname, 70)
# New dict for storing data
dict_ = {}
dict_.update({"companyname" : name})
dict_.update({"match_companyname" : match[0]})
dict_.update({"score" : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Save results
merge_table.to_csv("directory/Saved.csv")