假设我有以下3个数据帧:
fuzzywuzzy
我想通过使用full_name
包计算相似度来找到相似的建筑物名称,这是我需要改进的解决方案:
首先,我将所有三个数据帧连接为id
的一列。实际上,在这一步中,我不应该将full_name
添加到df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']
df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']
frames = [df4, df5, df6]
df = pd.concat(frames)
df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)
,但是为了更好地区分不同数据帧中的建筑物名称,我添加了它:
full_names
第二,我迭代所有similarity_ratio
并相互比较,以获得每对建筑物名称中的df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()
processedProjects = []
matchers = []
threshold_ratio = 10
for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))
with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='\n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))
for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)
:
my_list
('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......
结果:
output10.csv
在最后一步,我在Excel中手动拆分了 id city name matched_id matched_name \
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing
matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27
,并得到了最终的预期结果(如果每个建筑物都有数据框源,效果会更好):
{{1}}
如何在Python中以更有效的方式获得最终的预期结果?谢谢。
答案 0 :(得分:1)
尝试此解决方案:我正在使用numpy和itertools来加速和简化编码,而无需使用excel文件...
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
:
:
frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]
#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])}
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]