我正在尝试比较分组中短语的每种组合以匹配并评分。我迷上了循环浏览这些组的内容:
import pandas as pd
from fuzzywuzzy import fuzz as fz
import itertools
data = [[1,'ab'],[1,'bc'],[1,'de'],[2,'gh'],[2,'hi'],[2,'jk'],[3,'kl'],[3,'lm'],[3,'yz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])
def iterated(df):
for a, b in itertools.product(df['DESCR'],df['DESCR']):
try:
print(a, b, fz.partial_ratio(a, b), fz.token_set_ratio(a,b))
except:
pass
return result
df.groupby('Ids').apply(iterated(df))
以上内容是将每个DESCR与整个列表中的所有内容进行比较,而不是将其限制在每个分组中。我得到了:
ab ab 100 100
ab bc 50 50
ab de 0 0
ab gh 0 0
ab hi 0 0
ab jk 0 0
ab kl 0 0
ab lm 0 0
ab yz 0 0
bc ab 67 50
bc bc 100 100
bc de 0 0
bc gh 0 0
bc hi 0 0
bc jk 0 0
bc kl 0 0
bc lm 0 0
bc yz 0 0
...
但是应该是:
ab bc 50 50
ab de 0 0
bc de 0 0
gh hi 50 50
gh jk 0 0
hi jk 50 50
...
谢谢。
答案 0 :(得分:1)
我认为问题是您没有正确处理组。您正在进行分组,然后使用命令.apply(iterated(df))
将基于DESCR的函数应用于 entire df。另外,我认为您想使用combinations
而不是product
。
您可能需要将其拆开并分别处理各组。考虑:
import pandas as pd
import itertools
data = [[1,'ab'],[1,'bc'],[1,'de'],[2,'gh'],[2,'hi'],[2,'jk'],[3,'kl'],[3,'lm'],[3,'yz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])
def show_combos(df): #replace with your function...
combos = itertools.combinations(df.DESCR, 2)
for c in combos:
print(c)
groups = df.groupby('Ids')
#iterate through the groups, which are mini-data frames
for name, group in groups:
print('group name: {}'.format(name))
show_combos(group)
print()
哪个会产生您想要的组:
group name: 1
('ab', 'bc')
('ab', 'de')
('bc', 'de')
group name: 2
('gh', 'hi')
('gh', 'jk')
('hi', 'jk')
group name: 3
('kl', 'lm')
('kl', 'yz')
('lm', 'yz')