我有一个具有3个要素Product_detail,S.I_Units和Value的数据框。
df4 = pd.DataFrame({'Product_detail': ['XYZ', 'ABC', 'DEF', 'GHI'],'D': ['g', 'Kg', 'l', 'ml'],'F': ['500', '1', '1', '1000']} )
我的Product_detail
列包含文本,因此我已将其转换为TfidfVectorizer
我必须计算相似度矩阵,但是我不知道如何使用
S.I_units
列与Value
列。例如,DataFrame的某行类似于('Amul Butter','g','200'),('Amul Butter','g' ,'100'),('Amul Butter','g','300'),('Amul Milk','ml','1000')。我想要Amul黄油的顶级“ n”个同类产品。
答案 0 :(得分:0)
我不确定您的预期输出是多少,但是您可以使用fuzzywuzzy
做一些事情,即百分比str匹配:
让我们假设这个df:
df4 = pd.DataFrame({'Product_detail': ['butter', 'amul butter', 'amul milk', 'milk'],'D': ['g', 'Kg', 'l', 'ml'],'F': ['500', '1', '1', '1000']})
Product_detail D F
0 butter g 500
1 amul butter Kg 1
2 amul milk l 1
3 milk ml 1000
然后您可以创建选择列表并使用process.extract()
from fuzzywuzzy import fuzz, process
# create a list of choices from df['Product_detail']
choices = list(df4['Product_detail'].values)
# use fuzzywuzzy's process.extract()
# limit is the number of returned results
process.extract('amul butter', choices, limit=3)
退出:
[('amul butter', 100), ('butter', 90), ('amul milk', 59)]
如果您不希望百分比匹配,请执行列表理解:
result = process.extract('amul butter', choices, limit=3)
# list comprehension to remove the percent
[x[0] for x in result]
退出:
['amul butter', 'butter', 'amul milk']
如果您想返回自己的df:
# list comprehension to remove the percent
result_list = [x[0] for x in result]
# if you want to return your df
df4[df4['Product_detail'].isin(result_list)]
Product_detail D F
0 butter g 500
1 amul butter Kg 1
2 amul milk l 1