一起处理分类数据单元和数值数据

时间:2018-11-29 18:39:40

标签: python-3.x pandas categorical-data feature-selection

我有一个具有3个要素Product_detail,S.I_Units和Value的数据框。

df4 = pd.DataFrame({'Product_detail': ['XYZ', 'ABC', 'DEF', 'GHI'],'D': ['g', 'Kg', 'l', 'ml'],'F': ['500', '1', '1', '1000']} )

我的Product_detail列包含文本,因此我已将其转换为TfidfVectorizer

我必须计算相似度矩阵,但是我不知道如何使用 S.I_units列与Value列。例如,DataFrame的某行类似于('Amul Butter','g','200'),('Amul Butter','g' ,'100'),('Amul Butter','g','300'),('Amul Milk','ml','1000')。我想要Amul黄油的顶级“ n”个同类产品。

1 个答案:

答案 0 :(得分:0)

我不确定您的预期输出是多少,但是您可以使用fuzzywuzzy做一些事情,即百分比str匹配:

让我们假设这个df:

df4 = pd.DataFrame({'Product_detail': ['butter', 'amul butter', 'amul milk', 'milk'],'D': ['g', 'Kg', 'l', 'ml'],'F': ['500', '1', '1', '1000']})


Product_detail  D   F
0   butter      g   500
1   amul butter Kg  1
2   amul milk   l   1
3   milk        ml  1000

然后您可以创建选择列表并使用process.extract()

from fuzzywuzzy import fuzz, process

# create a list of choices from df['Product_detail']
choices = list(df4['Product_detail'].values)

# use fuzzywuzzy's process.extract()
# limit is the number of returned results
process.extract('amul butter', choices, limit=3)

退出:

[('amul butter', 100), ('butter', 90), ('amul milk', 59)]

如果您不希望百分比匹配,请执行列表理解:

result = process.extract('amul butter', choices, limit=3)

# list comprehension to remove the percent
[x[0] for x in result]

退出:

['amul butter', 'butter', 'amul milk']

如果您想返回自己的df:

# list comprehension to remove the percent
result_list = [x[0] for x in result]

# if you want to return your df
df4[df4['Product_detail'].isin(result_list)]

    Product_detail  D   F
0   butter          g   500
1   amul butter     Kg  1
2   amul milk       l   1