我的数据框看起来像这样:
Customer_ID Category Products
1 Veg A
2 Veg B
3 Fruit A
3 Fruit B
3 Veg B
1 Fruit A
3 Veg C
1 Fruit C
我想找出购买产品的每个类别的每个客户ID,并相应地为每个产品创建一个列。输出看起来像这样:
Customer_ID Category Pro_1 Pro_2 Pro_3
1 Veg A NA NA
1 Fruit A NA C
2 Veg NA B NA
3 Veg NA B C
3 Fruit A B NA
答案 0 :(得分:1)
使用crosstab
的另一个选项:
pd.crosstab([df['Customer_ID'],df['Category']], df['Products'])
输出:
Products A B C
Customer_ID Category
1 Fruit 1 0 1
Veg 1 0 0
2 Veg 0 1 0
3 Fruit 1 1 0
Veg 0 1 1
之后,您可以将索引重置为您想要的类似解决方案。
df = df.reset_index()
Products Customer_ID Category A B C
0 1 Fruit 1 0 1
1 1 Veg 1 0 0
2 2 Veg 0 1 0
3 3 Fruit 1 1 0
4 3 Veg 0 1 1
答案 1 :(得分:1)
将jsfiddle与groupby
一起使用,但如果重复行数据是并行的话:
df = df.groupby(['Customer_ID','Category','Products'])['Products'].sum().unstack()
df.columns = ['Pro_{}'.format(x) for x in range(1, len(df.columns)+1)]
df = df.reset_index()
print (df)
Customer_ID Category Pro_1 Pro_2 Pro_3
0 1 Fruit A None C
1 1 Veg A None None
2 2 Veg None B None
3 3 Fruit A B None
4 3 Veg None B C
另一个带有辅助列的解决方案,三元组必须是唯一的:
#if not unique triples remove duplicates
df = df.drop_duplicates(['Customer_ID','Category','Products'])
df['a'] = df['Products']
df = df.set_index(['Customer_ID','Category','Products'])['a'].unstack()
df.columns = ['Pro_{}'.format(x) for x in range(1, len(df.columns)+1)]
df = df.reset_index()
print (df)
Customer_ID Category Pro_1 Pro_2 Pro_3
0 1 Fruit A None C
1 1 Veg A None None
2 2 Veg None B None
3 3 Fruit A B None
4 3 Veg None B C
答案 2 :(得分:0)
试试这个:(不要介意IO只是简单的复制/粘贴)
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""
Customer_ID Category Products
1 Veg A
2 Veg B
3 Fruit A
3 Fruit B
3 Veg B
1 Fruit A
3 Veg C
1 Fruit C"""), sep='\s+')
df = df.join(pd.get_dummies(df['Products']))
g = df.groupby(['Customer_ID', 'Category']).sum()
print(g)
输出:
A B C
Customer_ID Category
1 Fruit 1 0 1
Veg 1 0 0
2 Veg 0 1 0
3 Fruit 1 1 0
Veg 0 1 1