我有两个pandas数据帧,它们必须是合并和数据透视。在其中一个数据框中,列是一个字符串,以逗号分隔。数据框是
import pandas as pd
import numpy as np
tableA = [(100, 'chocolate, sprinkles'),
(101, 'chocolate, sprinkles'),
(102, 'glazed')]
labels = ['product', 'tags']
dfA = pd.DataFrame.from_records(tableA, columns=labels)
tableB = [('A', 100),
('A', 101),
('B', 101),
('C', 100),
('C', 102),
('B', 101),
('A', 100),
('C', 102)]
labels = ['customer', 'product']
dfB = pd.DataFrame.from_records(tableB, columns=labels)
dfA:
product tags
0 100 chocolate, sprinkles
1 101 chocolate, sprinkles
2 102 glazed
dfB:
customer product
0 A 100
1 A 101
2 B 101
3 C 100
4 C 102
5 B 101
6 A 100
7 C 102
,结果必须像
customer sprinkles chocolate glazed
A ? ? ?
B ? ? ?
C ? ? ?
我尝试了各种功能,但是我失败了。任何建议将不胜感激!
我的一些代码,我知道这不会起作用,但它应该让你了解我尝试做的事情:
dfC=dfB.merge(dfA, left_on='product', right_on='product')
print(dfC)
导致了
customer product tags
0 A 100 chocolate, sprinkles
1 C 100 chocolate, sprinkles
2 A 100 chocolate, sprinkles
3 A 101 chocolate, sprinkles
4 B 101 chocolate, sprinkles
5 B 101 chocolate, sprinkles
6 C 102 glazed
7 C 102 glazed
和
dfS = pd.DataFrame(dfC.tags.str.split(',').tolist(),index=dfC.customer).stack()
dfS = dfS.reset_index()[[ 'customer',0]]
dfS.columns = ['var1', 'var2']
print(dfS)
导致:
var1 var2
0 A chocolate
1 A sprinkles
2 C chocolate
3 C sprinkles
4 A chocolate
5 A sprinkles
6 A chocolate
7 A sprinkles
8 B chocolate
9 B sprinkles
10 B chocolate
11 B sprinkles
12 C glazed
13 C glazed
答案 0 :(得分:1)
首先你需要剥离你的var2:
dfS['var2'] = dfS['var2'].str.strip()
删除空格,然后您可以为每个标记创建一个列,例如:
dfS['chocolate'] = dfS['var2'].apply(lambda x: 1 if x == 'chocolate' else 0)
dfS['sprinkles'] = dfS['var2'].apply(lambda x: 1 if x == 'sprinkles' else 0)
dfS['glazed'] = dfS['var2'].apply(lambda x: 1 if x == 'glazed' else 0)
然后你可以groupby
var1并汇总为其他列的总和,例如:
dfS.groupby('var1').agg(sum).reset_index().rename(columns ={'var1':'customer'})
输出如下:
customer chocolate sprinkles glazed
0 A 3 3 0
1 B 2 2 0
2 C 1 1 2
答案 1 :(得分:1)
使用合并数据框dfs
,您可以使用pd.crosstab
来获取客户的用户数量
pd.crosstab(dfs.var1,dfs.var2)
var2 chocolate glazed sprinkles
var1
A 3 0 3
B 2 0 2
C 1 2 1