Pandas中的复杂拆分,合并和透视多个数据框

时间:2018-04-13 17:06:55

标签: python pandas dataframe merge

我有两个pandas数据帧,它们必须是合并和数据透视。在其中一个数据框中,列是一个字符串,以逗号分隔。数据框是

import pandas as pd
import numpy as np

tableA = [(100, 'chocolate, sprinkles'),
     (101, 'chocolate, sprinkles'),
     (102, 'glazed')]
labels = ['product', 'tags']
dfA = pd.DataFrame.from_records(tableA, columns=labels)

tableB = [('A', 100),
       ('A', 101),
       ('B', 101),
       ('C', 100),
       ('C', 102),
       ('B', 101),
       ('A', 100),
       ('C', 102)]
labels = ['customer', 'product']
dfB = pd.DataFrame.from_records(tableB, columns=labels) 

dfA:
     product                  tags
 0      100  chocolate, sprinkles
 1      101  chocolate, sprinkles
 2      102                glazed
dfB:
   customer  product
 0        A      100
 1        A      101
 2        B      101
 3        C      100
 4        C      102
 5        B      101
 6        A      100
 7        C      102

,结果必须像

 customer   sprinkles   chocolate   glazed
 A          ?            ?              ?
 B          ?            ?              ?   
 C          ?            ?              ?   

我尝试了各种功能,但是我失败了。任何建议将不胜感激!

我的一些代码,我知道这不会起作用,但它应该让你了解我尝试做的事情:

dfC=dfB.merge(dfA, left_on='product', right_on='product')
print(dfC)

导致了

        customer  product                  tags
 0        A      100  chocolate, sprinkles
 1        C      100  chocolate, sprinkles
 2        A      100  chocolate, sprinkles
 3        A      101  chocolate, sprinkles
 4        B      101  chocolate, sprinkles
 5        B      101  chocolate, sprinkles
 6        C      102                glazed
 7        C      102                glazed

dfS = pd.DataFrame(dfC.tags.str.split(',').tolist(),index=dfC.customer).stack()
dfS = dfS.reset_index()[[ 'customer',0]] 
dfS.columns = ['var1', 'var2'] 
print(dfS)

导致:

     var1        var2
0     A   chocolate
1     A   sprinkles
2     C   chocolate
3     C   sprinkles
4     A   chocolate
5     A   sprinkles
6     A   chocolate
7     A   sprinkles
8     B   chocolate
9     B   sprinkles
10    B   chocolate
11    B   sprinkles
12    C      glazed
13    C      glazed

2 个答案:

答案 0 :(得分:1)

首先你需要剥离你的var2:

dfS['var2'] = dfS['var2'].str.strip()

删除空格,然后您可以为每个标记创建一个列,例如:

dfS['chocolate'] = dfS['var2'].apply(lambda x: 1 if x == 'chocolate' else 0)
dfS['sprinkles'] = dfS['var2'].apply(lambda x: 1 if x == 'sprinkles' else 0)
dfS['glazed'] = dfS['var2'].apply(lambda x: 1 if x == 'glazed' else 0)

然后你可以groupby var1并汇总为其他列的总和,例如:

dfS.groupby('var1').agg(sum).reset_index().rename(columns ={'var1':'customer'})

输出如下:

  customer  chocolate  sprinkles  glazed
0        A          3          3       0
1        B          2          2       0
2        C          1          1       2

答案 1 :(得分:1)

使用合并数据框dfs,您可以使用pd.crosstab来获取客户的用户数量

pd.crosstab(dfs.var1,dfs.var2)

var2  chocolate  glazed  sprinkles
var1
A             3       0          3
B             2       0          2
C             1       2          1