使用pivot重塑pandas数据框并提供多列作为索引

时间:2018-03-16 06:41:24

标签: python pandas

我有这样的数据框(样本),

     A   B                           C   D            E
0   V1  B1                    Clearing  C1   1538884.46
1   V1  B1  CustomerPayment_Difference  C1  13537679.70
2   V1  B1                     Invoice  C1 -15771005.81
3   V1  B1           PaymentDifference  C1         0.00
4   V2  B2                    Clearing  C2    104457.22
5   V2  B2                     Invoice  C2   -400073.56
6   V2  B2                     Payment  C2    297856.45
7   V3  B3                    Clearing  C3   1989462.95
8   V3  B3                  CreditMemo  C3       538.95
9   V3  B3  CustomerPayment_Difference  C3   2112329.00
10  V3  B3                     Invoice  C3  -4066485.69
11  V4  B4                    Clearing  C4   -123946.13
12  V4  B4                  CreditMemo  C4    127624.66
13  V4  B4                  Accounting  C4    424774.52
14  V4  B4                     Invoice  C4 -40446521.41
15  V4  B4                     Payment  C4  44441419.95

我想重塑这个数据框,如下所示:

   A  B  D    Accounting    Clearing  CreditMemo  CustomerPayment_Difference  \
  V1  B1 C1          NaN  1538884.46         NaN                  13537679.7   
  V2  B2 C2          NaN   104457.22         NaN                         NaN   
  V3  B3 C3          NaN  1989462.95      538.95                   2112329.0   
  V4  B4 C4    424774.52  -123946.13   127624.66                         NaN   

C      Invoice      Payment  PaymentDifference  
0 -15771005.81          NaN                0.0  
1   -400073.56    297856.45                NaN  
2  -4066485.69          NaN                NaN  
3 -40446521.41  44441419.95                NaN 

到目前为止,我试图从数据透视表中获取帮助, df.pivot(index='A',columns='C', values='E').reset_index()

它给出如下结果:

C   A  Accounting    Clearing  CreditMemo  CustomerPayment_Difference  \
0  V1         NaN  1538884.46         NaN                  13537679.7   
1  V2         NaN   104457.22         NaN                         NaN   
2  V3         NaN  1989462.95      538.95                   2112329.0   
3  V4   424774.52  -123946.13   127624.66                         NaN   

C      Invoice      Payment  PaymentDifference  
0 -15771005.81          NaN                0.0  
1   -400073.56    297856.45                NaN  
2  -4066485.69          NaN                NaN  
3 -40446521.41  44441419.95                NaN

在上表中,它留下了B& C列,我也需要这些列。

为简单起见,这提供了此样本数据。但是将来数据也会像这样,

     A   B                           C   D            E
0   V1  B1                    Clearing  C1   1538884.46
1   V1  B1  CustomerPayment_Difference  C1  13537679.70
2   V1  B1                     Invoice  C1 -15771005.81
3   V1  B1           PaymentDifference  C1         0.00
**4   V1  B2                    Clearing  C1         88.9
5   V1  B2                    Clearing  C2         79.9**

在这种情况下,我的代码会抛出重复的索引错误。

要解决这两个问题,我需要指定A,B,D作为索引。 我需要一个与此类似的代码,

df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()

这段代码给我一个错误。

如何解决这个问题?如何在pandas数据透视表中提供多列作为索引?

1 个答案:

答案 0 :(得分:1)

我认为需要:

df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C   A   B   D  Accounting    Clearing  CreditMemo  CustomerPayment_Difference  \
0  V1  B1  C1         NaN  1538884.46         NaN                  13537679.7   
1  V2  B2  C2         NaN   104457.22         NaN                         NaN   
2  V3  B3  C3         NaN  1989462.95      538.95                   2112329.0   
3  V4  B4  C4   424774.52  -123946.13   127624.66                         NaN   

C      Invoice      Payment  PaymentDifference  
0 -15771005.81          NaN                0.0  
1   -400073.56    297856.45                NaN  
2  -4066485.69          NaN                NaN  
3 -40446521.41  44441419.95                NaN  

另一种解决方案是使用pivot_table

df = df.pivot_table(index=['A','B','D'], columns='C', values='E')

但如果在ABCD列中重复,则汇总。在第一个解决方案中,如果重复,则获取错误:

print (df)
    A   B                           C   D            E
0  V1  B1                    Clearing  C1      3000.00 <-V1,B1,Clearing,C1
1  V1  B1  CustomerPayment_Difference  C1  13537679.70
2  V1  B1                     Invoice  C1 -15771005.81
3  V1  B1           PaymentDifference  C1         0.00
4  V1  B1                   Cleari7ng  C1      1000.00 <-V1,B1,Clearing,C1


df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
  

ValueError:索引包含重复的条目,无法重塑

但是pivot_table聚合:

df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)

C         Clearing  CustomerPayment_Difference      Invoice  PaymentDifference
A  B  D                                                                       
V1 B1 C1    2000.0                  13537679.7 -15771005.81                0.0

所以问题是:好主意总是使用pivot_table

在我看来,这取决于是否需要关注重复项 - 如果使用pivotset_index + unstack收到错误 - 您知道dupes,但pivot_table总是聚合,所以不知道欺骗。