Question

我有这样的数据框（样本），

     A   B                           C   D            E
0   V1  B1                    Clearing  C1   1538884.46
1   V1  B1  CustomerPayment_Difference  C1  13537679.70
2   V1  B1                     Invoice  C1 -15771005.81
3   V1  B1           PaymentDifference  C1         0.00
4   V2  B2                    Clearing  C2    104457.22
5   V2  B2                     Invoice  C2   -400073.56
6   V2  B2                     Payment  C2    297856.45
7   V3  B3                    Clearing  C3   1989462.95
8   V3  B3                  CreditMemo  C3       538.95
9   V3  B3  CustomerPayment_Difference  C3   2112329.00
10  V3  B3                     Invoice  C3  -4066485.69
11  V4  B4                    Clearing  C4   -123946.13
12  V4  B4                  CreditMemo  C4    127624.66
13  V4  B4                  Accounting  C4    424774.52
14  V4  B4                     Invoice  C4 -40446521.41
15  V4  B4                     Payment  C4  44441419.95

我想重塑这个数据框，如下所示：

   A  B  D    Accounting    Clearing  CreditMemo  CustomerPayment_Difference  \
  V1  B1 C1          NaN  1538884.46         NaN                  13537679.7   
  V2  B2 C2          NaN   104457.22         NaN                         NaN   
  V3  B3 C3          NaN  1989462.95      538.95                   2112329.0   
  V4  B4 C4    424774.52  -123946.13   127624.66                         NaN   

C      Invoice      Payment  PaymentDifference  
0 -15771005.81          NaN                0.0  
1   -400073.56    297856.45                NaN  
2  -4066485.69          NaN                NaN  
3 -40446521.41  44441419.95                NaN

到目前为止，我试图从数据透视表中获取帮助， df.pivot(index='A',columns='C', values='E').reset_index()

它给出如下结果：

C   A  Accounting    Clearing  CreditMemo  CustomerPayment_Difference  \
0  V1         NaN  1538884.46         NaN                  13537679.7   
1  V2         NaN   104457.22         NaN                         NaN   
2  V3         NaN  1989462.95      538.95                   2112329.0   
3  V4   424774.52  -123946.13   127624.66                         NaN   

C      Invoice      Payment  PaymentDifference  
0 -15771005.81          NaN                0.0  
1   -400073.56    297856.45                NaN  
2  -4066485.69          NaN                NaN  
3 -40446521.41  44441419.95                NaN

在上表中，它留下了B＆amp; C列，我也需要这些列。

为简单起见，这提供了此样本数据。但是将来数据也会像这样，

     A   B                           C   D            E
0   V1  B1                    Clearing  C1   1538884.46
1   V1  B1  CustomerPayment_Difference  C1  13537679.70
2   V1  B1                     Invoice  C1 -15771005.81
3   V1  B1           PaymentDifference  C1         0.00
**4   V1  B2                    Clearing  C1         88.9
5   V1  B2                    Clearing  C2         79.9**

在这种情况下，我的代码会抛出重复的索引错误。

要解决这两个问题，我需要指定A，B，D作为索引。我需要一个与此类似的代码，

df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()

这段代码给我一个错误。

如何解决这个问题？如何在pandas数据透视表中提供多列作为索引？

Answer 1

我认为需要：

df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C   A   B   D  Accounting    Clearing  CreditMemo  CustomerPayment_Difference  \
0  V1  B1  C1         NaN  1538884.46         NaN                  13537679.7   
1  V2  B2  C2         NaN   104457.22         NaN                         NaN   
2  V3  B3  C3         NaN  1989462.95      538.95                   2112329.0   
3  V4  B4  C4   424774.52  -123946.13   127624.66                         NaN   

C      Invoice      Payment  PaymentDifference  
0 -15771005.81          NaN                0.0  
1   -400073.56    297856.45                NaN  
2  -4066485.69          NaN                NaN  
3 -40446521.41  44441419.95                NaN

另一种解决方案是使用pivot_table：

df = df.pivot_table(index=['A','B','D'], columns='C', values='E')

但如果在A，B，C，D列中重复，则汇总。在第一个解决方案中，如果重复，则获取错误：

print (df)
    A   B                           C   D            E
0  V1  B1                    Clearing  C1      3000.00 <-V1,B1,Clearing,C1
1  V1  B1  CustomerPayment_Difference  C1  13537679.70
2  V1  B1                     Invoice  C1 -15771005.81
3  V1  B1           PaymentDifference  C1         0.00
4  V1  B1                   Cleari7ng  C1      1000.00 <-V1,B1,Clearing,C1


df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)

ValueError：索引包含重复的条目，无法重塑

但是pivot_table聚合：

df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)

C         Clearing  CustomerPayment_Difference      Invoice  PaymentDifference
A  B  D                                                                       
V1 B1 C1    2000.0                  13537679.7 -15771005.81                0.0

所以问题是：好主意总是使用pivot_table？

在我看来，这取决于是否需要关注重复项 - 如果使用pivot或set_index + unstack收到错误 - 您知道dupes，但pivot_table总是聚合，所以不知道欺骗。

使用pivot重塑pandas数据框并提供多列作为索引

1 个答案: