我有这样的数据框(样本),
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V2 B2 Clearing C2 104457.22
5 V2 B2 Invoice C2 -400073.56
6 V2 B2 Payment C2 297856.45
7 V3 B3 Clearing C3 1989462.95
8 V3 B3 CreditMemo C3 538.95
9 V3 B3 CustomerPayment_Difference C3 2112329.00
10 V3 B3 Invoice C3 -4066485.69
11 V4 B4 Clearing C4 -123946.13
12 V4 B4 CreditMemo C4 127624.66
13 V4 B4 Accounting C4 424774.52
14 V4 B4 Invoice C4 -40446521.41
15 V4 B4 Payment C4 44441419.95
我想重塑这个数据框,如下所示:
A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
V1 B1 C1 NaN 1538884.46 NaN 13537679.7
V2 B2 C2 NaN 104457.22 NaN NaN
V3 B3 C3 NaN 1989462.95 538.95 2112329.0
V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
到目前为止,我试图从数据透视表中获取帮助,
df.pivot(index='A',columns='C', values='E').reset_index()
它给出如下结果:
C A Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 NaN 1538884.46 NaN 13537679.7
1 V2 NaN 104457.22 NaN NaN
2 V3 NaN 1989462.95 538.95 2112329.0
3 V4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
在上表中,它留下了B& C列,我也需要这些列。
为简单起见,这提供了此样本数据。但是将来数据也会像这样,
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
**4 V1 B2 Clearing C1 88.9
5 V1 B2 Clearing C2 79.9**
在这种情况下,我的代码会抛出重复的索引错误。
要解决这两个问题,我需要指定A,B,D作为索引。 我需要一个与此类似的代码,
df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()
这段代码给我一个错误。
如何解决这个问题?如何在pandas数据透视表中提供多列作为索引?
答案 0 :(得分:1)
我认为需要:
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 B1 C1 NaN 1538884.46 NaN 13537679.7
1 V2 B2 C2 NaN 104457.22 NaN NaN
2 V3 B3 C3 NaN 1989462.95 538.95 2112329.0
3 V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
另一种解决方案是使用pivot_table
:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
但如果在A
,B
,C
,D
列中重复,则汇总。在第一个解决方案中,如果重复,则获取错误:
print (df)
A B C D E
0 V1 B1 Clearing C1 3000.00 <-V1,B1,Clearing,C1
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V1 B1 Cleari7ng C1 1000.00 <-V1,B1,Clearing,C1
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
ValueError:索引包含重复的条目,无法重塑
但是pivot_table聚合:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)
C Clearing CustomerPayment_Difference Invoice PaymentDifference
A B D
V1 B1 C1 2000.0 13537679.7 -15771005.81 0.0
所以问题是:好主意总是使用pivot_table
?
在我看来,这取决于是否需要关注重复项 - 如果使用pivot
或set_index
+ unstack
收到错误 - 您知道dupes,但pivot_table
总是聚合,所以不知道欺骗。