I have 2 pandas DataFrames
(df1
and df2
) with the following headers:
DF1: cnpj num_doc bc_icms
DF2: cnpj num_doc vlr_item
My question:
How can I compare the
sum()
ofBC_ICMS
of eachNUM_DOC
for eachCNPJ
fromDF1
, against thesum()
ofVLR_ITEM
of each ocurrence from the sameNUM_DOC
andCNPJ
fromDF2
?
My thoughts:
I believe it's something like "subtract the row that contains the result of aggfunc="sum" of df1 from df2"
Or maybe finding a way to create something like:
cnpj num_doc sum_bc_icms_df1 sum_vlr_item_df2 diff
Where diff
is a columns that contains the result of vlr_item_df1 - vlr_item_df2
My data:
DF1:
cnpj num_doc bc_icms
0 02817342000124 0000010154 17827.07
1 54921580000189 0000112428 108000.00
2 08953538000122 0000012865 232.00
3 08953538000122 0000012865 239.00
4 08953538000122 0000012865 215.00
5 07374346000107 0000014224 320.12
6 07374346000107 0000014231 385.04
7 07374346000107 0000014263 401.28
8 07374346000107 0000014279 391.26
9 02364118000124 0000015263 37353.10
10 02364118000124 0000015264 56214.14
11 02364118000124 0000015265 2115.92
12 07720786000160 0000020128 355.53
13 57101370000132 0000002042 200.00
14 07720786000160 0000020437 461.32
15 07720786000160 0000020438 168.04
16 07720786000160 0000020439 385.23
17 07720786000160 0000020444 378.80
18 07720786000160 0000020446 289.67
19 07720786000160 0000020451 223.36
20 07720786000160 0000020461 305.31
DF2:
cnpj num_doc vlr_item
0 02817342000124 0000010154 1391,04
1 02817342000124 0000010154 100,0
2 02817342000124 0000010154 1587,09
3 02817342000124 0000010154 200,0
4 02817342000124 0000010154 1430,0
5 02817342000124 0000010154 592,0
6 02817342000124 0000010154 7508,99
7 02817342000124 0000010154 443,03
8 02817342000124 0000010154 29,9
9 02817342000124 0000010154 343,26
10 07374346000107 0000014224 320,12
11 07374346000107 0000014231 385,04
12 07374346000107 0000014263 401,28
13 07374346000107 0000014279 391,26
14 02364118000124 0000015263 29725,96
15 02364118000124 0000015263 6590,31
16 02364118000124 0000015263 1036,83
17 02364118000124 0000015264 47348,71
18 02364118000124 0000015264 7641,97
19 02364118000124 0000015264 1223,46
20 02364118000124 0000015265 2115,92
Rules:
CNPJ
emits N NUM_DOC
.CNPJ A
emits NUM_DOC
1 and CNPJ B
emits NUM_DOC
1 too.CNPJ
can appear more than once.NUM_DOC
can appear more than once.My code:
From df1:
g1 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()
Output:
cnpj num_doc
00903175000109 0000059253 100.00
01114678000168 0000004705 358.00
01116428000167 0000006883 165.00
02295440000149 0000051845 72.65
0000051866 69.90
02364118000124 0000015263 37353.10
0000015264 56214.14
0000015265 2115.92
0000015571 39873.74
0000015572 52611.06
0000015573 1121.64
02817342000124 0000010154 17827.07
0000010299 407.02
03509978001143 0000145539 17000.00
0000145744 108000.00
0000145745 108000.00
0000145746 108000.00
0000145749 15000.00
0000145750 15000.00
0000145753 15000.00
03572854000130 0000007298 7190.54
03688838000108 0000006286 155.77
04967974000280 0000025175 388.18
0000025230 384.07
0000025273 363.35
0000025356 351.88
0000025455 434.77
0000025487 352.18
0000025527 382.65
0000025630 383.61
And from df2:
g2 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()
Output:
cnpj num_doc
02364118000124 0000015263 37353.10
0000015264 56214.14
0000015265 2115.92
0000015571 39873.74
0000015572 52611.06
0000015573 1121.64
02817342000124 0000010154 13625.31
0000010299 407.02
03572854000130 0000007298 7170.54
04967974000280 0000025175 388.18
0000025230 384.07
0000025273 363.35
0000025356 351.88
0000025455 434.77
0000025487 352.18
0000025527 382.65
0000025630 383.61
0000025660 400.15
0000025663 325.72
0000025785 342.14
0000025860 287.19
0000025933 378.77
0000026016 334.84
0000026083 357.55
0000026149 367.37
0000026197 388.14
05413136000147 0000005704 199.85
0000005709 187.22
0000005710 157.40
0000005711 81.30
So.. How can I create the following dataframe?
cnpj num_doc sum_bc_icms_df1 sum_vlr_item_df2 diff
0001 0000001 100.000,00 98.000,00 2.000,00
0001 0000002 50.000,00 50.000,00 0,00
0002 0000001 10.000,00 12.000,00 -2.000,00
0003 0000001 60.000,00 55.000,00 5.000,00
Or maybe achieving what I want (get the difference between the sum of values) by creating a way to conditionally sum the values using an AND condition... but I don't know how... yet.
答案 0 :(得分:2)
you can do it this way:
In [48]: g1 = df1.groupby(['cnpj','num_doc']).sum()
In [49]: g2 = df2.groupby(['cnpj','num_doc']).sum()
In [50]: joined = g1.join(g2).fillna(0)
In [51]: joined
Out[51]:
bc_icms vlr_item
cnpj num_doc
2364118000124 15263 37353.10 37353.10
15264 56214.14 56214.14
15265 2115.92 2115.92
2817342000124 10154 17827.07 13625.31
7374346000107 14224 320.12 320.12
14231 385.04 385.04
14263 401.28 401.28
14279 391.26 391.26
7720786000160 20128 355.53 0.00
20437 461.32 0.00
20438 168.04 0.00
20439 385.23 0.00
20444 378.80 0.00
20446 289.67 0.00
20451 223.36 0.00
20461 305.31 0.00
8953538000122 12865 686.00 0.00
54921580000189 112428 108000.00 0.00
57101370000132 2042 200.00 0.00
In [53]: joined['diff'] = joined['bc_icms'] - joined['vlr_item']
In [54]: joined
Out[54]:
bc_icms vlr_item diff
cnpj num_doc
2364118000124 15263 37353.10 37353.10 0.00
15264 56214.14 56214.14 0.00
15265 2115.92 2115.92 0.00
2817342000124 10154 17827.07 13625.31 4201.76
7374346000107 14224 320.12 320.12 0.00
14231 385.04 385.04 0.00
14263 401.28 401.28 0.00
14279 391.26 391.26 0.00
7720786000160 20128 355.53 0.00 355.53
20437 461.32 0.00 461.32
20438 168.04 0.00 168.04
20439 385.23 0.00 385.23
20444 378.80 0.00 378.80
20446 289.67 0.00 289.67
20451 223.36 0.00 223.36
20461 305.31 0.00 305.31
8953538000122 12865 686.00 0.00 686.00
54921580000189 112428 108000.00 0.00 108000.00
57101370000132 2042 200.00 0.00 200.00
One-liner:
joined = df1.groupby(['cnpj','num_doc']).sum() \
.join(df2.groupby(['cnpj','num_doc']).sum()) \
.fillna(0)