How can I get the difference between the conditional sum of values from 2 dataframes?

时间:2016-04-04 18:31:44

标签: python numpy pandas group-by dataframe

I have 2 pandas DataFrames (df1 and df2) with the following headers:

DF1: cnpj     num_doc   bc_icms
DF2: cnpj     num_doc   vlr_item

My question:

How can I compare the sum() of BC_ICMS of each NUM_DOC for each CNPJ from DF1, against the sum() of VLR_ITEM of each ocurrence from the same NUM_DOC and CNPJ from DF2 ?


My thoughts:

I believe it's something like "subtract the row that contains the result of aggfunc="sum" of df1 from df2"

Or maybe finding a way to create something like:

cnpj    num_doc    sum_bc_icms_df1    sum_vlr_item_df2    diff

Where diff is a columns that contains the result of vlr_item_df1 - vlr_item_df2


My data:

DF1:

               cnpj     num_doc    bc_icms
0    02817342000124  0000010154   17827.07
1    54921580000189  0000112428  108000.00
2    08953538000122  0000012865     232.00
3    08953538000122  0000012865     239.00
4    08953538000122  0000012865     215.00
5    07374346000107  0000014224     320.12
6    07374346000107  0000014231     385.04
7    07374346000107  0000014263     401.28
8    07374346000107  0000014279     391.26
9    02364118000124  0000015263   37353.10
10   02364118000124  0000015264   56214.14
11   02364118000124  0000015265    2115.92
12   07720786000160  0000020128     355.53
13   57101370000132  0000002042     200.00
14   07720786000160  0000020437     461.32
15   07720786000160  0000020438     168.04
16   07720786000160  0000020439     385.23
17   07720786000160  0000020444     378.80
18   07720786000160  0000020446     289.67
19   07720786000160  0000020451     223.36
20   07720786000160  0000020461     305.31

DF2:

               cnpj     num_doc  vlr_item
0    02817342000124  0000010154   1391,04
1    02817342000124  0000010154     100,0
2    02817342000124  0000010154   1587,09
3    02817342000124  0000010154     200,0
4    02817342000124  0000010154    1430,0
5    02817342000124  0000010154     592,0
6    02817342000124  0000010154   7508,99
7    02817342000124  0000010154    443,03
8    02817342000124  0000010154      29,9
9    02817342000124  0000010154    343,26
10   07374346000107  0000014224    320,12
11   07374346000107  0000014231    385,04
12   07374346000107  0000014263    401,28
13   07374346000107  0000014279    391,26
14   02364118000124  0000015263  29725,96
15   02364118000124  0000015263   6590,31
16   02364118000124  0000015263   1036,83
17   02364118000124  0000015264  47348,71
18   02364118000124  0000015264   7641,97
19   02364118000124  0000015264   1223,46
20   02364118000124  0000015265   2115,92

Rules:

  • 1 CNPJ emits N NUM_DOC.
  • It can happen that CNPJ A emits NUM_DOC 1 and CNPJ B emits NUM_DOC 1 too.
  • Each CNPJ can appear more than once.
  • Each NUM_DOC can appear more than once.

My code:

From df1:

g1 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()

Output:    

cnpj            num_doc   
00903175000109  0000059253       100.00
01114678000168  0000004705       358.00
01116428000167  0000006883       165.00
02295440000149  0000051845        72.65
                0000051866        69.90
02364118000124  0000015263     37353.10
                0000015264     56214.14
                0000015265      2115.92
                0000015571     39873.74
                0000015572     52611.06
                0000015573      1121.64
02817342000124  0000010154     17827.07
                0000010299       407.02
03509978001143  0000145539     17000.00
                0000145744    108000.00
                0000145745    108000.00
                0000145746    108000.00
                0000145749     15000.00
                0000145750     15000.00
                0000145753     15000.00
03572854000130  0000007298      7190.54
03688838000108  0000006286       155.77
04967974000280  0000025175       388.18
                0000025230       384.07
                0000025273       363.35
                0000025356       351.88
                0000025455       434.77
                0000025487       352.18
                0000025527       382.65
                0000025630       383.61

And from df2:

g2 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()

Output:

cnpj            num_doc   
02364118000124  0000015263    37353.10
                0000015264    56214.14
                0000015265     2115.92
                0000015571    39873.74
                0000015572    52611.06
                0000015573     1121.64
02817342000124  0000010154    13625.31
                0000010299      407.02
03572854000130  0000007298     7170.54
04967974000280  0000025175      388.18
                0000025230      384.07
                0000025273      363.35
                0000025356      351.88
                0000025455      434.77
                0000025487      352.18
                0000025527      382.65
                0000025630      383.61
                0000025660      400.15
                0000025663      325.72
                0000025785      342.14
                0000025860      287.19
                0000025933      378.77
                0000026016      334.84
                0000026083      357.55
                0000026149      367.37
                0000026197      388.14
05413136000147  0000005704      199.85
                0000005709      187.22
                0000005710      157.40
                0000005711       81.30

So.. How can I create the following dataframe?

cnpj    num_doc    sum_bc_icms_df1    sum_vlr_item_df2    diff
0001    0000001    100.000,00          98.000,00          2.000,00
0001    0000002    50.000,00           50.000,00          0,00
0002    0000001    10.000,00           12.000,00          -2.000,00
0003    0000001    60.000,00           55.000,00          5.000,00

Or maybe achieving what I want (get the difference between the sum of values) by creating a way to conditionally sum the values using an AND condition... but I don't know how... yet.

1 个答案:

答案 0 :(得分:2)

you can do it this way:

In [48]: g1 = df1.groupby(['cnpj','num_doc']).sum()

In [49]: g2 = df2.groupby(['cnpj','num_doc']).sum()

In [50]: joined = g1.join(g2).fillna(0)

In [51]: joined
Out[51]:
                          bc_icms  vlr_item
cnpj           num_doc
2364118000124  15263     37353.10  37353.10
               15264     56214.14  56214.14
               15265      2115.92   2115.92
2817342000124  10154     17827.07  13625.31
7374346000107  14224       320.12    320.12
               14231       385.04    385.04
               14263       401.28    401.28
               14279       391.26    391.26
7720786000160  20128       355.53      0.00
               20437       461.32      0.00
               20438       168.04      0.00
               20439       385.23      0.00
               20444       378.80      0.00
               20446       289.67      0.00
               20451       223.36      0.00
               20461       305.31      0.00
8953538000122  12865       686.00      0.00
54921580000189 112428   108000.00      0.00
57101370000132 2042        200.00      0.00

In [53]: joined['diff'] = joined['bc_icms'] - joined['vlr_item']

In [54]: joined
Out[54]:
                          bc_icms  vlr_item       diff
cnpj           num_doc
2364118000124  15263     37353.10  37353.10       0.00
               15264     56214.14  56214.14       0.00
               15265      2115.92   2115.92       0.00
2817342000124  10154     17827.07  13625.31    4201.76
7374346000107  14224       320.12    320.12       0.00
               14231       385.04    385.04       0.00
               14263       401.28    401.28       0.00
               14279       391.26    391.26       0.00
7720786000160  20128       355.53      0.00     355.53
               20437       461.32      0.00     461.32
               20438       168.04      0.00     168.04
               20439       385.23      0.00     385.23
               20444       378.80      0.00     378.80
               20446       289.67      0.00     289.67
               20451       223.36      0.00     223.36
               20461       305.31      0.00     305.31
8953538000122  12865       686.00      0.00     686.00
54921580000189 112428   108000.00      0.00  108000.00
57101370000132 2042        200.00      0.00     200.00

One-liner:

joined = df1.groupby(['cnpj','num_doc']).sum() \
            .join(df2.groupby(['cnpj','num_doc']).sum()) \
            .fillna(0)