用multiindex列数据框汇总某些列

时间:2018-08-28 02:41:34

标签: python pandas

我有一个从数据透视表创建的数据框,看起来像这样:

top_5_noisy_devices:

    { "device" : "1234", "type" : "foo"}
    { "device" : "1234", "type" : "foo"}
    { "device" : "1234", "type" : "foo"}
    { "device" : "2345", "type" : "foo"}
    { "device" : "4231", "type" : "foo"}
    { "device" : "4354", "type" : "foo"}

我正在迭代遍历multiindex列的上层以为每个公司创建一个sum列:

FSUM = FN + FP

SUM = FN + FP + TP

            import pandas as pd
            d = {
                    ('company1', 'False Negative'): {'April- 2012': 112.0, 'April- 2013': 370.0, 'April- 2014': 499.0,
                    'August- 2012': 431.0, 'August- 2013': 496.0, 'August- 2014': 221.0},
                    ('company1', 'False Positive'): {'April- 2012': 0.0, 'April- 2013'  544.0, 
                    'April- 2014': 50.0, 'August- 2012': 0.0, 'August- 2013': 0.0, 'August- 2014': 426.0}, 
                    ('company1', 'True Positive'): {'April- 2012': 0.0, 'April- 2013': 140.0, 
                    'April- 2014': 24.0, 'August- 2012': 0.0, 'August- 2013': 0.0,'August- 2014': 77.0},
                    ('company2', 'False Negative'): {'April- 2012': 112.0, 'April- 2013': 370.0, 
                    'April- 2014': 499.0, 'August- 2012': 431.0, 'August- 2013': 496.0, 'August- 2014': 221.0},
                    ('company2', 'False Positive'): {'April- 2012': 0.0, 'April- 2013': 544.0, 
                    'April- 2014': 50.0, 'August- 2012': 0.0, 'August- 2013': 0.0, 'August- 2014': 426.0},
                    ('company2', 'True Positive'): {'April- 2012': 0.0, 'April- 2013': 140.0, 'April- 2014': 24.0,
                    'August- 2012': 0.0, 'August- 2013': 0.0,'August- 2014': 77.0}
                }
            df = pd.DataFrame(d)

            company1    company2
            FN  FP  TP  FN  FP  TP
            April- 2012     112 0   0   112 0   0
            April- 2013     370 544 140 370 544 140
            April- 2014     499 50  24  499 50  24
            August- 2012    431 0   0   431 0   0
            August- 2013    496 0   0   496 0   0
            August- 2014    221 426 77  221 426 77

我事先不知道公司名称,因此需要循环

2 个答案:

答案 0 :(得分:2)

通过使用一些.stack.unstack来重新组合事物,可以使它变得容易一些:

n [96]: df = df.unstack().unstack(1)

In [97]: df
Out[97]:
                       False Negative  False Positive  True Positive
company1 April- 2012            112.0             0.0            0.0
         April- 2013            370.0           544.0          140.0
         April- 2014            499.0            50.0           24.0
         August- 2012           431.0             0.0            0.0
         August- 2013           496.0             0.0            0.0
         August- 2014           221.0           426.0           77.0
company2 April- 2012            112.0             0.0            0.0
         April- 2013            370.0           544.0          140.0
         April- 2014            499.0            50.0           24.0
         August- 2012           431.0             0.0            0.0
         August- 2013           496.0             0.0            0.0
         August- 2014           221.0           426.0           77.0

In [98]: df['SUM'] = df.sum(axis=1)

In [99]: df['FSUM'] = df['False Negative'] + df['False Positive']

In [100]: df = df.stack().unstack([0,2])

In [101]: df
Out[101]:
                   company1                                              \
             False Negative False Positive True Positive     SUM   FSUM
April- 2012           112.0            0.0           0.0   112.0  112.0
April- 2013           370.0          544.0         140.0  1054.0  914.0
April- 2014           499.0           50.0          24.0   573.0  549.0
August- 2012          431.0            0.0           0.0   431.0  431.0
August- 2013          496.0            0.0           0.0   496.0  496.0
August- 2014          221.0          426.0          77.0   724.0  647.0

                   company2
             False Negative False Positive True Positive     SUM   FSUM
April- 2012           112.0            0.0           0.0   112.0  112.0
April- 2013           370.0          544.0         140.0  1054.0  914.0
April- 2014           499.0           50.0          24.0   573.0  549.0
August- 2012          431.0            0.0           0.0   431.0  431.0
August- 2013          496.0            0.0           0.0   496.0  496.0
August- 2014          221.0          426.0          77.0   724.0  647.0

答案 1 :(得分:1)

一种方法是在级别命令中使用sum,然后使用pd.concat,最后是sort_index:

pd.concat([df,
           df.loc(axis=1)[:,['False Negative','False Positive']].sum(level=0, axis=1).assign(indx2 = 'FSUM').set_index('indx2', append=True).unstack(),
           df.sum(level=0, axis=1).assign(indx2='SUM').set_index('indx2', append=True).unstack()],
          axis=1).sort_index(axis=1)

输出:

             company1                                                      \
                 FSUM False Negative False Positive     SUM True Positive   
April- 2012     112.0          112.0            0.0   112.0           0.0   
April- 2013     914.0          370.0          544.0  1054.0         140.0   
April- 2014     549.0          499.0           50.0   573.0          24.0   
August- 2012    431.0          431.0            0.0   431.0           0.0   
August- 2013    496.0          496.0            0.0   496.0           0.0   
August- 2014    647.0          221.0          426.0   724.0          77.0   

             company2                                                      
                 FSUM False Negative False Positive     SUM True Positive  
April- 2012     112.0          112.0            0.0   112.0           0.0  
April- 2013     914.0          370.0          544.0  1054.0         140.0  
April- 2014     549.0          499.0           50.0   573.0          24.0  
August- 2012    431.0          431.0            0.0   431.0           0.0  
August- 2013    496.0          496.0            0.0   496.0           0.0  
August- 2014    647.0          221.0          426.0   724.0          77.0