如何在pandas multiindex数据帧中为每个子索引添加一行?

时间:2017-01-04 15:04:55

标签: python pandas

假设我有以下数据框:

import pandas as pd
df = pd.DataFrame(
    {
        'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
        'office_id': list(range(1, 7)) * 2,
        'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)]
    }
)

这是:

    office_id   sales state
0           1  903325    CA
1           2  364594    WA
2           3  737728    CO
3           4  239378    AZ
4           5  833003    CA
5           6  501536    WA
6           1  920821    CO
7           2  879602    AZ
8           3  661818    CA
9           4  548888    WA
10          5  842459    CO
11          6  906791    AZ

现在我在groupbyoffice_id进行states操作:

df.groupby(["office_id", "state"]).aggregate({"sales": "sum"})

这导致:

                  sales
office_id state
1         CA     903325
          CO     920821
2         AZ     879602
          WA     364594
3         CA     661818
          CO     737728
4         AZ     239378
          WA     548888
5         CA     833003
          CO     842459
6         AZ     906791
          WA     501536

是否可以为每个office_id添加一个新索引total,例如销售列的每个州的总和?

我可以通过按"office_id"和sum进行分组来计算它,但我获得了一个新的DataFrame并且我没有成功合并它。

3 个答案:

答案 0 :(得分:2)

您可以按Series.unstack重新塑造,添加新列total,然后按DataFrame.stack重新整形,如果需要MultiIndex使用Series.to_frame

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
print (df1)
                     sales
office_id state           
1         CA      505047.0
          CO      724412.0
          total  1229459.0
2         AZ      402775.0
          WA      339803.0
          total   742578.0
3         CA      343655.0
          CO      833474.0
          total  1177129.0
4         AZ      574130.0
          WA      656577.0
          total  1230707.0
5         CA      122260.0
          CO      207717.0
          total   329977.0
6         AZ      262568.0
          WA      504491.0
          total   767059.0
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
#cast if sales are always integers
df1.sales = df1.sales.astype(int)
print (df1)
                   sales
office_id state         
1         CA      323107
          CO      658336
          total   981443
2         AZ      273728
          WA      942249
          total  1215977
3         CA      773390
          CO      692275
          total  1465665
4         AZ      669435
          WA      735141
          total  1404576
5         CA      530182
          CO      232104
          total   762286
6         AZ      532248
          WA      951481
          total  1483729

<强>计时

def jez(df):
    df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
    df1['total'] = df1.sum(axis=1)
    df1 = df1.stack().to_frame('sales')
    df1.sales = df1.sales
    return (df1)

print (jez(df))

In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack())
100 loops, best of 3: 14.6 ms per loop

In [340]: %timeit (jez(df))
100 loops, best of 3: 2.78 ms per loop

答案 1 :(得分:2)

Pandas具有内置功能,可以通过将pivot_table参数设置为marginsTrue执行此操作。它只能正确排序,因为'total'是小写的,大写是第一个。

df.pivot_table(index='office_id', columns='state', margins=True,
               margins_name='total', aggfunc='sum').stack()

                     sales
office_id state           
1         CA      415727.0
          CO      240142.0
          total   655869.0
2         AZ      126350.0
          WA      385698.0
          total   512048.0
3         CA      387320.0
          CO      487075.0
          total   874395.0
4         AZ      978018.0
          WA      878368.0
          total  1856386.0
5         CA      105057.0
          CO      852025.0
          total   957082.0
6         AZ      130853.0
          WA      435940.0
          total   566793.0
total     AZ     1235221.0
          CA      908104.0
          CO     1579242.0
          WA     1700006.0
          total  5422573.0

答案 2 :(得分:0)

您还可以使用concat附加汇总总计,如下所示。

pd.concat([df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}),
           df.groupby(["state"]).aggregate({"sales": "sum"})
            .set_index([['Total', 'Total', 'Total', 'Total']], append=True).swaplevel(0, 1)])

返回

                   sales
office_id state         
1         CA      914776
          CO      902173
2         AZ      605783
          WA      865189
3         CA      280203
          CO      556867
4         AZ      958747
          WA      643333
5         CA      703606
          CO      644399
6         AZ      768268
          WA      834051
Total     AZ     2332798
          CA     1898585
          CO     2103439
          WA     2342573

此处,Data.frame在办公室州和州级汇总。这些与.concat连接在一起。在连接之前,必须为聚合到状态级别的DataFrame提供额外的索引。这是通过set_index完成的。此外,必须交换索引以符合办公室状态级别DataFrame。