假设我有以下数据框:
import pandas as pd
df = pd.DataFrame(
{
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)]
}
)
这是:
office_id sales state
0 1 903325 CA
1 2 364594 WA
2 3 737728 CO
3 4 239378 AZ
4 5 833003 CA
5 6 501536 WA
6 1 920821 CO
7 2 879602 AZ
8 3 661818 CA
9 4 548888 WA
10 5 842459 CO
11 6 906791 AZ
现在我在groupby
和office_id
进行states
操作:
df.groupby(["office_id", "state"]).aggregate({"sales": "sum"})
这导致:
sales
office_id state
1 CA 903325
CO 920821
2 AZ 879602
WA 364594
3 CA 661818
CO 737728
4 AZ 239378
WA 548888
5 CA 833003
CO 842459
6 AZ 906791
WA 501536
是否可以为每个office_id添加一个新索引total
,例如销售列的每个州的总和?
我可以通过按"office_id"
和sum进行分组来计算它,但我获得了一个新的DataFrame并且我没有成功合并它。
答案 0 :(得分:2)
您可以按Series.unstack
重新塑造,添加新列total
,然后按DataFrame.stack
重新整形,如果需要MultiIndex
使用Series.to_frame
:
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
print (df1)
sales
office_id state
1 CA 505047.0
CO 724412.0
total 1229459.0
2 AZ 402775.0
WA 339803.0
total 742578.0
3 CA 343655.0
CO 833474.0
total 1177129.0
4 AZ 574130.0
WA 656577.0
total 1230707.0
5 CA 122260.0
CO 207717.0
total 329977.0
6 AZ 262568.0
WA 504491.0
total 767059.0
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
#cast if sales are always integers
df1.sales = df1.sales.astype(int)
print (df1)
sales
office_id state
1 CA 323107
CO 658336
total 981443
2 AZ 273728
WA 942249
total 1215977
3 CA 773390
CO 692275
total 1465665
4 AZ 669435
WA 735141
total 1404576
5 CA 530182
CO 232104
total 762286
6 AZ 532248
WA 951481
total 1483729
<强>计时强>:
def jez(df):
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
df1.sales = df1.sales
return (df1)
print (jez(df))
In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack())
100 loops, best of 3: 14.6 ms per loop
In [340]: %timeit (jez(df))
100 loops, best of 3: 2.78 ms per loop
答案 1 :(得分:2)
Pandas具有内置功能,可以通过将pivot_table
参数设置为margins
来True
执行此操作。它只能正确排序,因为'total'是小写的,大写是第一个。
df.pivot_table(index='office_id', columns='state', margins=True,
margins_name='total', aggfunc='sum').stack()
sales
office_id state
1 CA 415727.0
CO 240142.0
total 655869.0
2 AZ 126350.0
WA 385698.0
total 512048.0
3 CA 387320.0
CO 487075.0
total 874395.0
4 AZ 978018.0
WA 878368.0
total 1856386.0
5 CA 105057.0
CO 852025.0
total 957082.0
6 AZ 130853.0
WA 435940.0
total 566793.0
total AZ 1235221.0
CA 908104.0
CO 1579242.0
WA 1700006.0
total 5422573.0
答案 2 :(得分:0)
您还可以使用concat
附加汇总总计,如下所示。
pd.concat([df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}),
df.groupby(["state"]).aggregate({"sales": "sum"})
.set_index([['Total', 'Total', 'Total', 'Total']], append=True).swaplevel(0, 1)])
返回
sales
office_id state
1 CA 914776
CO 902173
2 AZ 605783
WA 865189
3 CA 280203
CO 556867
4 AZ 958747
WA 643333
5 CA 703606
CO 644399
6 AZ 768268
WA 834051
Total AZ 2332798
CA 1898585
CO 2103439
WA 2342573
此处,Data.frame在办公室州和州级汇总。这些与.concat
连接在一起。在连接之前,必须为聚合到状态级别的DataFrame提供额外的索引。这是通过set_index
完成的。此外,必须交换索引以符合办公室状态级别DataFrame。