想想下表
np.random.seed(42)
ix = pd.date_range('2017-01-01', '2017-01-15', freq='60s')
df = pd.DataFrame(
{
'val': np.random.random(size=ix.shape[0]),
'active': np.random.choice([0,1], size=ix.shape[0])
},
index=ix
)
df.sample(10)
屈服:
active val
2017-01-02 06:05:00 1 0.774654
2017-01-04 08:15:00 1 0.934796
2017-01-13 01:02:00 0 0.792351...
我的目标是计算:
每日总和这个很简单:
gb = df.groupby(pd.to_datetime(df.index.date))
overall_sum_per_day = gb['val'].sum().rename('overall')
每个活跃日期的总和这有点棘手(参见this)。
active_sum_per_day = gb.agg(lambda x: x[x.active==1]['val'].sum())['val'].rename('active')
我的问题如何将两者结合起来。使用concat
:
pd.concat([overall_sum_per_day, active_sum_per_day], axis=1)
我可以实现我的目标。但我没有一次完成它并立即应用这两个聚合。可能吗?请参阅此comment。
答案 0 :(得分:4)
您可以使用GroupBy.apply
:
b = gb.apply(lambda x: pd.Series([x['val'].sum(), x.loc[x.active==1, 'val'].sum()],
index=['overall', 'active']))
print (b)
overall active
2017-01-01 715.997165 366.856234
2017-01-02 720.101832 355.100828
2017-01-03 711.247370 335.231948
2017-01-04 713.688122 338.088299
2017-01-05 716.127970 342.889442
2017-01-06 697.319129 338.741027
2017-01-07 708.121948 361.086977
2017-01-08 731.032093 370.697884
2017-01-09 718.386679 342.162494
2017-01-10 709.706473 349.657514
2017-01-11 720.477342 368.407343
2017-01-12 738.286682 378.618305
2017-01-13 735.805583 372.039108
2017-01-14 727.502271 345.612816
2017-01-15 0.613559 0.613559
另一种解决方案:
b = gb.agg(lambda x: [x['val'].sum(), x.loc[x.active==1, 'val'].sum()])
.rename(columns={'val':'overall'})
print (b)
active overall
2017-01-01 715.997165 366.856234
2017-01-02 720.101832 355.100828
2017-01-03 711.247370 335.231948
2017-01-04 713.688122 338.088299
2017-01-05 716.127970 342.889442
2017-01-06 697.319129 338.741027
2017-01-07 708.121948 361.086977
2017-01-08 731.032093 370.697884
2017-01-09 718.386679 342.162494
2017-01-10 709.706473 349.657514
2017-01-11 720.477342 368.407343
2017-01-12 738.286682 378.618305
2017-01-13 735.805583 372.039108
2017-01-14 727.502271 345.612816
2017-01-15 0.613559 0.613559
答案 1 :(得分:3)
IIUC我们可以一步完成,使用您原来的DF:
In [105]: df.groupby([df.index.normalize(), 'active'])['val'] \
...: .sum() \
...: .unstack(fill_value=0) \
...: .rename(columns={0:'overall', 1:'active'}) \
...: .assign(overall=lambda x: x['overall'] + x['active'])
Out[105]:
active overall active
2017-01-01 715.997165 366.856234
2017-01-02 720.101832 355.100828
2017-01-03 711.247370 335.231948
2017-01-04 713.688122 338.088299
2017-01-05 716.127970 342.889442
... ... ...
2017-01-11 720.477342 368.407343
2017-01-12 738.286682 378.618305
2017-01-13 735.805583 372.039108
2017-01-14 727.502271 345.612816
2017-01-15 0.613559 0.613559
[15 rows x 2 columns]
说明:
In [64]: df.groupby([df.index.normalize(), 'active'])['val'].sum()
Out[64]:
active
2017-01-01 0 349.140931
1 366.856234
2017-01-02 0 365.001004
1 355.100828
2017-01-03 0 376.015422
...
2017-01-13 0 363.766475
1 372.039108
2017-01-14 0 381.889455
1 345.612816
2017-01-15 1 0.613559
Name: val, Length: 29, dtype: float64
In [65]: df.groupby([df.index.normalize(), 'active'])['val'].sum().unstack(fill_value=0)
Out[65]:
active 0 1
2017-01-01 349.140931 366.856234
2017-01-02 365.001004 355.100828
2017-01-03 376.015422 335.231948
2017-01-04 375.599823 338.088299
2017-01-05 373.238528 342.889442
... ... ...
2017-01-11 352.069999 368.407343
2017-01-12 359.668377 378.618305
2017-01-13 363.766475 372.039108
2017-01-14 381.889455 345.612816
2017-01-15 0.000000 0.613559
[15 rows x 2 columns]
答案 2 :(得分:1)
我认为使用为{date}分组构建的pd.Grouper
进行分组更加清晰。您也可以定义一个功能,以便清晰。
def func(df):
active = (df['active'] * df['val']).sum()
overall = df['val'].sum()
return pd.Series(data=[active, overall], index=['active','overall'])
df.groupby(pd.Grouper(freq='d')).apply(func)
active overall
2017-01-01 366.856234 715.997165
2017-01-02 355.100828 720.101832
2017-01-03 335.231948 711.247370
2017-01-04 338.088299 713.688122
2017-01-05 342.889442 716.127970
2017-01-06 338.741027 697.319129
2017-01-07 361.086977 708.121948
2017-01-08 370.697884 731.032093
2017-01-09 342.162494 718.386679
2017-01-10 349.657514 709.706473
2017-01-11 368.407343 720.477342
2017-01-12 378.618305 738.286682
2017-01-13 372.039108 735.805583
2017-01-14 345.612816 727.502271
2017-01-15 0.613559 0.613559
您应该可以使用resample
和apply
there is a bug执行此操作。
df.resample('d').apply(func) # should work but doens't produce correct output
active val
2017-01-01 366.856234 NaN
2017-01-02 355.100828 NaN
2017-01-03 335.231948 NaN
2017-01-04 338.088299 NaN
2017-01-05 342.889442 NaN
2017-01-06 338.741027 NaN
2017-01-07 361.086977 NaN
2017-01-08 370.697884 NaN
2017-01-09 342.162494 NaN
2017-01-10 349.657514 NaN
2017-01-11 368.407343 NaN
2017-01-12 378.618305 NaN
2017-01-13 372.039108 NaN
2017-01-14 345.612816 NaN
2017-01-15 0.613559 NaN