我正在像这样选择一些数据:
base = spark.sql("""
SELECT
...
...
""")
print(base.count())
base.cache()
base=base.toPandas()
base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
base.set_index("yyyy_mm_dd", inplace=True)
这给了我一个像这样的数据框:
id aggregated_field aggregated_field2
yyyy_mm_dd
我想按yyyy_mm_dd
和id
进行分组,但是对汇总的字段求和。通过这种方式,我每天可以查看每个提供商的汇总字段的总和。然后,我希望将其汇总为每月一次。这就是我所做的:
agg = base.groupby(['yyyy_mm_dd', 'id'])[['aggregated_field','aggregated_field2']].sum()
我的数据框现在看起来像这样:
aggregated_field aggregated_field2
yyyy_mm_dd id
最后,我尝试将resample()
设置为每月一次:
agg = agg.resample('M').sum()
然后我得到这个错误:
TypeError:仅对DatetimeIndex,TimedeltaIndex或PeriodIndex有效,但具有“ MultiIndex”的实例
我不确定为什么,因为我之前将yyyy_mm_dd转换为日期索引。
编辑:我正在寻找的输出是这样:
yyyy_mm_dd id aggregated_metric aggregated_metric2
2019-01-01 1 ... ...
2
3
2019-01-02 1
2
3
答案 0 :(得分:1)
也许您会发现这很有用:
解决方案1(采用pd.Period及其“合法”显示每月数据格式)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm'] = base['yyyy_mm_dd'].dt.to_period('M')
>>> agg = base.groupby(['yyyy_mm', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm id
2012-01 1 2 202
2 4 204
2012-02 1 10 210
2 12 212
解决方案2(坚持datetime64)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm_dd_month_start'] = base['yyyy_mm_dd'].values.astype('datetime64[M]')
>>> agg = base.groupby(['yyyy_mm_dd_month_start', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm_dd_month_start id
2012-01-01 1 2 202
2 4 204
2012-02-01 1 10 210
2 12 212