在熊猫中对MultiIndex进行分组

时间:2019-11-13 18:55:10

标签: python python-3.x pandas

我正在像这样选择一些数据:

base = spark.sql("""
    SELECT
        ...
        ...
""")
print(base.count())
base.cache()
base=base.toPandas()
base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
base.set_index("yyyy_mm_dd", inplace=True)

这给了我一个像这样的数据框:

              id    aggregated_field    aggregated_field2
yyyy_mm_dd

我想按yyyy_mm_ddid进行分组,但是对汇总的字段求和。通过这种方式,我每天可以查看每个提供商的汇总字段的总和。然后,我希望将其汇总为每月一次。这就是我所做的:

agg = base.groupby(['yyyy_mm_dd', 'id'])[['aggregated_field','aggregated_field2']].sum()

我的数据框现在看起来像这样:

                  aggregated_field    aggregated_field2
yyyy_mm_dd  id

最后,我尝试将resample()设置为每月一次:

agg = agg.resample('M').sum()

然后我得到这个错误:

  

TypeError:仅对DatetimeIndex,TimedeltaIndex或PeriodIndex有效,但具有“ MultiIndex”的实例

我不确定为什么,因为我之前将yyyy_mm_dd转换为日期索引。

编辑:我正在寻找的输出是这样:

yyyy_mm_dd    id   aggregated_metric    aggregated_metric2
2019-01-01    1    ...                  ...
              2
              3
2019-01-02    1
              2
              3

1 个答案:

答案 0 :(得分:1)

也许您会发现这很有用:

解决方案1(采用pd.Period及其“合法”显示每月数据格式)

>>> import pandas as pd

>>> base = \
pd.DataFrame(
    {
        'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
        'id': [1,2,1,2,1,2,1,2],
        'aggregated_field': [0,1,2,3,4,5,6,7],
        'aggregated_field2': [100,101,102,103,104,105,106,107]
    }
)

>>> base
   yyyy_mm_dd  id  aggregated_field  aggregated_field2
0  2012-01-01   1                 0                100
1  2012-01-01   2                 1                101
2  2012-01-02   1                 2                102
3  2012-01-02   2                 3                103
4  2012-02-01   1                 4                104
5  2012-02-01   2                 5                105
6  2012-02-02   1                 6                106
7  2012-02-02   2                 7                107

>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm'] = base['yyyy_mm_dd'].dt.to_period('M')
>>> agg = base.groupby(['yyyy_mm', 'id'])[['aggregated_field','aggregated_field2']].sum()

>>> agg
            aggregated_field  aggregated_field2
yyyy_mm id                                     
2012-01 1                  2                202
        2                  4                204
2012-02 1                 10                210
        2                 12                212

解决方案2(坚持datetime64)

>>> import pandas as pd

>>> base = \
pd.DataFrame(
    {
        'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
        'id': [1,2,1,2,1,2,1,2],
        'aggregated_field': [0,1,2,3,4,5,6,7],
        'aggregated_field2': [100,101,102,103,104,105,106,107]
    }
)

>>> base
   yyyy_mm_dd  id  aggregated_field  aggregated_field2
0  2012-01-01   1                 0                100
1  2012-01-01   2                 1                101
2  2012-01-02   1                 2                102
3  2012-01-02   2                 3                103
4  2012-02-01   1                 4                104
5  2012-02-01   2                 5                105
6  2012-02-02   1                 6                106
7  2012-02-02   2                 7                107

>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm_dd_month_start'] = base['yyyy_mm_dd'].values.astype('datetime64[M]')
>>> agg = base.groupby(['yyyy_mm_dd_month_start', 'id'])[['aggregated_field','aggregated_field2']].sum()

>>> agg
                           aggregated_field  aggregated_field2
yyyy_mm_dd_month_start id                                     
2012-01-01             1                  2                202
                       2                  4                204
2012-02-01             1                 10                210
                       2                 12                212