Dask DataFrame Groupby:汇总中列的最频繁值

时间:2020-08-10 14:50:42

标签: python pandas pandas-groupby dask dask-dataframe

自定义快捷键GroupBy Aggregation非常方便,但是我很难为列中最常用的值定义一个工作方式。

我有什么:

因此,在示例here中,我们可以定义自定义聚合函数,如下所示:

custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
my_aggregate = {
    'A': custom_sum,
    'B': custom_most_often_value, ### <<< This is the goal.
    'C': ['max','min','mean'],
    'D': ['max','min','mean']
}
col_name = 'Z'
ddf_agg = ddf.groupby(col_name).agg(my_aggregate).compute()

虽然此方法适用于custom_sum(如示例页所示),但对最常使用的值的适应可能是这样的(from the example here):

custom_most_often_value = dd.Aggregation('custom_most_often_value', lambda x:x.value_counts().index[0], lambda x0:x0.value_counts().index[0])

但是会产生

ValueError: Metadata inference failed in `_agg_finalize`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

然后,我尝试在dd.Aggregation implementation中找到关键字meta来定义它,但是找不到它。.并且在custom_sum的示例中不需要它让我认为错误是在其他地方。

所以我的问题是,如何获取df.groupby(..).agg(..)中列的主要出现的值。谢谢!

2 个答案:

答案 0 :(得分:1)

快速澄清而不是给出答案:meta方法中使用.agg()参数来指定所需的列数据类型,最好用零长度的pandas数据帧表示。 Dask会向您的函数提供伪数据,以尝试猜测那些类型,但这并不总是有效。

答案 1 :(得分:0)

您遇到的问题是,聚合的各个阶段不能与递归应用的函数相同,就像您要查看的custom_sum示例一样。

我已经修改了此答案中的代码,并留下了@的注释 user8570642,因为它们非常有帮助。请注意,此方法将解决groupby键的列表: https://stackoverflow.com/a/46082075/3968619

def chunk(s):
    # for the comments, assume only a single grouping column, the 
    # implementation can handle multiple group columns.
    #
    # s is a grouped series. value_counts creates a multi-series like 
    # (group, value): count
    return s.value_counts()


def agg(s):
#     print('agg',s.apply(lambda s: s.groupby(level=-1).sum()))
    # s is a grouped multi-index series. In .apply the full sub-df will passed
    # multi-index and all. Group on the value level and sum the counts. The
    # result of the lambda function is a series. Therefore, the result of the 
    # apply is a multi-index series like (group, value): count
    return s.apply(lambda s: s.groupby(level=-1).sum())

    # faster version using pandas internals
    s = s._selected_obj
    return s.groupby(level=list(range(s.index.nlevels))).sum()


def finalize(s):
    # s is a multi-index series of the form (group, value): count. First
    # manually group on the group part of the index. The lambda will receive a
    # sub-series with multi index. Next, drop the group part from the index.
    # Finally, determine the index with the maximum value, i.e., the mode.
    level = list(range(s.index.nlevels - 1))
    return (
        s.groupby(level=level)
        .apply(lambda s: s.reset_index(level=level, drop=True).idxmax())
    )

max_occurence = dd.Aggregation('mode', chunk, agg, finalize)

chunk将计算每个分区中groupby对象的值。 agg将从chunk中获取结果,并对原始groupby命令进行分组,并对值计数求和,以便获得每个组的值计数。 finalize将采用agg提供的多索引序列,并为B中的每个组返回最频繁出现的Z值。

这是一个测试用例:

df = dd.from_pandas(
    pd.DataFrame({"A":[1,1,1,1,2,2,3]*10,"B":[5,5,5,5,1,1,1]*10,
                  'Z':['mike','amy','amy','amy','chris','chris','sandra']*10}), npartitions=10)
res = df.groupby(['Z']).agg({'B': mode}).compute()
print(res)