自定义快捷键GroupBy
Aggregation
非常方便,但是我很难为列中最常用的值定义一个工作方式。
我有什么:
因此,在示例here中,我们可以定义自定义聚合函数,如下所示:
custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
my_aggregate = {
'A': custom_sum,
'B': custom_most_often_value, ### <<< This is the goal.
'C': ['max','min','mean'],
'D': ['max','min','mean']
}
col_name = 'Z'
ddf_agg = ddf.groupby(col_name).agg(my_aggregate).compute()
虽然此方法适用于custom_sum
(如示例页所示),但对最常使用的值的适应可能是这样的(from the example here):
custom_most_often_value = dd.Aggregation('custom_most_often_value', lambda x:x.value_counts().index[0], lambda x0:x0.value_counts().index[0])
但是会产生
ValueError: Metadata inference failed in `_agg_finalize`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
然后,我尝试在dd.Aggregation
implementation中找到关键字meta
来定义它,但是找不到它。.并且在custom_sum
的示例中不需要它让我认为错误是在其他地方。
所以我的问题是,如何获取df.groupby(..).agg(..)
中列的主要出现的值。谢谢!
答案 0 :(得分:1)
快速澄清而不是给出答案:meta
方法中使用.agg()
参数来指定所需的列数据类型,最好用零长度的pandas数据帧表示。 Dask会向您的函数提供伪数据,以尝试猜测那些类型,但这并不总是有效。
答案 1 :(得分:0)
您遇到的问题是,聚合的各个阶段不能与递归应用的函数相同,就像您要查看的custom_sum示例一样。
我已经修改了此答案中的代码,并留下了@的注释 user8570642,因为它们非常有帮助。请注意,此方法将解决groupby键的列表: https://stackoverflow.com/a/46082075/3968619
def chunk(s):
# for the comments, assume only a single grouping column, the
# implementation can handle multiple group columns.
#
# s is a grouped series. value_counts creates a multi-series like
# (group, value): count
return s.value_counts()
def agg(s):
# print('agg',s.apply(lambda s: s.groupby(level=-1).sum()))
# s is a grouped multi-index series. In .apply the full sub-df will passed
# multi-index and all. Group on the value level and sum the counts. The
# result of the lambda function is a series. Therefore, the result of the
# apply is a multi-index series like (group, value): count
return s.apply(lambda s: s.groupby(level=-1).sum())
# faster version using pandas internals
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
# s is a multi-index series of the form (group, value): count. First
# manually group on the group part of the index. The lambda will receive a
# sub-series with multi index. Next, drop the group part from the index.
# Finally, determine the index with the maximum value, i.e., the mode.
level = list(range(s.index.nlevels - 1))
return (
s.groupby(level=level)
.apply(lambda s: s.reset_index(level=level, drop=True).idxmax())
)
max_occurence = dd.Aggregation('mode', chunk, agg, finalize)
chunk
将计算每个分区中groupby
对象的值。 agg
将从chunk
中获取结果,并对原始groupby命令进行分组,并对值计数求和,以便获得每个组的值计数。 finalize
将采用agg
提供的多索引序列,并为B
中的每个组返回最频繁出现的Z
值。
这是一个测试用例:
df = dd.from_pandas(
pd.DataFrame({"A":[1,1,1,1,2,2,3]*10,"B":[5,5,5,5,1,1,1]*10,
'Z':['mike','amy','amy','amy','chris','chris','sandra']*10}), npartitions=10)
res = df.groupby(['Z']).agg({'B': mode}).compute()
print(res)