Question

我想在Pandas中使用reduce和accumulate函数的方式类似于它们在带有列表的本机python中的应用方式。在itertools和functools实现中，reduce和accumulate（有时在其他语言中称为fold和cumulative fold）需要一个带有两个参数的函数。在Pandas中，没有类似的实现。该函数有两个参数： f（cumulative_value，popped_value）

所以，我有一个二进制变量列表，想要计算我们处于1状态时的持续时间数：

In [1]: from itertools import accumulate
        import pandas as pd
        drawdown_periods = [0,1,1,1,0,0,0,1,1,1,1,0,1,1,0]

使用lambda函数

对其应用累积

lambda x,y: (x+y)*y

给出

In [2]: list(accumulate(drawdown_periods, lambda x,y: (x+y)*y))
Out[2]: [0, 1, 2, 3, 0, 0, 0, 1, 2, 3, 4, 0, 1, 2, 0]

计算每个drawdown_period的长度。

是否有一种聪明但古怪的方式来提供带有两个参数的lambda函数？我可能在这里错过了一个技巧。

我知道groupby有一个可爱的食谱（见StackOverflow How to calculate consecutive Equal Values in Pandas/How to emulate itertools.groupby with a series/dataframe）。我会重复它，因为它很可爱：

In [3]: df = pd.DataFrame(data=drawdown_periods, columns=['dd'])
       df['dd'].groupby((df['dd'] != df['dd'].shift()).cumsum()).cumsum()
Out[3]:
    0     0
    1     1
    2     2
    3     3
    4     0
    5     0
    6     0
    7     1
    8     2
    9     3
    10    4
    11    0
    12    1
    13    2
    14    0
    Name: dd, dtype: int64

不我想要的解决方案。我需要一种将双参数lambda函数传递给pandas-native reduce / accumulate函数的方法，因为这也适用于许多其他函数式编程配方。

Answer 1

你可以使用numpy来降低效率。在实践中，您可能更好地编写临时矢量化解决方案。

使用np.frompyfunc：

s = pd.Series([0,1,1,1,0,0,0,1,1,1,1,0,1,1,0])
f = numpy.frompyfunc(lambda x, y: (x+y) * y, 2, 1)
f.accumulate(series.astype(object))

0     0
1     1
2     2
3     3
4     0
5     0
6     0
7     1
8     2
9     3
10    4
11    0
12    1
13    2
14    0
dtype: object

Answer 2

你正在寻找的是一个pandas方法，它将从Series中提取所有对象，将它们转换为Python对象，调用Python函数并拥有一个也是Python对象的累加器。

当您拥有大量数据时，这种行为无法很好地扩展，因为在Python对象中包装原始数据会产生大量时间/内存开销。 Pandas方法尝试直接处理底层（numpy）原始数据，能够处理大量数据而无需将它们包装在Python对象中。您给出的groupby + cumsum示例是一种避免使用.apply和Python函数的聪明方法，这会慢一些。

然而，如果你不关心性能，你当然可以自由地用Python做自己的功能。因为它无论如何都是Python，并且没有办法在熊猫方面加快速度，你可以自己编写：

df["cev"] = list(accumulate(df.dd, lambda x,y:(x+y)*y))

Answer 3

使用pandas.DataFrame.aggregate和functools.reduce：

import pandas as pd
import operator
from functools import reduce

def reduce_or(series):
    return reduce(operator.or_, series)


df = pd.DataFrame([1,0,0,0], index='a b a b'.split()).astype(bool)
df

df.groupby(df.index).aggregate(reduce_or)

熊猫“减少”和“积累”功能 - 实施不完整

3 个答案: