Question

我正在尝试在多列上使用pandas.DataFrame.rolling.apply()滚动功能。 Python版本是3.7，熊猫版本是1.0.2。

import pandas as pd

#function to calculate
def masscenter(x):
    print(x); # for debug purposes
    return 0;

#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

'stamp'是单调且唯一的，'price'是双精度且不包含NaN，'nQty'是整数且也不包含NaN。

因此，我需要计算滚动的“质心”，即sum(price*nQty)/sum(nQty)。

到目前为止我尝试过的事情：

df.apply(masscenter, axis = 1)

masscenter被单行调用5次，输出将类似于

price     87.6
nQty     739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64

希望输入到masscenter，因为我可以使用price轻松访问nQty和x[0], x[1]。但是，我坚持使用rolling.apply() 阅读文档 DataFrame.rolling()和rolling.apply() 我以为在'axis'中使用rolling()在'raw'中使用apply可以达到类似的行为。天真的方法

rol = df.rolling(window=2)
rol.apply(masscenter)

逐行打印（增加行数直到窗口大小）

stamp
1900-01-01 02:59:47.000282    87.60
1900-01-01 03:00:01.042391    87.51
dtype: float64

然后

stamp
1900-01-01 02:59:47.000282    739.0
1900-01-01 03:00:01.042391     10.0
dtype: float64

因此，列分别传递给了masscenter（预期）。

遗憾的是，在文档中几乎没有关于'axis'的任何信息。但是，下一个变体显然是

rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)

从不叫masscenter并加注ValueError in rol.apply(..)

> Length of passed values is 1, index implies 5

我承认由于缺乏文档，我不确定'axis'参数及其工作方式。这是问题的第一部分： 这是怎么回事？如何正确使用“轴”？它的目的是什么？

当然，以前有答案，即：

How-to-apply-a-function-to-two-columns-of-pandas-dataframe
它适用于整个DataFrame，而不适用于滚动。

How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
答案建议编写自己的滚动函数，但是对我来说，罪魁祸首与comments中的问题相同：如果对于非均匀时间戳，需要使用偏移窗口大小（例如'1T'）怎么办？
我不喜欢从头开始重新发明轮子的想法。另外，我想对所有事物都使用熊猫，以防止从熊猫获得的套和“自制卷”之间出现不一致。这个问题还有另一个答案，建议分别填充数据帧并计算所需的数据，但它不起作用：存储的数据量巨大。此处提出了相同的想法：
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments

另一个问答集发布在这里
Pandas-using-rolling-on-multiple-columns
很好，并且最接近我的问题，但是同样，无法使用偏移窗口大小（window = '1T'）。

在pandas 1.0发布之前，有人问了一些答案，并且鉴于文档可能会更好，所以我希望现在可以同时滚动多列。

问题的第二部分是： 是否可以使用具有偏移窗口大小的pandas 1.0.x同时滚动多列？

非常感谢您。

Answer 1

如何？

def masscenter(ser):
    print(df.loc[ser.index])
    return 0

rol = df.price.rolling(window=2)
rol.apply(masscenter, raw=False)

它使用滚动逻辑从任意列获取子集。 raw = False选项为您提供这些子集的索引值（作为系列提供给您），然后使用这些索引值从原始DataFrame中获取多列切片。

Answer 2

您可以使用numpy_ext模块中的 rolling_apply 功能：

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply


def masscenter(price, nQty):
    return np.sum(price * nQty) / np.sum(nQty)


df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

window = 2
df['y'] = rolling_apply(masscenter, window, df.price.values, df.nQty.values)
print(df)

                            price  nQty          y
stamp                                             
1900-01-01 02:59:47.000282  87.60   739        NaN
1900-01-01 03:00:01.042391  87.51    10  87.598798
1900-01-01 03:00:01.630182  87.51    10  87.510000
1900-01-01 03:00:01.635150  88.00   792  87.993890
1900-01-01 03:00:01.914104  88.00    10  88.000000

Answer 3

因此，我发现没有办法跨越两列，但是没有内置的pandas函数。代码在下面列出。

# function to find an index corresponding
# to current value minus offset value
def prevInd(series, offset, date):
    offset = to_offset(offset)
    end_date = date - offset
    end = series.index.searchsorted(end_date, side="left")
    return end

# function to find an index corresponding
# to the first value greater than current
# it is useful when one has timeseries with non-unique
# but monotonically increasing values
def nextInd(series, date):
    end = series.index.searchsorted(date, side="right")
    return end

def twoColumnsRoll(dFrame, offset, usecols, fn, columnName = 'twoColRol'):
    # find all unique indices
    uniqueIndices = dFrame.index.unique()
    numOfPoints = len(uniqueIndices)
    # prepare an output array
    moving = np.zeros(numOfPoints)
    # nameholders
    price = dFrame[usecols[0]]
    qty   = dFrame[usecols[1]]

    # iterate over unique indices
    for ii in range(numOfPoints):
        # nameholder
        pp = uniqueIndices[ii]
        # right index - value greater than current
        rInd = afta.nextInd(dFrame,pp)
        # left index - the least value that 
        # is bigger or equal than (pp - offset)
        lInd = afta.prevInd(dFrame,offset,pp)
        # call the actual calcuating function over two arrays
        moving[ii] = fn(price[lInd:rInd], qty[lInd:rInd])
    # construct and return DataFrame
    return pd.DataFrame(data=moving,index=uniqueIndices,columns=[columnName])

此代码有效，但是相对较慢且效率低下。我想可以使用How to invoke pandas.rolling.apply with parameters from multiple column?中的numpy.lib.stride_tricks来加快速度。但是，不管走大路还是回家-我结束了用C ++和一个包装器编写函数的工作。
我不想将其发布为答案，因为这是一种解决方法，并且我也没有回答任何问题，但是对于评论来说太长了。

Answer 4

参考@saninstein 的出色回答。

从以下位置安装 numpy_ext：https://pypi.org/project/numpy-ext/

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply as rolling_apply_ext

def box_sum(a,b):
    return np.sum(a) + np.sum(b)

df = pd.DataFrame({"x": [1,2,3,4], "y": [1,2,3,4]})

window = 2
df["sum"] = rolling_apply_ext(box_sum, window , df.x.values, df.y.values)

输出：

print(df.to_string(index=False))
 x  y  sum
 1  1  NaN
 2  2  6.0
 3  3 10.0
 4  4 14.0

注意事项

滚动功能对时间序列友好。它默认总是向后看，所以 6 是数组中当前值和过去值的总和。
在上面的示例中，将 rolling_apply 导入为 rolling_apply_ext，因此它不可能干扰对 Pandas rolling_apply 的任何现有调用（感谢 @LudoSchmidt 的评论）。

顺便提一下，我放弃了尝试使用 Pandas。它从根本上被打破了：它处理单列聚合并且几乎没有问题，但是当试图让它处理更多两列或更多列时，它是一个过于复杂的 rube-goldberg 机器。

熊猫滚动应用多列

4 个答案: