如何应用一个函数,将使用滑动窗口切割的数据帧作为pandas中的参数?

时间:2017-11-04 08:39:59

标签: python pandas numpy dataframe sliding-window

我有一个存储在pandas dataframe中的时间序列数据,如下所示:

    Date         Open    High    Low     Close   Volume
0   2016-01-19   22.86   22.92   22.36   22.60   838024
1   2016-01-20   22.19   22.98   21.87   22.77   796745
2   2016-01-21   22.75   23.10   22.62   22.76   573068
3   2016-01-22   23.13   23.35   22.96   23.33   586967
4   2016-01-25   23.22   23.42   23.01   23.26   645551
5   2016-01-26   23.28   23.85   23.22   23.74   592658
6   2016-01-27   23.68   23.78   18.76   20.09   5351850
7   2016-01-28   20.05   20.69   19.11   19.37   2255635
8   2016-01-29   19.51   20.02   19.40   19.90   1203969
9   2016-02-01   19.77   19.80   19.13   19.14   1203375

我想创建一个适用的函数,它可以获取原始数据集的一部分,它可以由任何自定义聚合运算符聚合。

可以说,该功能的应用如下:

aggregated_df = data.apply(calculateMySpecificAggregation, axis=1)

其中 calculateMySpecificAggregation 为原始数据帧的每一行获取原始数据帧的3个大小的切片。 对于每一行,函数的参数数据帧包含原始数据帧的前一行和下一行。

#pseudocode example
def calculateMySpecificAggregation(df_slice):

    # I want to know which row was this function applied on (an index I would like to have here)
    ri= ???   # index of the row where was this function applied

    # where df_slice contains 3 rows and all columns
    return float(df_slice["Close"][ri-1] + \
               ((df_slice["High"][ri] + df_slice["Low"][ri]) / 2) + \ 
                 df_slice["Open"][ri+1])
    # this line will fail on the borders, but don't worry, I will handle it later...

我希望参数化滑动窗口大小,访问行的其他列并知道应用函数的原始行的行索引。

这意味着,如果slideWindow = 3,我想要参数数据帧:

#parameter dataframe when the function is applied on row[0]:
    Date         Open    High    Low     Close   Volume
0   2016-01-19   22.86   22.92   22.36   22.60   838024
1   2016-01-20   22.19   22.98   21.87   22.77   796745

#parameter dataframe when the function is applied on row[1]:
    Date         Open    High    Low     Close   Volume
0   2016-01-19   22.86   22.92   22.36   22.60   838024
1   2016-01-20   22.19   22.98   21.87   22.77   796745
2   2016-01-21   22.75   23.10   22.62   22.76   573068

#parameter dataframe when the function is applied on row[2]:
    Date         Open    High    Low     Close   Volume
1   2016-01-20   22.19   22.98   21.87   22.77   796745
2   2016-01-21   22.75   23.10   22.62   22.76   573068
3   2016-01-22   23.13   23.35   22.96   23.33   586967

#parameter dataframe when the function is applied on row[3]:
    Date         Open    High    Low     Close   Volume
2   2016-01-21   22.75   23.10   22.62   22.76   573068
3   2016-01-22   23.13   23.35   22.96   23.33   586967
4   2016-01-25   23.22   23.42   23.01   23.26   645551

...            

#parameter dataframe when the function is applied on row[7]:
    Date         Open    High    Low     Close   Volume
6   2016-01-27   23.68   23.78   18.76   20.09   5351850
7   2016-01-28   20.05   20.69   19.11   19.37   2255635
8   2016-01-29   19.51   20.02   19.40   19.90   1203969

#parameter dataframe when the function is applied on row[8]:
    Date         Open    High    Low     Close   Volume
7   2016-01-28   20.05   20.69   19.11   19.37   2255635
8   2016-01-29   19.51   20.02   19.40   19.90   1203969
9   2016-02-01   19.77   19.80   19.13   19.14   120375

#parameter dataframe when the function is applied on row[9]:
    Date         Open    High    Low     Close   Volume
8   2016-01-29   19.51   20.02   19.40   19.90   1203969
9   2016-02-01   19.77   19.80   19.13   19.14   1203375

如果可能的话,我不想使用结合iloc索引的循环。

我已尝试pandas.DataFrame.rollingpandas.rolling_apply但没有成功。

有谁知道如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

好的,经过长时间的痛苦,我已经解决了这个问题。 我无法避免iloc(在这种情况下这不是一个大问题),但至少在这里没有使用周期。

contextSizeLeft = 2
contextSizeRight = 3

def aggregateWithContext(df, row, func, contextSizeLeft, contextSizeRight):

    leftBorder  = max(0,       row.name - contextSizeLeft)
    rightBorder = min(len(df), row.name + contextSizeRight) + 1

    '''
    print("pos: ", row.name, \
          "\t", (row.name-contextSizeLeft, row.name+contextSizeRight), \
          "\t", (leftBorder, rightBorder), \
          "\t", len(df.loc[:][leftBorder : rightBorder]))
    '''

    return func(df.iloc[:][leftBorder : rightBorder], row.name)

def aggregate(df, center):
    print()
    print("center", center)
    print(df["Date"])
    return len(df)


df.apply(lambda x: aggregateWithContext(df, x, aggregate, contextSizeLeft, contextSizeRight), axis=1)

如果有人需要它,日期相同:

def aggregateWithContext(df, row, func, timedeltaLeft, timedeltaRight):

    dateInRecord = row["Date"]
    leftBorder  = pd.to_datetime(dateInRecord - timedeltaLeft)
    rightBorder = pd.to_datetime(dateInRecord + timedeltaRight)

    dfs = df[(df['Date'] >= leftBorder) & (df['Date'] <= rightBorder)]
    #print(dateInRecord, ":\t", leftBorder, "\t", rightBorder, "\t", len(dfs))

    return func(dfs, row.name)

def aggregate(df, center):
    #print()
    #print("center", center)
    #print(df["Date"])
    return len(df)


timedeltaLeft  = timedelta(days=2)
timedeltaRight = timedelta(days=2)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, timedeltaLeft, timedeltaRight), axis=1)