在数据框上滚动函数

时间:2015-01-28 10:52:51

标签: python pandas dataframe apply

我有以下数据框C

>>> C
              a    b   c
2011-01-01    0    0 NaN
2011-01-02   41   12 NaN
2011-01-03   82   24 NaN
2011-01-04  123   36 NaN
2011-01-05  164   48 NaN
2011-01-06  205   60   2
2011-01-07  246   72   4
2011-01-08  287   84   6
2011-01-09  328   96   8
2011-01-10  369  108  10

我想添加一个新列d,在我应用滚动函数的地方,在一个固定的窗口(这里是6),我在某种程度上,对于每一行(或日期),修复c。这个滚动函数中的一个循环应该是(伪):

              a    b   c   d
2011-01-01    0    0 NaN   a + b*2 (a,b from this row, '2' is from 'c' on 2011-01-06)
2011-01-02   41   12 NaN   a + b*2 (a,b from this row, '2' is still from 2011-01-06)
2011-01-03   82   24 NaN   a + b*2
2011-01-04  123   36 NaN   a + b*2
2011-01-05  164   48 NaN   a + b*2
2011-01-06  205   60   2   a + b*2
2011-01-07  246   72   4   
2011-01-08  287   84   6   
2011-01-09  328   96   8   
2011-01-10  369  108  10

此后"循环"我想在d中获取所有这6个计算行并运行一个函数调用,然后返回一个值,该值应存储在另一列e中说:

              a    b   c   d                               e

2011-01-01    0    0 NaN   a + b*2 ---|                   NaN
2011-01-02   41   12 NaN   a + b*2    |                   NaN
2011-01-03   82   24 NaN   a + b*2    | These values      NaN
2011-01-04  123   36 NaN   a + b*2    | are input to      NaN
2011-01-05  164   48 NaN   a + b*2    | function          NaN
2011-01-06  205   60   2   a + b*2 ---| yielding          X
2011-01-07  246   72   4                value X in
2011-01-08  287   84   6                column 'e'
2011-01-09  328   96   8   
2011-01-10  369  108  10

然后将此过程迭代到 next 窗口(再次为6长),如:

              a    b   c   d             e
2011-01-01    0    0 NaN   
2011-01-02   41   12 NaN   a + b*4 (a,b from this row, '4' is from 'c' now from 2011-01-07)
2011-01-03   82   24 NaN   a + b*4 (a,b from this row, '4' is still from 2011-01-07)
2011-01-04  123   36 NaN   a + b*4
2011-01-05  164   48 NaN   a + b*4
2011-01-06  205   60   2   a + b*4       X
2011-01-07  246   72   4   a + b*4
2011-01-08  287   84   6   
2011-01-09  328   96   8   
2011-01-10  369  108  10

              a    b   c   d                               e

2011-01-01    0    0 NaN                                  NaN
2011-01-02   41   12 NaN   a + b*4 ---|                   NaN
2011-01-03   82   24 NaN   a + b*4    | These values      NaN
2011-01-04  123   36 NaN   a + b*4    | are input to      NaN
2011-01-05  164   48 NaN   a + b*4    | function          NaN
2011-01-06  205   60   2   a + b*4    | yielding          X
2011-01-07  246   72   4   a + b*4 ---| value Y in        Y
2011-01-08  287   84   6                column 'e'
2011-01-09  328   96   8   
2011-01-10  369  108  10

希望这很清楚,

谢谢, Ñ

1 个答案:

答案 0 :(得分:7)

您可以使用pd.rolling_apply

import numpy as np
import pandas as pd
df = pd.read_table('data', sep='\s+')

def foo(x, df):
    window = df.iloc[x]
    # print(window)
    c = df.ix[int(x[-1]), 'c']
    dvals = window['a'] + window['b']*c
    return bar(dvals)

def bar(dvals):
    # print(dvals)
    return dvals.mean()

df['e'] = pd.rolling_apply(np.arange(len(df)), 6, foo, args=(df,))
print(df)

产量

              a    b   c       e
2011-01-01    0    0 NaN     NaN
2011-01-02   41   12 NaN     NaN
2011-01-03   82   24 NaN     NaN
2011-01-04  123   36 NaN     NaN
2011-01-05  164   48 NaN     NaN
2011-01-06  205   60   2   162.5
2011-01-07  246   72   4   311.5
2011-01-08  287   84   6   508.5
2011-01-09  328   96   8   753.5
2011-01-10  369  108  10  1046.5

argskwargs参数为added to rolling_apply in Pandas version 0.14.0

因为在上面的示例中,df是一个全局变量,所以它并不是必需的 将其作为参数传递给foo。您只需从df行中删除def foo,并在args=(df,)的调用中省略rolling_apply

但是,有时可能无法在df可访问的范围中定义foo。在这种情况下,有一个简单的解决方法 - 制作一个闭包:

def foo(df):
    def inner_foo(x):
        window = df.iloc[x]
        # print(window)
        c = df.ix[int(x[-1]), 'c']
        dvals = window['a'] + window['b']*c
        return bar(dvals)
    return inner_foo

df['e'] = pd.rolling_apply(np.arange(len(df)), 6, foo(df))