优化代码以查找DataFrame中每行过去4到6天的值的中位数

时间:2017-05-27 05:50:07

标签: python pandas optimization dataframe time-series

给定时间戳数据的数据框,我想计算过去4-6天的某个变量的中位数。 过去1-3天的中位数可以通过pd.pandas.DataFrame.rolling来计算,但我无法找到如何使用滚动来计算过去4-6天的中位数。

import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

数据看起来像这样。在我的实际数据中,存在时间差,可能在一天内有更多的数据点。

              timestamp       var
0   2011-01-01 00:00:00  1.624345
1   2011-01-01 06:00:00 -0.611756
2   2011-01-01 12:00:00 -0.528172
3   2011-01-01 18:00:00 -1.072969
4   2011-01-02 00:00:00  0.865408
5   2011-01-02 06:00:00 -2.301539
6   2011-01-02 12:00:00  1.744812
7   2011-01-02 18:00:00 -0.761207
8   2011-01-03 00:00:00  0.319039
9   2011-01-03 06:00:00 -0.249370
10  2011-01-03 12:00:00  1.462108

期望的输出:

              timestamp       var  past4d-6d_var_median
0   2011-01-01 00:00:00  1.624345                   NaN # no data in past 4-6 days
1   2011-01-01 06:00:00 -0.611756                   NaN # no data in past 4-6 days
2   2011-01-01 12:00:00 -0.528172                   NaN # no data in past 4-6 days
3   2011-01-01 18:00:00 -1.072969                   NaN # no data in past 4-6 days
4   2011-01-02 00:00:00  0.865408                   NaN # no data in past 4-6 days
5   2011-01-02 06:00:00 -2.301539                   NaN # no data in past 4-6 days
6   2011-01-02 12:00:00  1.744812                   NaN # no data in past 4-6 days
7   2011-01-02 18:00:00 -0.761207                   NaN # no data in past 4-6 days
8   2011-01-03 00:00:00  0.319039                   NaN # no data in past 4-6 days
9   2011-01-03 06:00:00 -0.249370                   NaN # no data in past 4-6 days
10  2011-01-03 12:00:00  1.462108                   NaN # no data in past 4-6 days
11  2011-01-03 18:00:00 -2.060141                   NaN # no data in past 4-6 days
12  2011-01-04 00:00:00 -0.322417                   NaN # no data in past 4-6 days
13  2011-01-04 06:00:00 -0.384054                   NaN # no data in past 4-6 days
14  2011-01-04 12:00:00  1.133769                   NaN # no data in past 4-6 days
15  2011-01-04 18:00:00 -1.099891                   NaN # no data in past 4-6 days
16  2011-01-05 00:00:00 -0.172428                   NaN # only 4 data in past 4-6 days
17  2011-01-05 06:00:00 -0.877858             -0.528172
18  2011-01-05 12:00:00  0.042214             -0.569964
19  2011-01-05 18:00:00  0.582815             -0.528172
20  2011-01-06 00:00:00 -1.100619             -0.569964
21  2011-01-06 06:00:00  1.144724             -0.528172
22  2011-01-06 12:00:00  0.901591             -0.388771
23  2011-01-06 18:00:00  0.502494             -0.249370

我目前的代码:

def findPastVar2(df, var='var' ,window=3, method='median'):
    # window= # of past days    
    for i in xrange(len(df)):
        pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
        if pastVar2.shape[0]>=5: # At least 5 data points
            if method == 'median':
                df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
    return(df)

当前速度:

In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop

我编辑了帖子,以便我可以清楚地显示至少5个数据点的预期输出。我已经设置了随机种子,这样每个人都应该能够获得相同的输入并显示相同的输出。据我所知,简单的rollingshift不适用于同一天的多个数据的情况。

3 个答案:

答案 0 :(得分:2)

我们走了:

df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)

df['var'] 
Out[55]: 
timestamp
2011-01-01 00:00:00         NaN
2011-01-01 06:00:00         NaN
2011-01-01 12:00:00         NaN
2011-01-01 18:00:00         NaN
2011-01-02 00:00:00         NaN
2011-01-02 06:00:00         NaN
2011-01-02 12:00:00         NaN
2011-01-02 18:00:00         NaN
2011-01-03 00:00:00         NaN
2011-01-03 06:00:00         NaN
2011-01-03 12:00:00         NaN
2011-01-03 18:00:00         NaN
2011-01-04 00:00:00         NaN
2011-01-04 06:00:00         NaN
2011-01-04 12:00:00         NaN
2011-01-04 18:00:00         NaN
2011-01-05 00:00:00         NaN
2011-01-05 06:00:00   -0.528172
2011-01-05 12:00:00   -0.569964
2011-01-05 18:00:00   -0.528172
2011-01-06 00:00:00   -0.569964
2011-01-06 06:00:00   -0.528172
2011-01-06 12:00:00   -0.569964
2011-01-06 18:00:00   -0.528172
2011-01-07 00:00:00   -0.388771
2011-01-07 06:00:00   -0.249370
2011-01-07 12:00:00   -0.388771

答案 1 :(得分:1)

这样设置的方式是针对每一行,并且作为不规则的时间序列,它将具有不同的宽度,因此需要像您已经开始的迭代方法。但是,如果我们将索引作为时间序列

# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))

在这种情况下,我每隔12小时选择一个间隔,但可以是任何可用的或不规则的。使用具有中值窗口的修改函数以及偏移量(此处,正向Delta向后看),为您提供所需的灵活性:

def GetMedian(df,var='var',window='2D',Delta='3D'):
    for Ti in df.index:
        Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
                (df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
        df.loc[Ti,'Medians']=Vals[var].median()
    return df

这大大加快了:

%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案 2 :(得分:0)

min_period应该是2而不是5,因为你不应该计算窗口大小。(5 - 3 = 2)

import pandas as pd
import numpy as np
import datetime
np.random.seed(1)  # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

def first():
    df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
    return df

%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop

我的第一次尝试没有使用shift(),但后来我看到Noobie的answer

我使用shift()制作了以下内容,比前一个快得多。

def test():
    df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
    return df

%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop

第二个速度是第一个速度的4倍。

这两个函数创建相同的结果,如下所示:

df2 = test()
df2
                  timestamp       var   past4d-6d_var_median
    0   2011-01-01 00:00:00  1.624345                    NaN
    1   2011-01-02 00:00:00 -0.611756                    NaN
    2   2011-01-03 00:00:00 -0.528172                    NaN
    3   2011-01-04 00:00:00 -1.072969                    NaN
    4   2011-01-05 00:00:00  0.865408               0.506294
    5   2011-01-06 00:00:00 -2.301539              -0.528172
    6   2011-01-07 00:00:00  1.744812              -0.611756
    ...         ...            ...             ...
    93  2011-04-04 00:00:00 -0.638730               1.129484
    94  2011-04-05 00:00:00  0.423494               1.129484
    95  2011-04-06 00:00:00  0.077340               0.185156
    96  2011-04-07 00:00:00 -0.343854              -0.375285
    97  2011-04-08 00:00:00  0.043597              -0.375285
    98  2011-04-09 00:00:00 -0.620001               0.077340
    99  2011-04-10 00:00:00  0.698032               0.077340