加快熊猫查询

时间:2016-06-08 18:16:13

标签: python pandas

我有2个熊猫系列:数据和事件。

  • 数据是有序的数据点系列
  • 事件包含数据中感兴趣点的索引

我想在每个感兴趣的点周围提取一个固定大小的窗口。

我想出了:

res = []
for k in events:
    win = data.loc[k - ticks_before:k + ticks_after].values
    res.append(win)

new_df = pd.DataFrame(res)

哪个有效,但速度很慢。任何Panda-fu都能让它变得快速吗?

编辑:找到了一个快5倍的解决方案:

res = np.zeros((len(events), win_len))
i = 0
for k in events:
    res[i] = data.loc[k - ticks_before:k + ticks_after]
    i+=1

new_df = pd.DataFrame(res)

有什么想让它更快?

下面是输入和输出代码:

输入:

data = pd.Series(xrange(200))
events = [50, 77, 98, 125, 133, 159, 161]
ticks_before = 32
ticks_after = 16
def slow_loop(data, events, ticks_before, ticks_after):
    res = []
    for k in events:
        win = data.loc[k - ticks_before:k + ticks_after].values
        res.append(win)
    new_df = pd.DataFrame(res)
    return new_df.mean()

def fast_loop(data, events, ticks_before, ticks_after):
    win_len = ticks_before + ticks_after + 1
    res = np.zeros((len(events), win_len))
    i = 0
    for k in events:
        res[i] = data.loc[k - ticks_before:k + ticks_after]
        i+=1

    new_df = pd.DataFrame(res)
    return new_df.mean()


assert(all(slow_loop(data, events, ticks_before, ticks_after) ==  
           fast_loop(data, events, ticks_before, ticks_after)))
%timeit slow_loop(data, events, ticks_before, ticks_after)
%timeit fast_loop(data, events, ticks_before, ticks_after)
fast_loop(data, events, ticks_before, ticks_after)

输出:

100 loops, best of 3: 3.66 ms per loop
1000 loops, best of 3: 632 µs per loop

0      82.714286
1      83.714286
2      84.714286
3      85.714286
4      86.714286
5      87.714286
6      88.714286
7      89.714286
8      90.714286
9      91.714286
10     92.714286
11     93.714286
12     94.714286
13     95.714286
14     96.714286
15     97.714286
16     98.714286
17     99.714286
18    100.714286
19    101.714286
20    102.714286
21    103.714286
22    104.714286
23    105.714286
24    106.714286
25    107.714286
26    108.714286
27    109.714286
28    110.714286
29    111.714286
30    112.714286
31    113.714286
32    114.714286
33    115.714286
34    116.714286
35    117.714286
36    118.714286
37    119.714286
38    120.714286
39    121.714286
40    122.714286
41    123.714286
42    124.714286
43    125.714286
44    126.714286
45    127.714286
46    128.714286
47    129.714286
48    130.714286
dtype: float64

4 个答案:

答案 0 :(得分:2)

这是一个numpy解决方案,与fast_loop相比,速度提高了10倍:

# numpy solution
def np1(data, events, ticks_before, ticks_after):
    return pd.Series(
                np.concatenate(
                    [data.values[x - ticks_before: x + ticks_after+1] for x in events])
                .reshape(len(events), ticks_before + ticks_after+1)
                .mean(0))

# similar Pandas solution
def pd1(data, events, ticks_before, ticks_after):
    return pd.Series(
            pd.concat(
                [data[x - ticks_before : x + ticks_after +1] for x in events],
                ignore_index=True)
              .reshape((len(events), ticks_before + ticks_after +1))
              .mean(0))
对于20M行系列

时间

In [440]: %timeit slow_loop(data2, events, ticks_before, ticks_after)
The slowest run took 10.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.7 ms per loop

In [441]: %timeit fast_loop(data2, events, ticks_before, ticks_after)
1000 loops, best of 3: 936 µs per loop

In [442]: %timeit pir5(data2, events, ticks_before, ticks_after)
1000 loops, best of 3: 436 µs per loop

In [443]: %timeit pd1(data2, events, ticks_before, ticks_after)
1000 loops, best of 3: 804 µs per loop

In [444]: %timeit np1(data2, events, ticks_before, ticks_after)
10000 loops, best of 3: 75.8 µs per loop

设定:

In [435]: data2 = data.copy()

In [436]: data2 = pd.concat([data2] * 10**5, ignore_index=True)

In [437]: data2.shape
Out[437]: (20000000,)

OLD回答:

计时(在另一台/较慢的机器上):

In [353]: %timeit fast_loop(data, events, ticks_before, ticks_after)
100 loops, best of 3: 2.27 ms per loop

In [354]: %timeit np1(data, events, ticks_before, ticks_after)
1000 loops, best of 3: 222 ┬╡s per loop

In [360]: %timeit slow_loop(data, events, ticks_before, ticks_after)
100 loops, best of 3: 12.5 ms per loop

检查:

In [356]: (fast_loop(data, events, ticks_before, ticks_after) == np1(data, events, ticks_before, ticks_after)).all()
Out[356]: True

答案 1 :(得分:1)

我放弃了!我尝试了很多东西。以下只是几个:

def pir1(data, events, ticks_before, ticks_after):
    rng = np.add.outer(events, [-1 * ticks_before, ticks_after + 1])
    res = np.zeros(ticks_before + ticks_after + 1)
    for r in rng:
        res += data[r[0]:r[1]]
    res /= len(rng)
    return res

def pir2(data, events, ticks_before, ticks_after):
    rng = np.add.outer(events, [-1 * ticks_before, ticks_after + 1])
    return np.array([data[r[0]:r[1]] for r in rng]).mean(axis=0)

def pir3(data, events, ticks_before, ticks_after):
    events = np.asarray(events)
    return pd.DataFrame([data[offset + events].mean() for offset in range(-ticks_before, ticks_after + 1)])

def pir4(data, events, ticks_before, ticks_after):
    events = np.asarray(events)
    return pd.DataFrame([data[offset + events] for offset in range(-ticks_before, ticks_after + 1)]).mean(axis=1)

def pir5(data, events, ticks_before, ticks_after):
    events = np.asarray(events)
    data = data.values
    return np.dstack((data[offset + events] for offset in range(-ticks_before, ticks_after + 1))).mean(axis=1)

def pir6(data, events, ticks_before, ticks_after):
    events = np.asarray(events)
    cums = data.cumsum()
    return np.dstack((data[offset + events] for offset in range(-ticks_before, ticks_after + 1))).mean(axis=1)

时代:pir5有点打败它。

enter image description here

答案 2 :(得分:0)

我认为使用Pandas过滤时间索引应该非常有效。

这样的东西
df.set_index('my_time_variable',inplace=True)
df[time - ticks_before:time + ticks_after]

应该加快查询速度。然后你仍然可以遍历所有日期。 请务必使用Pandas可识别的time - ticks_before格式作为日期,例如'2015-05-05 13:30'

答案 3 :(得分:0)

简化了一下,问题:为什么xxx.mean()有一系列数字。意思是应该是一个数字!这可能会增加处理时间。

 def fast_loop(data, events, ticks_before, ticks_after):
        win_len = ticks_before + ticks_after + 1
        res     = np.zeros( (len(events), win_len) )

        for k in enumerate(events):
            res[i] = data.loc[k - ticks_before:k + ticks_after]

        return pd.DataFrame(res).mean()