我有2个熊猫系列:数据和事件。
我想在每个感兴趣的点周围提取一个固定大小的窗口。
我想出了:
res = []
for k in events:
win = data.loc[k - ticks_before:k + ticks_after].values
res.append(win)
new_df = pd.DataFrame(res)
哪个有效,但速度很慢。任何Panda-fu都能让它变得快速吗?
编辑:找到了一个快5倍的解决方案:
res = np.zeros((len(events), win_len))
i = 0
for k in events:
res[i] = data.loc[k - ticks_before:k + ticks_after]
i+=1
new_df = pd.DataFrame(res)
有什么想让它更快?
下面是输入和输出代码:
输入:
data = pd.Series(xrange(200))
events = [50, 77, 98, 125, 133, 159, 161]
ticks_before = 32
ticks_after = 16
def slow_loop(data, events, ticks_before, ticks_after):
res = []
for k in events:
win = data.loc[k - ticks_before:k + ticks_after].values
res.append(win)
new_df = pd.DataFrame(res)
return new_df.mean()
def fast_loop(data, events, ticks_before, ticks_after):
win_len = ticks_before + ticks_after + 1
res = np.zeros((len(events), win_len))
i = 0
for k in events:
res[i] = data.loc[k - ticks_before:k + ticks_after]
i+=1
new_df = pd.DataFrame(res)
return new_df.mean()
assert(all(slow_loop(data, events, ticks_before, ticks_after) ==
fast_loop(data, events, ticks_before, ticks_after)))
%timeit slow_loop(data, events, ticks_before, ticks_after)
%timeit fast_loop(data, events, ticks_before, ticks_after)
fast_loop(data, events, ticks_before, ticks_after)
输出:
100 loops, best of 3: 3.66 ms per loop
1000 loops, best of 3: 632 µs per loop
0 82.714286
1 83.714286
2 84.714286
3 85.714286
4 86.714286
5 87.714286
6 88.714286
7 89.714286
8 90.714286
9 91.714286
10 92.714286
11 93.714286
12 94.714286
13 95.714286
14 96.714286
15 97.714286
16 98.714286
17 99.714286
18 100.714286
19 101.714286
20 102.714286
21 103.714286
22 104.714286
23 105.714286
24 106.714286
25 107.714286
26 108.714286
27 109.714286
28 110.714286
29 111.714286
30 112.714286
31 113.714286
32 114.714286
33 115.714286
34 116.714286
35 117.714286
36 118.714286
37 119.714286
38 120.714286
39 121.714286
40 122.714286
41 123.714286
42 124.714286
43 125.714286
44 126.714286
45 127.714286
46 128.714286
47 129.714286
48 130.714286
dtype: float64
答案 0 :(得分:2)
这是一个numpy解决方案,与fast_loop
相比,速度提高了10倍:
# numpy solution
def np1(data, events, ticks_before, ticks_after):
return pd.Series(
np.concatenate(
[data.values[x - ticks_before: x + ticks_after+1] for x in events])
.reshape(len(events), ticks_before + ticks_after+1)
.mean(0))
# similar Pandas solution
def pd1(data, events, ticks_before, ticks_after):
return pd.Series(
pd.concat(
[data[x - ticks_before : x + ticks_after +1] for x in events],
ignore_index=True)
.reshape((len(events), ticks_before + ticks_after +1))
.mean(0))
对于20M行系列时间:
In [440]: %timeit slow_loop(data2, events, ticks_before, ticks_after)
The slowest run took 10.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.7 ms per loop
In [441]: %timeit fast_loop(data2, events, ticks_before, ticks_after)
1000 loops, best of 3: 936 µs per loop
In [442]: %timeit pir5(data2, events, ticks_before, ticks_after)
1000 loops, best of 3: 436 µs per loop
In [443]: %timeit pd1(data2, events, ticks_before, ticks_after)
1000 loops, best of 3: 804 µs per loop
In [444]: %timeit np1(data2, events, ticks_before, ticks_after)
10000 loops, best of 3: 75.8 µs per loop
设定:
In [435]: data2 = data.copy()
In [436]: data2 = pd.concat([data2] * 10**5, ignore_index=True)
In [437]: data2.shape
Out[437]: (20000000,)
OLD回答:
计时(在另一台/较慢的机器上):
In [353]: %timeit fast_loop(data, events, ticks_before, ticks_after)
100 loops, best of 3: 2.27 ms per loop
In [354]: %timeit np1(data, events, ticks_before, ticks_after)
1000 loops, best of 3: 222 ┬╡s per loop
In [360]: %timeit slow_loop(data, events, ticks_before, ticks_after)
100 loops, best of 3: 12.5 ms per loop
检查:
In [356]: (fast_loop(data, events, ticks_before, ticks_after) == np1(data, events, ticks_before, ticks_after)).all()
Out[356]: True
答案 1 :(得分:1)
def pir1(data, events, ticks_before, ticks_after):
rng = np.add.outer(events, [-1 * ticks_before, ticks_after + 1])
res = np.zeros(ticks_before + ticks_after + 1)
for r in rng:
res += data[r[0]:r[1]]
res /= len(rng)
return res
def pir2(data, events, ticks_before, ticks_after):
rng = np.add.outer(events, [-1 * ticks_before, ticks_after + 1])
return np.array([data[r[0]:r[1]] for r in rng]).mean(axis=0)
def pir3(data, events, ticks_before, ticks_after):
events = np.asarray(events)
return pd.DataFrame([data[offset + events].mean() for offset in range(-ticks_before, ticks_after + 1)])
def pir4(data, events, ticks_before, ticks_after):
events = np.asarray(events)
return pd.DataFrame([data[offset + events] for offset in range(-ticks_before, ticks_after + 1)]).mean(axis=1)
def pir5(data, events, ticks_before, ticks_after):
events = np.asarray(events)
data = data.values
return np.dstack((data[offset + events] for offset in range(-ticks_before, ticks_after + 1))).mean(axis=1)
def pir6(data, events, ticks_before, ticks_after):
events = np.asarray(events)
cums = data.cumsum()
return np.dstack((data[offset + events] for offset in range(-ticks_before, ticks_after + 1))).mean(axis=1)
时代:pir5有点打败它。
答案 2 :(得分:0)
我认为使用Pandas过滤时间索引应该非常有效。
像
这样的东西df.set_index('my_time_variable',inplace=True)
df[time - ticks_before:time + ticks_after]
应该加快查询速度。然后你仍然可以遍历所有日期。
请务必使用Pandas可识别的time - ticks_before
格式作为日期,例如'2015-05-05 13:30'
答案 3 :(得分:0)
简化了一下,问题:为什么xxx.mean()有一系列数字。意思是应该是一个数字!这可能会增加处理时间。
def fast_loop(data, events, ticks_before, ticks_after):
win_len = ticks_before + ticks_after + 1
res = np.zeros( (len(events), win_len) )
for k in enumerate(events):
res[i] = data.loc[k - ticks_before:k + ticks_after]
return pd.DataFrame(res).mean()