我非常了解如何利用pandas和numpy对整个数据列进行矢量化操作。但是,我遇到了一种似乎无法向量化的情况。当计算涉及利用上一行的值来计算当前行时,我不得不退回到for循环。
有可能将这种事情向量化吗?这是我的意思的简单示例:
# Test set of 20 random integers
df = pd.DataFrame({'base': [15, 16, 2, 16, 14,
1, 18, 18, 4, 7,
4, 18, 19, 13, 16,
11, 1, 8, 1, 9]})
# Empty array to hold calculated values
calc_data = np.empty((20, 1))
period = 14
for idx, value in enumerate(df.base):
# Seeding the first element of the calculated array
if idx == 0:
calc_data[idx] = 5
else:
calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period
# Adding the column to the dataframe
df['calculated'] = calc_data
print(df)
输出:
base calculated
0 15 5.000000
1 16 5.785714
2 2 5.515306
3 16 6.264213
4 14 6.816769
5 1 6.401286
6 18 7.229765
7 18 7.999068
8 4 7.713420
9 7 7.662461
10 4 7.400857
11 18 8.157939
12 19 8.932372
13 13 9.222916
14 16 9.706994
15 11 9.799351
16 1 9.170826
17 8 9.087196
18 1 8.509539
19 9 8.544572
答案 0 :(得分:2)
一种矢量化方式(将“ vectorized”视为“避免Python级循环”)将其视为linear signal filter:
import numpy as np
import pandas as pd
import scipy.signal
def via_lfilter(arr):
period = 14
y0 = 5.0 # initial value
# calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period
b = [1.0/period] # coefficients of 'original' terms
a = [1.0, -(period-1)/period] # coefficients of 'computed' terms
zi = scipy.signal.lfiltic(b, a, [y0], x=arr[1::-1])
y = np.zeros_like(arr)
y[0] = y0
result = scipy.signal.lfilter(b, a, arr[1:], axis=0, zi=zi)
y[1:] = result[0]
return y
但是在现实世界中,我只会使用numba,它的设计恰恰是为我们带来矢量化的性能优势,而不会让人头疼:
import numba
@numba.jit(nopython=True)
def via_numba(arr):
calc_data = np.zeros_like(arr)
period = 14
calc_data[0] = 5.0 # initial value
for idx in range(1, len(arr)):
calc_data[idx] = (calc_data[idx - 1] * (period - 1) + arr[idx]) / period
return calc_data
这些给我:
In [238]: df["vect"] = via_lfilter(df.base.values.astype(float))
...: df["via_numba"] = via_numba(df.base.values.astype(float))
...:
...:
In [239]: df
Out[239]:
base calculated vect via_numba
0 15 5.000000 5.000000 5.000000
1 16 5.785714 5.785714 5.785714
2 2 5.515306 5.515306 5.515306
3 16 6.264213 6.264213 6.264213
4 14 6.816769 6.816769 6.816769
5 1 6.401286 6.401286 6.401286
6 18 7.229765 7.229765 7.229765
7 18 7.999068 7.999068 7.999068
8 4 7.713420 7.713420 7.713420
9 7 7.662461 7.662461 7.662461
10 4 7.400857 7.400857 7.400857
11 18 8.157939 8.157939 8.157939
12 19 8.932372 8.932372 8.932372
13 13 9.222916 9.222916 9.222916
14 16 9.706994 9.706994 9.706994
15 11 9.799351 9.799351 9.799351
16 1 9.170826 9.170826 9.170826
17 8 9.087196 9.087196 9.087196
18 1 8.509539 8.509539 8.509539
19 9 8.544572 8.544572 8.544572
都在较大的帧上表现合理:
In [240]: df = pd.DataFrame({"base": np.random.uniform(1, 100, 10**6)})
In [241]: %timeit via_lfilter(df.base.values.astype(float))
11.4 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [242]: %timeit via_numba(df.base.values.astype(float))
11 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 1 :(得分:1)
以下是矢量化的意义,即所有使用的操作都是pandas和numpy层的数组操作。
X = ((period-1)/period) ** np.arange(len(df)) / period
a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)
可以通过提取递归的顺序性质来构建快速解决方案。
即请注意,结果的每个元素都遵循特定的模式:
0: 5
1: 5 (13/14) + 16 (1/14)
2: 5 (13 / 14)^2 + 16 (13 / 14^2) + 2 (1/14)
...
如果第一个元素乘以14,则我们可以将上面的内容表示为
0: sum{(1/14)*[70]}
1: sum{(1/14)*[70(13/14), 16]}
2: sum{(1/14)*[70(13/14)^2, 16(13/14), 2]}
...
如果我们从df.base
中删除元素,则会得到可以求和的序列:
0: (1/14) * [1]
1: (1/14) * [(13/14), 1]
2: (1/14) * [(13/14)^2, (13/14), 1]
...
以上系列的序列可以作为以下内容的反向切片获得:
X = ((period-1)/period) ** np.arange(len(df)) / period
还要注意,df.base
的第一个值在calculated
的构造中未使用。而是用(5*period = 70)
因此,第n个结果是修改后的df.base
的扩展序列的总和乘以X
的适当片段
a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)
# df outputs:
base calculated
0 15 5.000000
1 16 5.785714
2 2 5.515306
3 16 6.264213
4 14 6.816769
5 1 6.401286
6 18 7.229765
7 18 7.999068
8 4 7.713420
9 7 7.662461
10 4 7.400857
11 18 8.157939
12 19 8.932372
13 13 9.222916
14 16 9.706994
15 11 9.799351
16 1 9.170826
17 8 9.087196
18 1 8.509539
19 9 8.544572
答案 2 :(得分:-1)
您可以使用shift()方法访问n个位置的移位值,
这应该使您的任务更轻松
df.value.shift(1) + df.value