Python矢量化操作涉及上一行的数据

时间:2018-07-30 13:18:20

标签: python pandas numpy vectorization

我非常了解如何利用pandas和numpy对整个数据列进行矢量化操作。但是,我遇到了一种似乎无法向量化的情况。当计算涉及利用上一行的值来计算当前行时,我不得不退回到for循环。

有可能将这种事情向量化吗?这是我的意思的简单示例:

# Test set of 20 random integers
df = pd.DataFrame({'base': [15, 16, 2, 16, 14,
                            1, 18, 18, 4, 7,
                            4, 18, 19, 13, 16,
                            11, 1, 8, 1, 9]})


# Empty array to hold calculated values
calc_data = np.empty((20, 1))

period = 14

for idx, value in enumerate(df.base):

    # Seeding the first element of the calculated array
    if idx == 0:
        calc_data[idx] = 5

    else:
        calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period

# Adding the column to the dataframe
df['calculated'] = calc_data

print(df)

输出:

    base  calculated
0     15    5.000000
1     16    5.785714
2      2    5.515306
3     16    6.264213
4     14    6.816769
5      1    6.401286
6     18    7.229765
7     18    7.999068
8      4    7.713420
9      7    7.662461
10     4    7.400857
11    18    8.157939
12    19    8.932372
13    13    9.222916
14    16    9.706994
15    11    9.799351
16     1    9.170826
17     8    9.087196
18     1    8.509539
19     9    8.544572

3 个答案:

答案 0 :(得分:2)

一种矢量化方式(将“ vectorized”视为“避免Python级循环”)将其视为linear signal filter

import numpy as np
import pandas as pd
import scipy.signal

def via_lfilter(arr):
    period = 14
    y0 = 5.0  # initial value

    # calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period
    b = [1.0/period]  # coefficients of 'original' terms
    a = [1.0, -(period-1)/period]  # coefficients of 'computed' terms

    zi = scipy.signal.lfiltic(b, a, [y0], x=arr[1::-1])

    y = np.zeros_like(arr)
    y[0] = y0
    result = scipy.signal.lfilter(b, a, arr[1:], axis=0, zi=zi)
    y[1:] = result[0]

    return y

但是在现实世界中,我只会使用numba,它的设计恰恰是为我们带来矢量化的性能优势,而不会让人头疼:

import numba

@numba.jit(nopython=True)
def via_numba(arr):
    calc_data = np.zeros_like(arr)
    period = 14
    calc_data[0] = 5.0  # initial value
    for idx in range(1, len(arr)):
        calc_data[idx] = (calc_data[idx - 1] * (period - 1) + arr[idx]) / period
    return calc_data

这些给我:

In [238]: df["vect"] = via_lfilter(df.base.values.astype(float))
     ...: df["via_numba"] = via_numba(df.base.values.astype(float))
     ...: 
     ...: 

In [239]: df
Out[239]: 
    base  calculated      vect  via_numba
0     15    5.000000  5.000000   5.000000
1     16    5.785714  5.785714   5.785714
2      2    5.515306  5.515306   5.515306
3     16    6.264213  6.264213   6.264213
4     14    6.816769  6.816769   6.816769
5      1    6.401286  6.401286   6.401286
6     18    7.229765  7.229765   7.229765
7     18    7.999068  7.999068   7.999068
8      4    7.713420  7.713420   7.713420
9      7    7.662461  7.662461   7.662461
10     4    7.400857  7.400857   7.400857
11    18    8.157939  8.157939   8.157939
12    19    8.932372  8.932372   8.932372
13    13    9.222916  9.222916   9.222916
14    16    9.706994  9.706994   9.706994
15    11    9.799351  9.799351   9.799351
16     1    9.170826  9.170826   9.170826
17     8    9.087196  9.087196   9.087196
18     1    8.509539  8.509539   8.509539
19     9    8.544572  8.544572   8.544572

都在较大的帧上表现合理:

In [240]: df = pd.DataFrame({"base": np.random.uniform(1, 100, 10**6)})

In [241]: %timeit via_lfilter(df.base.values.astype(float))
11.4 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [242]: %timeit via_numba(df.base.values.astype(float))
11 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案 1 :(得分:1)

tldr:

以下是矢量化的意义,即所有使用的操作都是pandas和numpy层的数组操作。

X = ((period-1)/period) ** np.arange(len(df)) / period
a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)

解释:

可以通过提取递归的顺序性质来构建快速解决方案。

即请注意,结果的每个元素都遵循特定的模式:

0: 5
1: 5 (13/14) + 16 (1/14)
2: 5 (13 / 14)^2 + 16 (13 / 14^2) + 2 (1/14)
...

如果第一个元素乘以14,则我们可以将上面的内容表示为

0: sum{(1/14)*[70]}
1: sum{(1/14)*[70(13/14), 16]}
2: sum{(1/14)*[70(13/14)^2, 16(13/14), 2]}
...

如果我们从df.base中删除元素,则会得到可以求和的序列:

0: (1/14) * [1]
1: (1/14) * [(13/14), 1]
2: (1/14) * [(13/14)^2, (13/14), 1]
...

以上系列的序列可以作为以下内容的反向切片获得:

X = ((period-1)/period) ** np.arange(len(df)) / period

还要注意,df.base的第一个值在calculated的构造中未使用。而是用(5*period = 70)

代替

因此,第n个结果是修改后的df.base的扩展序列的总和乘以X的适当片段

a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)
# df outputs:
    base  calculated
0     15    5.000000
1     16    5.785714
2      2    5.515306
3     16    6.264213
4     14    6.816769
5      1    6.401286
6     18    7.229765
7     18    7.999068
8      4    7.713420
9      7    7.662461
10     4    7.400857
11    18    8.157939
12    19    8.932372
13    13    9.222916
14    16    9.706994
15    11    9.799351
16     1    9.170826
17     8    9.087196
18     1    8.509539
19     9    8.544572

答案 2 :(得分:-1)

您可以使用shift()方法访问n个位置的移位值,

这应该使您的任务更轻松

df.value.shift(1) + df.value