Question

我有800,000+行的数据。我想采用其中一列的指数移动平均线（EMA）。时间未均匀采样，我想在每次更新（行）时衰减EMA。我的代码是这样的：

window = 5            
for i in range(1, len(series)):
    dt = series['datetime'][i] - series['datetime'][i - 1]
    decay = 1 - numpy.exp(-dt / window)
    result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)

问题是，对于800,000行，这非常慢。是否有使用numpy的其他功能对此进行优化？我无法对其进行矢量化处理，因为results[i]依赖于results[i-1]。

此处的示例数据：

Timestamp             Midpoint
1559655000001096130    2769.125
1559655000001162260    2769.127
1559655000001171688    2769.154
1559655000001408734    2769.138
1559655000001424200    2769.123
1559655000001433128    2769.110
1559655000001541560    2769.125
1559655000001640406    2769.125
1559655000001658436    2769.127
1559655000001755924    2769.129
1559655000001793266    2769.125
1559655000001878688    2769.143
1559655000002061024    2769.125

Answer 1

如何处理下面这样的问题，我需要0.34秒的时间来处理一系列有90万行的不规则间隔数据？我假设5的窗口意味着5天的时间跨度。

首先，让我们创建一些示例数据。

# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24  # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35  # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'

np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s')  # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)

# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()

现在让我们创建一个生成器函数来存储指数加权时间序列的最新值。这可以运行c。通过安装numba，将其导入，然后在函数定义@jit(nopython=True)上方添加单个装饰行，可使速度提高4倍。

from numba import jit  # Optional, see below.

@jit(nopython=True)  # Optional, see below.
def ewma(vals, decay_vals):
    result = vals[0]
    yield result
    for val, decay in zip(vals[1:], decay_vals[1:]):
        result = result * (1 - decay) + val * decay
        yield result

现在让我们在不规则间隔的序列s上运行此生成器。对于具有90万行的示例，我需要1.2秒来运行以下代码。通过选择使用numba中的即时编译器，我可以进一步将执行时间减少到0.34秒。您首先需要install该软件包，例如conda install numba。请注意，我使用列表理解来填充生成器中的ewma值，然后在将其首先转换为数据帧之后将这些值分配回原始序列。

# Assumes time series data is now named `s`.
window = 5  # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day)  # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
    ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))

>>> result.tail()
                       midpoint        ewma
2018-12-30 23:59:45  103.894471  105.546004
2018-12-30 23:59:49  103.914077  105.545929
2018-12-30 23:59:50  103.901910  105.545910
2018-12-30 23:59:53  103.913476  105.545853
2018-12-31 00:00:00  103.910422  105.545720

>>> result.shape
(907200, 2)

为确保数字遵循我们的直觉，让我们每小时取样一次以可视化结果。这对我来说很好。

obs_per_day = 24  # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()

Answer 2

通过迭代基础的numpy数组而不是pandas DataFrames和Series可能会获得一些改进：

result = np.ndarray(len(series))
window = 5
serdt = series['datetime'].values
sermp = series['midpoint'].values
for i in range(1, len(series)):
    dt = serdt[i] - serdt[i - 1]
    decay = 1 - numpy.exp(-dt / window)
    result[i] = (1 - decay) * result[i - 1] + decay * sermp[i]
return pandas.Series(result, index=series.index)

使用样本数据，它的速度大约是原始方法的6倍。

在不规则时间间隔的大型数据集上快速进行EMA计算

2 个答案: