我有800,000+行的数据。我想采用其中一列的指数移动平均线(EMA)。时间未均匀采样,我想在每次更新(行)时衰减EMA。我的代码是这样的:
window = 5
for i in range(1, len(series)):
dt = series['datetime'][i] - series['datetime'][i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)
问题是,对于800,000行,这非常慢。是否有使用numpy的其他功能对此进行优化?我无法对其进行矢量化处理,因为results[i]
依赖于results[i-1]
。
此处的示例数据:
Timestamp Midpoint
1559655000001096130 2769.125
1559655000001162260 2769.127
1559655000001171688 2769.154
1559655000001408734 2769.138
1559655000001424200 2769.123
1559655000001433128 2769.110
1559655000001541560 2769.125
1559655000001640406 2769.125
1559655000001658436 2769.127
1559655000001755924 2769.129
1559655000001793266 2769.125
1559655000001878688 2769.143
1559655000002061024 2769.125
答案 0 :(得分:2)
如何处理下面这样的问题,我需要0.34秒的时间来处理一系列有90万行的不规则间隔数据?我假设5的窗口意味着5天的时间跨度。
首先,让我们创建一些示例数据。
# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24 # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35 # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'
np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s') # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)
# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()
现在让我们创建一个生成器函数来存储指数加权时间序列的最新值。这可以运行c。通过安装numba,将其导入,然后在函数定义@jit(nopython=True)
上方添加单个装饰行,可使速度提高4倍。
from numba import jit # Optional, see below.
@jit(nopython=True) # Optional, see below.
def ewma(vals, decay_vals):
result = vals[0]
yield result
for val, decay in zip(vals[1:], decay_vals[1:]):
result = result * (1 - decay) + val * decay
yield result
现在让我们在不规则间隔的序列s
上运行此生成器。对于具有90万行的示例,我需要1.2秒来运行以下代码。通过选择使用numba中的即时编译器,我可以进一步将执行时间减少到0.34秒。您首先需要install该软件包,例如conda install numba
。请注意,我使用列表理解来填充生成器中的ewma
值,然后在将其首先转换为数据帧之后将这些值分配回原始序列。
# Assumes time series data is now named `s`.
window = 5 # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day) # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))
>>> result.tail()
midpoint ewma
2018-12-30 23:59:45 103.894471 105.546004
2018-12-30 23:59:49 103.914077 105.545929
2018-12-30 23:59:50 103.901910 105.545910
2018-12-30 23:59:53 103.913476 105.545853
2018-12-31 00:00:00 103.910422 105.545720
>>> result.shape
(907200, 2)
为确保数字遵循我们的直觉,让我们每小时取样一次以可视化结果。这对我来说很好。
obs_per_day = 24 # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()
答案 1 :(得分:0)
通过迭代基础的numpy数组而不是pandas DataFrames和Series可能会获得一些改进:
result = np.ndarray(len(series))
window = 5
serdt = series['datetime'].values
sermp = series['midpoint'].values
for i in range(1, len(series)):
dt = serdt[i] - serdt[i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * sermp[i]
return pandas.Series(result, index=series.index)
使用样本数据,它的速度大约是原始方法的6倍。