Question

我将一个时间序列s存储为pandas.Series，我需要确定该时间序列跟踪的值何时至少变化x。

使用伪代码：

print s(0)
s*=s(0)
for all t in ]t, t_max]:
    if |s(t)-s*| > x:
        s* = s(t)
        print s*

天真的，可以使用以下代码在Python中进行编码：

import pandas as pd

def find_changes(s, x):

    changes = []
    s_last = None

    for index, value in s.iteritems():

        if s_last is None:
            s_last = value 

        if value-s_last > x or s_last-value > x:
            changes += [index, value]
            s_last = value
    return changes

我的数据集很大，所以我不能只使用上面的方法。而且，由于框架的限制，我无法使用Cython或Numba。我可以（并计划）使用熊猫和NumPy。

我正在寻找有关使用哪种NumPy矢量化/优化方法以及如何使用的指南。

谢谢！

编辑：更改了代码以匹配伪代码。

Answer 1

我不知道我是否正确理解了您，但是这是我对问题的解释：

import pandas as pd
import numpy as np

# Our series of data.

data = pd.DataFrame(np.random.rand(10), columns = ['value'])

# The threshold.

threshold = .33

# For each point t, grab t - 1. 

data['value_shifted'] = data['value'].shift(1)

# Absolute difference of t and t - 1.

data['abs_change'] = abs(data['value'] - data['value_shifted'])

# Test against the threshold.

data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)

print(data)

给予：

      value  value_shifted  abs_change  change_exceeds_threshold
0  0.005382            NaN         NaN                         0
1  0.060954       0.005382    0.055573                         0
2  0.090456       0.060954    0.029502                         0
3  0.603118       0.090456    0.512661                         1
4  0.178681       0.603118    0.424436                         1
5  0.597814       0.178681    0.419133                         1
6  0.976092       0.597814    0.378278                         1
7  0.660010       0.976092    0.316082                         0
8  0.805768       0.660010    0.145758                         0
9  0.698369       0.805768    0.107400                         0

Answer 2

我不认为伪代码可以向量化，因为s*的下一个状态取决于最后一个状态。有一个纯python解决方案（1次迭代）：

import random
import pandas as pd

s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20

ss = s[0]
for i in range(len(s)):
    if abs(s[i] - ss) > thres:
        ss = s[i]
        res.append([i, s[i]])

df = pd.DataFrame(res, columns=['value'])

在这种情况下，我认为运行速度无法超过O（N）。

查找pandas.Series的值至少变化x的时间

2 个答案: