Question

我有一个带有时间序列分数的数据框。我的目标是检测分数何时大于某个特定阈值th，然后查找分数何时回到0。分别查找每个条件非常容易

dates_1 = score > th
dates_2 = np.sign(score[1:]) == np.sign(score.shift(1).dropna())

但是，我不知道什么是重写date_2的最Python方式，以便仅在观察到“活动的” date_1时才显示日期

每当score > th为True时，可能使用辅助列'active'设置为1，并在满足dates_2的条件时将其设置为False。这样，我可以要求更改符号AND active == True。但是，这种方法需要迭代，我想知道是否存在针对我的问题的矢量化解决方案

关于如何改进我的方法的任何想法？

样本数据：

date         score
2010-01-04   0.0
2010-01-05  -0.3667779798467592
2010-01-06  -1.9641427199568868
2010-01-07  -0.49976215445519134
2010-01-08  -0.7069108074548405
2010-01-11  -1.4624766212523337
2010-01-12  -0.9132777669357441
2010-01-13   0.16204588193577152
2010-01-14   0.958085568609925
2010-01-15   1.4683022129399834
2010-01-19   3.036016680985081
2010-01-20   2.2357911432637345
2010-01-21   2.8827438241030707
2010-01-22   -3.395977874791837

预期产量

如果th = 0.94

date    active
2010-01-04  False
2010-01-05  False
2010-01-06  False
2010-01-07  False
2010-01-08  False
2010-01-11  False
2010-01-12  False
2010-01-13  False
2010-01-14  True
2010-01-15  True
2010-01-19  True
2010-01-20  True
2010-01-21  True
2010-01-22  False

Answer 1

未向量化！

def alt_cond(s, th):
    active = False
    for x in s:
        active = [x >= th, x > 0][int(active)]
        yield active

df.assign(A=[*alt_cond(df.score, 0.94)])

          date     score      A
0   2010-01-04  0.000000  False
1   2010-01-05 -0.366778  False
2   2010-01-06 -1.964143  False
3   2010-01-07 -0.499762  False
4   2010-01-08 -0.706911  False
5   2010-01-11 -1.462477  False
6   2010-01-12 -0.913278  False
7   2010-01-13  0.162046  False
8   2010-01-14  0.958086   True
9   2010-01-15  1.468302   True
10  2010-01-19  3.036017   True
11  2010-01-20  2.235791   True
12  2010-01-21  2.882744   True
13  2010-01-22 -3.395978  False

矢量化（排序）

我使用Numba确实加快了速度。仍然是一个循环，但是如果可以安装numba

，应该会非常快

from numba import njit

@njit
def alt_cond(s, th):
    active = False
    out = np.zeros(len(s), dtype=np.bool8)
    for i, x in enumerate(s):
        if active:
            if x <= 0:
                active = False
        else:
            if x >= th:
                active = True
        out[i] = active
    return out

df.assign(A=alt_cond(df.score.values, .94))

回复评论

您可以拥有一个列名和阈值字典并进行迭代

th = {'score': 0.94}

df.join(pd.DataFrame(
    np.column_stack([[*alt_cond(df[k], v)] for k, v in th.items()]),
    df.index, [f"{k}_A" for k in th]
))


          date     score  score_A
0   2010-01-04  0.000000    False
1   2010-01-05 -0.366778    False
2   2010-01-06 -1.964143    False
3   2010-01-07 -0.499762    False
4   2010-01-08 -0.706911    False
5   2010-01-11 -1.462477    False
6   2010-01-12 -0.913278    False
7   2010-01-13  0.162046    False
8   2010-01-14  0.958086     True
9   2010-01-15  1.468302     True
10  2010-01-19  3.036017     True
11  2010-01-20  2.235791     True
12  2010-01-21  2.882744     True
13  2010-01-22 -3.395978    False

Answer 2

我假设您的数据在pandas数据框中，并且'date'是您的索引列。然后这就是我要做的方式：

th = 0.94 # Threshold value
i = df[df.score>th].index[0] # Check the index for the first condition

df[i:][df.score<0].index[0] # Check the index for the second condition, after the index of the first condition

因此，请使用条件索引找到第一个条件（[df.score>th]）的索引，然后检查第二个条件（[df.score<0]），但开始从为第一个条件找到的索引中查找（ [i:]）

另一个条件处于活动状态时，满足熊猫第一个约会条件

预期产量

2 个答案:

未向量化！

矢量化（排序）

回复评论