Question

我使用pandas实现了一些脚本，但我不确定它是否是最优雅/最快的实现。（我正在使用for循环...）

假设您有一个y变量（浮点数）与时间

我需要做的是在数据框上创建一个新列（0或1），指示Y值在允许的变化/噪声范围内稳定的区间

abs(Y_start - Y_end) < noise_allowed

如果有多个间隔，则只应标记Y最大的间隔

最优雅的方法是什么？也许使用滚动窗口？

Answer 1

这是一个矢量化的NumPy方法 -

def max_stable_mask(a, thresh): # thresh controls noise
    mask = np.r_[ False, np.abs(a - a.max()) < thresh, False]
    idx = np.flatnonzero(mask[1:] != mask[:-1])
    s0 = (idx[1::2] - idx[::2]).argmax()
    valid_mask = np.zeros(a.size, dtype=int) #Use dtype=bool for mask o/p
    valid_mask[idx[2*s0]:idx[2*s0+1]] = 1
    return valid_mask

样品运行 -

In [193]: a = np.array([0, 2, 4, 9.2, 6, 6, 9, 9 , 9.2, 9.1, 9.2, 5, 0])

In [194]: max_stable_mask(a, thresh = 0.5)
Out[194]: array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0])

In [195]: max_stable_mask(a, thresh = 0.1)
Out[195]: array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])

In [196]: max_stable_mask(a, thresh = 0.01)
Out[196]: array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

在最后一个阈值为0.01的样本案例中，有两个区间，每个区间各有一个元素，对于这种情况，它会选择第一个区间。

Answer 2

不完全优雅，但不涉及任何明确的for循环：

import pandas as pd

threshold = 0.015

# construct a test DataFrame
df = pd.DataFrame({'y': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.5, 0.5, 0.5,
                         0.6, 0.7, 0.8, 0.9, 1, 1.01, 1.02, 1.03, 1.04,
                         0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0]})
# calculate a shifted series for differencing
df['y_diff'] = df.y.shift(1) - df.y
# find flat areas
df['y_flat'] = abs(df['y_diff']) < threshold
# index blocks
df['y_flat_index'] = (df.y_flat.shift(1) != df.y_flat).astype(int).cumsum()
# calculate mean y for flat areas
df['y_flat_mean'] = df[['y', 'y_flat_index']].groupby('y_flat_index').transform(lambda x: x.mean())
# mark the flat area with the highest mean y
df['y_marked'] = ((df['y_flat'] == True) & (df['y_flat_mean'] == max(df['y_flat_mean']))).astype(int)
# take into account that we shifted the series, i.e. add one more True value at the beginning, 
# unless there's only one maximum value, in which case assign mark to that index
if sum(df['y_marked']) == 0:
    df['y_marked'].loc[df['y'] == max(df['y'])] = 1
else:
    df['y_marked'].iloc[df[df['y_marked'] == 1].index - 1] = 1
print(df)

结果：

       y  y_diff y_flat  y_flat_index  y_flat_mean y_marked
0   0.00     NaN  False             1        0.250    0
1   0.10   -0.10  False             1        0.250    0
2   0.20   -0.10  False             1        0.250    0
3   0.30   -0.10  False             1        0.250    0
4   0.40   -0.10  False             1        0.250    0
5   0.50   -0.10  False             1        0.250    0
6   0.50    0.00   True             2        0.500    0
7   0.50    0.00   True             2        0.500    0
8   0.50    0.00   True             2        0.500    0
9   0.60   -0.10  False             3        0.800    0
10  0.70   -0.10  False             3        0.800    0
11  0.80   -0.10  False             3        0.800    0
12  0.90   -0.10  False             3        0.800    0
13  1.00   -0.10  False             3        0.800    1
14  1.01   -0.01   True             4        1.025    1
15  1.02   -0.01   True             4        1.025    1
16  1.03   -0.01   True             4        1.025    1
17  1.04   -0.01   True             4        1.025    1
18  0.90    0.14  False             5        0.450    0
19  0.80    0.10  False             5        0.450    0
20  0.70    0.10  False             5        0.450    0
21  0.60    0.10  False             5        0.450    0
22  0.50    0.10  False             5        0.450    0
23  0.40    0.10  False             5        0.450    0
24  0.30    0.10  False             5        0.450    0
25  0.20    0.10  False             5        0.450    0
26  0.10    0.10  False             5        0.450    0
27  0.00    0.10  False             5        0.450    0

请注意，如果有多个，则会标记所有具有最大y的平面区域，除了[0,2,0,0,2,2]等情况。

熊猫找到最大稳定间隔

2 个答案: