Question

我正在努力取代＆＃34;糟糕的价值观＆＃34;低于和高于阈值的默认值（例如，将它们设置为NaN）。我正在注意一个具有1000k值和更多值的numpy数组 - 所以性能是一个问题。

我的原型分两步完成操作，是否有可能一步到位？

import numpy as np

data = np.array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

upper_threshold = 7
lower_threshold = 1
default_value = np.NaN

# is it possible to do this in one expression?
data[data > upper_threshold] = default_value
data[data < lower_threshold] = default_value

print data # [ nan   1.   2.   3.   4.   5.   6.   7.  nan  nan]

如相关问题（Pythonic way to replace list values with upper and lower bound (clamping, clipping, thresholding)?）

所述

与许多其他函数一样，np.clip是python，但它遵循arr.clip这个方法。对于编译方法的常规数组，因此速度更快（大约2倍）。 - hpaulj

我希望能够找到更快捷的方式，提前谢谢！

Answer 1

使用组合掩码一次性使用boolean-indexing -

data[(data > upper_threshold) | (data < lower_threshold)] = default_value

运行时测试 -

In [109]: def onepass(data, upper_threshold, lower_threshold, default_value):
     ...:     mask = (data > upper_threshold) | (data < lower_threshold)
     ...:     data[mask] = default_value
     ...: 
     ...: def twopass(data, upper_threshold, lower_threshold, default_value):
     ...:     data[data > upper_threshold] = default_value
     ...:     data[data < lower_threshold] = default_value
     ...:     

In [110]: upper_threshold = 7
     ...: lower_threshold = 1
     ...: default_value = np.NaN
     ...: 

In [111]: data = np.random.randint(-4,11,(1000000)).astype(float)

In [112]: %timeit twopass(data, upper_threshold, lower_threshold, default_value)
100 loops, best of 3: 2.41 ms per loop

In [113]: data = np.random.randint(-4,11,(1000000)).astype(float)

In [114]: %timeit onepass(data, upper_threshold, lower_threshold, default_value)
100 loops, best of 3: 2.74 ms per loop

使用建议的one-pass-indexing方法看起来效果不佳。原因可能是掩码的OR-ing的计算比直接用布尔索引本身分配值要贵一些。

用python中的默认值替换阈值上下的列表值？

1 个答案: