Question

我有一个严格增加长度为m的“cutoff”值的numpy数组，以及一个pandas系列值（认为索引不重要，可以转换为numpy数组）长度为n的值。我需要提出一种有效的方法来吐出pandas系列中元素数量的长度m向量，而不是“cutoff”数组的第j个元素。

我可以通过列表迭代器来完成这个：

output = array([(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar])

但我想知道是否有任何方法可以利用更多numpy的魔法速度，因为我必须在多个循环中多次这样做并且它一直在弄乱我的电脑。

谢谢！

Answer 1

这是你在找什么？

In [36]: a = np.random.random(20)

In [37]: a
Out[37]: 
array([ 0.68574307,  0.15743428,  0.68006876,  0.63572484,  0.26279663,
        0.14346269,  0.56267286,  0.47250091,  0.91168387,  0.98915746,
        0.22174062,  0.11930722,  0.30848231,  0.1550406 ,  0.60717858,
        0.23805205,  0.57718675,  0.78075297,  0.17083826,  0.87301963])

In [38]: b = np.array((0.3,0.7))

In [39]: np.sum(a[:,None]<b[None,:], axis=0)
Out[39]: array([ 8, 16])

In [40]: np.sum(a[:,None]<b, axis=0) # b's new axis above is unnecessary...
Out[40]: array([ 8, 16])

In [41]: (a[:,None]<b).sum(axis=0)   # even simpler
Out[41]: array([ 8, 16])

时间总是很受欢迎（对于一个很长的2E6元素阵列）

In [47]: a = np.random.random(2000000)

In [48]: %timeit (a[:,None]<b).sum(axis=0)
10 loops, best of 3: 78.2 ms per loop

In [49]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
1 loop, best of 3: 448 ms per loop

对于较小的数组

In [50]: a = np.random.random(2000)

In [51]: %timeit (a[:,None]<b).sum(axis=0)
10000 loops, best of 3: 89 µs per loop

In [52]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 141 µs per loop

修改的

Divakar说长度b可能会有所不同，让我们看看

In [71]: a = np.random.random(2000)

In [72]: b =np.random.random(200)

In [73]: %timeit (a[:,None]<b).sum(axis=0)
1000 loops, best of 3: 1.44 ms per loop

In [74]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
10000 loops, best of 3: 172 µs per loop

确实完全不同！感谢您提出我的好奇心。

可能OP应该测试他的用例，关于截止序列的非常长的样本？哪里有平衡？

编辑＃2

我在我的时间制作了一个blooper，我忘记了axis=0的{{1}}参数......

我已经用更正后的陈述编辑了时间，当然还有正确的时间。道歉。

答案 1 :(得分：2)

您可以将np.searchsorted用于某些NumPy magic -

# Convert to numpy array for some "magic"
pan_series_arr = np.array(pan_series)

# Let the magic begin!
sortidx = pan_series_arr.argsort()
out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)

<强>解释

您正在执行[(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar]，即每个人 cutoff_ar中的元素，我们计算的pan_series元素的数量小于它。现在使用np.searchsorted，我们正在寻找将cutoff_ar放入已排序的pan_series_arr并获取此类位置的索引，与cutoff_ar中当前元素所在的位置进行比较{ {1}}位置。这些指数基本上代表当前'right'元素下面的pan_series元素的数量，从而为我们提供所需的输出。

示例运行

cutoff_ar

Answer 2

您可以将np.searchsorted用于某些NumPy magic -

# Convert to numpy array for some "magic"
pan_series_arr = np.array(pan_series)

# Let the magic begin!
sortidx = pan_series_arr.argsort()
out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)

<强>解释

您正在执行[(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar]，即每个人 cutoff_ar中的元素，我们计算的pan_series元素的数量小于它。现在使用np.searchsorted，我们正在寻找将cutoff_ar放入已排序的pan_series_arr并获取此类位置的索引，与cutoff_ar中当前元素所在的位置进行比较{ {1}}位置。这些指数基本上代表当前'right'元素下面的pan_series元素的数量，从而为我们提供所需的输出。

示例运行

cutoff_ar

数组元素的数量少于python

2 个答案: