Question

我试图快速检查列表中有多少项目是否低于一系列阈值，类似于执行here所述的内容很多次。这一点的重点是对机器学习模型进行一些诊断，这些模型比sci-kit learn（ROC曲线等）中内置的更深入。

想象一下preds是预测列表（0到1之间的概率）。实际上，我将拥有超过100万，这就是我试图加快速度的原因。

这会创建一些假分数，通常分布在0到1之间。

fake_preds = [np.random.normal(0, 1) for i in range(1000)]
fake_preds = [(pred + np.abs(min(fake_preds)))/max(fake_preds + np.abs(min(fake_preds))) for pred in fake_preds]

现在，我这样做的方法是循环100个阈值水平并检查在任何给定阈值下预测的数量是多少：

thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]

这需要大约1.5秒的10k（比产生假预测的时间少）但你可以想象它需要更长的时间来预测更多。我必须做几千次来比较一堆不同的模型。

有关将第二个代码块加快的方法的任何想法？我认为必须有一种方法可以对预测进行排序，使计算机更容易检查阈值（类似于类似SQL的场景中的索引），但我无法找出除{之外的任何其他方式。 {1}}检查它们，并且没有利用任何索引或订购。

提前感谢您的帮助！

Answer 1

一种方法是使用numpy.histogram。

timeit

从%timeit my_cov = np.histogram(fake_preds, len(thresholds))[0].cumsum() 169 µs ± 6.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds] 172 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)开始，我得到了：

{{1}}

Answer 2

方法＃1

您可以对predictions数组进行排序，然后使用searchsorted或np.digitize，就像这样 -

np.searchsorted(np.sort(fake_preds), thresholds, 'right')

np.digitize(thresholds, np.sort(fake_preds))

如果您不介意改变predictions数组，请使用fake_preds.sort()进行就地排序，然后使用fake_preds代替np.sort(fake_preds)。这应该更高效，因为我们将避免在那里使用任何额外的内存。

方法＃2

现在，从100到0的阈值为1，这些阈值将是0.01的倍数。因此，我们可以简单地将每个版本的100进行数字化并将其转换为ints，这可以非常直接地作为bins投放到np.bincount 。然后，要获得或需要结果，请使用cumsum，如此 -

np.bincount((fake_preds*100).astype(int),minlength=99)[:99].cumsum()

基准

方法 -

def searchsorted_app(fake_preds, thresholds):
    return np.searchsorted(np.sort(fake_preds), thresholds, 'right')

def digitize_app(fake_preds, thresholds):
    return np.digitize(thresholds, np.sort(fake_preds) )

def bincount_app(fake_preds, thresholds):
    return np.bincount((fake_preds*100).astype(int),minlength=99)[:99].cumsum()

10000元素的运行时测试和验证 -

In [210]: np.random.seed(0)
     ...: fake_preds = np.random.rand(10000)
     ...: thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
     ...: thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]
     ...: 

In [211]: print np.allclose(thresh_cov, searchsorted_app(fake_preds, thresholds))
     ...: print np.allclose(thresh_cov, digitize_app(fake_preds, thresholds))
     ...: print np.allclose(thresh_cov, bincount_app(fake_preds, thresholds))
     ...: 
True
True
True

In [214]: %timeit [sum(fake_preds < thresh) for thresh in thresholds]
1 loop, best of 3: 1.43 s per loop

In [215]: %timeit searchsorted_app(fake_preds, thresholds)
     ...: %timeit digitize_app(fake_preds, thresholds)
     ...: %timeit bincount_app(fake_preds, thresholds)
     ...: 
1000 loops, best of 3: 528 µs per loop
1000 loops, best of 3: 535 µs per loop
10000 loops, best of 3: 24.9 µs per loop

对于2,700x+而言， searchsorted 加速57,000x+ 加强bincount 对于较大的数据集，bincount和searchsorted之间的差距必然会增加，因为bincount并不需要排序。

Answer 3

您可以在此重新设定thresholds以启用广播。首先，这里对您创建的fake_preds和thresholds进行了一些可能的更改，以消除循环。

np.random.seed(123)
fake_preds = np.random.normal(size=1000)
fake_preds = (fake_preds + np.abs(fake_preds.min())) \
           / (np.max(fake_preds + np.abs((fake_preds.min()))))
thresholds = np.linspace(.01, 1, 100)

然后你要做的就是一行完成：

print(np.sum(np.less(fake_preds, np.tile(thresholds, (1000,1)).T), axis=1))
[  2   2   2   2   2   2   5   5   6   7   7  11  11  11  15  18  21  26
  28  34  40  48  54  63  71  77  90 100 114 129 143 165 176 191 206 222
 240 268 288 312 329 361 392 417 444 479 503 532 560 598 615 648 671 696
 710 726 747 768 787 800 818 840 860 877 891 902 912 919 928 942 947 960
 965 970 978 981 986 987 988 991 993 994 995 995 995 997 997 997 998 998
 999 999 999 999 999 999 999 999 999 999]

操作实例：

fake_preds有形状（1000,1）。您需要将thresholds操作为与此广播兼容的形状。（见general broadcasting rules。）

可播放的第二个形状是

print(np.tile(thresholds, (1000,1)).T.shape)
# (100, 1000)

Answer 4

选项1：

from scipy.stats import percentileofscore 
thresh_cov = [percentileofscore (fake_preds, thresh) for thresh in thresholds]

选项2：与上面相同，但首先对列表进行排序

选项3：将阈值插入列表，对列表进行排序，找到阈值的索引。请注意，如果您有快速排序算法，则可以通过将阈值设置为枢轴并在根据阈值对所有内容进行分区后终止排序来优化它。

选项4：基于上述内容：将阈值放在二叉树中，然后对列表中的每个项目进行比较，将其与二进制搜索中的阈值进行比较。您可以逐项执行此操作，也可以在每一步将列表拆分为子集。

快速方法将列表中的项目数量降至某个阈值以下

4 个答案:

基准