Question

所以，如果我有x=np.random.rand(60000)*400-200之类的话。 iPython的%timeit说：

x.astype(int)需要0.14毫秒
np.rint(x)和np.around(x)需要1.01毫秒

请注意，在rint和around个案例中，您仍然需要花费额外的0.14毫秒来做最后的astype(int)（假设这是您最终想要的）。

问题：我认为大多数现代硬件能够在同等时间内完成两项操作。如果是这样的话，为什么numpy需要花费8倍的时间进行舍入？

碰巧我对算术的准确性并不十分挑剔，但我看不出如何利用numpy的优势（我正在做杂乱的生物学而不是粒子物理学）。

Answer 1

np.around(x).astype(int)和x.astype(int)不会生成相同的值。前者甚至是轮次（它与((x*x>=0+0.5) + (x*x<0-0.5)).astype(int)相同），而后者则向零舍入。然而，

y = np.trunc(x).astype(int)
z = x.astype(int)

显示y==z，但计算y要慢得多。所以np.trunc和np.around函数很慢。

In [165]: x.dtype
Out[165]: dtype('float64')
In [168]: y.dtype
Out[168]: dtype('int64')

所以np.trunc(x)从零到双舍入为零。然后astype(int)必须将double转换为int64。

在内部我不知道python或numpy正在做什么，但我知道我将如何在C中执行此操作。让我们讨论一些硬件。使用SSE4.1，可以使用以下方法将round，floor，ceil和trunc从double改为double：

_mm_round_pd(a, 0); //round: round even
_mm_round_pd(a, 1); //floor: round towards minus infinity
_mm_round_pd(a, 2); //ceil:  round towards positive infinity
_mm_round_pd(a, 3); //trunc: round towards zero

但numpy需要支持没有SSE4.1的系统，所以它必须在没有SSE4.1和SSE4.1的情况下构建，然后使用调度程序。

但是使用SSE / AVX从double直接执行到int64在AVX512之前效率不高。但是，只使用SSE2可以有效地将double舍入到int32：

_mm_cvtpd_epi32(a);  //round double to int32 then expand to int64
_mm_cvttpd_epi32(a); //trunc double to int32 then expand to int64

这些将两个双精度转换为两个int64。

在你的情况下，这将工作正常，因为范围肯定在int32内。但除非python知道范围适合int32，否则它不能假设这样，所以它必须舍入或截断到int64，这是缓慢的。此外，无论如何，numpy必须构建以支持SSE2。

但也许您可以使用单个浮点数组开始。在那种情况下你可以做到：

_mm_cvtps_epi32(a); //round single to int32
_mm_cvttps_epi32(a) //trunc single to int32

这些将四个单曲转换为四个int32。

因此，为了回答您的问题，SSE2可以有效地从double舍入或截断为int32。 AVX512也可以使用_mm512_cvtpd_epi64(a)或_mm512_cvttpd_epi64(a)有效地从double到int64进行舍入或截断。 SSE4.1可以从浮动到浮动或圆形/截断/地板/细胞，或者有效地加倍或加倍。

Answer 2

正如@jme在评论中所指出的，rint和around函数必须确定是否将分数向上或向下舍入到最接近的整数。相反，astype函数将始终向下舍入，因此它可以立即丢弃小数信息。还有许多其他功能可以做同样的事情。此外，您可以通过使用较低的整数位来提高速度。但是，您必须小心，您可以容纳所有输入数据。

%%timeit
np.int8(x)
10000 loops, best of 3: 165 µs per loop

注意，这不存储-128到127范围之外的值，因为它是8位。示例中的某些值超出此范围。

在我尝试的所有其他人中，np.intc似乎是最快的：

%%timeit
np.int16(x)
10000 loops, best of 3: 186 µs per loop

%%timeit
np.intc(x)
10000 loops, best of 3: 169 µs per loop

%%timeit
np.int0(x)
10000 loops, best of 3: 170 µs per loop

%%timeit
np.int_(x)
10000 loops, best of 3: 188 µs per loop

%%timeit
np.int32(x)
10000 loops, best of 3: 187 µs per loop

%%timeit
    np.trunc(x)
1000 loops, best of 3: 940 µs per loop

您的示例，在我的机器上：

%%timeit
np.around(x)
1000 loops, best of 3: 1.48 ms per loop

%%timeit
np.rint(x)
1000 loops, best of 3: 1.49 ms per loop

%%timeit
x.astype(int)
10000 loops, best of 3: 188 µs per loop

与astype（int）相比，numpy / / rint变慢

2 个答案: