Question

问题

假设给定一个双精度的numpy数组arr和一个小的正整数n。我正在寻找一种有效的方法来将n的每个元素的arr最低有效项设置为0或1。有ufunc吗？如果没有，我是否可以将适当的C函数应用于Cython中的元素？

动机

下面，我将提供这个问题的动机。如果您发现不需要上述问题的答案即可达到最终目标，我很高兴收到相应的评论。然后，我将创建一个单独的问题，以使事情保持排序。

这个问题的动机是实现接受相对公差参数的np.unique(arr, True)版本。因此，np.unique的第二个参数很重要：我需要知道原始数组中唯一元素的索引（第一次出现！）。因此，对元素进行排序并不重要。

我知道questions and solutions on np.unique with tolerance。但是，我还没有找到一种解决方案，该解决方案还返回原始数组中唯一元素的首次出现的索引。此外，我看到的解决方案基于排序，排序以 O（arr.size log（arr.size））运行。但是，使用哈希映射可以实现固定时间的解决方案。

这个想法是将arr中的每个元素上下舍入，并将这些元素放在哈希图中。如果两个值中的任何一个已经在哈希映射中，则将忽略一个条目。否则，该元素将包含在结果中。由于散列图的插入和查找以恒定的平均时间运行，因此该方法在理论上应该比基于排序的方法更快。

在下面找到我的Cython实现：

import numpy as np
cimport numpy as np
import cython
from libcpp.unordered_map cimport unordered_map

@cython.boundscheck(False)
@cython.wraparound(False)
def unique_tol(np.ndarray[DOUBLE_t, ndim=1] lower,
               np.ndarray[DOUBLE_t, ndim=1] higher):
    cdef long i, count
    cdef long endIndex = lower.size
    cdef unordered_map[double, short] vals = unordered_map[double, short]()
    cdef np.ndarray[DOUBLE_t, ndim=1] result_vals = np.empty_like(lower)
    cdef np.ndarray[INT_t, ndim=1] result_indices = np.empty_like(lower, 
                                                                  dtype=int)

    count = 0
    for i in range(endIndex): 
        if not vals.count(lower[i]) and not vals.count(higher[i]):

            # insert in result
            result_vals[count] = lower[i]
            result_indices[count] = i

            # put lowerVal and higherVal in the hashMap
            vals[lower[i]]
            vals[higher[i]]

            # update the index in the result
            count += 1

    return result_vals[:count], result_indices[:count]

此方法通过适当的舍入来完成。例如，如果忽略小于10 ^ -6的差异，我们将写

unique_tol(np.round(a, 6), np.round(a+1e-6, 6))

现在，我想用基于尾数操作的相对舍入过程替换np.round。我知道alternative ways of relative rounding，但是我认为直接操纵尾数应该更有效，更优雅。（诚然，我认为性能提升并不显着。但是我会对解决方案感兴趣。）

编辑

沃伦·韦克瑟（Warren Weckesser）的解决方案就像魅力一样。但是，该结果不适用于我希望的结果，因为差异很小的两个数字可能具有不同的指数。统一尾数将不会导致相似的数字。我想我必须坚持使用相对的舍入解决方案。

Answer 1

“我正在寻找一种有效的方法来将arr的每个元素的n个最低有效项设置为0或1。”

您可以创建数据类型为numpy.uint64的数组的视图，然后根据需要操纵该视图中的位。

例如，我将数组尾数的最低21位设置为0。

In [46]: np.set_printoptions(precision=15)                                                            

In [47]: x = np.array([0.0, -1/3, 1/5, -1/7, np.pi, 6.02214076e23])                                   

In [48]: x                                                                                            
Out[48]: 
array([ 0.000000000000000e+00, -3.333333333333333e-01,
        2.000000000000000e-01, -1.428571428571428e-01,
        3.141592653589793e+00,  6.022140760000000e+23])

在x中创建数据类型为numpy.uint64的数据视图：

In [49]: u = x.view(np.uint64)

看看这些值的二进制表示形式。

In [50]: [np.binary_repr(t, width=64) for t in u]                                                     
Out[50]: 
['0000000000000000000000000000000000000000000000000000000000000000',
 '1011111111010101010101010101010101010101010101010101010101010101',
 '0011111111001001100110011001100110011001100110011001100110011010',
 '1011111111000010010010010010010010010010010010010010010010010010',
 '0100000000001001001000011111101101010100010001000010110100011000',
 '0100010011011111111000011000010111001010010101111100010100010111']

将n的低位设置为0，然后再看一遍。

In [51]: n = 21                                                                                       

In [52]: u &= ~np.uint64(2**n-1)                                                              

In [53]: [np.binary_repr(t, width=64) for t in u]                                                     
Out[53]: 
['0000000000000000000000000000000000000000000000000000000000000000',
 '1011111111010101010101010101010101010101010000000000000000000000',
 '0011111111001001100110011001100110011001100000000000000000000000',
 '1011111111000010010010010010010010010010010000000000000000000000',
 '0100000000001001001000011111101101010100010000000000000000000000',
 '0100010011011111111000011000010111001010010000000000000000000000']

由于u是与x中相同数据的视图，因此x也已就地修改。

In [54]: x                                                                      
Out[54]: 
array([ 0.000000000000000e+00, -3.333333332557231e-01,
        1.999999999534339e-01, -1.428571428405121e-01,
        3.141592653468251e+00,  6.022140758954589e+23])

Answer 2

类似于@WarrenWeckesser，但没有使用“官方” ufuncs的黑魔法。缺点：我很确定它速度较慢，很可能是这样，

>>> a = np.random.normal(size=10)**5
>>> a
array([ 9.87664561e-12, -1.79654870e-03,  4.36740261e-01,  7.49256141e+00,
       -8.76894617e-01,  2.93850753e+00, -1.44149959e-02, -1.03026094e-03,
        3.18390143e-03,  3.05521581e-03])
>>> 
>>> mant,expn = np.frexp(a)
>>> mant
array([ 0.67871792, -0.91983293,  0.87348052,  0.93657018, -0.87689462,
        0.73462688, -0.92255974, -0.5274936 ,  0.81507877,  0.78213525])
>>> expn
array([-36,  -9,  -1,   3,   0,   2,  -6,  -9,  -8,  -8], dtype=int32)
>>> a_binned = np.ldexp(np.round(mant,5),expn)
>>> a_binned
array([ 9.87667590e-12, -1.79654297e-03,  4.36740000e-01,  7.49256000e+00,
       -8.76890000e-01,  2.93852000e+00, -1.44150000e-02, -1.03025391e-03,
        3.18390625e-03,  3.05523437e-03])

numpy：在双数组中设置n个尾数元素

2 个答案: