Numpy和diff()

时间:2016-06-29 14:26:33

标签: python numpy

我正在尝试创建我的排序numpy数组的diff,这样如果我记录第一行的值和diffs,我可以重新创建原始表但存储的数据更少。

所以这是表格的一个例子:

my_array = numpy.array([(0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0), 
                        (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1),
                        (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  2), 
                        (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34),
                        (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35), 
                        (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36)
                       ],'uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8')

在运行numpy.diff(my_array)之后,我 会想到这样的事情:

[(0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1), 
 (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1),
 (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 32),
 (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1),
 (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1)
]
  

注意:以上数据来自第一个&最后三排的   “真正的”数据,它要大得多。使用完整的数据集,大部分   diff之后的行将是0,0,0,0,0,0,0,0,0,0,0,0,1 - 哪个可以a)   存储在一个小得多的结构中,并且b)将在磁盘上非常好地压缩,因为大多数行包含非常相似的数据。

     

我应该指出,我首先拥有一大堆uint8的原因是因为我需要在尽可能少的内存中存储一​​个极大数字的数组。最大的数字是185439173519100986733232011757860,这对于uint64来说太大了。实际上,存储它的最小位数是108位,或14个字节(到最近的字节)。因此,为了使这些大数字适合numpy,我使用以下两个函数:

     

def large_number_to_numpy(number,columns): return tuple((number >> (8*x)) & 255 for x in range(columns-1,-1,-1))

     

def numpy_to_large_number(numbers): return sum([y << (8*x) for x,y in enumerate(numbers[::-1])])

     

使用方法如下:

     

>>> large_number_to_numpy(185439173519100986733232011757860L,14) (9L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L)

     

numpy_to_large_number((9L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L)) 185439173519100986733232011757860L

     

使用这样创建的数组:

     

my_array = numpy.zeros(TOTAL_ROWS,','.join(14*['uint8']))

     

然后填充:

     

my_array[x] = large_number_to_numpy(large_number,14)

但我得到了这个:

>>> my_array
array([(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
       (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
       (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2),
       (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34),
       (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35),
       (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36)],
      dtype=[('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')])
>>> numpy.diff(my_array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1567, in diff
    return a[slice1]-a[slice2]
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')]) dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')]) dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')])

2 个答案:

答案 0 :(得分:4)

问题是你有一个结构化数组而不是常规的二维数组,所以numpy不知道如何从另一个元组中减去一个元组。

将结构化数组转换为常规数组(from this SO question):

my_array = my_array.view(numpy.uint8).reshape((my_array.shape[0], -1))

然后执行numpy.diff(my_array, axis=0)

或者,如果可以,请通过将my_array定义为

来避免创建结构化数组
numpy.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
             [9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34],
             [9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35],
             [9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36]],
            dtype=numpy.uint8)

答案 1 :(得分:0)

感谢Alberto和Andras,我需要做的就是:

从以下位置更改我的数组:

my_array = numpy.zeros(6,'uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8')

要:

my_array = numpy.zeros((6,14),'uint8')

我完全不知道为什么一个人与另一个人不同,但我想这就是numpy如何滚动。

然后我可以填充:

my_array[0] = large_number_to_numpy(0,14)
my_array[1] = large_number_to_numpy(1,14)
my_array[2] = large_number_to_numpy(2,14)
my_array[3] = large_number_to_numpy(185439173519100986733232011757858,14)
my_array[4] = large_number_to_numpy(185439173519100986733232011757859,14)
my_array[5] = large_number_to_numpy(185439173519100986733232011757860,14)

生成:

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   1],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   2],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73,  34],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73,  35],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36]], dtype=uint8)

numpy.diff(my_array,1,0)进行差异以给出:

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73, 32],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1]], dtype=uint8)

哪个是完美的:)