我正在尝试创建我的排序numpy数组的diff,这样如果我记录第一行的值和diffs,我可以重新创建原始表但存储的数据更少。
所以这是表格的一个例子:
my_array = numpy.array([(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36)
],'uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8')
在运行numpy.diff(my_array)之后,我 会想到这样的事情:
[(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 32),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
]
注意:以上数据来自第一个&最后三排的 “真正的”数据,它要大得多。使用完整的数据集,大部分 diff之后的行将是0,0,0,0,0,0,0,0,0,0,0,0,1 - 哪个可以a) 存储在一个小得多的结构中,并且b)将在磁盘上非常好地压缩,因为大多数行包含非常相似的数据。
我应该指出,我首先拥有一大堆uint8的原因是因为我需要在尽可能少的内存中存储一个极大数字的数组。最大的数字是185439173519100986733232011757860,这对于uint64来说太大了。实际上,存储它的最小位数是108位,或14个字节(到最近的字节)。因此,为了使这些大数字适合numpy,我使用以下两个函数:
def large_number_to_numpy(number,columns): return tuple((number >> (8*x)) & 255 for x in range(columns-1,-1,-1))
def numpy_to_large_number(numbers): return sum([y << (8*x) for x,y in enumerate(numbers[::-1])])
使用方法如下:
>>> large_number_to_numpy(185439173519100986733232011757860L,14) (9L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L)
numpy_to_large_number((9L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L)) 185439173519100986733232011757860L
使用这样创建的数组:
my_array = numpy.zeros(TOTAL_ROWS,','.join(14*['uint8']))
然后填充:
my_array[x] = large_number_to_numpy(large_number,14)
但我得到了这个:
>>> my_array
array([(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35),
(9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36)],
dtype=[('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')])
>>> numpy.diff(my_array)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1567, in diff
return a[slice1]-a[slice2]
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')]) dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')]) dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')])
答案 0 :(得分:4)
问题是你有一个结构化数组而不是常规的二维数组,所以numpy
不知道如何从另一个元组中减去一个元组。
将结构化数组转换为常规数组(from this SO question):
my_array = my_array.view(numpy.uint8).reshape((my_array.shape[0], -1))
然后执行numpy.diff(my_array, axis=0)
。
或者,如果可以,请通过将my_array
定义为
numpy.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
[9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34],
[9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35],
[9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36]],
dtype=numpy.uint8)
答案 1 :(得分:0)
感谢Alberto和Andras,我需要做的就是:
从以下位置更改我的数组:
my_array = numpy.zeros(6,'uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8')
要:
my_array = numpy.zeros((6,14),'uint8')
我完全不知道为什么一个人与另一个人不同,但我想这就是numpy如何滚动。
然后我可以填充:
my_array[0] = large_number_to_numpy(0,14)
my_array[1] = large_number_to_numpy(1,14)
my_array[2] = large_number_to_numpy(2,14)
my_array[3] = large_number_to_numpy(185439173519100986733232011757858,14)
my_array[4] = large_number_to_numpy(185439173519100986733232011757859,14)
my_array[5] = large_number_to_numpy(185439173519100986733232011757860,14)
生成:
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
[ 9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34],
[ 9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35],
[ 9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36]], dtype=uint8)
与numpy.diff(my_array,1,0)
进行差异以给出:
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[ 9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 32],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=uint8)
哪个是完美的:)