Question

我有一个3D numpy数组，arr，形状为m*n*k。

对于m轴上的每一组值（例如arr[:, 0, 0]）我想生成一个单独的值来表示这个集合，这样我最终可能得到一个2D矩阵{{1 }}。如果重复n*k轴上的一组值，则每次都应生成相同的值。

即它是一个散列问题。

我使用字典创建了问题的解决方案，但它大大降低了性能。对于每组值，我称之为函数：

数组本身的大小通常为30 * 256 * 256，因此一组值将具有30个值。我有数百个这样的阵列可以在任何时候处理。目前，进行所有需要完成的处理以计算哈希值对于100个阵列的块，需要1.3s。包括高达75s的散列凸起。

是否有更快的方法来生成单个代表值？

Answer 1

根据需要生成的新密钥与旧密钥的数量，很难说什么是最佳的。但是使用你的逻辑，以下应该相当快：

import collections
import hashlib

_key = 0

def _get_new_key():
    global _key
    _key += 1
    return _key

attributes = collections.defaultdict(_get_new_key)

def get_cell_id(series):                             
    global attributes
    return attributes[hashlib.md5(series.tostring()).digest()]

修改

我现在更新了根据您的问题使用步幅循环所有数据系列：

In [99]: import numpy as np In [100]: A = np.random.random((30, 256, 256)) In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2])) In [102]: %timeit tuple(get_cell_id(S) for S in A_strided) 10 loops, best of 3: 169 ms per loop

以上是每个30个元素阵列的256x256查找/分配。当然不能保证md5哈希不会发生碰撞。如果这应该是一个问题，你当然可以改为同一个库中的其他哈希值。

编辑2：

鉴于您似乎在3D阵列的第一个轴上进行了大部分昂贵的操作，我建议您重新组织阵列：

In [254]: A2 = np.random.random((256, 256, 30)) In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize)) In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided) 10 loops, best of 3: 126 ms per loop

不必在内存中长距离跳跃可以实现大约25％的加速

编辑3：

如果没有实际需要将哈希缓存到int查找，但是您只需要实际哈希值并且如果3D数组是int8 - 类型，则给出{{ 1}}和A2组织，时间可以减少一些。在这15ms中是元组循环。

A2_strided

Answer 2

这可能是使用基本numpy函数的一种方法 -

import numpy as np

# Random input for demo
arr = np.random.randint(0,3,[2,5,4])

# Get dimensions for later usage
m,n,k = arr.shape

# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])

# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d =  arr2d[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)

# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)

# Get unique labels
labels = df2.cumsum(0)

# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

示例运行 -

In [216]: arr
Out[216]: 
array([[[2, 1, 2, 1],
        [1, 0, 2, 1],
        [2, 0, 1, 1],
        [0, 0, 1, 1],
        [1, 0, 0, 2]],

       [[2, 1, 2, 2],
        [0, 0, 2, 1],
        [2, 1, 0, 0],
        [1, 0, 1, 0],
        [0, 1, 1, 0]]])

In [217]: out
Out[217]: 
array([[6, 4, 6, 5],
       [1, 0, 6, 4],
       [6, 3, 1, 1],
       [3, 0, 4, 1],
       [1, 3, 3, 2]], dtype=int32)

Answer 3

如果只是哈希试试这个

import numpy as np
import numpy.random

# create random data
a = numpy.random.randint(10,size=(5,3,3))

# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)

# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)

h[0,0]==h[0,1]
# Output: True

但是，请谨慎使用，并首先使用您的代码测试此代码。 ......我只能说它适用于这个简单的例子。

此外，尽管两个值可能具有相同的哈希值，但它们可能有所不同。这是一个总是可以使用哈希函数发生的问题，但它们非常不可能

修改：为了与其他解决方案进行比较

timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop

基于numpy数组中的行生成唯一值

3 个答案: