Question

我有一个坐标为N点的数组。另一个数组包含这N个点的质量。

 >>> import numpy as np
 >>> N=10
 >>> xyz=np.random.randint(0,2,(N,3))
 >>> mass=np.random.rand(len(xyz))
 >>> xyz
 array([[1, 0, 1],
   [1, 1, 0],
   [0, 1, 1],
   [0, 0, 0],
   [0, 1, 0],
   [1, 1, 0],
   [1, 0, 1],
   [0, 0, 1],
   [1, 0, 1],
   [0, 0, 1]])
 >>> mass
 array([ 0.38668401,  0.44385111,  0.47756182,  0.74896529,  0.20424403,
    0.21828435,  0.98937523,  0.08736635,  0.24790248,  0.67759276])

现在我想获得一个具有xyz唯一值的数组和相应的总计质量数组。这意味着以下数组：

 >>> xyz_unique
 array([[0, 1, 1],
   [1, 1, 0],
   [0, 0, 1],
   [1, 0, 1],
   [0, 0, 0],
   [0, 1, 0]])
 >>> mass_unique
 array([ 0.47756182,  0.66213546,  0.76495911,  1.62396172,  0.74896529,
    0.20424403])

我的尝试是以下带有双循环的代码：

 >>> xyz_unique=np.array(list(set(tuple(p) for p in xyz)))
 >>> mass_unique=np.zeros(len(xyz_unique))
 >>> for j in np.arange(len(xyz_unique)):
 ...     indices=np.array([],dtype=np.int64)
 ...     for i in np.arange(len(xyz)):
 ...         if np.all(xyz[i]==xyz_unique[j]):
 ...             indices=np.append(indices,i)
 ...     mass_unique[j]=np.sum(mass[indices])

问题是这需要太长时间，实际上我有N = 100000。是否有更快的方式或如何改进我的代码？

编辑我的坐标实际上是浮点数。为了简单起见，我制作了随机整数，以便在低N处重复。

Answer 1

案例1：xyz

中的二进制数字

如果输入数组xyz中的元素是0和1，则可以将每行转换为十进制数，然后根据它们的每一行标记< em> uniqueness 和其他十进制数字。然后，根据这些标签，您可以使用np.bincount来累加求和，就像在MATLAB中可以使用accumarray一样。这是实现所有目标的实现 -

import numpy as np

# Input arrays xyz and mass
xyz = np.array([
   [1, 0, 1],
   [1, 1, 0],
   [0, 1, 1],
   [0, 0, 0],
   [0, 1, 0],
   [1, 1, 0],
   [1, 0, 1],
   [0, 0, 1],
   [1, 0, 1],
   [0, 0, 1]])

mass = np.array([ 0.38668401,  0.44385111,  0.47756182,  0.74896529,  0.20424403,
    0.21828435,  0.98937523,  0.08736635,  0.24790248,  0.67759276])

# Convert each row entry in xyz into equivalent decimal numbers
dec_num = np.dot(xyz,2**np.arange(xyz.shape[1])[:,None])

# Get indices of the first occurrences of the unique values and also label each row
_, unq_idx,row_labels = np.unique(dec_num, return_index=True, return_inverse=True)

# Find unique rows from xyz array
xyz_unique = xyz[unq_idx,:]

# Accumulate the summations from mass based on the row labels
mass_unique = np.bincount(row_labels, weights=mass)

输出 -

In [148]: xyz_unique
Out[148]: 
array([[0, 0, 0],
       [0, 1, 0],
       [1, 1, 0],
       [0, 0, 1],
       [1, 0, 1],
       [0, 1, 1]])

In [149]: mass_unique
Out[149]: 
array([ 0.74896529,  0.20424403,  0.66213546,  0.76495911,  1.62396172,
        0.47756182])

案例2：通用

对于一般情况，您可以使用此 -

import numpy as np

# Perform lex sort and get the sorted indices
sorted_idx = np.lexsort(xyz.T)
sorted_xyz =  xyz[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_xyz,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)

# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1

# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels

# Get unique indices
unq_idx  = sorted_idx[df2]

# Get unique xyz's and the mass counts using accumulation with bincount
xyz_unique = xyz[unq_idx,:]
mass_unique = np.bincount(labels, weights=mass)

示例运行 -

In [238]: xyz
Out[238]: 
array([[1, 2, 1],
       [1, 2, 1],
       [0, 1, 0],
       [1, 0, 1],
       [2, 1, 2],
       [2, 1, 1],
       [0, 1, 0],
       [1, 0, 0],
       [2, 1, 0],
       [2, 0, 1]])

In [239]: mass
Out[239]: 
array([ 0.5126308 ,  0.69075674,  0.02749734,  0.384824  ,  0.65151772,
        0.77718427,  0.18839268,  0.78364902,  0.15962722,  0.09906355])

In [240]: xyz_unique
Out[240]: 
array([[1, 0, 0],
       [0, 1, 0],
       [2, 1, 0],
       [1, 0, 1],
       [2, 0, 1],
       [2, 1, 1],
       [1, 2, 1],
       [2, 1, 2]])

In [241]: mass_unique
Out[241]: 
array([ 0.78364902,  0.21589002,  0.15962722,  0.384824  ,  0.09906355,
        0.77718427,  1.20338754,  0.65151772])

查找3-dim numpy数组的唯一值索引

1 个答案: