Question

我们说我有一个numpy数组

x = np.array([[2, 5],
              [3, 4],
              [1, 3],
              [2, 5],
              [4, 5],
              [1, 3],
              [1, 4],
              [3, 4]])

我想从中得到一个数组，其中只包含非重复的行，即我希望从这个例子

array([[4, 5],
       [1, 4]])

我正在寻找一种速度相当快且能够很好地扩展的方法。我能想到的唯一方法是

首先在x中找到一组唯一行，作为新数组y。
创建一个新的数组z，其中y的个别元素已从x中删除，因此z是x中重复行的列表
在x和z之间设置差异。

但这似乎非常低效。谁有更好的方法？

如果这很重要，我保证我的每一行都会从最小到最大排序，这样你就不会有[5, 2]或{{1}行。 }。

Answer 1

方法＃1

这是一种基于np.unique的方法，并将每一行视为效率的索引元组（假设输入数组有整数） -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

注意：如果输入数组中有大量列，您可能需要手动获取线性索引lid，您可以使用np.cumprod ，就像这样 -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

方法＃2

这是另一种将计数任务卸载到np.bincount的替代方案，这可能对此类目的更有效 -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

方法＃3

这是使用convolution来捕捉这种模式的不同方法。让内联的注释有助于理解潜在的想法。在这里 -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]

Answer 2

以下是pandas方法：

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()

#array([[4, 5],
#       [1, 4]])

Answer 3

一种可能性（对包含大量元素的数组需要大量内存）首先要创建一个行相等的布尔掩码：

b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 1, 2],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [2, 0, 0, 2, 1, 0, 0, 0],
       [1, 0, 0, 1, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [0, 1, 1, 0, 0, 1, 2, 1],
       [0, 2, 0, 0, 0, 0, 1, 2]])

此数组显示哪一行与另一行有多少相等的元素。对角线将行与自身进行比较，因此需要设置为零：

np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 2, 1, 0],
       [2, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 0, 0, 0, 1, 0]])

现在让我们看看每行的最大值：

c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])

然后我们需要找到这个最大值为!=2的值，并从原始数组中索引这些值：

x[np.where([c != 2])[1]]
array([[4, 5],
       [1, 4]])

Answer 4

有关完整性，另请参阅http://www.labri.fr/perso/nrougier/teaching/numpy.100/

中的第78项

Answer 5

使用numpy_indexed包可以有效解决此问题（免责声明：我是其作者）：

import numpy_indexed as npi
x[npi.multiplicity(x) == 1]

此解决方案不仅非常易读，而且非常高效，可以使用任意数量的列或dtypes。

从numpy数组中获取非重复行

5 个答案: