仅删除3D numpy阵列的该行中包含重复项的行

时间:2018-07-14 09:02:12

标签: python numpy multidimensional-array

我有一个像这样的3D numpy数组:

>>> a
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],
       [[6, 7, 8],
        [0, 1, 2],
        [6, 7, 8]],
       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

我只想删除那些自身包含重复项的行。例如,输出应如下所示:

>>> remove_row_duplicates(a)
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

这是我正在使用的功能:

delindices = np.empty(0, dtype=int)

for i in range(len(a)):
    _, indices = np.unique(np.around(a[i], decimals=10), axis=0, return_index=True)

    if len(indices) < len(a[i]):

        delindices = np.append(delindices, i) 

a = np.delete(a, delindices, 0)

这很好用,但是问题是我的数组形状像(1000000,7,3)。 for循环在python中非常慢,这需要很多时间。我的原始数组还包含浮点数。有更好的解决方案或可以帮助我将这一功能向量化的人吗?

2 个答案:

答案 0 :(得分:2)

沿着每个2D block的行,即沿着axis=1对其进行排序,然后沿着连续的行寻找匹配的行,最后沿着相同的any寻找axis=1匹配-

b = np.sort(a,axis=1)
out = a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]

示例运行说明

输入数组:

In [51]: a
Out[51]: 
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],

       [[6, 7, 8],
        [0, 1, 2],
        [6, 7, 8]],

       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

代码步骤:

# Sort along axis=1, i.e rows in each 2D block
In [52]: b = np.sort(a,axis=1)

In [53]: b
Out[53]: 
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],

       [[0, 1, 2],
        [6, 7, 8],
        [6, 7, 8]],

       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

In [54]: (b[:,1:] == b[:,:-1]).all(-1) # Look for successive matching rows
Out[54]: 
array([[ True, False],
       [False,  True],
       [False, False]])

# Look for matches along each row, which indicates presence
# of duplicate rows within each 2D block in original 2D array
In [55]: ((b[:,1:] == b[:,:-1]).all(-1)).any(1)
Out[55]: array([ True,  True, False])

# Invert those as we need to remove those cases
# Finally index with boolean indexing and get the output
In [57]: a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Out[57]: 
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

答案 1 :(得分:1)

您可能可以使用广播轻松地做到这一点,但是由于您要处理的二维数组以上,因此它不会像您期望的那样被优化,甚至在某些情况下非常慢。相反,您可以使用受Jaime's answer启发的以下方法:

In [28]: u = np.unique(arr.view(np.dtype((np.void, arr.dtype.itemsize*arr.shape[1])))).view(arr.dtype).reshape(-1, arr.shape[1])

In [29]: inds = np.where((arr == u).all(2).sum(0) == u.shape[1])

In [30]: arr[inds]
Out[30]: 
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])