检查两个2D numpy数组的常见元素,无论是行还是列

时间:2016-12-20 02:55:01

标签: python arrays performance numpy

给定numpynx3的两个mx3数组,确定行索引(计数器)的有效方法是什么,其中行在两个数组中是通用的。例如,我有以下解决方案,对于甚至更大的数组而言,这个解决方案显然很慢

def arrangment(arr1,arr2):
    hits = []
    for i in range(arr2.shape[0]):
        current_row = np.repeat(arr2[i,:][None,:],arr1.shape[0],axis=0)
        x = current_row - arr1
        for j in range(arr1.shape[0]):
            if np.isclose(x[j,0],0.0) and np.isclose(x[j,1],0.0) and np.isclose(x[j,2],0.0):
                hits.append(j)

    return hits

它检查arr2中是否存在arr1行,并返回行匹配的arr1行索引。我需要这种安排总是按arr2的行顺序递增。比如给出

arr1 = np.array([[-1., -1., -1.],
       [ 1., -1., -1.],
       [ 1.,  1., -1.],
       [-1.,  1., -1.],
       [-1., -1.,  1.],
       [ 1., -1.,  1.],
       [ 1.,  1.,  1.],
       [-1.,  1.,  1.]])
arr2 = np.array([[-1.,  1., -1.],
       [ 1.,  1., -1.],
       [ 1.,  1.,  1.],
       [-1.,  1.,  1.]])

该函数应该返回:

[3, 2, 6, 7]

2 个答案:

答案 0 :(得分:3)

快速而肮脏的回答

(arr1[:, None] == arr2).all(-1).argmax(0)

array([3, 2, 6, 7])

更好的答案
考虑到arr2中的一行与arr1

中的任何内容不匹配的可能性
t = (arr1[:, None] == arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)

array([ 3.,  2.,  6.,  7.])

正如@Divakar np.isclose所指出的那样,在比较花车时会出现舍入错误

t = np.isclose(arr1[:, None], arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)

答案 1 :(得分:0)

我有一个类似的problem in the past,我想出了一个相当优化的解决方案。

首先,你需要对多维数组进行numpy.unique的推广,为了完整起见,我会copy在这里

def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False): 
    """Get unique values along an axis for 2D arrays.

        input:
            arr:
                2D array
            consider_sort:
                Does permutation of the values within the axis matter? 
                Two rows can contain the same values but with 
                different arrangements. If consider_sort 
                is True then those rows would be considered equal
            return_index:
                Similar to numpy unique
            return_inverse:
                Similar to numpy unique
        returns:
            2D array of unique rows
            If return_index is True also returns indices
            If return_inverse is True also returns the inverse array 
            """

    if consider_sort is True:
        a = np.sort(arr,axis=1)
    else:
        a = arr
    b = np.ascontiguousarray(a).view(np.dtype((np.void, 
            a.dtype.itemsize * a.shape[1])))

    if return_inverse is False:
        _, idx = np.unique(b, return_index=True)
    else:
        _, idx, inv = np.unique(b, return_index=True, return_inverse=True)

    if return_index == False and return_inverse == False:
        return arr[idx]
    elif return_index == True and return_inverse == False:
        return arr[idx], idx
    elif return_index == False and return_inverse == True:
        return arr[idx], inv
    else:
        return arr[idx], idx, inv

现在您只需要连接(np.vstack)数组并找到唯一的行。反向映射与np.searchsorted一起将为您提供所需的索引。因此,我们编写另一个类似于numpy.in2d的函数,但是对于多维(2D)数组

def in2d_unsorted(arr1, arr2, axis=1, consider_sort=False):
    """Find the elements in arr1 which are also in 
       arr2 and sort them as the appear in arr2"""

    assert arr1.dtype == arr2.dtype

    if axis == 0:
        arr1 = np.copy(arr1.T,order='C')
        arr2 = np.copy(arr2.T,order='C')

    if consider_sort is True:
        sorter_arr1 = np.argsort(arr1)
        arr1 = arr1[np.arange(arr1.shape[0])[:,None],sorter_arr1]
        sorter_arr2 = np.argsort(arr2)
        arr2 = arr2[np.arange(arr2.shape[0])[:,None],sorter_arr2]


    arr = np.vstack((arr1,arr2))
    _, inv = unique2d(arr, return_inverse=True)

    size1 = arr1.shape[0]
    size2 = arr2.shape[0]

    arr3 = inv[:size1]
    arr4 = inv[-size2:]

    # Sort the indices as they appear in arr2
    sorter = np.argsort(arr3)
    idx = sorter[arr3.searchsorted(arr4, sorter=sorter)]

    return idx 

现在您需要做的就是使用输入参数调用in2d_unsorted

>>> in2d_unsorted(arr1,arr2)
array([ 3,  2,  6,  7])

虽然可能没有完全优化,但这种方法要快得多。让我们针对@piRSquared解决方案进行基准测试

def indices_piR(arr1,arr2):
    t = np.isclose(arr1[:, None], arr2).all(-1)
    return np.where(t.any(0), t.argmax(0), np.nan)

使用以下数组

n=150
arr1 = np.random.permutation(n).reshape(n//3, 3)
idx = np.random.permutation(n//3)
arr2 = arr1[idx]

In [13]: np.allclose(in2d_unsorted(arr1,arr2),indices_piR(arr1,arr2))
True

In [14]: %timeit indices_piR(arr1,arr2)
10000 loops, best of 3: 181 µs per loop
In [15]: %timeit in2d_unsorted(arr1,arr2)
10000 loops, best of 3: 85.7 µs per loop

现在,n=1500

In [24]: %timeit indices_piR(arr1,arr2)
100 loops, best of 3: 10.3 ms per loop
In [25]: %timeit in2d_unsorted(arr1,arr2)
1000 loops, best of 3: 403 µs per loop

n=15000

In [28]: %timeit indices_piR(A,B)
1 loop, best of 3: 1.02 s per loop
In [29]: %timeit in2d_unsorted(arr1,arr2)
100 loops, best of 3: 4.65 ms per loop

因此,对于较大的ish数组,与@piRSquared的矢量化解决方案相比,它的速度超过 200X