查找一组索引,将一个NumPy ndarray的行映射到另一个

时间:2016-02-05 22:23:24

标签: python algorithm sorting numpy mapping

我有两个结构化的2D numpy数组,原则上相等,意思是

A = numpy.array([[a1,b1,c1],

B = numpy.array([[a2,b2,c2],


numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False


B进行排序/随机排列以匹配或等于A或将A排序为等于B的有效方法是什么?只要两个数组都被混洗以相互匹配,相等检查确实不重要。 A因此B具有唯一的行。


def sort2d(A):
    A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
             A.dtype.itemsize * A.shape[1])))
    return A_view.view(A.dtype).reshape(-1,A.shape[1])   


2 个答案:

答案 0 :(得分:4)

根据您的示例,您似乎已同时对所有列进行了混洗,因此存在一个映射 A→B 的行索引向量。这是一个玩具示例:

A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]

# array([[ 7, 11,  6],
#        [ 4, 10,  8],
#        [ 9,  2,  0],
#        [ 1,  3,  5]])

# array([[ 1,  3,  5],
#        [ 4, 10,  8],
#        [ 7, 11,  6],
#        [ 9,  2,  0]])

我们希望恢复一组索引idx,以便A[idx] == B。当且仅当 A B 不包含重复行时,这将是唯一的映射。

一种有效的*方法是找到对 A 中的行进行词法排序的索引,然后找到 B 中的每一行落在排序版本中的位置 A 的。 A useful trick使用A dtype将Bnp.void视为1D数组,将每行视为单个元素:

rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()

a_to_as = np.argsort(a)     # indices that sort the rows of A in lexical order

现在我们可以使用np.searchsorted执行二进制搜索,了解 B 中的每一行是否属于 A 的排序版本:

# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)

A→B 的映射可以表示为 A→A s →B 的复合

a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True

如果 A B 不包含重复行,则 B→A 的逆映射也可以使用

b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True


def find_row_mapping(A, B):
    Given A and B, where B is a copy of A permuted over the first dimension, find
    a set of indices idx such that A[idx] == B.
    This is a unique mapping if and only if there are no repeated rows in A and B.

        A, B:   n-dimensional arrays with same shape and dtype
        idx:    vector of indices into the rows of A

    if not (A.shape == B.shape):
        raise ValueError('A and B must have the same shape')
    if not (A.dtype == B.dtype):
        raise TypeError('A and B must have the same dtype')

    rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
    a = np.ascontiguousarray(A).view(rowtype).ravel()
    b = np.ascontiguousarray(B).view(rowtype).ravel()
    a_to_as = np.argsort(a)
    as_to_b = a.searchsorted(b, sorter=a_to_as)

    return a_to_as.take(as_to_b)


In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop

*最昂贵的步骤是快速排队,平均 O(n log n)。我不确定它是否可能比这更好。

答案 1 :(得分:1)

由于其中一个阵列可以改组以匹配另一个阵列,因此没有人阻止我们重新安排两者。使用Jaime's Answer,我们可以vstack两个数组并找到唯一的行。然后,unique返回的逆索引基本上是所需的映射(因为数组不包含重复的行)。


def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False): 
    """Get unique values along an axis for 2D arrays.

                2D array
                Does permutation of the values within the axis matter? 
                Two rows can contain the same values but with 
                different arrangements. If consider_sort 
                is True then those rows would be considered equal
                Similar to numpy unique
                Similar to numpy unique
            2D array of unique rows
            If return_index is True also returns indices
            If return_inverse is True also returns the inverse array 

    if consider_sort is True:
        a = np.sort(arr,axis=1)
        a = arr
    b = np.ascontiguousarray(a).view(np.dtype((np.void, 
            a.dtype.itemsize * a.shape[1])))

    if return_inverse is False:
        _, idx = np.unique(b, return_index=True)
        _, idx, inv = np.unique(b, return_index=True, return_inverse=True)

    if return_index == False and return_inverse == False:
        return arr[idx]
    elif return_index == True and return_inverse == False:
        return arr[idx], idx
    elif return_index == False and return_inverse == True:
        return arr[idx], inv
        return arr[idx], idx, inv


def row_mapper(a,b,consider_sort=False):
    """Given two 2D numpy arrays returns mappers idx_a and idx_b 
        such that a[idx_a] = b[idx_b] """

    assert a.dtype == b.dtype
    assert a.shape == b.shape

    c = np.concatenate((a,b))
    _, inv = unique2d(c, consider_sort=consider_sort, return_inverse=True)
    mapper_a = inv[:b.shape[0]]
    mapper_b = inv[b.shape[0]:]

    return np.argsort(mapper_a), np.argsort(mapper_b) 


n = 100000
A = np.arange(n).reshape(n//4,4)
B = A[::-1,:]

idx_a, idx_b  = row_mapper(A,B)
print np.all(A[idx_a]==B[idx_b])
# True

<强>基准: 针对@ ali_m解决方案的基准

%timeit find_row_mapping(A,B) # ali_m's solution
%timeit row_mapper(A,B) # current solution

# n = 100
100000 loops, best of 3: 12.2 µs per loop
10000 loops, best of 3: 47.3 µs per loop

# n = 1000
10000 loops, best of 3: 49.1 µs per loop
10000 loops, best of 3: 148 µs per loop

# n = 10000
1000 loops, best of 3: 548 µs per loop
1000 loops, best of 3: 1.6 ms per loop

# n = 100000
100 loops, best of 3: 6.96 ms per loop
100 loops, best of 3: 19.3 ms per loop

# n = 1000000
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 372 ms per loop

# n = 10000000
1 loops, best of 3: 2.54 s per loop
1 loops, best of 3: 5.92 s per loop
