Question

我有一个python矩阵

leafs = np.array([[1,2,3],[1,2,4],[2,3,4],[4,2,1]])

我想为每两行计算它们具有相同元素的时间。

在这种情况下，我会获得4x4矩阵接近度

proximity = array([[3, 2, 0, 1],
                   [2, 3, 1, 1],
                   [0, 1, 3, 0],
                   [1, 1, 0, 3]])

这是我目前使用的代码。

proximity = []

for i in range(n):
 print(i)
 proximity.append(np.apply_along_axis(lambda x: sum(x==leafs[i, :]), axis=1,
                                      arr=leafs))

我需要更快的解决方案

编辑：接受的解决方案在此示例中不起作用

    >>> type(f.leafs)
<class 'numpy.ndarray'>
>>> f.leafs.shape
(7210, 1000)
>>> f.leafs.dtype
dtype('int64')

>>> f.leafs.reshape(7210, 1, 1000) == f.leafs.reshape(1, 7210, 1000)
False
>>> f.leafs
array([[ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       ..., 
       [139,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157]])
>>>

Answer 1

这是使用广播的一种方式。请注意：临时数组eq的形状为(nrows, nrows, ncols)，因此，如果nrows为4000且ncols为1000，则eq将需要16GB的内存。

In [38]: leafs
Out[38]: 
array([[1, 2, 3],
       [1, 2, 4],
       [2, 3, 4],
       [4, 2, 1]])

In [39]: nrows, ncols = leafs.shape

In [40]: eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)

In [41]: proximity = eq.sum(axis=-1)

In [42]: proximity
Out[42]: 
array([[3, 2, 0, 1],
       [2, 3, 1, 1],
       [0, 1, 3, 0],
       [1, 1, 0, 3]])

另请注意，此解决方案效率低：proximity是对称的，对角线始终等于ncols，但此解决方案计算完整数组，因此它的工作量是必要。

Answer 2

Warren Weckesser使用广播提供了一个非常漂亮的解决方案。然而，即使使用循环的简单方法也可以具有相当的性能。 np.apply_along_axis在初始解决方案中速度很慢，因为它没有利用矢量化。但是以下修复了它：

def proximity_1(leafs):
    n = len(leafs)
    proximity = np.zeros((n,n))
    for i in range(n):
        proximity[i] = (leafs == leafs[i]).sum(1)  
    return proximity

您还可以使用列表推导来使上述代码更简洁。不同之处在于np.apply_along_axis会以非优化方式遍历所有行，而leafs == leafs[i]将利用numpy速度。

Warren Weckesser的解决方案真实地展现了numpy的美丽。但是，它包括创建大小为nrows*nrows*ncols的中间3-d数组的开销。因此，如果您拥有大量数据，那么简单循环可能会更有效。

这是一个例子。下面是Warren Weckesser提供的代码，包含在一个函数中。（我不知道这里的代码版权规则是什么，所以我认为这个引用足够:)）

def proximity_2(leafs):
    nrows, ncols = leafs.shape    
    eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)
    proximity = eq.sum(axis=-1)  
    return proximity

现在让我们评估一个大小为10000 x 100的随机整数数组的性能。

leafs = np.random.randint(1,100,(10000,100))
time proximity_1(leafs)
>> 28.6 s
time proximity_2(leafs) 
>> 35.4 s

我在同一台机器上的IPython环境中运行了两个示例。

如何在numpy.array中快速计算相等的元素？

2 个答案: