Question

我有一个条目列表

.drive-link {font-weight: bold; color: peachpuff;}

和矩阵l = [5, 3, 8, 12, 24]

我想找到矩阵的indeces，其中显示M: 12 34 5 8 7 0 24 12 3 1中的数字。对于l k-entry，我想保存一对随机的索引l，i j。我正在做以下

M[i][j]==l[k]

我想看看是否有办法避免那个循环

Answer 1

您应该能够显着加快代码速度的一种方法是避免重复工作：

tmp = np.where(M == i)

由于这会为您提供M中值等于i的所有位置的列表，因此必须搜索整个矩阵。因此，对于l中的每个元素，您都在搜索完整的矩阵。

不要这样做，而是尝试将矩阵编入索引作为第一步：

matrix_index = {}
for i in len(M):
    for j in len(M[i]):
        if M[i][j] not in matrix_index:
            matrix_index[M[i][j]] = [(i,j)]
        else:
            matrix_index[M[i][j]].append((i,j))

然后对于l中的每个值，而不是通过完整矩阵进行代价高昂的搜索，您可以直接从矩阵索引中获取它。

_{注意：我没有非常多的numpy，所以我可能得到了不正确的特定语法。在numpy中也可能有更惯用的方法。}

Answer 2

一个不使用for一词的解决方案是

c = np.apply_along_axis(lambda row: np.random.choice(np.argwhere(row).ravel()), 1, M.ravel()[np.newaxis, :] == l[:, np.newaxis])
indI, indJ = c // M.shape[1], c % M.shape[1]

请注意，虽然这样可以解决问题，但M.ravel()[np.newaxis, :] == l[:, np.newaxis]会很快生成MemoryError个。更务实的方法是通过类似

之类的方式获得感兴趣的指数

s = np.argwhere(M.ravel()[np.newaxis, :] == l[:, np.newaxis])

然后手动进行随机选择后处理。但是，这可能不会使您的搜索产生任何显着的性能提升。

让它变慢的原因是你在循环的每一步中搜索整个矩阵;通过预先排序矩阵（以一定的成本）为您提供了一种直接的方式，使每个单独的搜索更快：

In [312]: %paste
def direct_search(M, l):
    indI = []
    indJ = []
    for i in l:
        tmp = np.where(M == i)
        rd = np.random.randint(len(tmp[0]))  # Note the fix here
        indI.append(tmp[0][rd])
        indJ.append(tmp[1][rd])
    return indI, indJ

def using_presorted(M, l):
    a = np.argsort(M.ravel())
    M_sorted = M.ravel()[a]
    def find_indices(i):
        s = np.searchsorted(M_sorted, i)
        j = 0
        while M_sorted[s + j] == i:
            yield a[s + j]
            j += 1
    indices = [list(find_indices(i)) for i in l]
    c = np.array([np.random.choice(i) for i in indices])
    return c // M.shape[1], c % M.shape[1]

## -- End pasted text --

In [313]: M = np.random.randint(0, 1000000, (1000, 1000))

In [314]: l = np.random.choice(M.ravel(), 1000)

In [315]: %timeit direct_search(M, l)
1 loop, best of 3: 4.76 s per loop

In [316]: %timeit using_presorted(M, l)
1 loop, best of 3: 208 ms per loop

In [317]: indI, indJ = using_presorted(M, l)  # Let us check that it actually works

In [318]: np.all(M[indI, indJ] == l)
Out[318]: True

Answer 3

如果l和M都不是大型矩阵，如下所示：

    In: l0 = [5, 3, 8, 12, 34, 1, 12]
    In: M0 = [[12, 34,  5,  8,  7],
    In:       [ 0, 24, 12,  3,  1]]

    In: l = np.asarray(l)
    In: M = np.asarray(M)

你可以试试这个：

    In: np.where(l[None, None, :] == M[:, :, None])

    Out:
        (array([0, 0, 0, 0, 0, 1, 1, 1, 1]),   <- i
         array([0, 0, 1, 2, 3, 2, 2, 3, 4]),   <- j
         array([3, 6, 4, 0, 2, 3, 6, 1, 5]))   <- k

行应分别为i，j，k，并阅读该列以获取您需要的每个(i, j, k)。例如，第一列[0, 0, 3]表示M[0, 0] = l[3]，第二列[0, 0, 6]表示M[0, 0] = l[6]，反之亦然。我认为这些都是你想要的。

但是，numpy技巧无法扩展到非常大的矩阵，例如l中的2M元素或M中的2500x2500元素。他们需要相当多的内存和非常长的时间来计算...如果他们幸运的话不会因内存不足而崩溃。：）

Python：如何避免循环？

3 个答案: