Question

我有以下代码，我想使用numpy进行优化，最好是删除循环。我看不出如何处理它，所以任何建议都会有所帮助。

索引是一个（N，2）numpy整数数组，N可以是几百万。代码的作用是在第一列中找到重复的索引。对于这些索引，我在第二列中进行了两个相应索引的所有组合。然后我将它们与第一列中的索引一起收集。

index_sets = []
uniques, counts = np.unique(indices[:,0], return_counts=True)
potentials = uniques[counts > 1]
for p in potentials:
    correspondents = indices[(indices[:,0] == p),1]
    combs = np.vstack(list(combinations(correspondents, 2)))
    combs = np.hstack((np.tile(p, (combs.shape[0], 1)), combs))
    index_sets.append(combs)

Answer 1

可以提出很少的改进：

初始化输出数组，我们可以预先计算存储与每个组对应的组合所需的估计行数。我们知道，对于N元素，可能的组合总数将为N*(N-1)/2，以便为每个组提供组合长度。此外，输出数组中的总行数将是所有这些间隔长度的总和。
在进入循环之前以矢量化方式预先计算尽可能多的东西。
使用循环来获取组合，由于粗糙的模式无法进行矢量化。使用np.repeat来模拟平铺并在循环之前执行此操作，以便为每个组提供第一个元素，从而为输出数组的第一列提供。

因此，考虑到所有这些改进，实现将如下所示 -

# Remove rows with counts == 1 
_,idx, counts = np.unique(indices[:,0], return_index=True, return_counts=True)
indices = np.delete(indices,idx[counts==1],axis=0)

# Decide the starting indices of corresponding to start of new groups 
# charaterized by new elements along the sorted first column
start_idx = np.unique(indices[:,0], return_index=True)[1]
all_idx = np.append(start_idx,indices.shape[0])

# Get interval lengths that are required to store pairwise combinations
# of each group for unique ID from column-0
interval_lens = np.array([item*(item-1)/2 for item in np.diff(all_idx)])

# Setup output array and set the first column as a repeated array
out = np.zeros((interval_lens.sum(),3),dtype=int)
out[:,0] = np.repeat(indices[start_idx,0],interval_lens)

# Decide the start-stop indices for storing into output array 
ssidx = np.append(0,np.cumsum(interval_lens))

# Finally run a loop gto store all the combinations into initialized o/p array
for i in range(idx.size):
    out[ssidx[i]:ssidx[i+1],1:] = \
    np.vstack(combinations(indices[all_idx[i]:all_idx[i+1],1],2))

请注意，输出数组将是一个大(M, 3)形状的数组，不会拆分为原始代码生成的数组列表。如果仍然需要，可以使用np.split。

此外，快速运行时测试表明，提议的代码没有太大的改进。因此，可能大部分运行时用于获取组合。因此，似乎替代方法networkx特别适用于基于连接的问题可能更适合。

Answer 2

这是一个在N上向量化的解决方案。注意它仍然包含一个for循环，但它是每个'key-multiplicities组'的循环，这保证是一个小得多的数字（通常是最多几十个。）

对于N = 1.000.000，运行时间在我的电脑上是一秒的数量级。

import numpy_indexed as npi
N = 1000000
indices = np.random.randint(0, N/10, size=(N, 2))

def combinations(x):
    """vectorized computation of combinations for an array of sequences of equal length

    Parameters
    ----------
    x : ndarray, [..., n_items]

    Returns
    -------
    ndarray, [..., n_items * (n_items - 1) / 2, 2]
    """
    return np.rollaxis(x[..., np.triu_indices(x.shape[-1], 1)], -2, x.ndim+1)

def process(indices):
    """process a subgroup of indices, all having equal multiplicity

    Parameters
    ----------
    indices : ndarray, [n, 2]

    Returns
    -------
    ndarray, [m, 3]
    """
    keys, vals = npi.group_by(indices[:, 0], indices[:, 1])
    combs = combinations(vals)
    keys = np.repeat(keys, combs.shape[1])
    return np.concatenate([keys[:, None], combs.reshape(-1, 2)], axis=1)

index_groups = npi.group_by(npi.multiplicity(indices[:, 0])).split(indices)
result = np.concatenate([process(ind) for ind in index_groups])

免责声明：我是numpy_indexed包的作者。

优化/删除循环

2 个答案: