Question

我有两个形状相同的2D numpy数组：

idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])

我知道我们可以使用np.bincount设置val作为权重：

np.bincount(idx.reshape(-1), weights=val.reshape(-1))

但这不是我想要的。 np.bincount在不存在索引的位置放置零。在示例中，结果为：

array([0. , 0.2, 0.7, 0. , 0. , 1.1, 0.2])

但是我不希望这些额外的零用于不存在的索引。我希望加权计数对应于np.unique(idx)

array([1, 2, 3, 5, 6])

我的预期结果是：

array([0.2, 0.7, 0., 1.1, 0.2])

有人想有效地做到这一点吗？我的idx和val很大，有超过100万个元素。

Answer 1

您可以有效地使用numpy库。

检查一下：

Tom

这非常快。希望对您有所帮助。

Answer 2

您可能知道，在python中使用for循环并不是提高效率的好主意：

您可以尝试使用np.unique方法为bincount的输出建立索引：

>>> np.bincount(idx.reshape(-1), val.reshape(-1))[np.unique(idx)]
array([0.2, 0.7, 0. , 1.1, 0.2])

如果您只是想摆脱零，那可能是最快的方法。

Answer 3

成功的关键是：

idx

unique

到连续整数的映射，从 0 开始，
根据上述映射结果计算bincount，而不是 idx 本身。

执行此操作的代码（非常简洁，没有任何循环）是：

unq = np.unique(idx)
mapper = pd.Series(range(unq.size), index=unq)
np.bincount(mapper[idx.reshape(-1)], weights=val.reshape(-1))

对于您的样本数据，结果为：

array([0.2, 0.7, 0. , 1.1, 0.2])

Answer 4

方法1：

将np.unique与return_inverse=True一起使用。

idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])

unq,inv=np.unique(idx,return_inverse=True)
np.bincount(inv,val.reshape(-1))
# array([0.2, 0.7, 0. , 1.1, 0.2])

方法2：

使用bincount，然后删除（真正的）零。

np.bincount(idx.reshape(-1),val.reshape(-1))[np.bincount(idx.reshape(-1)).nonzero()]
# array([0.2, 0.7, 0. , 1.1, 0.2])

哪个更好，将取决于idx的分散程度。

快速计算numpy数组的方法

4 个答案: