Question

我有一个n个1x3数组的numpy数组，其中n是1x3数组中元素的可能组合的总数，其中每个元素的范围是0到50.也就是说，

 A = [[0,0,0],[0,0,1]...[0,1,0]...[50,50,50]]

和

 len(A) = 50*50*50 = 125000

我有m个1x3阵列的numpy数组B，其中m = 1000万，并且数组可以具有属于A所描述的集合的值。

我想要计算B中每种组合的数量，即B中出现的[0,0,0]次，[0,0,1]出现的次数... {{{{{{{ 1}}出现。到目前为止，我有以下内容：

[50,50,50]

其中y跟踪第i个数组出现的次数。因此，for i in range(len(A)): for j in range(len(B)): if np.array_equal(A[i], B[j]): y[i] += 1是y[0]出现在B中的次数，[0,0,0]是y[1]出现的次数... [0,0,1]是多少次{{1}出现等等。

问题是这需要永远。它必须检查1000万条目，125000次。有更快更有效的方法吗？

Answer 1

这是一种快速方法。它在几分之一秒内从10处理range(50)^3百万个元组，比下一个最佳解决方案（@ Primusa's）快100倍：

它使用了这样的元组与数字0 - 50^3 - 1之间存在直接转换的事实。（映射恰好与A行和行号之间的映射相同。）函数np.ravel_multi_index和np.unravel_index实现了此转换及其反转。

将B转换为数字后，可以使用np.bincount非常有效地确定其频率。下面我重新调整结果以获得50x50x50直方图，但这只是一个品味问题，可以省略。（我已经冒昧地只使用数字0到49，因此len(A)变为125000）：

>>> B = np.random.randint(0, 50, (10000000, 3))
>>> Br = np.ravel_multi_index(B.T, (50, 50, 50))
>>> result = np.bincount(Br, minlength=125000).reshape(50, 50, 50)

让我们看一个较小的示例：

>>> B = np.random.randint(0, 3, (10, 3))
>>> Br = np.ravel_multi_index(B.T, (3, 3, 3))
>>> result = np.bincount(Br, minlength=27).reshape(3, 3, 3)
>>> 
>>> B
array([[1, 1, 2],
       [2, 1, 2],
       [2, 0, 0],
       [2, 1, 0],
       [2, 0, 2],
       [0, 0, 2],
       [0, 0, 2],
       [0, 2, 2],
       [2, 0, 0],
       [0, 2, 0]])
>>> result
array([[[0, 0, 2],
        [0, 0, 0],
        [1, 0, 1]],

       [[0, 0, 0],
        [0, 0, 1],
        [0, 0, 0]],

       [[2, 0, 1],
        [1, 0, 1],
        [0, 0, 0]]])

例如，要查询B中[2,1,0]的次数

>>> result[2,1,0]
1

如上所述：要将指数转换为A和A的实际行（这是我result的索引），np.ravel_multi_index和{{ 1}}可以使用。或者您可以省略最后一次重塑（即使用np.unravel_index;然后将计数编入索引与result = np.bincount(Br, minlength=125000)完全相同。

Answer 2

您可以使用dict()来加快此过程，直到1000万个条目。

因此，您要做的第一件事是将A中的所有子列表更改为可清除对象，您可以将它们用作词典中的键。

将所有子列表转换为元组：

A = [tuple(i) for i in A]

然后创建一个dict()，其中A中的每个值都作为键，值为0。

d = {i:0 for i in A}

现在对于你的numpy数组中的每个子数组，你只想将它转换为一个元组并将d [该数组]增加1

for subarray in B:
    d[tuple(subarray)] += 1

D现在是一个字典，其中每个键的值是该键在B中出现的次数。

Answer 3

您可以通过在其第一个轴B上调用np.unique来查找数组return_counts=True中的唯一行及其计数。然后，您可以通过在适当的轴上调用B和A方法，使用广播在ndarray.all中查找ndarray.any个唯一行的索引。然后您只需要一个简单的索引：

In [82]: unique, counts = np.unique(B, axis=0, return_counts=True)

In [83]: indices = np.where((unique == A[:,None,:]).all(axis=2).any(axis=0))[0]

# Get items from A that exist in B
In [84]: unique[indices]

# Get the counts 
In [85]: counts[indices]

示例：

In [86]: arr = np.array([[2 ,3, 4], [5, 6, 0], [2, 3, 4], [1, 0, 4], [3, 3, 3], [5, 6, 0], [2, 3, 4]])

In [87]: a = np.array([[2, 3, 4], [1, 9, 5], [3, 3, 3]])

In [88]: unique, counts = np.unique(arr, axis=0, return_counts=True)

In [89]: indices = np.where((unique == a[:,None,:]).all(axis=2).any(axis=0))[0]

In [90]: unique[indices]
Out[90]: 
array([[2, 3, 4],
       [3, 3, 3]])

In [91]: counts[indices]
Out[91]: array([3, 1])

Answer 4

你可以这样做

y=[np.where(np.all(B==arr,axis=1))[0].shape[0] for arr in A]

arr只是迭代A和np.all检查它与B匹配的位置，np.where将这些匹配的位置作为数组返回{{ 1}}只返回该数组的长度，换句话说，返回所需的频率

比较大型数组

4 个答案: