Question

我正在尝试计算np.array中每行显示的数字，例如：

import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [9, 7, 5, 3, 2, 1],
                     [1, 1, 1, 0, 0, 0], 
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [1, 1, 1, 1, 1, 0]])

行[1, 2, 0, 1, 1, 1]显示3次。

一个简单的天真解决方案将涉及将我的所有行转换为元组，并应用collections.Counter，如下所示：

from collections import Counter
def row_counter(my_array):
    list_of_tups = [tuple(ele) for ele in my_array]
    return Counter(list_of_tups)

哪个收益率：

In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})

然而，我担心我的方法的效率。也许有一个库提供了这样做的内置方式。我将问题标记为pandas，因为我认为pandas可能拥有我正在寻找的工具。

Answer 1

您可以使用the answer to this other question of yours来获取唯一项目的计数。

在numpy 1.9中有一个return_counts可选的关键字参数，所以你可以这样做：

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

在早期版本中，您可以这样做：

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

Answer 2

（这假设数组相当小，例如少于1000行。）

这是一种简短的NumPy方法，用于计算每行在阵列中出现的次数：

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

这计算每行在my_array中出现的次数，返回一个数组，其中第一个值显示第一行出现的次数，第二个值显示第二行出现的次数，依此类推。

Answer 3

你的解决方案也不错，但是如果你的矩阵很大，你可能希望在计数之前使用更高效的哈希（与计数器使用的默认哈希相比）。您可以使用joblib：

执行此操作

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

使用这么多列，大熊猫解决方案非常慢（每个循环大约2秒）。对于像你所展示的那样的小矩阵，你的方法比joblib散列更快但比numpy慢：

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

如果您有大量行，那么您可以找到更好的替代计数器来查找哈希频率。

编辑：在我的系统中添加了来自@ acjr解决方案的numpy基准测试，以便更容易比较。在两种情况下，numpy解决方案是最快的解决方案。

Answer 4

pandas方法可能看起来像这样

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

注意：不必提供列名称

Answer 5

可以在numpy_indexed包中找到与Jaime相同的解决方案（免责声明：我是其作者）

import numpy_indexed as npi
npi.count(my_array)

Answer 6

我认为只需在axis中指定np.unique即可满足您的需求。

import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)

注意：此功能仅在numpy>=1.13.0中可用。

计算numpy.array中每行的出现次数

6 个答案: