Question

我试图在python中实现一种创建频率表的有效方法，其中有一个相当大的numpy输入数组~30 million条目。目前我正在使用for-loop，但它花了太长时间。

输入是

形式的有序numpy array

Y = np.array([4, 4, 4, 6, 6, 7, 8, 9, 9, 9..... etc])

我希望得到以下形式的输出：

Z = {4:3, 5:0, 6:2, 7:1,8:1,9:3..... etc} (as any data type)

目前我正在使用以下实现：

Z = pd.Series(index = np.arange(Y.min(), Y.max()))

for i in range(Y.min(), Y.max()):
  Z[i] = (Y == i).sum()

是否有更快的方法可以通过循环执行此操作或没有iterating的方法？感谢您的帮助，对不起，如果之前有人问过这个问题！

Answer 1

您可以使用Counter from collections模块执行此操作。请参阅我为您的测试用例运行的以下代码。

import numpy as np
from collections import Counter
Y = np.array([4, 4, 4, 6, 6, 7, 8, 9, 9, 9,10,5,5,5])
print(Counter(Y))

它提供了以下输出

Counter({4: 3, 9: 3, 5: 3, 6: 2, 7: 1, 8: 1, 10: 1})

您可以轻松地使用此对象。我希望这会有所帮助。

Answer 2

我认为numpy.unique是你的解决方案。

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.unique.html

import numpy as np
t = np.random.randint(0, 1000, 100000000)
print(np.unique(t, return_counts=True))

这需要约4秒钟。 collections.Counter方法需要大约10秒钟。

但是numpy.unique返回数组中的频率，collections.Counter返回一个字典。这取决于方便性。

编辑。我不能对其他帖子发表评论，所以我在这里写道@lomereiters解决方案是快速的（线性的），应该是可以接受的。

Answer 3

如果您的输入数组x已排序，您可以执行以下操作以获得线性时间内的计数：

diff1 = np.diff(x)
# get indices of the elements at which jumps occurred
jumps = np.concatenate([[0], np.where(diff1 > 0)[0] + 1, [len(x)]])
unique_elements = x[jumps[:-1]]
counts = np.diff(jumps)

如何有效地创建数组Python中条目数的频率表

3 个答案: