Question

我有两个相同维度的标签和标签类别的数组。我想根据类别对标签进行分组并计算标签的出现次数。

正如您所看到的，标签可以共享相同的类别（＆＃39; world＆＃39;，＆＃39; hello＆＃39;）。

我知道这可以通过循环轻松完成，但我确定numpy有一些很好的方法可以更有效地完成它。任何帮助将不胜感激。

# Tag category
A = [10, 10, 20, 10, 10, 10, 20, 10, 20, 20]
# Tags
B = ['hello', 'world', 'how', 'are', 'you', 'world', 'you', 'how', 'hello', 'hello']

预期结果：

[(10, (('hello', 1), ('are', 1), ('you', 1), ('world', 2))), (20, (('how', 1), ('you', 1), ('hello', 2)))]

Answer 1

您可以使用嵌套collections.defaultdict。

这里我们将使用A中的整数作为外部词典的键，并且对于每个内部词典，我们将使用B中的单词作为键，并且它们的值将是它们的计数。

>>> from collections import defaultdict
>>> from pprint import pprint
>>> d = defaultdict(lambda: defaultdict(int))
>>> for k, v in zip(A, B):
        d[k][v] += 1

现在d包含（我将其转换为普通字典，因为它的输出不那么混乱）：

>>> pprint({k: dict(v) for k, v in d.items()})
{10: {'are': 1, 'hello': 1, 'how': 1, 'world': 2, 'you': 1},
 20: {'hello': 2, 'how': 1, 'you': 1}}

现在我们需要遍历外部字典并在外部列表上调用tuple(.iteritems())以获得所需的输出：

>>> pprint([(k, tuple(v.iteritems())) for k, v in d.items()])
[(10, (('world', 2), ('you', 1), ('hello', 1), ('how', 1), ('are', 1))),
 (20, (('how', 1), ('you', 1), ('hello', 2)))]

Answer 2

由于已经提到过，这里有一种用Pandas聚合值的方法。

设置DataFrame ...

>>> import pandas as pd
>>> df = pd.DataFrame({'A': A, 'B': B})
>>> df
    A      B
0  10  hello
1  10  world
2  20    how
3  10    are
4  10    you
5  10  world
6  20    you
7  10    how
8  20  hello
9  20  hello

透视聚合价值......

>>> table = pd.pivot_table(df, rows='B', cols='A', aggfunc='size')
>>> table
A      10  20
B            
are     1 NaN
hello   1   2
how     1   1
world   2 NaN
you     1   1

转换回字典......

>>> table.to_dict()
{10: {'are': 1.0, 'hello': 1.0, 'how': 1.0, 'world': 2.0, 'you': 1.0},
 20: {'are': nan, 'hello': 2.0, 'how': 1.0, 'world': nan, 'you': 1.0}}

从这里你可以使用Python将字典调整为所需的格式（例如列表）。

Answer 3

这是一种方式：

>>> from collections import Counter
>>> A = np.array([10, 10, 20, 10, 10, 10, 20, 10, 20, 20])
>>> B = np.array(['hello', 'world', 'how', 'are', 'you', 'world', 'you', 'how', 'hello','hello'])
>>> [(i,Counter(B[np.where(A==i)]).items()) for i in set(A)]
[(10, [('world', 2), ('you', 1), ('hello', 1), ('how', 1), ('are', 1)]), (20, [('how', 1), ('you', 1), ('hello', 2)])]

Answer 4

但我确定numpy有一些更有效的方式来更有效地做到这一点

，你是对的！以下是代码

# convert to integer
category_lookup, categories = numpy.unique(A, return_inverse=True)
tag_lookup, tags = numpy.unique(B, return_inverse=True)

statistics = numpy.zeros([len(category_lookup), len(tag_lookup)])
numpy.add.at(statistics, [categories, tags], 1)

result = {}
for category, stat in zip(category_lookup, statistics):
    result[category] = dict(zip(tag_lookup[stat != 0], stat[stat != 0]))

有关说明，请参阅numpy tips and tricks。这给出了预期的答案：

{10：{＆＃39;是＆＃39;：1.0，＆＃39;你好＆＃39;：1.0，＆＃39;怎么＆＃39;：1.0，＆＃39;世界＆＃39;：2.0 ，＆＃39;你＆＃39;：1.0}， 20：{＆＃39;你好＆＃39;：2.0，＆＃39;怎么＆＃39;：1.0，＆＃39;你＆＃39;：1.0}}

我承认，这不是最明确的方法（参见pandas解决方案），但是当你拥有大量数据时它真的很快。

Answer 5

Python：NumPy简化了计数次数：

#import NumPy

将numpy导入为np

arr = np.array（[0,1,2,2,3,3,7,3,4,0,4,4,0,4,5,0,5,9,5,9， 5,8,5]） print（np.sum（arr == 4））＃测试数字4的出现

唯一，计数= np.unique（arr，return_counts = True）打印（唯一，计数）

[0 1 2 3 4 5 7 8 9] [4 1 2 3 4 5 1 1 2]

上面是输出

计算numpy数组中的出现次数

5 个答案: