Question

我正在使用python / pandas。

我有一个这样的数据框：

     date         id         my_column
0    31.07.20     128909     ['hey', 'hi']
1    31.07.20     128914     ['hi']
3    31.07.20     853124     ['hi', 'hello', 'hey']
4    30.07.20     123456     ['hey']
...

长度超过1.000.000行的数据框。我想要my_column列中的前10个最常用词。

感谢任何帮助。

Answer 1

将Series.explode与Series.value_counts一起使用，默认情况下是对值进行排序，因此对于top10，需要前10个索引值：

out = df['my_column'].explode().value_counts().index[:10].tolist()

或者您可以使用纯python解决方案进行展平并计算top10：

from collections import Counter
from  itertools import chain

c = Counter(chain.from_iterable(df['my_column']))
out = [a for a, b in c.most_common(10)]

列表列中最常见的元素

1 个答案: