Question

我正在尝试计算pandas DataFrame列中元素的频率。

一些玩具数据：

d = pd.DataFrame({'letters':[['a', 'b', 'c'], np.nan, ['a', 'e', 'd', 'c'], ['a', 'e', 'c']]})

我能想到的是遍历行并将值添加到字典：

letter_count = {}
for i in range(len(d)):
    if d.iloc[i, ]['letters'] is np.nan:
        continue
    else:
        for letter in d.iloc[i, ]['letters']:
            letter_count[letter] = letter_count.get(letter, 0) + 1

这对我有用，但是因为我的数据集很大，所以速度不是很快。我认为通过避免显式的for循环可能会有所帮助，但是我无法提出一种更加“熊猫式”的方法。

感谢您的帮助。

Answer 1

使用chain.from_iterable展平列表，然后使用Counter进行计数：

from itertools import chain
from collections import Counter

pd.Series(Counter(chain.from_iterable(d.letters.dropna())))

a    3
b    1
c    3
e    2
d    1
dtype: int64

或者，将value_counts用于计数步骤：

pd.Series(list(chain.from_iterable(d.letters.dropna()))).value_counts()

a    3
c    3
e    2
b    1
d    1
dtype: int64

或者，np.unique也表现出色：

u, c = np.unique(list(chain.from_iterable(d.letters.dropna())), return_counts=True)

pd.Series(dict(zip(u, c)))

a    3
b    1
c    3
d    1
e    2
dtype: int64

Answer 2

再次确定unnesting

unnesting(d.dropna(),['letters'])['letters'].value_counts()
Out[71]: 
a    3
c    3
e    2
d    1
b    1
Name: letters, dtype: int64

获取列表的熊猫列中元素频率的有效方法

2 个答案: