Question

我有一个数据框，其中包含一列中的句子列表，我正在尝试创建一个新列，该列等于字符串列表显示的次数。

例如，相关数据框看起来像

book['sentences']
0 The brown dog jumped over the big moon
1 The brown fox slid under the brown log

我正在尝试计算每个句子中显示“褐色”，“结束”和“日志”的次数（即新列将等于2和3）。

我知道我可以使用str.count执行此操作，但一次只能使用一个字符串，然后我必须将它们添加起来

book['count_brown'] = book['sentences'].str.count('brown')
book['count_over'] = book['sentences'].str.count('over')
book['count_log'] = book['sentences'].str.count('log')
book['count'] = book['count_brown']+book['count_over']+book['count_log']

我正在搜索的字符串列表长度超过300字，所以即使使用循环也不是最佳选择。有更好的方法吗？

Answer 1

Ganky！

lst = ['brown', 'over', 'log']

book['sentences'].str.extractall(
    '({})'.format('|'.join(lst))
).groupby(level=0)[0].value_counts().unstack(fill_value=0)

0  brown  log  over
0      1    0     1
1      2    1     0

Answer 2

与piRSquared的解决方案类似，但使用get_dummies和sum作为计数。

df
                                sentences
0  The brown dog jumped over the big moon
1  The brown fox slid under the brown log

words = ['brown', 'over', 'log']
df = df.sentences.str.extractall('({})'.format('|'.join(words)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)
df
   brown  log  over
0      1    0     1
1      2    1     0

如果您想在单个列中按行计算所有字，只需沿第一个轴求和。

df.sum(1)
0    2
1    3
dtype: int64

Answer 3

在nltk频率分布的帮助下，你可以很容易地做到这一点，即

import nltk 
lst = ['brown', 'over', 'log']
ndf = df['sentences'].apply(nltk.tokenize.word_tokenize).apply(nltk.FreqDist).apply(pd.Series)[lst].fillna(0)

输出：

   brown  over  log
0    1.0   1.0  0.0
1    2.0   0.0  1.0

总和

ndf['count'] = ndf.sum(1)

   brown  over  log  count
0    1.0   1.0  0.0    2.0
1    2.0   0.0  1.0    3.0

计算多个子字符串在dataframe列中显示的次数

3 个答案: