我有一个像下面这样的pandas数据框,列名为'texts'
texts
throne one
bar one
foo two
bar three
foo two
bar two
foo one
foo three
one three
我想计算每一行的三个单词'one','two'和'three'的存在,并返回这些单词的匹配计数,如果它是一个完整的单词。输出如下所示。
texts counts
throne one 1
bar one 1
foo two 1
bar three 1
foo two 1
bar two 1
foo one 1
foo three 1
one three 2
你可以看到,与第一行相比,count是1,因为'throne'不被视为被搜索的值之一'one'不是一个完整的单词而是它是'宝座'。
对此有何帮助?
答案 0 :(得分:7)
将pd.Series.str.count
加入words
'|'
与正则表达式结合使用
words = 'one two three'.split()
df.assign(counts=df.texts.str.count('|'.join(words)))
texts counts
0 throne one 2
1 bar one 1
2 foo two 1
3 bar three 1
4 foo two 1
5 bar two 1
6 foo one 1
7 foo three 1
8 one three 2
为了确定'throne'
,就像不计算它一样,我们可以为正则表达式添加一些单词边界
words = 'one two three'.split()
df.assign(counts=df.texts.str.count('|'.join(map(r'\b{}\b'.format, words))))
texts counts
0 throne one 1
1 bar one 1
2 foo two 1
3 bar three 1
4 foo two 1
5 bar two 1
6 foo one 1
7 foo three 1
8 one three 2
对于天赋,在Python 3.6中使用原始形式的f-string
words = 'one two three'.split()
df.assign(counts=df.texts.str.count('|'.join(fr'\b{w}\b' for w in words)))
texts counts
0 throne one 1
1 bar one 1
2 foo two 1
3 bar three 1
4 foo two 1
5 bar two 1
6 foo one 1
7 foo three 1
8 one three 2