我有一个熊猫数据框df
,其字符串列为Posts
,如下所示:
df['Posts']
0 this is an example sentence
1 this too is an example too is an example sentence
2 yup, still an example sentence
我还有另一个数据框df1
,其列Phrases
中有标签列表,如下所示:
df1['Phrases']
0 example
1 example sentence
2 is an
3 is an example
4 yup
我需要一个在Phrases
的{{1}}中出现的df1
中df
唯一计数的数据框,如下所示:
Posts
答案 0 :(得分:2)
使用str.extract
,然后按sum
检查非缺失值并计数出现次数-True
类似于1
s的过程:
df1['Count'] = [df['Posts'].str.extract('(' + x + ')', expand=False).notnull().sum()
for x in df1['Phrases']]
print (df1)
Tags Count
0 example 3
1 example sentence 3
2 is an 2
3 is an example 2
4 yup 1
编辑:
对于不计算partail匹配的单词,请使用单词边界:
df1['Count'] = [df['Posts'].str.extract(r'(\b' + x + r'\b)', expand=False).notnull().sum()
for x in df1['Phrases']]
print (df1)
Phrases Count
0 example 3
1 example sentence 3
2 is an 2
3 is an example 2
4 yup 1