我在df1中有100个关键字,在df2中有10,000个文章。我想计算有多少文章包含某个关键字。例如,大约有20篇文章包含关键字“apple”。
我尝试使用df.str.contains(),但我必须计算每个关键字。你能告诉我一个有效的方法吗?
df1=pd.DataFrame(['apple','mac','pc','ios','lg'],columns=['keywords'])
df2=pd.DataFrame(['apple is good for health','mac is another pc','today is sunday','Star wars pc game','ios is a system,lg is not','lg is a japan company '],columns=['article'])
结果:
1 artricl contain "apple"
1 article contain 'mac'
2 article contain 'pc'
1 article contain "ios"
2 article contain 'lg'
答案 0 :(得分:2)
对于所有sum
使用{{True
s,我认为需要str.contains
对于计数为1
的布尔系列keywords
list comprehension
1}}与DataFrame
构造函数:
L = [(x, df2['article'].str.contains(x).sum()) for x in df1['keywords']]
#alternative solution
#L = [(x, sum(x in article for article in df2['article'])) for x in df1['keywords']]
df3 = pd.DataFrame(L, columns=['keyword', 'count'])
print (df3)
keyword count
0 apple 1
1 mac 1
2 pc 2
3 ios 1
4 lg 2
如果只想打印输出:
for x in df1['keywords']:
count = df2['article'].str.contains(x).sum()
#another solution if no NaNs with sum, generator and check membership by in
#count = sum(x in article for article in df2['article'])
print ('{} article contain "{}"'.format(count, x))
1 article contain "apple"
1 article contain "mac"
2 article contain "pc"
1 article contain "ios"
2 article contain "lg"