假设我有2个数据框:
<key>CFBundlePackageType</key>
<string>APPL</string>
一个包含各种主题,另一个文本应该可以从中提取主题
我希望文本数据框的输出为:
sub = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
text = pd.DataFrame(['Little Red Corvette must Grow Your ego', 'Grow Your Beans', 'James Dean and his Little Red coat', 'I love pasta'])
有什么想法可以实现吗? 我当时在看这个问题:Check if words in one dataframe appear in another (python 3, pandas) 但这与我期望的输出不完全相同。谢谢
答案 0 :(得分:5)
使用str.findall
,将sub
的所有|
值与正则表达式词的边界结合在一起:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
text['new'] = text[0].str.findall(pat).str.join(', ')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red, Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta
如果要NaN
输入不匹配的值,请使用loc
:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
lists = text[0].str.findall(pat)
m = lists.astype(bool)
text.loc[m, 'new'] = lists.loc[m].str.join(',')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red,Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta NaN