Question

假设我有2个数据框：

<key>CFBundlePackageType</key>
<string>APPL</string>

一个包含各种主题，另一个文本应该可以从中提取主题

我希望文本数据框的输出为：

sub = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
text = pd.DataFrame(['Little Red Corvette must Grow Your ego', 'Grow Your Beans', 'James Dean and his Little Red coat', 'I love pasta'])

有什么想法可以实现吗？我当时在看这个问题：Check if words in one dataframe appear in another (python 3, pandas) 但这与我期望的输出不完全相同。谢谢

Answer 1

使用str.findall，将sub的所有|值与正则表达式词的边界结合在一起：

pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
text['new'] = text[0].str.findall(pat).str.join(', ')
print (text)
                                        0                    new
0  Little Red Corvette must Grow Your ego  Little Red, Grow Your
1                         Grow Your Beans              Grow Your
2      James Dean and his Little Red coat             Little Red
3                            I love pasta

如果要NaN输入不匹配的值，请使用loc：

pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
lists = text[0].str.findall(pat)
m = lists.astype(bool)
text.loc[m, 'new'] = lists.loc[m].str.join(',')
print (text)
                                        0                   new
0  Little Red Corvette must Grow Your ego  Little Red,Grow Your
1                         Grow Your Beans             Grow Your
2      James Dean and his Little Red coat            Little Red
3                            I love pasta                   NaN

在另一个数据框中找到的句子中查找存储在数据框中的短语

1 个答案: