提取单词周围的单词并在数据框列中插入结果

时间:2017-08-02 20:26:13

标签: python pandas

我有一个数据框df,其中有3列如下:

company | year | text  
Apple   | 2016 |"The Company sells its products worldwide through its..."  

我想搜索"产品"在df['text']中提取" products"之前和之后的3个单词并将前后3个字分别插入数据框中的两列,df['before']df['after']

这是我到目前为止所做的:

m = re.search(r'((?:\w+\W+){,3})(products)\W+((?:\w+\W+){,3})', df['text'])       
merge['searchText'])    
if m:
    l = [ x.strip().split() for x in m.groups()]
df['left'], df['right'] = l[0], l[2]  

但是,我收到此消息:

  

TypeError:期望的字符串或缓冲区

我怎样才能让它发挥作用?

1 个答案:

答案 0 :(得分:3)

使用pd.Series.str.extract

pat = '(?P<before>(?:\w+\W+){,3})products\W+(?P<after>(?:\w+\W+){,3})'
new = df.text.str.extract(pat, expand=True)

new

               before                     after
0  Company sells its   worldwide through its...

您可以使用新列

创建新数据框
df.assign(**new)

  company  year                                               text                     after              before
0   Apple  2016  The Company sells its products worldwide throu...  worldwide through its...  Company sells its