Question

我有一个数据框df，其中有3列如下：

company | year | text  
Apple   | 2016 |"The Company sells its products worldwide through its..."

我想搜索＆＃34;产品＆＃34;在df['text']中提取＆＃34; products＆＃34;之前和之后的3个单词并将前后3个字分别插入数据框中的两列，df['before']和df['after']。

这是我到目前为止所做的：

m = re.search(r'((?:\w+\W+){,3})(products)\W+((?:\w+\W+){,3})', df['text'])       
merge['searchText'])    
if m:
    l = [ x.strip().split() for x in m.groups()]
df['left'], df['right'] = l[0], l[2]

但是，我收到此消息：

TypeError：期望的字符串或缓冲区

我怎样才能让它发挥作用？

Answer 1

使用pd.Series.str.extract

pat = '(?P<before>(?:\w+\W+){,3})products\W+(?P<after>(?:\w+\W+){,3})'
new = df.text.str.extract(pat, expand=True)

new

               before                     after
0  Company sells its   worldwide through its...

您可以使用新列

创建新数据框

df.assign(**new)

  company  year                                               text                     after              before
0   Apple  2016  The Company sells its products worldwide throu...  worldwide through its...  Company sells its

提取单词周围的单词并在数据框列中插入结果

1 个答案: