我有一个数据框df
,其中有3列如下:
company | year | text
Apple | 2016 |"The Company sells its products worldwide through its..."
我想搜索"产品"在df['text']
中提取" products"之前和之后的3个单词并将前后3个字分别插入数据框中的两列,df['before']
和df['after']
。
这是我到目前为止所做的:
m = re.search(r'((?:\w+\W+){,3})(products)\W+((?:\w+\W+){,3})', df['text'])
merge['searchText'])
if m:
l = [ x.strip().split() for x in m.groups()]
df['left'], df['right'] = l[0], l[2]
但是,我收到此消息:
TypeError:期望的字符串或缓冲区
我怎样才能让它发挥作用?
答案 0 :(得分:3)
pat = '(?P<before>(?:\w+\W+){,3})products\W+(?P<after>(?:\w+\W+){,3})'
new = df.text.str.extract(pat, expand=True)
new
before after
0 Company sells its worldwide through its...
您可以使用新列
创建新数据框df.assign(**new)
company year text after before
0 Apple 2016 The Company sells its products worldwide throu... worldwide through its... Company sells its