使用正则表达式在熊猫中拆分列

时间:2020-02-24 23:31:26

标签: regex python-3.x pandas split

我的第一个问题...我有一个带有列'Description'的Pandas数据框。该列有一个引用和一个名称,我想分为两列。我在单独的df中放置了'Names'

#  Description                                   #  Names
---------------------------------------          ---------------
0  A long walk by Miss D'Bus                     0  Teresa Green
1  A day in the country by Teresa Green          1  Tim Burr
2  Falling Trees by Tim Burr                     2  Miss D'Bus
3  Evergreens by Teresa Green
4  Late for Dinner by Miss D'Bus

我已经使用所有名称的正则表达式字符串成功搜索了说明以确定其名称是否匹配:

regex = '$|'.join(map(re.escape, df['Names'])) + '$' 
df['Reference'] = df['Description'].str.split(regex, expand=True)

获得

#  Description                                   Reference
-----------------------------------------------------------------------
0  A long walk by Miss D'Bus                     A long walk by
1  A day in the country by Teresa Green          A day in the country by
2  Falling Trees by Tim Burr                     Falling Trees by
3  Evergreens by Teresa Green                    Evergreens by
4  Late for Dinner by Miss D'Bus                 Late for Dinner by

但是我想要将相应的(=删除的定界符)名称作为附加列。

它尝试添加*?到this

这样的正则表达式

我尝试使用“参考”列拆分“描述”列

df['Name'] = df['Description'].str.split(df['Reference'])

我尝试通过使用'Reference'字符串的长度来对'Description'列进行切片

# like: df['Name'] = df['Description'].str[-10:]
df['Name'] = df['Description'].str[-(df['Reference'].str.len()):]

但是我得到一个恒定的切片长度。

1 个答案:

答案 0 :(得分:2)

您可以使用Series.str.extract从原始列中获取两种类型的信息:

regex = r'^(.*?)\s*({})$'.format('|'.join(map(re.escape, df['Names'])))
df[['Reference','Name']] = df['Description'].str.extract(regex, expand=True)

输出:

>>> df
                            Description                Reference          name
0             A long walk by Miss D'Bus           A long walk by    Miss D'Bus
1  A day in the country by Teresa Green  A day in the country by  Teresa Green
2             Falling Trees by Tim Burr         Falling Trees by      Tim Burr
3            Evergreens by Teresa Green            Evergreens by  Teresa Green
4         Late for Dinner by Miss D'Bus       Late for Dinner by    Miss D'Bus

正则表达式将类似于^(.*?)\s*(Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)$

  • ^-字符串的开头
  • (.*?)-第1组(“参考”):除换行符以外的任何零个或多个字符,应尽可能少
  • \s*-超过0个空格
  • (Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)-第2组(“名称”):具有已知名称的替代组
  • $-字符串的结尾。