我的第一个问题...我有一个带有列'Description'
的Pandas数据框。该列有一个引用和一个名称,我想分为两列。我在单独的df中放置了'Names'
:
# Description # Names
--------------------------------------- ---------------
0 A long walk by Miss D'Bus 0 Teresa Green
1 A day in the country by Teresa Green 1 Tim Burr
2 Falling Trees by Tim Burr 2 Miss D'Bus
3 Evergreens by Teresa Green
4 Late for Dinner by Miss D'Bus
我已经使用所有名称的正则表达式字符串成功搜索了说明以确定其名称是否匹配:
regex = '$|'.join(map(re.escape, df['Names'])) + '$'
df['Reference'] = df['Description'].str.split(regex, expand=True)
获得
# Description Reference
-----------------------------------------------------------------------
0 A long walk by Miss D'Bus A long walk by
1 A day in the country by Teresa Green A day in the country by
2 Falling Trees by Tim Burr Falling Trees by
3 Evergreens by Teresa Green Evergreens by
4 Late for Dinner by Miss D'Bus Late for Dinner by
但是我想要将相应的(=删除的定界符)名称作为附加列。
它尝试添加*?到this
这样的正则表达式我尝试使用“参考”列拆分“描述”列
df['Name'] = df['Description'].str.split(df['Reference'])
我尝试通过使用'Reference'字符串的长度来对'Description'列进行切片
# like: df['Name'] = df['Description'].str[-10:]
df['Name'] = df['Description'].str[-(df['Reference'].str.len()):]
但是我得到一个恒定的切片长度。
答案 0 :(得分:2)
您可以使用Series.str.extract
从原始列中获取两种类型的信息:
regex = r'^(.*?)\s*({})$'.format('|'.join(map(re.escape, df['Names'])))
df[['Reference','Name']] = df['Description'].str.extract(regex, expand=True)
输出:
>>> df
Description Reference name
0 A long walk by Miss D'Bus A long walk by Miss D'Bus
1 A day in the country by Teresa Green A day in the country by Teresa Green
2 Falling Trees by Tim Burr Falling Trees by Tim Burr
3 Evergreens by Teresa Green Evergreens by Teresa Green
4 Late for Dinner by Miss D'Bus Late for Dinner by Miss D'Bus
正则表达式将类似于^(.*?)\s*(Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)$
:
^
-字符串的开头(.*?)
-第1组(“参考”):除换行符以外的任何零个或多个字符,应尽可能少\s*
-超过0个空格(Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)
-第2组(“名称”):具有已知名称的替代组$
-字符串的结尾。