我有一个庞大的数据库,该数据库要删除各种长度的前导文本。这是一个最小的工作示例:
data = {'Title' : ['Bertram, C. et al., 2015a: Carbon',
'Bertram, C. et al., 2015b: Complementing',
'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])
给出
Title
0 Bertram, C. et al., 2015a: Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Bertram, C. et al., 2018: Targeted
首次尝试
我在熊猫re
方法中应用了replace
:
df['Title'].replace(r'(\A[\D\s.,]*\d\d\d\d[ab:] )', '', regex=True, inplace=True)
但这不能解决所有情况:
Title
0 Bertram, C. et al., 2015a: Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Targeted
第二次尝试
我在regex
中使用了replace
命令:
df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\d:)',
r'(\A[\D\s.,]*\d\d\d\da:)'
r'(\A[\D\s.,]*\d\d\d\db:)'], value='', inplace=True)
但这给出了相同的结果。
Title
0 Bertram, C. et al., 2015a: Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Targeted
第三次尝试
如果我对正则表达式列表重新排序:
df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\da:)',
r'(\A[\D\s.,]*\d\d\d\db:)'
r'(\A[\D\s.,]*\d\d\d\d:)'], value='', inplace=True)
我有所进步,但还不够:
Title
0 Carbon
1 Bertram, C. et al., 2015b: Complementing
2 Targeted
所需结果
Title
0 Carbon
1 Complementing
2 Targeted
缺少相关问题
我仔细查看了re
和panda
的{{1}}的文档,但是有些地方不对。 SO问答中没有一个可以解决这个问题。
答案 0 :(得分:2)
如果最后总是冒号:
,并且想在其后加上最后一个单词,则可能不希望使用re
模块。通常re比简单的字符串操作要慢得多。
替代方法可能是:
data = {'Title' : ['Bertram, C. et al., 2015a: Carbon',
'Bertram, C. et al., 2015b: Complementing',
'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])
df['title2'] = df.Title.str.split(':').str[-1].str.lstrip()
print(df)
输出
0 Bertram, C. et al., 2015a: Carbon Carbon
1 Bertram, C. et al., 2015b: Complementing Complementing
2 Bertram, C. et al., 2018: Targeted Targeted
答案 1 :(得分:1)
Include Directories
的意思是“ a或b或:”。您需要"[ab:]"
(“ a或b或:,可能重复”),因为它们是在"[ab:]+"
中重复的。进行此更正后,第一种方法将起作用。