Python大熊猫`替换`行为不一致

时间:2020-11-03 03:35:19

标签: python pandas re

我有一个庞大的数据库,该数据库要删除各种长度的前导文本。这是一个最小的工作示例:

data = {'Title' : ['Bertram, C. et al., 2015a: Carbon', 
                   'Bertram, C. et al., 2015b: Complementing', 
                   'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])

给出

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2        Bertram, C. et al., 2018: Targeted

首次尝试

我在熊猫re方法中应用了replace

df['Title'].replace(r'(\A[\D\s.,]*\d\d\d\d[ab:] )', '', regex=True, inplace=True)

但这不能解决所有情况:

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

第二次尝试

我在regex中使用了replace命令:

df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\d:)', 
                           r'(\A[\D\s.,]*\d\d\d\da:)'
                           r'(\A[\D\s.,]*\d\d\d\db:)'], value='', inplace=True)

但这给出了相同的结果。

                                      Title
0         Bertram, C. et al., 2015a: Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

第三次尝试

如果我对正则表达式列表重新排序:

df['Title'].replace(regex=[r'(\A[\D\s.,]*\d\d\d\da:)', 
                           r'(\A[\D\s.,]*\d\d\d\db:)'
                           r'(\A[\D\s.,]*\d\d\d\d:)'], value='', inplace=True)

我有所进步,但还不够:

                                      Title
0                                    Carbon
1  Bertram, C. et al., 2015b: Complementing
2                                  Targeted

所需结果

    Title
0   Carbon
1   Complementing
2   Targeted

缺少相关问题

我仔细查看了repanda的{​​{1}}的文档,但是有些地方不对。 SO问答中没有一个可以解决这个问题。

2 个答案:

答案 0 :(得分:2)

如果最后总是冒号:,并且想在其后加上最后一个单词,则可能不希望使用re模块。通常re比简单的字符串操作要慢得多。

替代方法可能是:

data = {'Title' : ['Bertram, C. et al., 2015a: Carbon', 
                   'Bertram, C. et al., 2015b: Complementing', 
                   'Bertram, C. et al., 2018: Targeted']}
df = pd.DataFrame(data, columns = ['Title'])
df['title2'] = df.Title.str.split(':').str[-1].str.lstrip()

print(df)

输出

0         Bertram, C. et al., 2015a: Carbon         Carbon
1  Bertram, C. et al., 2015b: Complementing  Complementing
2        Bertram, C. et al., 2018: Targeted       Targeted

答案 1 :(得分:1)

Include Directories的意思是“ a或b或:”。您需要"[ab:]"(“ a或b或:,可能重复”),因为它们是在"[ab:]+"中重复的。进行此更正后,第一种方法将起作用。