pandas.replace与str.replace正则表达式冲突。代码订单

时间:2017-07-30 07:26:40

标签: python regex string pandas

我的任务是删除括号中的所有内容,并删除任何数字,后跟国家/地区名称。更改几个国家/地区的名称。

e.g。 玻利维亚(多民族国)'应该是'玻利维亚' 瑞士17'应该是'瑞士'。

我的原始代码在顺序中:

dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']

str.replace部分正常。任务完成了。 当我使用最后一行检查是否成功更改了国家/地区名称。这个原始代码不起作用。但是,如果我将代码的顺序更改为:

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

然后它成功更改了国家/地区名称。 所以我的Regex语法一定有问题,如何解决这个冲突呢?为什么会这样?

1 个答案:

答案 0 :(得分:3)

问题是您需要regex=True replace替换substrings

energy = pd.DataFrame({'Country':['United States of America4',
                                  'United States of America (aaa)','Slovakia']})
print (energy)
                          Country
0       United States of America4
1  United States of America (aaa)
2                        Slovakia

dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"} 
#no replace beacuse no match (numbers and ()) 
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
                          Country
0       United States of America4
1  United States of America (aaa)
2                        Slovakia

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
                    Country
0  United States of America
1  United States of America
2                  Slovakia

print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
               Country
0       United States4
1  United States (aaa)
2             Slovakia

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
         Country
0  United States
1  United States
2       Slovakia

print (energy.loc[energy['Country'] == 'United States'])
         Country
0  United States
1  United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
                    Country
0  United States of America
1  United States of America
2                  Slovakia

#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
         Country
0  United States
1  United States
2       Slovakia

print (energy.loc[energy['Country'] == 'United States'])
         Country
0  United States
1  United States