Question

我的任务是删除括号中的所有内容，并删除任何数字，后跟国家/地区名称。更改几个国家/地区的名称。

e.g。玻利维亚（多民族国）'应该是'玻利维亚' 瑞士17'应该是'瑞士'。

我的原始代码在顺序中：

dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']

str.replace部分正常。任务完成了。当我使用最后一行检查是否成功更改了国家/地区名称。这个原始代码不起作用。但是，如果我将代码的顺序更改为：

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

然后它成功更改了国家/地区名称。所以我的Regex语法一定有问题，如何解决这个冲突呢？为什么会这样？

Answer 1

问题是您需要regex=True replace替换substrings：

energy = pd.DataFrame({'Country':['United States of America4',
                                  'United States of America (aaa)','Slovakia']})
print (energy)
                          Country
0       United States of America4
1  United States of America (aaa)
2                        Slovakia

dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}

#no replace beacuse no match (numbers and ()) 
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
                          Country
0       United States of America4
1  United States of America (aaa)
2                        Slovakia

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
                    Country
0  United States of America
1  United States of America
2                  Slovakia

print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []

energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
               Country
0       United States4
1  United States (aaa)
2             Slovakia

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
         Country
0  United States
1  United States
2       Slovakia

print (energy.loc[energy['Country'] == 'United States'])
         Country
0  United States
1  United States

#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
                    Country
0  United States of America
1  United States of America
2                  Slovakia

#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
         Country
0  United States
1  United States
2       Slovakia

print (energy.loc[energy['Country'] == 'United States'])
         Country
0  United States
1  United States

pandas.replace与str.replace正则表达式冲突。代码订单

1 个答案: