我的任务是删除括号中的所有内容,并删除任何数字,后跟国家/地区名称。更改几个国家/地区的名称。
e.g。 玻利维亚(多民族国)'应该是'玻利维亚' 瑞士17'应该是'瑞士'。
我的原始代码在顺序中:
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']
str.replace
部分正常。任务完成了。
当我使用最后一行检查是否成功更改了国家/地区名称。这个原始代码不起作用。但是,如果我将代码的顺序更改为:
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy['Country'] = energy['Country'].replace(dict1)
然后它成功更改了国家/地区名称。 所以我的Regex语法一定有问题,如何解决这个冲突呢?为什么会这样?
答案 0 :(得分:3)
问题是您需要regex=True
replace
替换substrings
:
energy = pd.DataFrame({'Country':['United States of America4',
'United States of America (aaa)','Slovakia']})
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#no replace beacuse no match (numbers and ())
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
Country
0 United States4
1 United States (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States