我正在尝试通过创建一个新的df ['Country Clean']来清理df ['Country']变量,如果该变量在df ['Country']列中找到,则会采用country变量的值。
我发现,如果我重复执行此命令,我还将删除以前的发现,并且我只会得到一个报告“俄罗斯”发现的变量
有没有办法做到这一点?
data = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy 1', 'Italie', 'Ecco', 'Russia is in Euroasia' , 'Yugoslavia', 'Russia']}
df = pd.DataFrame(data)
df['Country Clean'] = df['Country'].str.replace(r'(^.*Italy.*$)', 'Italy')
df['Country Clean'] = df['Country'].str.replace(r'(^.*Russia.*$)', 'Russia')
预期产量
data2 = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy', 'Italy', NaN, 'Russia' , NaN , 'Russia']}
exp = pd.DataFrame(data2)
exp
答案 0 :(得分:1)
使用:
In [15]: countries = ["italy", "russia", "yugoslavia", "italie"]
In [16]: for i in countries:df.loc[lambda x:x.Country.str.lower().str.contains(i), 'Country Clean'] = i.capitalize()
In [17]: df
Out[17]:
Number Country Country Clean
0 1 Italy 1 Italy
1 2 Italie Italie
2 1 Ecco NaN
3 2 Russia is in Euroasia Russia
4 1 Yugoslavia Yugoslavia
5 2 Russia Russia
答案 1 :(得分:1)
我建议先规范化国家/地区名称,然后根据允许的国家/地区列表更改“国家/地区清洁”列的值:
normalize_countries={"Italie": "Italy", "Rusia": "Russia"} # Spelling corrections
pattern = r"\b(?:{})\b".format("|".join(normalize_countries)) # Regex to find misspellings
countries = ["Italy", "Russia"] # Country list
df['Country Clean'] = df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
def applyFunc(s):
for e in countries:
if e in s:
return e
return 'NaN'
df['Country Clean'] = df['Country Clean'].apply(applyFunc)
输出:
>>> df
Number Country Country Clean
0 1 Italy 1 Italy
1 2 Italie Italy
2 1 Ecco NaN
3 2 Russia is in Euroasia Russia
4 1 Yugoslavia NaN
5 2 Russia Russia
df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
行在Country
列中搜索所有拼写错误的国家名称作为整个单词,并将其替换为正确的拼写变体。
如果您在countries
列表中使用正则表达式,然后在re.search
中使用if e in countries
而不是applyFunc
,则在搜索国家/地区时也可以添加整个单词检查。 / p>