我正在尝试清理excel文件中的一些数据。该文件包含7400行和18列,其中包括具有各自地址和其他数据的客户列表。我遇到的问题是,有些城市拼写错误,导致信息失真,难以进一步处理。
SURNAME | ADDRESS | CITY
0 Jenson | 252 Des Chênes | D.DO
1 Jean | 236 Gouin | DOLLARD
2 Denis | 993 Boul. Gouin | DOLLARD-DES-ORMEAUX
3 Bradford | 1690 Dollard #7 | DDO
4 Alisson | 115 Du Buisson | IL PERROT
5 Abdul | 9877 Boul. Gouin | Pierrefonds
6 O'Neil | 5 Du College | Ile Bizard
7 Bundy | 7345 Sherbrooke | ILLE Perot
8 Darcy | 8671 Anthony #2 | ILE Perrot
9 Adams | 845 Georges | Pierrefonds
在上面的例子中,D.DO,DOLLARD,DDO应拼写为DOLLARD-DES-ORMEAUX,IL PERROT,ILLE PEROT,ILE PERROT应拼写为ILE-PERROT。
我已经能够使用:
替换值df["CITY"].replace(to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", regex=True)
df["CITY"].replace(to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT", regex=True)
有没有办法将上述操作合二为一? 我试过了:
df["CITY"].replace({to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT"}, regex=True)
但我没有运气
答案 0 :(得分:15)
replacements = {
'CITY': {
r'(D.*DO|DOLLARD.*)': 'DOLLARD-DES-ORMEAUX',
r'I[lL]*[eE]*.*': 'ILLE Perot'}
}
df.replace(replacements, regex=True, inplace=True)
print(df)
输出:
SURNAME ADDRESS CITY
0 Jenson 252 Des Chênes DOLLARD-DES-ORMEAUX
1 Jean 236 Gouin DOLLARD-DES-ORMEAUX
2 Denis 993 Boul. Gouin DOLLARD-DES-ORMEAUX
3 Bradford 1690 Dollard #7 DOLLARD-DES-ORMEAUX
4 Alisson 115 Du Buisson ILLE Perot
5 Abdul 9877 Boul. Gouin Pierrefonds
6 O'Neil 5 Du College ILLE Perot
7 Bundy 7345 Sherbrooke ILLE Perot
8 Darcy 8671 Anthony #2 ILLE Perot
9 Adams 845 Georges Pierrefonds
答案 1 :(得分:3)
您可以创建替换词典,然后使用' loc'迭代它们。替换。
target_for_values = {
'DOLLARD-DES-ORMEAUX': ['D.DO', 'DOLLARD', 'DDO'],
'ILE-PERROT': ['IL PERROT', 'ILLE PEROT', 'ILE PERROT']}
for k, v in target_for_values.iteritems():
df.loc[df.CITY.str.upper().isin(v), 'CITY'] = k
>>> df.CITY
CITY
0 C.DO
1 DOLLARD-DES-ORMEAUX
2 DOLLARD-DES-ORMEAUX
3 DOLLARD-DES-ORMEAUX
4 ILE-PERROT
5 Pierrefonds
6 Ile Bizard
7 ILE-PERROT
8 ILE-PERROT
9 Pierrefonds