我正在尝试从包含拉丁文和中文字符的csv中删除所有中文字符。数据如下:
address lat
1 农工商超市, Zhangjiang, Pudong New District, 203718 31.204024
2 欧尚, 3057号, Jinke Road, Pudong, 201203, China 31.181804
我需要它看起来像:
address lat
1 , Zhangjiang, Pudong New District, 203718 31.204024
2 , 3057, Jinke Road, Pudong, 201203, China 31.181804
我尝试使用df.replace(/[^\x00-\x7F]/g, "")
和df.replace(/[\u{0080}-\u{FFFF}]/gu,"")
,但收到错误:
df1.replace([^\x00-\x7F],"");
^
SyntaxError: invalid syntax
需要帮助!感谢
答案 0 :(得分:2)
df['address'] = df['address'].str.replace(r'[^\x00-\x7F]+', '')
结果:
In [99]: df
Out[99]:
address lat
0 , Zhangjiang, Pudong New District, 203718 31.204024
1 , 3057, Jinke Road, Pudong, 201203, China 31.181804
答案 1 :(得分:2)
一种方法也可能是使用filter
与string.printable
类似link:
import string
printable = set(string.printable)
df['address'] = df['address'].apply(lambda row: ''.join(filter(lambda x: x in printable, row)))
df
结果:
address lat
1 , Zhangjiang, Pudong New District, 203718 31.204024
2 , 3057, Jinke Road, Pudong, 201203, China 31.181804
使用encode
和decode
与lambda
类似link
df['address'] = df['address'].apply(lambda row: row.encode('ascii',errors='ignore').decode())
答案 2 :(得分:0)
如果你想限制你的字符集,可以说这种方法更有效的方法是用一个你想要的编码读取文件对象而忽略错误
with open('your_csv_file.csv', encoding='ascii', errors='ignore') as infile:
df = pd.read_csv(infile)