在大熊猫中删除中文

时间:2018-02-17 15:00:05

标签: python string pandas dataframe replace

我正在尝试从包含拉丁文和中文字符的csv中删除所有中文字符。数据如下:

    address                                                 lat
1   农工商超市, Zhangjiang, Pudong New District, 203718       31.204024
2   欧尚, 3057号, Jinke Road, Pudong, 201203, China          31.181804

我需要它看起来像:

    address                                                 lat
1   , Zhangjiang, Pudong New District, 203718               31.204024
2   , 3057, Jinke Road, Pudong, 201203, China               31.181804

我尝试使用df.replace(/[^\x00-\x7F]/g, "")df.replace(/[\u{0080}-\u{FFFF}]/gu,""),但收到错误:

    df1.replace([^\x00-\x7F],"");
                 ^
SyntaxError: invalid syntax

需要帮助!感谢

3 个答案:

答案 0 :(得分:2)

你几乎就在那里:

df['address'] = df['address'].str.replace(r'[^\x00-\x7F]+', '')

结果:

In [99]: df
Out[99]:
                                     address        lat
0  , Zhangjiang, Pudong New District, 203718  31.204024
1  , 3057, Jinke Road, Pudong, 201203, China  31.181804

答案 1 :(得分:2)

一种方法也可能是使用filterstring.printable类似link

import string
printable = set(string.printable)
df['address'] = df['address'].apply(lambda row: ''.join(filter(lambda x: x in printable, row)))
df

结果:

                                    address        lat
1  , Zhangjiang, Pudong New District, 203718  31.204024
2  , 3057, Jinke Road, Pudong, 201203, China  31.181804

使用encodedecodelambda类似link

df['address'] = df['address'].apply(lambda row: row.encode('ascii',errors='ignore').decode())

答案 2 :(得分:0)

如果你想限制你的字符集,可以说这种方法更有效的方法是用一个你想要的编码读取文件对象而忽略错误

with open('your_csv_file.csv', encoding='ascii', errors='ignore') as infile:
    df = pd.read_csv(infile)