CSV编码,熊猫数据框

时间:2020-06-03 15:30:57

标签: python pandas csv

我正在尝试摆脱\ xa0 \ xc2等字符串。我知道这是一个编码问题,但是我将如何处理? utf-8 “ ISO-8859-1” 编码选项都不适合我。

train = pd.read_csv('./data/train.csv',index_col = False,low_memory = False,encoding='utf-8')

test = pd.read_csv('./data/test.csv',index_col = False,low_memory = False,encoding="ISO-8859-1")

这是使用

之后的输出
train = pd.DataFrame(data = train)
print(train)
        Insult  Date    Comment

1   0   20120528192215Z "i really don't understand your point.\xa0 It ...
2   0   NaN "A\\xc2\\xa0majority of Canadians can and has ...
3   0   NaN "listen if you dont wanna get married to a man...
4   0   20120619094753Z "C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...

1 个答案:

答案 0 :(得分:0)

您可以尝试使用正则表达式:

string_cleaned = "string_contatining_unicode_or_latin".replace(u'\xa0', u' ')

有关更多信息:https://docs.python.org/3/howto/unicode.html

最推荐的另一种最佳方式:unicodedata.normalize

希望有帮助