我正在尝试摆脱\ xa0 \ xc2等字符串。我知道这是一个编码问题,但是我将如何处理? utf-8 和“ ISO-8859-1” 编码选项都不适合我。
train = pd.read_csv('./data/train.csv',index_col = False,low_memory = False,encoding='utf-8')
test = pd.read_csv('./data/test.csv',index_col = False,low_memory = False,encoding="ISO-8859-1")
这是使用
之后的输出train = pd.DataFrame(data = train)
print(train)
Insult Date Comment
1 0 20120528192215Z "i really don't understand your point.\xa0 It ...
2 0 NaN "A\\xc2\\xa0majority of Canadians can and has ...
3 0 NaN "listen if you dont wanna get married to a man...
4 0 20120619094753Z "C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...
答案 0 :(得分:0)
您可以尝试使用正则表达式:
string_cleaned = "string_contatining_unicode_or_latin".replace(u'\xa0', u' ')
有关更多信息:https://docs.python.org/3/howto/unicode.html
最推荐的另一种最佳方式:unicodedata.normalize
希望有帮助