Question

我正在尝试摆脱\ xa0 \ xc2等字符串。我知道这是一个编码问题，但是我将如何处理？ utf-8 和“ ISO-8859-1” 编码选项都不适合我。

train = pd.read_csv('./data/train.csv',index_col = False,low_memory = False,encoding='utf-8')

test = pd.read_csv('./data/test.csv',index_col = False,low_memory = False,encoding="ISO-8859-1")

这是使用

之后的输出

train = pd.DataFrame(data = train)
print(train)

        Insult  Date    Comment

1   0   20120528192215Z "i really don't understand your point.\xa0 It ...
2   0   NaN "A\\xc2\\xa0majority of Canadians can and has ...
3   0   NaN "listen if you dont wanna get married to a man...
4   0   20120619094753Z "C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...

Answer 1

您可以尝试使用正则表达式：

string_cleaned = "string_contatining_unicode_or_latin".replace(u'\xa0', u' ')

有关更多信息：https://docs.python.org/3/howto/unicode.html

最推荐的另一种最佳方式：unicodedata.normalize

希望有帮助

CSV编码，熊猫数据框

1 个答案: