删除字符串Python中的Unicode代码(\ uxxx)

时间:2017-05-16 20:10:15

标签: python regex python-3.x unicode

我在文档中有一些Unicode字符串。我想要的是删除此Unicode代码或用一些空格(“”)替换它。示例=“”

doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"

如何将其转换为以下内容?

doc = "Hello my name is Ruth! I really like swimming and dancing"

我已经尝试过这个:https://stackoverflow.com/a/20078869/5505608,但没有任何反应。我正在使用Python 3。

1 个答案:

答案 0 :(得分:2)

您可以编码为ASCII并忽略错误(即无法转换为ASCII字符的代码点)。

>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '

如果尾随空白困扰你,strip关闭它。根据您的使用情况,您可以使用ASCII再次解码结果。链接一切看起来像这样:

>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'