Question

我尝试过str（）和x.encode（'UTF8'）。是否有一种快速简便的方法来删除unicode字符？我的列表如下所示：

mcd = [u'Chicken saut\xc3\xa9ed potatoes',  'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes']

我试图摆脱你的原因是因为我想将这些数据复制到CSV文件中。当我尝试这样做时，它会给我一个像下面那样的错误......

UnicodeEncodeError：'ascii'编解码器无法编码位置8-10的字符：序数不在范围内（128）

我认为完全删除unicode会更容易。

提前致谢！

Answer 1

这对我有用：

mcd = [u'Chicken saut\xc3\xa9ed potatoes',  'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes']

new = [str(m) for m in mcd]

for m,n in zip(mcd,new): # compare before and after
    print type(m), type(n)

OUT：

<type 'unicode'> <type 'str'>
<type 'str'> <type 'str'>
<type 'str'> <type 'str'>

如果上述方法不起作用（请参阅评论中的convo）：

new = [m.encode('utf-8') for m in mcd]

Answer 2

问题可能是您按Enter而不是打印结果。这称为 repr 而不是 str 。引用文档：

在交互式解释器中，输出字符串用引号括起来，特殊字符用反斜杠转义。虽然这有时可能与输入看起来不同（封闭的引号可能会改变），但这两个字符串是等价的。 reference

让我告诉你：

In [1]: mcd = [u'Chicken saut\xc3\xa9ed potatoes',  'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes']

In [2]: mcd[0]
Out[2]: u'Chicken saut\xc3\xa9ed potatoes'

In [3]: print repr(mcd[0])
u'Chicken saut\xc3\xa9ed potatoes'

In [4]: print mcd[0]  # Here will use my current OS encoding, i think utf8 in my case
Chicken sautÃ©ed potatoes

In [5]: print mcd[0].encode('utf8')  # yes! i was right
Chicken sautÃ©ed potatoes

您应首先选择编码类型，我认为在这种情况下您必须使用latin1：

In [20]: print mcd[0].encode('latin1')
Chicken sautéed potatoes

希望有所帮助。

编辑：如果您想要替换字符，我还没有看到问题的编辑，check this answer

Answer 3

如果您获得的字符串是网站抓取的结果，则表明您关闭的网站的编码设置不正确。

网站指定charset=utf-8然后将网站的内容实际放在其他字符集（特别是windows-1252）或反之亦然。这种现象没有简单的通用解决方法（也称为mojibake）。

您可能希望尝试使用不同的抓取库 - 大多数都有某种识别和处理此方案的策略，但它们在不同的方案中具有不同的成功率。如果您使用的是BeautifulSoup，则可能需要尝试使用chardet后端的不同参数。

当然，如果您只关心正确抓取单个网站，则可以对网站声明的字符编码进行硬编码。

你这样的问题没有多大意义。你要完成什么并不是很清楚。 u'Chicken and sauted potatoes'不再正确，只比u'Chicken and sautÃ©ed potatoes'稍微不那么吸引人（并且在某些方面更没有吸引力，因为你不能说有尝试使其正确，尽管它没有胜任执行）。

如果因为使用ASCII编码将Unicode提供给文件句柄而出现编码错误，则正确的解决方法是在打开文件进行写入时指定ASCII以外的编码（通常为UTF-8）。

摆脱列表中的unicode字符

3 个答案: