Question

我一直在纠结这个错误已有一段时间了，即使有类似的问题，我似乎无法在任何地方找到解决方案。

这是我的代码：

f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

我得到的错误是：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

为什么ascii如果我说utf-8？我真的很感激任何帮助。

Answer 1

尝试：

value = u"Bitte überprüfen"

以将值声明为unicode字符串和

# -*- coding: utf-8 -*-

在文件的开头，以声明你的python文件是用utf-8编码保存的。

Answer 2

为了永远不再受到unicode错误的伤害，请切换到python3：

% python3
>>> with open('/tmp/foo', 'w') as f:
...     value = "Bitte überprüfen"
...     f.write(('"{}" = "{}";\n'.format('no_internet', value)))
... 
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";

虽然如果你真的与python2绑定并且别无选择：

% python2
>>> with open('/tmp/foo2', 'w') as f:
...   value = u"Bitte überprüfen"
...   f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
... 
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";

正如@JuniorCompressor建议的那样，不要忘记在python2文件的开头添加# encoding: utf-8告诉python以unicode读取源文件，而不是ASCII！

您的错误：

f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

是你将整个格式化的字符串编码为utf-8，而你应该在执行格式之前将值字符串编码为utf-8 ：

>>> with open('/tmp/foo2', 'w') as f: ... value = u"Bitte überprüfen" ... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8'))) ... Traceback (most recent call last): File "<stdin>", line 3, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)

这是因为python需要首先将字符串解码为utf-8，因此您必须使用unicode类型（u""的作用）。然后，您需要在将该值作为unicode显式解码，然后再将其提供给格式解析器，以构建新的字符串。

正如Karl在他的回答中所说，Python2在使用unicode字符串时非常混乱/错误，击败 Explicit比python的隐式 zen更好。对于更奇怪的行为，以下在python2中工作得很好：

>>> value = "Bitte überprüfen" >>> out = '"{}" = "{}";\n'.format('no_internet', value) >>> out '"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n' >>> print(out) "no_internet" = "Bitte überprüfen";

仍然不相信切换到python3？： - ）

更新

这是从文件读取和写入unicode字符串到另一个文件的方法：

% echo "Bitte überprüfen" > /tmp/foobar % python2 >>> with open('/tmp/foobar', 'r') as f: ... data = f.read().decode('utf-8').strip() ... >>> >>> with open('/tmp/foo2', 'w') as f: ... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8')))) ... >>> import sys;sys.exit(0) % cat /tmp/foo2 "no_internet" = "Bitte überprüfen";

更新

作为一般规则：

当您获得DecodeError时，您应在包含.decode('utf-8')数据的字符串上使用unicode

当您获得EncodeError时，您应在包含.encode('utf-8')数据的字符串上使用unicode

更新：如果你不能更新到python3，你至少可以使用以下python-future import语句使你的python2表现得像几乎是python3：

from __future__ import absolute_import, division, print_function, unicode_literals

HTH

Answer 3

为什么ascii如果我说utf-8？

因为在Python 2中，"Bitte überprüfen"不是Unicode字符串。在通过显式调用.encode之前，Python必须隐式decode到Unicode（这也是它引发Unicode Decode Error），它选择ASCII，因为它没有其他信息可以使用。 ü用一些值为＆gt; = 128的字节表示，因此它不是有效的ASCII。

@JuniorCompressor显示的u前缀会使它成为Unicode字符串，你也应该指定文件的编码（不要盲目地设置utf-8;它需要匹配你的文本）编辑器使用！）保存.py文件。

切换到Python 3实际上（部分）是一个更好的长期解决方案:)但是理解这个问题仍然是必不可少的。有关详细信息，请参阅http://bit.ly/unipain。 Python 2的行为确实是一个错误，或者至少未能满足Pythonic的设计原则：Explicit is better than implicit，在这里我们非常清楚地看到了原因;）

Answer 4

就像已经建议你从这一行得到的错误结果：

f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

它应该是：

f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))

关于unicode和编码的说明

如果使用Python 2进行woking，软件应该只在内部使用unicode字符串，在输出时转换为特定的编码。

要防止一遍又一遍地犯同样的错误，你应该确保你理解 ascii 和 utf-8 编码之间以及 str之间的区别和Python中的 unicode 对象。

ASCII和UTF-8编码之间的区别：

Ascii只需要一个字节来表示ascii字符集/编码中的所有可能字符。 UTF-8最多需要四个字节来表示完整的字符集。

ascii (default)
1    If the code point is < 128, each byte is the same as the value of the code point.
2    If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

utf-8 (unicode transformation format)
1    If the code point is <128, it’s represented by the corresponding byte value.
2    If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3    Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

str和unicode对象之间的区别：

你可以说str是baiscally字节字符串而unicode是unicode字符串。两者都可以使用不同的编码，如ascii或utf-8。

str vs. unicode
1   str     = byte string (8-bit) - uses \x and two digits
2   unicode = unicode string      - uses \u and four digits
3   basestring
        /\
       /  \
    str    unicode

如果您遵循一些简单的规则，您应该可以处理不同编码的str / unicode对象，如ascii或utf-8或您必须使用的任何编码：

Rules
1    encode(): Gets you from Unicode -> bytes
     encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2    decode(): Gets you from bytes -> Unicode
     decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3    codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4    u”: Makes your string literals into Unicode objects rather than byte sequences.
5    unicode(string[, encoding, errors])

警告：不要在字节上使用encode（）或在Unicode对象上使用decode（）

再次：软件应该只在内部使用Unicode字符串，在输出时转换为特定的编码。

编写德语字母时的Python UnicodeDecodeError

4 个答案: