我看到不同的行为在两个方框中解码Python 3.4.3上的字节字符串 - 一个运行OS X,另一个运行Debian Wheezy。
在OS X上:
$ python
Python 3.4.3 (default, Mar 10 2015, 14:53:35)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\xc4\x8dtrn\xc3\xa1ct'
>>> print(s.decode("utf-8"))
čtrnáct
关于Debian:
$ python
Python 3.4.3 (default, Apr 4 2015, 22:21:17)
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\xc4\x8dtrn\xc3\xa1ct'
>>> print(s.decode("utf-8"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u010d' in position 0: ordinal not in range(128)
在这两个安装中必须有一些配置略有不同导致这种情况。我已经检查了两者的默认编码,结果是一样的,但我不确定我能检查什么。
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
更新:locale返回两者之间的差异:
OS X:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
的Debian:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
答案 0 :(得分:0)
我找到了答案 - 我按照&#34; Locales:配置&#34; http://perlgeek.de/en/article/set-up-a-clean-utf8-environment部分。具体而言,有用的步骤是:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8