什么可能导致不同的安装不同的python 3.4 bytes.decode()行为

时间:2015-04-04 22:46:24

标签: python unicode

我看到不同的行为在两个方框中解码Python 3.4.3上的字节字符串 - 一个运行OS X,另一个运行Debian Wheezy。

在OS X上:

$ python
Python 3.4.3 (default, Mar 10 2015, 14:53:35) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\xc4\x8dtrn\xc3\xa1ct'
>>> print(s.decode("utf-8"))
čtrnáct

关于Debian:

$ python
Python 3.4.3 (default, Apr  4 2015, 22:21:17) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\xc4\x8dtrn\xc3\xa1ct'
>>> print(s.decode("utf-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u010d' in position 0: ordinal not in range(128)

在这两个安装中必须有一些配置略有不同导致这种情况。我已经检查了两者的默认编码,结果是一样的,但我不确定我能检查什么。

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

更新:locale返回两者之间的差异:

OS X:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

的Debian:

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

1 个答案:

答案 0 :(得分:0)

我找到了答案 - 我按照&#34; Locales:配置&#34; http://perlgeek.de/en/article/set-up-a-clean-utf8-environment部分。具体而言,有用的步骤是:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8