Question

我正在抓一个json feed，然后尝试将某些字段保存到文件中。

首先，我使用urllib2

获取json feed

html = urllib2.urlopen(url).read()

然后我使用json.loads

data = json.loads(html)

然后我尝试抓住每个项目的“名称”字段

for i in range (len(data["response"]["feeds"])):
    Name = str(data["body"]["events"][i]["Name"])

只要“名称”字段中有重音字符，python就会抛出UnicodeEncodeError

Answer 1

这是一个复杂的问题，您应该花一些时间来理解; Python 2.X中的unicode vs bytestrings。我在PyCon 2012上发现Ned Batchelder的Unicode Pain演讲对于理解这一点非常有帮助。

由于pyvideo网站无休止地在线保存视频，这里有几个链接：

http://pyvideo.org/video/948/
http://www.youtube.com/watch?feature=player_embedded&v=sgHbC6udIqc

在从未知来源和未知编码中抓取网站时，这一点尤其重要！

编辑：总结一下nedbat演讲中的一些信息：你真的应该知道你的数据来自目标网站的编码类型。 urllib2将向您返回字节，这可能会也可能不会被强制为unicode。在这种情况下，您的Name字段可能包含重音字符，这是一种无法转换为标准ASCII表（即A-Z，a-z，0-9等）的字节类型。

解决方法是将这些字节解码为utf-8（或其他一些可以处理重音字符的编码），如下所示：

url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'  # A page containing raw unicode!
html = urllib2.urlopen(url).read().decode(u'utf-8', u'replace')  # Decode the contents of the page as utf-8 instead of bytes, replacing characters that can't convert into a ? character.

在这里，您可以比较这两种方法的输出：

# Look at the last section of unicode data as bytes. Notice the \xef, signifying bytes, not unicode.
>>> urllib2.urlopen(url).read().splitlines()[-11]
'<dd>\xef\xbc\x81 \xef\xbc\x82 \xef\xbc\x83 \xef\xbc\x84 \xef\xbc\x85 \xef\xbc\x86 ... '

# Now, convert that data into unicode as you open the site.
>>> urllib2.urlopen(url).read().decode(u'utf-8').splitlines()[-11]
u'<dd>\uff01 \uff02 \uff03 \uff04 \uff05 \uff06 \uff07 \uff08 \uff09 \uff0a \uff0b ... '

在第一个例子中，你可以看到数据以字节形式返回，在第二个例子中，它是所有unicode数据。

这有几点需要注意。并非每个页面都可以解码为utf-8，但这种情况很少发生。

最后一条建议是切换到使用第三方requests library，它将自动为您处理unicode。一个例子：

>>> import requests
>>> url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'
>>> response = requests.get(url)

# You can get bytes out of the response:
>>> type(response.content)  # Returns bytes
<type 'str'>

# Or, you can get unicode out of it:
response.text  # Returns unicode
<type 'unicode'>

使用response.text，您现在可以将其传递给json.loads(response.text)，以便成功地从结果中获取unicode。然后，删除您的str()包装。

以上是上面使用的link to the requests method reference。

Answer 2

Name = data["body"]["events"][i]["Name"].decode('utf8')

很可能是你想要的

问题是你正在调用str(my_variable)而str将强制它为不支持重音的ascii

使用Python中的json.loads获取Accented字符的UnicodeEncodeError

2 个答案: