Question

我正在使用Requests检索Atom响应，并遇到编码问题：

当我使用curl检索它时，它是正确的，显示Ä：

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:zapi="http://zotero.org/ns/api">
<title>The power broker : Robert Moses and the fall of New York</title>
(snip)
<content zapi:type="citation" type="xhtml">
    <span xmlns="http://www.w3.org/1999/xhtml">(Robert Ä. Caro 1974)</span>
</content>
</entry>

但是当我使用Python 2.7.4上的请求2.2.1检索它时，我得到了这个unicode响应：

import requests
r = requests.get(url)
r.text
u'<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:zapi="http://zotero.org/ns/api">
<title>The power broker : Robert Moses and the fall of New York</title>
(snip)
<content zapi:type="citation" type="xhtml">
    <span xmlns="http://www.w3.org/1999/xhtml">(Robert \u0102\x84. Caro 1974)</span>
</content>
</entry>'

当然，将其编码为utf-8并不能让我恢复原状。怎么办？

Answer 1

由于您没有包含服务器发送的任何响应标头，我无法真正得出结论，但我猜测的是服务器发送回带有错误字符集的标头的utf8编码字符串：

Content-Type: text/html; charset=iso-8859-1

因此请求会将其视为字节流（或python2中的str），并将基于该字符集的字符串解码为unicode字符串。将unicode重新编码为latin1并解码回utf8应该返回原始字符串。

r.encode('iso-8859-1').decode('utf8')

但是，是的，使用r.content并返回str类型，您可以通过将其解码为utf8来手动应用正确的编码。

Answer 2

你确定没有尝试创建一个“已知”字母来替换\u0102使用的“Ă”吗？谷歌搜索这个作者的名字，“A”应该是简单的（罗伯特艾伦卡罗）。 u“\ x84”字符本身是一个结束引号unicode char - （check http://www.fileformat.info/info/unicode/category/Cc/list.htm） - 所以这可能是一个OCR工件来自扫描“Robert”A.“Caro”来自somwhere - 在服务器中表示正如你在Python方面看到的那样。

尝试使用带有--raw选项的curl来检查这种情况下的实际内容。

（我玩过弦乐，这个hipoteses看起来更有可能在这种情况下，我比双重编码。）

修复Requests返回的双编码utf-8

2 个答案: