Question

我试图从网站上获取字符串。我使用requests模块发送gridx请求。

GET

然而，出于某种原因，文字出现在Gibberish而不是希伯来语中：

text = requests.get("http://example.com") #send GET requests to the website
print text.text #print the variable

当我用Fiddler嗅到流量或在我的浏览器中查看网站时，我会用希伯来语看到它：

<div>
<p>×©×¨×ª</p>
</div>

顺便说一下，<div> <p>שרת</p> </div>代码包含定义编码的元标记，即html。我试图将文本编码为utf-8，但它仍然是胡言乱语。我尝试使用utf-8取消它，但它会抛出utf-8异常。我声明我在脚本的第一行使用UnicodeEncodeError。此外，当我使用内置的utf-8模块发送请求时，也会出现问题。

我看了Unicode HOWTO，但仍然无法修复它。我在这里也阅读了很多线程（关于urllib异常以及为什么希伯来语在Python中变成乱码）但我仍然无法修复它。

我在Windows机器上使用Python 2.7.9。我在Python IDLE中运行我的脚本。

提前致谢。

Answer 1

服务器没有正确声明编码。

>>> print u'×©×¨×ª'.encode('latin-1').decode('utf-8')
שרת

在访问text.encoding之前设置text.text。

text = requests.get("http://example.com") #send GET requests to the website
text.encoding = 'utf-8' # Correct the page encoding
print text.text #print the variable

来自网站的文字显示为Gibberish而不是希伯来语

1 个答案: