处理响应编码的请求

时间:2016-05-19 16:11:26

标签: python python-3.x web-scraping python-requests

我正在使用requests来请求页面。任务很简单,但我有编码问题。该页面包含非ascii,土耳其语字符,但在HTML源代码中,结果如下:

ÇINARTEPE # What it looks like
ÇINARTEPE # What it is like in HTML source

因此,以下操作不会返回我的预期:

# What I have tried as encoding
req.encoding = "utf-8"
req.encoding = "iso-8859-9"
req.encoding = "iso-8859-1"

# The operations
"ÇINARTEPE" in req.text # False, it must return True
bytes("ÇINARTEPE", "utf-8") in req.content # False
bytes("ÇINARTEPE", "iso-8859-9") in req.content # False
bytes("ÇINARTEPE", "iso-8859-1") in req.content # False

我想要的是找出"ÇINARTEPE"字符串是否在HTML源代码中。

更多信息

一个例子:

req = requests.get("http://www.eshot.gov.tr/tr/OtobusumNerede/290")
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-1"
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-9"
"ÇINARTEPE" in req.text # False
# Supposed to return True

环境

  • python 3.5.1
  • 要求2.10.0

1 个答案:

答案 0 :(得分:3)

您需要做的是取消HTML中的HTML代码。 stackoverflow中已有一些答案,请检查this post

但基本上一种方法是

from HTMLParser import HTMLParser
parser = HTMLParser()
html_decoded_string = parser.unescape(html_encoded_string)

<强>更新

从python3 docs得到了更好的答案并经过测试

>>> import html
>>> html.unescape("&#199;INARTEPE")
'ÇINARTEPE'