Question

我正在尝试使用正则表达式搜索网页，但我收到以下错误：

TypeError：不能在类字节对象上使用字符串模式

我理解为什么，urllib.request.urlopen（）返回一个字节流，所以，至少我猜测，重新不知道要使用的编码。在这种情况下我该怎么办？有没有办法在urlrequest中指定编码方法，或者我需要自己重新编码字符串？如果是这样我想做什么，我假设我应该从头信息或编码类型读取编码，如果在html中指定，然后重新编码为它？

Answer 1

至于我，解决方案如下（python3）：

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())

Answer 2

您只需使用Content-Type标头解码响应，通常是最后一个值。 the tutorial中也有一个例子。

output = response.decode('utf-8')

Answer 3

使用requests：

import requests

response = requests.get(URL).text

Answer 4

过去两天我遇到了同样的问题。我终于有了解决方案。我使用info()返回的对象的urlopen()方法：

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)

Answer 5

urllib.urlopen(url).headers.getheader('Content-Type')

将输出如下内容：

text/html; charset=utf-8

Answer 6

这是一个简单的http请求示例（我已测试并正常运行）...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

请务必阅读文档。

https://docs.python.org/3/library/urllib.request.html

如果您想执行更详细的GET / POST REQUEST。

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html

Answer 7

在您发出请求req = urllib.request.urlopen(...)后，您必须通过调用html_string = req.read()来读取请求，该请求将为您提供字符串响应，然后您可以按照自己的方式进行解析。

如何处理来自urllib.request.urlopen（）的响应编码

7 个答案: