Question

我使用Python 3.2.3运行此代码：

regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

然后使用findall搜索模式：

titles = re.findall(pattern,html)
print(titles)

html对象从特定网址获取HTML代码。

html = response.read()

我收到错误“无法在类字节对象上使用字符串模式”。我尝试过使用：

regex = b'<title>(.+?)</title>'

但是我的结果会附加“b”吗？谢谢。

Answer 1

urllib.request响应为您提供字节，而不是unicode字符串。这就是re模式也需要成为bytes对象的原因，并且会再次获得bytes个结果。

您可以使用服务器在HTTP标头中提供的编码来解码响应：

html = response.read()
# no codec set? We default to UTF-8 instead, a reasonable assumption
codec = response.info().get_param('charset', 'utf8')
html = html.decode(codec)

现在您拥有Unicode并且也可以使用unicode正则表达式。

如果服务器对编码撒谎或者没有编码设置且UTF-8的默认值也不正确，上述情况仍会导致UnicodeDecodeException错误。

在任何情况下，用b'...'表示的返回值都是bytes个对象;原始字符串数据尚未解码为Unicode，如果您知道正确的数据编码，则无需担心。

错误：无法在类字节对象上使用字符串模式

1 个答案: