Question

当我构建这样的代码时：

import urllib.request

with urllib.request.urlopen('http://google.ru') as url:
    print(url.read().decode())

我有这个错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 102: invalid continuation byte

有什么方法可以解决它？

Answer 1

您尝试在不指定编解码器的情况下解码数据。在这种情况下使用默认值（UTF-8），并且该页面的默认值是错误的。鉴于域名，我希望它可以是Cyrillic encoding。

如果回复包含正确的编解码器，则会找到url.info().get_charset();如果未设置，它将返回None，此时HTML可能会在<meta> tag中包含提示;你必须手动解析它。

您尝试加载的网址不包含内容类型中的字符集：

>>> import urllib.request
>>> url = urllib.request.urlopen('http://google.ru')
>>> url.info().get_charset() is None
True

如果未设置<meta>标记或Content-Type字符集，则默认值为Latin-1;这适用于您提供的URL：

print(url.read().decode('latin1'))

然而，这可能甚至不是正确的编码;因为Latin-1适用于所有内容。您可能会获得Mochibake。在某些情况下，您可能需要硬编码;这对我来说就像CP-1251编码（Windows Cyrilic代码页）。

如果您打算解析HTML，请使用BeautifulSoup并传入bytes内容;它会为你自动检测编码：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('http://google.ru') as url:
    soup = BeautifulSoup(url)

如果自动检测错误，您可以告诉BeautifulSoup使用from_encoding的特定编码：

with urllib.request.urlopen('http://google.ru') as url:
    soup = BeautifulSoup(url, from_encoding='cp1251')

演示：

>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = urllib.request.urlopen('http://google.ru')
>>> soup = BeautifulSoup(url, from_encoding='cp1251')
>>> soup.head.meta
<meta content="Поиск информации в интернете: веб страницы, картинки, видео и многое другое." name="description"/>

我必须说我很惊讶Google没有在响应中设置正确的内容类型字符集。

带有urllib.request对象的UnicodeDecodeError

1 个答案: