Question

使用Requests和Beautiful Soup解析HTML文件时，以下行会在某些网页上引发异常：

if 'var' in str(tag.string):

以下是上下文：

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

以下是例外：

UnicodeDecodeError：'ascii'编解码器无法解码位置15中的字节0xc3：序数不在范围内（128）

我在encode('utf-8')行中使用和不使用BeautifulSoup函数都尝试过，它没有任何区别。我注意到，对于抛出异常的页面，javascript中的注释中有一个字符Ã，即使response.encoding报告的编码是ISO-8859-1。我确实意识到我可以使用unicodedata.normalize删除有问题的字符但是我希望将tag变量转换为utf-8并保留字符。以下方法均无法将变量更改为utf-8：

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

我必须对此字符串做什么才能将其转换为可用的utf-8？

Answer 1

好的，所以基本上你得到了Latin-1编码的HTTP响应。确实Ã提供问题的字符0xC3，因为查看here您可能会看到if 'var' in str(tag.string):正是拉丁语-1中的字符。

我认为你对你想象的关于解码/编码请求的每个组合进行了盲测。首先，如果你这样做：string只要response = requests.get(url) # decode the latin-1 bytes to unicode #soup = bs4.BeautifulSoup(response.text.decode('latin-1')) #try this line instead soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding) for tag in soup.findAll('script'): # since now soup was made with unicode strings I supposed you can treat # its elements as so if u'var' in tag.string: # This is the line throwing the exception # now if you want output in utf-8 print(tag.string.encode('utf-8')) var包含非ASCII字节，python就会投诉。

查看您与我们分享的代码，恕我直言的正确方法是：

编辑：您可以查看the encoding section from the BeautifiulSoup 4 doc

基本上，逻辑是：

您获得了一些以编码X
您可以通过bytes.decode('X') and this returns a unicode byte sequence

Y

您使用unicode
您将unicode编码为输出ubytes.encode('Y')

希望这能为问题带来一些启示。

Answer 2

您还可以尝试使用Unicode Dammit lib（它是BS4的一部分）来解析页面。详细说明如下：http://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html

为什么Python坚持使用ascii？

2 个答案: