Question

# -*- coding: utf-8 -*-
import sys
import urllib
from bs4 import BeautifulSoup
print sys.getdefaultencoding()
html_req_url = 'http://www.superfix.com/'
html_content = urllib.urlopen(html_req_url)
soup = BeautifulSoup(html_content, 'lxml')
html_title = soup.findAll('title')
print html_title

这是输出：

ascii
[<title>SUPERFIX\u5b98\u7f51 \u2013 \u5b89\u5168\u4fbf\u6377\u7684\u624b\u673a\u7ef4\u4fee\u670d\u52a1</title>]

我在Mac上使用PyCharm，我无法对来自soup.findAll('title')的任何输出str进行编码，但print soup是正常的。

我错过了什么吗？

Answer 1

soup.findAll()（或soup.find_all()，新名称）会返回Tag objects的列表。这些本身不是文本，不能编码。

HTML文档中只有一个<title>标记，因此请使用soup.find()，然后使用Tag.get_text()提取文本内容：

title_tag = soup.find('title')
if title_tag is not None:
    title = title_tag.get_text()
    print title

BeautifulSoup中的文本返回NavigableString objects，unicode的子类。您可以根据需要将这些值编码为UTF-8。

为什么我不能将输出str从soup.findAll编码为utf-8？

1 个答案: