Question

所以我将带有.findAll（BeautifulSoup）的html页面解析为名为result的变量。如果我在Python shell中键入result然后按Enter键，我会按预期看到正常文本，但是由于我想将此结果作为字符串对象进行后处理，我注意到str(result)返回垃圾，如下样本：< / p>

\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>

Html页面来源为utf-8编码

我该如何处理？

代码基本上就是这个，如果重要的话：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)

Python是2.7

Answer 1

Python 2.6.7 BeautifulSoup。版本 3.2.0

这对我有用：

unicode.join(u'\n',map(unicode,result))

我很确定result是BeautifulSoup.ResultSet对象，它似乎是标准python列表的扩展

Answer 2

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
#findAll should get multiple parsed result
result = soup.findAll(something)
#then iterate result
for line in result:
    #get str value from each line,replace charset with utf-8 or other charset you need
    print line.__str__('charset')

BTW：BeautifulSoup的版本是beautifulsoup-3.2.1

Answer 3

那不是垃圾，那是UTF-8编码的文本。 Use Unicode instead.

Answer 4

使用此：

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')

Unicode有multiple normalization forms 那个输出不应该是垃圾。
使用originalEncoding属性验证编码方案。
关于python的unicode实现，请参考this document（即使是规范化）

如何将BeautifulSoup.ResultSet转换为字符串

4 个答案: