应用错误收集

我遇到麻烦让python mechanize返回unicode字符串，而不是在Python 2.7.6中输入str。

url = 'http://www.huffingtonpost.com'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
html = br.open(url).read()
print(type(html))
print(BeautifulSoup(html).original_encoding)

html是str类型，但是由美丽的汤检测到的页面的原始编码，如果你查看页面的源代码，你可以看到是utf8，即unicode。有没有办法让浏览器将html（以及从html中提取的后续文本）作为unicode？我遇到的一个解决方案是将html解码回unicode，即html.decode('utf8')，但似乎奇怪的是页面需要转换为str然后再转换为unicode。有没有办法通过机械化浏览器获取网站编码的html？

使用Python Mechanize将HTML作为Unicode读取

0 个答案: