Question

我正在学习urllib2和Beautiful Soup，并且在第一次测试时遇到如下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

似乎有很多关于此类错误的帖子，我尝试过我能理解的解决方案，但似乎有22个问题，例如：

我想打印post.text（其中text是一个美丽的汤方法，只返回文本）。 str(post.text)和post.text会产生unicode错误（例如右撇号的'和...）。

所以我在post = unicode(post)之上添加str(post.text)，然后我得到：

AttributeError: 'unicode' object has no attribute 'text'

我还尝试了(post.text).encode()和(post.text).renderContents()。后者产生错误：

AttributeError: 'unicode' object has no attribute 'renderContents'

然后我尝试str(post.text).renderContents()并收到错误：

AttributeError: 'str' object has no attribute 'renderContents'

如果我可以在文档'make this content 'interpretable''的顶部定义某个位置并且仍然可以访问所需的text函数，那就太棒了。

建议后

更新：

如果我将post = post.decode("utf-8")添加到str(post.text)以上，我会：

TypeError: unsupported operand type(s) for -: 'str' and 'int'

如果我将post = post.decode()添加到str(post.text)以上，我会：

AttributeError: 'unicode' object has no attribute 'text'

如果我将post = post.encode("utf-8")添加到(post.text)以上，我会：

AttributeError: 'str' object has no attribute 'text'

我尝试了print post.text.encode('utf-8')并得到了：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

为了尝试可能有用的东西，我从here安装了适用于Windows的lxml，并使用以下方法实现：

parsed_content = BeautifulSoup(original_content, "lxml")

根据http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters。

这些步骤似乎没有什么区别。

我正在使用Python 2.7.4和Beautiful Soup 4.

解决方案：

在深入了解unicode，utf-8和Beautiful Soup类型之后，它与我的打印方法有关。我删除了所有str方法和连接，例如str(something) + post.text + str(something_else)，因此它是something, post.text, something_else并且它似乎打印得很好，除了我在此阶段对格式的控制较少（例如，在,处插入的空格）。

Answer 1

在Python 2中，unicode个对象只有在可以转换为ASCII时才能打印。如果无法用ASCII编码，您将收到该错误。您可能希望对其进行显式编码，然后打印生成的str：

print post.text.encode('utf-8')

Answer 2

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

为我工作; - ）

Answer 3

您是否尝试过.decode()或.decode("utf-8")？

而且，我建议您使用lxml

使用html5lib parser

http://lxml.de/html5parser.html

UnicodeEncodeError：'ascii'编解码器无法编码字符u'\ u2026'

3 个答案: