Question

我正在尝试使用Beautiful Soup和Python 2.6.5从具有斯堪的纳维亚字符的网站中提取文本和HTML。

html = open('page.html', 'r').read()
soup = BeautifulSoup(html)

descriptions = soup.findAll(attrs={'class' : 'description' })

for i in descriptions:
    description_html = i.a.__str__()
    description_text = i.a.text.__str__()
    description_html = description_html.replace("/subdir/", "http://www.domain.com/subdir/")
    print description_html

但是执行时，程序失败并显示以下错误消息：

Traceback (most recent call last):
    File "test01.py", line 40, in <module>
        description_text = i.a.text.__str__()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 19:         ordinal not in range(128)

输入页面似乎是用ISO-8859-1编码的，如果有任何帮助的话。我尝试使用BeautifulSoup(html, fromEncoding="latin-1")设置正确的源编码，但它也没有帮助。

2011年，我正在努力解决琐碎的字符编码问题，我相信这一切都有一个非常简单的解决方案。

Answer 1

i.a.__str__('latin-1')

或

i.a.text.encode('latin-1')

应该有用。

你确定它是latin-1吗？它应该正确检测编码。

另外，为什么不使用str(i.a)如果它发生你不需要指定编码？

修改：您需要install chardet才能自动检测编码。

Answer 2

我遇到的问题是Beautiful Soup无法输出包含德语字符的文本行。不幸的是，即使在stackoverflow上也有无数的答案并没有解决我的问题：

        title = str(link.contents[0].string)

这给出了UnicodeEncode错误：＆＃39; ascii编解码器无法对字符u＆＃39; \ xe4＆＃39;进行编码。位置32：序数不在范围内（128）

对于正确的解决方案，许多答案都有宝贵的指示。正如Lennart Regebro在UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)所说：

当您执行str（u＆＃39; \ u2013＆＃39;）时，您正在尝试转换Unicode 字符串到8位字符串。要做到这一点，你需要使用编码，a Unicode数据与8位数据之间的映射。 str（）的作用是什么使用系统默认编码，在Python 2下是ASCII。 ASCII 仅包含Unicode的127个第一个代码点，即\ u0000到 \ u007F1。结果是您得到上述错误，ASCII编解码器只是不知道\ u2013是什么（它是一个很长的冲刺，顺便说一句。）

对我来说，这是一个不使用str（）将Beautiful Soup对象转换为字符串格式的简单情况。摆弄控制台的默认输出也没有任何区别。

            ### title = str(link.contents[0].string)
            ### should be
            title = link.contents[0].encode('utf-8')

美丽的汤和字符编码

2 个答案: