为什么html2text模块会抛出UnicodeDecodeError?

时间:2014-07-15 09:50:52

标签: python unicode

我遇到html2text模块的问题...给我看了UnicodeDecodeError:

UnicodeDecodeError: 'ascii' codec can't decode byte 
0xbe in position 6: ordinal not in range(128)

示例:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib

h = html2text.HTML2Text()
h.ignore_links = True

html = urllib.urlopen( "http://google.com" ).read()

print h.handle( html )

...也试过h.handle( unicode( html, "utf-8" )但没有成功。任何帮助。 编辑:

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    print h.handle(html)
  File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
    return self.optwrap(self.close())
  File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)

1 个答案:

答案 0 :(得分:5)

不解码时,该问题很容易重现,但在正确解码源时正常。如果您重复使用解析器,会收到错误!

您可以使用已知良好的Unicode源代码进行尝试,例如http://www.ltg.ed.ac.uk/~richard/unicode-sample.html

如果您未对unicode的响应进行解码,则该库将失败:

>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
    return self.optwrap(self.close())
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

现在,如果你重用 HTML2Text对象,它的状态不会被清除,它仍然保存不正确的数据,所以即使传入Unicode也会失败:

>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
    return self.optwrap(self.close())
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

您需要使用一个新对象,它才能正常工作:

>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>