Question

我遇到html2text模块的问题...给我看了UnicodeDecodeError：

UnicodeDecodeError: 'ascii' codec can't decode byte 
0xbe in position 6: ordinal not in range(128)

示例：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib

h = html2text.HTML2Text()
h.ignore_links = True

html = urllib.urlopen( "http://google.com" ).read()

print h.handle( html )

...也试过h.handle( unicode( html, "utf-8" )但没有成功。任何帮助。编辑：

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    print h.handle(html)
  File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
    return self.optwrap(self.close())
  File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)

Answer 1

当不解码时，该问题很容易重现，但在正确解码源时正常。如果您重复使用解析器，也会收到错误！

您可以使用已知良好的Unicode源代码进行尝试，例如http://www.ltg.ed.ac.uk/~richard/unicode-sample.html。

如果您未对unicode的响应进行解码，则该库将失败：

>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
    return self.optwrap(self.close())
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

现在，如果你重用 HTML2Text对象，它的状态不会被清除，它仍然保存不正确的数据，所以即使传入Unicode也会失败：

>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
    return self.optwrap(self.close())
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

您需要使用一个新对象，它才能正常工作：

>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

为什么html2text模块会抛出UnicodeDecodeError？

1 个答案: