我遇到html2text模块的问题...给我看了UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
示例:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...也试过h.handle( unicode( html, "utf-8" )
但没有成功。任何帮助。
编辑:
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)
答案 0 :(得分:5)
当不解码时,该问题很容易重现,但在正确解码源时正常。如果您重复使用解析器,也会收到错误!
您可以使用已知良好的Unicode源代码进行尝试,例如http://www.ltg.ed.ac.uk/~richard/unicode-sample.html
。
如果您未对unicode
的响应进行解码,则该库将失败:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
现在,如果你重用 HTML2Text
对象,它的状态不会被清除,它仍然保存不正确的数据,所以即使传入Unicode也会失败:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
您需要使用一个新对象,它才能正常工作:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>