使用BeautifulSoup4&amp ;;解析错误Python 3.3

时间:2013-02-15 03:25:45

标签: python parsing encoding python-3.x beautifulsoup

运行此代码:

from bs4 import BeautifulSoup
soup = BeautifulSoup (open("my.html"))
print(soup.prettify())

产生此错误:

Traceback (most recent call last):
  File "soup.py", line 5, in <module>
    print(soup.prettify())
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25ba' in position
9001: character maps to <undefined>

然后我尝试了:

print(soup.encode('UTF-8').prettify())

但是由于字节对象的字符串格式化而失败了:

Traceback (most recent call last):
  File "soup.py", line 11, in <module>
    print(soup.encode('UTF-8').prettify())
AttributeError: 'bytes' object has no attribute 'prettify'

不确定如何解决这个问题。任何意见都将不胜感激。

1 个答案:

答案 0 :(得分:3)

您的(Windows)控制台正在使用cp437编码,并且该编码不支持汤中的字符。默认情况下是在这种情况下抛出异常,但您可以更改它。

import sys,io
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')
soup = BeautifulSoup (open("my.html"))
print(soup.prettify())

或者,将汤写入文件并使用支持编码的编辑器阅读:

# On Windows, utf-8-sig will allow the file to be read by Notepad.
with open('out.txt','w',encoding='utf-8-sig') as f:
   f.write(soup.prettify())