Question

我只是想用urllib下载一个html文件，然后将html打印到终端。 html文件似乎在utf-8中正确编码，因为元标记指定（将其保存到文件并在任何其他程序中打开它导致该文件被正确读取和显示）。

问题在于，当我尝试将整个html打印到终端时，Python抛出了与编码相关的异常，我有点迷失。我虽然关于文件的编码，可能是我没有正确指定编码。我也尝试在cygwin终端中执行此操作，并打印html虽然存在编码问题（某些字符错误）。

这是代码：

from bs4 import BeautifulSoup
import gzip
import urllib.request
import sys, codecs

myheaders = dict()
myheaders['User-Agent'] = "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36"
myheaders['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
myheaders['Connection'] = "keep-alive"
myheaders['Accept-Encoding'] = "gzip"

request = urllib.request.Request("http://www.seriesyonkis.com", headers=myheaders)
responsehandler = urllib.request.urlopen(request)
rawresponse = responsehandler.read()
rawhtml = gzip.decompress(rawresponse)

rawhtml = str(rawhtml, encoding="utf-8")

print(rawhtml) #Throws encoding related exception

这是来自Windows控制台的追溯：

(venv) F:\dev\own\pyscraper>python scraper.py
Traceback (most recent call last):
  File "scraper.py", line 20, in <module>
    print(rawhtml)
  File "F:\dev\own\pyscraper\venv\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position
27973: character maps to <undefined>

我做错了什么？

Answer 1

该页面似乎确实存在一些有问题的字符。在Safari的“查看源”模式中查看html，我发现了以下内容：

enter image description here

正如您所看到的，语法着色突然停止 - 这表明源中存在一些不良字符。我不认为你做了什么“错误”。我想知道你为什么要把这个文本转储到控制台......

Answer 2

没有办法解决这个问题。您的终端不支持字符，这就是全部。

作为替代方案，您可以将数据写入文件：

with open('html.txt', 'w') as fout:
    fout.write(rawhtml)

此外，如果你想删除一个坏字符，这可能有效：

html = rawhtml.encode('utf-8', errors='ignore')

下载的utf-8 html文件无法使用Python 3打印到终端

2 个答案: