Question

我试图以下列方式打印出html内容：

from lxml import html
import requests

url = 'http://www.amazon.co.uk/Membrane-Protectors-FujiFilm-FinePix-SL1000/dp/B00D2UVI9C/ref=pd_sim_ph_3?ie=UTF8&refRID=06BDVRBE6TT4DNRFWFVQ'
page = requests.get(url)
print page.text

然后我执行python print_url.py > out，我收到以下错误：

print page.text UnicodeEncodeError：＆＃39; ascii＆＃39;编解码器无法编码字符u＆＃39; \ xa3＆＃39;位置113525：序数不在范围内（128）

有人能给我一些想法吗？我以前遇到过这些问题，但我无法弄明白。感谢

Answer 1

您的page.txt不在您的本地编码中。相反，它可能是unicode。要打印page.text的内容，必须首先使用stdout期望的编码对它们进行编码：

import sys
print page.text.encode(sys.stdout.encoding)

Answer 2

该页面包含非ascii unicode字符。如果您尝试打印到不支持它们的shell，或者因为您将输出重定向到文件并且假设输出为ascii编码，则可能会出现此错误。我指定这个是因为有些shell没有问题，而其他shell会（例如我当前的shell /终端默认为uf8）

如果您希望将输出编码为utf8，则应对其进行显式编码：

print page.text.encode('utf8')

如果您希望将其编码为shell可以处理的内容或ascii删除或替换不可打印的字符，请使用以下方法之一：

print page.text.encode(sys.stdout.encoding or "ascii", 'xmlcharrefreplace') - 用数字实体替换不可打印的字符

print page.text.encode(sys.stdout.encoding or "ascii", 'replace') - 用“？”替换不可打印的字符

print page.text.encode(sys.stdout.encoding or "ascii", 'ignore') - 用任何内容替换不可打印的字符（删除它们）

在终端错误中输出html

2 个答案: