Question

我在Windows 7中使用python 3.3.0。

我让这个脚本绕过系统上的http proxy without authentication。但是当我执行时，它会给出错误：UnicodeEncodeError: 'charmap' codec can't encode characters in position 6242-6243: character maps to <undefined> 它似乎无法将unicode字符解码为字符串。

那么，我应该使用或编辑/做什么？有人有任何线索或解决方案吗？

我的.py包含以下内容：

import sys, urllib
import urllib.request

url = "http://www.python.org"
proxies = {'http': 'http://199.91.174.6:3128/'}

opener = urllib.request.FancyURLopener(proxies)

try:
    f = urllib.request.urlopen(url)
except urllib.error.HTTPError as  e:
    print ("[!] The connection could not be established.")
    print ("[!] Error code: ",  e.code)
    sys.exit(1)
except urllib.error.URLError as  e:
    print ("[!] The connection could not be established.")
    print ("[!] Reason: ",  e.reason)
    sys.exit(1)

source = f.read()

if "iso-8859-1" in str(source):
    source = source.decode('iso-8859-1')
else:
    source = source.decode('utf-8')

print("\n SOURCE:\n",source)

Answer 1

此代码甚至不使用您的代理
这种编码检测形式非常薄弱。您应该只在明确定义的位置查找声明的编码：HTTP标头'Content-Type'，如果响应是charset元标记中的HTML。
由于您没有包含堆栈跟踪，因此我假设该错误在该行中出现 if "iso-8859-1" in str(source):。对str()的调用使用系统默认编码（sys.getdefaultencoding()）对字节数据进行解码。如果你真的想保持这种检查（见第2点）你应该这样做 if b"iso-8859-1" in source:这适用于字节而不是字符串，因此不必事先进行解码。

注意：这段代码对我来说很好，大概是因为我的系统使用utf-8的默认编码，而你的windows系统使用不同的东西。

更新：我建议在python中使用http时使用python-requests。

import requests

proxies = {'http': your_proxy_here}

with requests.Session(proxies=proxies) as sess:
    r = sess.get('http://httpbin.org/ip')
    print(r.apparent_encoding)
    print(r.text)
    # more requests

注意：这不使用HTML中指定的编码，您需要像beautifulsoup这样的HTML解析器来提取它。

python 3 - HTTP代理问题

1 个答案: