Question

我有一个非常基本的脚本，可以使用Python urllib2下载网站。

在过去的6个月里，这种情况一直很好用，今天早上它已经不再适用了吗？

#!/usr/bin/python
import urllib2
proxy_support = urllib2.ProxyHandler({'http': 'http://DOMAIN\USER:PASS@PROXY:PORT/'})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
translink = open('/tmp/trains.html' ,'w')
response = urllib2.urlopen('http://translink.com.au')
html = response.read()
translink.write(html)
translink.close()

我现在收到以下错误

Traceback (most recent call last):
  File "./gettrains.py", line 7, in <module>
    response = urllib2.urlopen('http://translink.com.au')
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 407, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 445, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 502: Proxy Error ( The HTTP message includes an unsupported header or an unsupported combination of headers.  )

我是Python新手，非常感谢任何帮助。

干杯

#!/usr/bin/python
import requests
proxies = {
"http": "http://domain\user:pass@proxy:port",
"https": "http:// domain\user:pass@proxy:port",
} 
html = requests.get("http://translink.com.au", proxies=proxies)
translink = open('/tmp/trains.html' ,'w')
translink.write(html.content)
translink.close()

Answer 1

尝试更改标题。例如：

opener = urllib2.build_opener(proxy_support)
opener.addheaders = ([('User-Agent' , 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)')])
urllib2.install_opener(opener)

前几天我遇到了同样的问题。我的代理不承认默认标头user-agent ='Python-urllib / 2.7'

Answer 2

为了简化一些事情，我会避免在python中进行代理设置，只需让你的操作系统为你管理它。您可以通过设置环境变量（如Linux中的export http_proxy="your_proxy"）来完成此操作。然后直接通过python抓取文件，您可以使用urllib2或requests，也可以考虑wget模块。

完全可能的是，您的代理可能会发生一些更改，这些更改会使用您的最终目标无法接受的标头转发请求。在那种情况下，你可以做的很少。

Python urllib2问题

2 个答案: