urllib.request.urlopen无法获取Stack Overflow选举的初选页面

时间:2015-11-16 21:43:38

标签: python python-3.4 urllib

我有一点script来总结和排序Stack Exchange选举初选中​​的候选分数。它适用于大多数站点,但Stack Overflow除外,它使用request.urlopen的{​​{1}}检索URL失败,403错误(禁止)。为了证明这个问题:

urllib

输出,Math SE和Server Fault的URL工作正常,但Stack Overflow失败:

from urllib import request

urls = (
    'http://math.stackexchange.com/election/5?tab=primary',
    'http://serverfault.com/election/5?tab=primary',
    'http://stackoverflow.com/election/7?tab=primary',
)

for url in urls:
    print('fetching {} ...'.format(url))
    request.urlopen(url).read()

使用fetching http://math.stackexchange.com/election/5?tab=primary ... fetching http://serverfault.com/election/5?tab=primary ... fetching http://stackoverflow.com/election/7?tab=primary ... Traceback (most recent call last): File "examples/t.py", line 11, in <module> request.urlopen(url).read() File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen return opener.open(url, data, timeout) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 469, in open response = meth(req, response) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 579, in http_response 'http', request, response, code, msg, hdrs) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 507, in error return self._call_chain(*args) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain result = func(*args) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 587, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden ,所有网址都有效。所以问题似乎是curl的{​​{1}}特有的。我试过OSX和Linux,结果相同。发生了什么事?怎么解释这个?

2 个答案:

答案 0 :(得分:2)

使用requests代替urllib

import requests

urls = (
    'http://math.stackexchange.com/election/5?tab=primary',
    'http://serverfault.com/election/5?tab=primary',
    'http://stackoverflow.com/election/7?tab=primary',
)

for url in urls:
    print('fetching {} ...'.format(url))
    data = requests.get(url)

如果您想通过使用单个HTTP会话来提高效率

import requests

urls = (
    'http://math.stackexchange.com/election/5?tab=primary',
    'http://serverfault.com/election/5?tab=primary',
    'http://stackoverflow.com/election/7?tab=primary',
)
with requests.Session() as session:
    for url in urls:
        print('fetching {} ...'.format(url))
        data = session.get(url)

答案 1 :(得分:1)

它似乎是与urllib一起发送的用户代理。这段代码适合我:

from urllib import request

urls = (
    'http://math.stackexchange.com/election/5?tab=primary',
    'http://serverfault.com/election/5?tab=primary',
    'http://stackoverflow.com/election/7?tab=primary',
)

for url in urls:
    print('fetching {} ...'.format(url))
    try:
        request.urlopen(url).read()
    except:
        print('got an exception, changing user-agent to urllib3 default')
        req = request.Request(url)
        req.add_header('User-Agent', 'Python-urllib/3.4')
        try:
            request.urlopen(req)
        except:
            print('got another exception, changing user-agent to something else')
            req.add_header('User-Agent', 'not-Python-urllib/3.4')
            request.urlopen(req)

这是目前的输出(2015-11-16),为了便于阅读添加了空行:

fetching http://math.stackexchange.com/election/5?tab=primary ...
success with url: http://math.stackexchange.com/election/5?tab=primary

fetching http://serverfault.com/election/5?tab=primary ...
success with url: http://serverfault.com/election/5?tab=primary

fetching http://stackoverflow.com/election/7?tab=primary ...
got an exception, changing user-agent to urllib default
got another exception, changing user-agent to something else
success with url: http://stackoverflow.com/election/7?tab=primary