我正在尝试使用urllib2和BeautifulSoup抓取网页。它工作正常,然后当我在代码的不同部分放入input()
来尝试调试某些东西时,我得到了一个HTTPError。当我再次尝试运行程序时,我在尝试调用read()时遇到了HTTPError。错误堆栈如下:
[2013-07-17 16:47:07,415: ERROR/MainProcess] Task program.tasks.testTask[460db7cf-ff58-4a51-9c0f-749affc66abb] raised exception: IOError()
16:47:07 celeryd.1 | Traceback (most recent call last):
16:47:07 celeryd.1 | File "/Users/username/folder/server2/venv/lib/python2.7/site-packages/celery/execute/trace.py", line 181, in trace_task
16:47:07 celeryd.1 | R = retval = fun(*args, **kwargs)
16:47:07 celeryd.1 | File "/Users/username/folder/server2/program/tasks.py", line 193, in run
16:47:07 celeryd.1 | self.get_top_itunes_game_by_genre(genre)
16:47:07 celeryd.1 | File "/Users/username/folder/server2/program/tasks.py", line 244, in get_top_itunes_game_by_genre
16:47:07 celeryd.1 | game_page = BeautifulSoup(urllib2.urlopen(game_url).read())
16:47:07 celeryd.1 | File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
16:47:07 celeryd.1 | return _opener.open(url, data, timeout)
16:47:07 celeryd.1 | File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
16:47:07 celeryd.1 | response = meth(req, response)
16:47:07 celeryd.1 | File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
16:47:07 celeryd.1 | 'http', request, response, code, msg, hdrs)
16:47:07 celeryd.1 | File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
16:47:07 celeryd.1 | return self._call_chain(*args)
16:47:07 celeryd.1 | File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
16:47:07 celeryd.1 | result = func(*args)
16:47:07 celeryd.1 | File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
16:47:07 celeryd.1 | raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
16:47:07 celeryd.1 | HTTPError
以下是代码:
for game_url in urls:
game_page = BeautifulSoup(urllib2.urlopen(game_url).read())
# code to process page
有谁知道为什么我开始收到此错误?谢谢!
答案 0 :(得分:1)
将我的评论改为答案:
您正在抓取的页面(最有可能)响应4xx响应,并且urllib2会引发HTTPError,正如它在the docs中所说的那样。抓住异常并(希望)用它来做某事,记录它或者你有什么是你的工作。无论出于何种原因,您的回溯都不会显示HTTPError的代码/原因,但它就在那里。查看错误的“代码”和“原因”属性。
编辑: 您正在抓取的网站可能会发现您是一名机器人。您可能需要花一点时间来重写刮刀以使用更加服务器友好(并且更好的API)库。 urllib2适用于一次性任务,但它有很多缺点,我不会在这里讨论。可能需要考虑的高级库是requests,mechanize,也许是httplib2。所有都有上升/下降所以我不能告诉你哪一个适合你的需要。
您还可能希望查看您的请求中发送的用户代理标头,因为如果您自我识别为机器人,那么。呀。