防范异常 - PRAW

时间:2015-01-20 04:14:56

标签: python web-crawler praw

目前我有一个脚本可以从Reddit的首页下载顶级标题,它几乎总是有效。偶尔我会收到以下例外情况。我知道我应该插入tryexcept语句来保护我的代码,但是我应该把它放在哪里?

抓取:

def crawlReddit():                                                     
    r = praw.Reddit(user_agent='challenge')             # PRAW object
    topHeadlines = []                                   # List of headlines 
    for item in r.get_front_page():
        topHeadlines.append(item)                       # Add headlines to list
    return topHeadlines[0].title                            # Return top headline

def main():
    headline = crawlReddit()                            # Pull top headline

if __name__ == "__main__":
    main()              

错误:

Traceback (most recent call last):
  File "makecall.py", line 57, in <module>
    main()                                      # Run
  File "makecall.py", line 53, in main
    headline = crawlReddit()                            # Pull top headline
  File "makecall.py", line 34, in crawlReddit
    for item in r.get_front_page():
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/__init__.py", line 480, in get_content
    page_data = self.request_json(url, params=params)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/decorators.py", line 161, in wrapped
    return_value = function(reddit_session, *args, **kwargs)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/__init__.py", line 519, in request_json
    response = self._request(url, params, data)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/__init__.py", line 383, in _request
    _raise_response_exceptions(response)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/internal.py", line 172, in _raise_response_exceptions
    response.raise_for_status()
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/requests/models.py", line 831, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable

1 个答案:

答案 0 :(得分:1)

看起来r.get_front_page()返回一个延迟评估的对象,你只需要该对象的第一个元素。如果是这样,请尝试以下操作:

import time

def crawlReddit():                                                     
    r = praw.Reddit(user_agent='challenge')             # PRAW object
    front_page = r.get_front_page()
    try:
        first_headline = front_page.next() # Get the first item from front_page
    except HTTPError:
        return None
    else:
        return first_headline.title


def main():
    max_attempts = 3
    attempts = 1
    headline = crawlReddit()
    while not headline and attempts < max_attempts:
        time.sleep(1)  # Make the program wait a bit before resending request
        headline = crawlReddit()
        attempts += 1
    if not headline:
        print "Request failed after {} attempts".format(max_attempts)


if __name__ == "__main__":
    main()

编辑代码现在尝试最多访问数据3次,失败尝试之间间隔一秒。在第三次尝试后它放弃了。服务器可能处于脱机状态等。