对API的多个URL请求,而不会从urllib2或请求

时间:2016-08-20 18:19:54

标签: python python-requests urllib2 urllib httplib2

我正在尝试从不同的API获取数据。它们以JSON格式接收,存储在SQLite中,然后进行解析。

我遇到的问题是,在发送许多请求时,即使我在请求之间使用time.sleep,我最终也会收到错误。

常用方法

我的代码如下所示,其中这将在循环中,并且要打开的url将会发生变化:

base_url = 'https://www.whateversite.com/api/index.php?'
custom_url = 'variable_text1' + & + 'variable_text2' 

url = base_url + custom_urls #url will be changing

time.sleep(1)
data = urllib2.urlopen(url).read() 

这在循环中运行了数千次。问题出现在脚本运行一段时间后(最多几个小时),然后我得到以下错误或类似错误:

    data = urllib2.urlopen(url).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1222, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

    uh = urllib.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 437, in open_https
    h.endheaders(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 829, in _send_output
    self.send(msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 791, in send
    self.connect()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1172, in connect
    self.timeout, self.source_address)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known

我相信这是因为如果在短时间内经常使用,模块会在某些时候出错。

从我在which module is better的许多不同主题中看到的内容,我认为我的需求一切都会奏效,选择一个的主要杠杆是它可以打开尽可能多的网址。根据我的经验,urlliburllib2requests更好,因为requests在更短的时间内崩溃。

假设我不想增加time.sleep中使用的等待时间,这些是我到目前为止所考虑的解决方案:

可能的解决方案?

A

我想过要结合所有不同的模块。那将是:

  • 例如,使用requests
  • 开始
  • 在特定时间或错误发生后,自动切换 到urllib2
  • 在特定时间或错误发生后,自动切换到其他模块(例如httplib2urllib)或返回requests
  • 依旧......

使用try .. except块来处理该异常,如建议here

C

我也读过sending multiple requests at once or in parallel。我不知道这究竟是如何起作用的,以及它是否真的有用

但是,我不相信任何这些解决方案。

你能想到更优雅和/或更有效的解决方案来处理这个错误吗?

我使用的是Python 2.7

1 个答案:

答案 0 :(得分:0)

即使我不相信,我最终也试图实施try .. except阻止,我对结果非常满意:

for url in list_of_urls:
    time.sleep(2)
    try:
        response = urllib2.urlopen(url)
        data = response.read()
        time.sleep(0.1)
        response.close() #as suggested by zachyee in the comments

        #code to save data in SQLite database

    except urllib2.URLError as e:
        print '***** urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known> *****'
        #save error in SQLite
        cur.execute('''INSERT INTO Errors (error_type, error_ts, url_queried)
        VALUES (?, ?, ?)''', ('urllib2.URLError', timestamp, url))
        conn.commit()
        time.sleep(30) #give it a small break

脚本一直运行到最后。

从数以千计的请求中我得到了8个错误,这些错误已保存在我的数据库及其相关URL中。这样,如果需要,我可以尝试再次检索这些网址。