在Urllib2 + pool.map中处理超时异常和time.sleep

时间:2014-06-12 17:09:08

标签: python-2.7 multiprocessing urllib2

我是python的新手,我写了一些代码来从Web API下载数据。但是,在使用API​​时我必须遵守一些限制:

  • 每个API密钥每秒1个请求
  • 如果发生超时,请在再次尝试前等待30秒
  • 每个API密钥每天限制100k请求

向Web API发出请求的方法的代码是:

def getMatchDetails(self,match_id):
    '''Calls the WEB Api and requests the data for the match with
    a specific id (in match_id). Then returns the data already decoded 
    from json.'''
    import urllib2
    import json
    import time
    url = self.__makeUrl__(api_key= self.api_key, parameters = ['match_id='+str(match_id)])
    # Sometimes a time out occurs, we keep trying
    while True:
        try:
            start = time.time()
            json_obj = urllib2.urlopen(url)
            end = time.time()
            if end - start < 1:
                time.sleep(1 - (end - start))
        except:
            print('Timed Out, Trying again in 30 seconds')
            time.sleep(30)
            continue
        else:
            break
    detailed_data = json.load(json_obj)
    return detailed_data

方法 makeUrl 简单地连接一些字符串并返回它们。 并且为了在每次调用上述方法时更改API密钥,我使用:

def getMatchDetailsForMap(self,match_id):
    self.counter += 1
    self.api_key = self.api_keys[self.counter%len(self.api_keys)]
    return self.getMatchDetails(match_id)

其中self.api_keys是包含所有API密钥的列表。 然后我在下面的代码中使用方法 getMatchDetailsForMap 和map函数:

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(14)
ids_to_get = self.__idsToGetChunks__(14)
for chunk in ids_to_get:
        results = pool.map(self.getMatchDetailsForMap,chunk)

方法 idsToGetChunks 返回带有参数(match_id)的lits of lists(块),这些参数将被提供给getMatchDetailsForMap方法。

问题:

  • 尝试使用代码,我意识到每个密钥的1秒限制没有保留;那是为什么?
  • 当发生超时时,它确实减慢了获取数据的过程;使用地图时是否有更好的方法来处理这种异常? (请提示)

感谢阅读和帮助!对不起,很长的帖子。

1 个答案:

答案 0 :(得分:0)

为了符合这三个要求,我建议编写一个简单的for循环,每个循环执行一个请求。一般来说,等一秒钟。如果发生超时,请等待30秒。不要循环超过100k次。 (我假设这个脚本每天运行一次,并且需要不到24小时;))

主程序会为每个API密钥激活一个Process

简单!

# 1 request per second per API key
# If a timeout occurs, wait 30 seconds before trying again
# Limit of 100k requests per day per API key

import logging, time, urllib2
import multiprocessing as mp

def do_fetch(key, timeout):
    return urllib2.urlopen(
        'http://example.com', timeout=timeout
    ).read()

def get_data(api_key):
    logger = mp.get_logger()
    data = None
    # Limit of 100k requests per day per API key
    for num in range(100*1000): 
        t = 1 if num!=1 else 0 # test timeout exception
        try:
            data = do_fetch(api_key, timeout=t)
            logger.info('%d bytes', len(data))
        except urllib2.URLError as exc:
            logger.error('exc: %s', repr(exc))
            # If a timeout occurs, wait 30 seconds before trying again
            time.sleep(3)
        else:
            # "1 request per second per API key"
            time.sleep(1)


mp.log_to_stderr(level=logging.INFO)
keys = [123, 234]
pool = mp.Pool(len(keys))
pool.map( get_data, keys )

输出

[INFO/PoolWorker-1] child process calling self.run()
[INFO/PoolWorker-2] child process calling self.run()
[INFO/PoolWorker-2] 1270 bytes
[INFO/PoolWorker-1] 1270 bytes
[ERROR/PoolWorker-2] exc: URLError(error(115, 'Operation now in progress'),)
[ERROR/PoolWorker-1] exc: URLError(error(115, 'Operation now in progress'),)
[INFO/PoolWorker-2] 1270 bytes
[INFO/PoolWorker-1] 1270 bytes