随机执行Python线程返回错误结果

时间:2019-06-13 11:17:24

标签: python multithreading rest python-multithreading

我的问题

我一直在使用python库 Thread ,似乎在返回正确结果方面存在一些问题。当我运行相同的功能时连续十次,结果八次是正确的,两次是错误的。

当结果不正确时,是因为各个调用中的一些结果字典似乎随机地合并在一起。

我的代码:

此功能使会话重试针对某些状态代码的剩余呼叫:

# Makes retry sessions
def requests_retry_session(retries=3,backoff_factor=0.3,status_forcelist=(500, 502, 504),session=None):
    """
    Description:
        Creates a session which uses retries
    Input:
        retries (int):  Max number of retries
        backoff_factor (int): Time between retries
        status_forcelist (tuble): Status for which to retry
    Returns:
        session: Requests session which handles different status and connection errors
    """
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        redirect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

此函数对多个URL进行剩余调用:

def make_rest_calls(urls, header, store=None):
    """
    Description:
        Processes list of urls
    Input:
        urls (list): List of urls for rest call
        header (dictionary): Dictionary containing credentials
        store (dictionary): Dictionary for collecting results
    Returns:
        store (dictionary): Dictionary with results
    """
    if store is None:
        store = {}
    for url in urls:
        store[url] = requests_retry_session().get(url, headers=header, timeout=5)

    return store

此函数运行多线程的其余调用

def run_multi_threaded(nthreads, list_of_urls, header):
    """
    Description:
        Runs multiple threads
    Input:
        nthreads (int): Number for threads to run
        list_of_urls(list): List of rest urls
        header (dictionary): Dictionary containing credentials
    Returns:
        store (dictionary): Dictionary with results
    """
    store = {}
    threads = []

    # create the threads
    for i in range(nthreads):
        small_list_of_urls = list_of_urls[i::nthreads]
        t = Thread(target=make_rest_calls, args=(small_list_of_urls))
        threads.append(t)

    # start the threads
    [t.start() for t in threads ]
    # wait for the threads to finish
    [ t.join() for t in threads ]

    return store

问题

这是包装的弱点吗?我应该使用多个进程吗?还是我做错了一些导致这种副作用的事情?

我需要进行多次调用,因此需要进行多线程处理。显然也必须是正确的。

2 个答案:

答案 0 :(得分:2)

正如blhsing所提到的,问题可能没有显示一些有助于回答的细节。

通过对结果的描述(仅在某些时候进行随机合并的输出),似乎在以下情况下可能会发生并发问题:

  1. 执行将存储作为参数传递给Thread初始化中的make_rest_calls函数。 (如果make_rest_calls中store的默认值确实是'{}'而不是None,也可能会发生这种情况)
  2. list_of_urls中保留了很少的重复项。

如果这两种情况都以某种方式发生,则结果可能如前所述,因为两个不同的线程可能会尝试同时更改同一商店条目。

如果是这种情况,简单的解决方案可能是确保您的list_of_urls不会两次保存相同的URL。 (例如,您可能要使用set(list_of_urls)。)

值得一提的是,至少在这种用法中,在make_rest_calls函数中没有任何返回值的用途,因为线程不返回该函数的返回值。例如,唯一可行的方法是更改​​输入存储值,而不返回任何内容。

答案 1 :(得分:0)

问题出在以下事实:线程不应该返回任何东西,并且可能是在同时写入store字典的情况下发生的。

我建议从threading切换到multiprocessing,这虽然会花费少量的内存(加上实例化进程的时间),但是却可以运行实际的并行操作(可以通过以下方式进行调整)如果您想避免在给定的时间范围内触发太多的休息电话,则需要超时。然后,您可以使用Pool对象及其功能强大的apply_async方法来设置完整的任务集(其中每个任务都是“对该单个url进行其余调用”)并进行分发进程间异步;然后,您可以使用Queue的put方法作为回调,以确保以元组的形式收集结果,并在返回时最终将结果转换为dict。

在代码中:

import multiprocessing

# requests_retry_session is unchanged

# modified based on make_rest_calls:
def make_rest_call(url, header):
    """
    Description:
        Requests a single url. Return its name plus the results.
        Designed for use as a multiprocessed function.
    Input:
        url (str): url for rest call
        header (dictionary): Dictionary containing credentials
    Returns:
        results (tuple): Tuple containing the url and the request's output
    """
    response = requests_retry_session().get(url, headers=header, timeout=5)
    return (url, response)


# full rewrite of run_multi_threaded:
def run_multi_processed(njobs, list_of_urls, header):
    """
    Description:
        Parallelize urls requesting on multiple processes.
    Input:
        njobs (int): Number for processes to run
        list_of_urls(list): List of rest urls
        header (dictionary): Dictionary containing credentials
    Returns:
        results (dictionary): Dictionary with results
    """
    queue = multiprocessing.Manager().Queue()
    with multiprocessing.Pool(njobs) as pool:
        # Set up the tasks list, asynchronously handled by processes.
        for url in urls:
            pool.apply_async(
                make_rest_call,
                args=(url, header),
                callback=queue.put
            )
        # Wait for all tasks' completion.
        pool.join()
        pool.close()
    # Gather the results and format them into a dict.
    results = {}
    while not queue.empty():
        url, response = queue.get()
        results[url] = response
    return results

我希望这会有所帮助:-)