我一直在使用python库 Thread ,似乎在返回正确结果方面存在一些问题。当我运行相同的功能时连续十次,结果八次是正确的,两次是错误的。
当结果不正确时,是因为各个调用中的一些结果字典似乎随机地合并在一起。
此功能使会话重试针对某些状态代码的剩余呼叫:
# Makes retry sessions
def requests_retry_session(retries=3,backoff_factor=0.3,status_forcelist=(500, 502, 504),session=None):
"""
Description:
Creates a session which uses retries
Input:
retries (int): Max number of retries
backoff_factor (int): Time between retries
status_forcelist (tuble): Status for which to retry
Returns:
session: Requests session which handles different status and connection errors
"""
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
redirect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
此函数对多个URL进行剩余调用:
def make_rest_calls(urls, header, store=None):
"""
Description:
Processes list of urls
Input:
urls (list): List of urls for rest call
header (dictionary): Dictionary containing credentials
store (dictionary): Dictionary for collecting results
Returns:
store (dictionary): Dictionary with results
"""
if store is None:
store = {}
for url in urls:
store[url] = requests_retry_session().get(url, headers=header, timeout=5)
return store
此函数运行多线程的其余调用
def run_multi_threaded(nthreads, list_of_urls, header):
"""
Description:
Runs multiple threads
Input:
nthreads (int): Number for threads to run
list_of_urls(list): List of rest urls
header (dictionary): Dictionary containing credentials
Returns:
store (dictionary): Dictionary with results
"""
store = {}
threads = []
# create the threads
for i in range(nthreads):
small_list_of_urls = list_of_urls[i::nthreads]
t = Thread(target=make_rest_calls, args=(small_list_of_urls))
threads.append(t)
# start the threads
[t.start() for t in threads ]
# wait for the threads to finish
[ t.join() for t in threads ]
return store
这是包装的弱点吗?我应该使用多个进程吗?还是我做错了一些导致这种副作用的事情?
我需要进行多次调用,因此需要进行多线程处理。显然也必须是正确的。
答案 0 :(得分:2)
正如blhsing所提到的,问题可能没有显示一些有助于回答的细节。
通过对结果的描述(仅在某些时候进行随机合并的输出),似乎在以下情况下可能会发生并发问题:
如果这两种情况都以某种方式发生,则结果可能如前所述,因为两个不同的线程可能会尝试同时更改同一商店条目。
如果是这种情况,简单的解决方案可能是确保您的list_of_urls不会两次保存相同的URL。 (例如,您可能要使用set(list_of_urls)。)
值得一提的是,至少在这种用法中,在make_rest_calls函数中没有任何返回值的用途,因为线程不返回该函数的返回值。例如,唯一可行的方法是更改输入存储值,而不返回任何内容。
答案 1 :(得分:0)
问题出在以下事实:线程不应该返回任何东西,并且可能是在同时写入store
字典的情况下发生的。
我建议从threading
切换到multiprocessing
,这虽然会花费少量的内存(加上实例化进程的时间),但是却可以运行实际的并行操作(可以通过以下方式进行调整)如果您想避免在给定的时间范围内触发太多的休息电话,则需要超时。然后,您可以使用Pool
对象及其功能强大的apply_async
方法来设置完整的任务集(其中每个任务都是“对该单个url进行其余调用”)并进行分发进程间异步;然后,您可以使用Queue的put
方法作为回调,以确保以元组的形式收集结果,并在返回时最终将结果转换为dict。
在代码中:
import multiprocessing
# requests_retry_session is unchanged
# modified based on make_rest_calls:
def make_rest_call(url, header):
"""
Description:
Requests a single url. Return its name plus the results.
Designed for use as a multiprocessed function.
Input:
url (str): url for rest call
header (dictionary): Dictionary containing credentials
Returns:
results (tuple): Tuple containing the url and the request's output
"""
response = requests_retry_session().get(url, headers=header, timeout=5)
return (url, response)
# full rewrite of run_multi_threaded:
def run_multi_processed(njobs, list_of_urls, header):
"""
Description:
Parallelize urls requesting on multiple processes.
Input:
njobs (int): Number for processes to run
list_of_urls(list): List of rest urls
header (dictionary): Dictionary containing credentials
Returns:
results (dictionary): Dictionary with results
"""
queue = multiprocessing.Manager().Queue()
with multiprocessing.Pool(njobs) as pool:
# Set up the tasks list, asynchronously handled by processes.
for url in urls:
pool.apply_async(
make_rest_call,
args=(url, header),
callback=queue.put
)
# Wait for all tasks' completion.
pool.join()
pool.close()
# Gather the results and format them into a dict.
results = {}
while not queue.empty():
url, response = queue.get()
results[url] = response
return results
我希望这会有所帮助:-)