我很难在概念上解决这个问题...如果工作失败,我不能简单地将其重新添加到期货中。所以我正在寻找一个更简单的python多线程进程。
摘要:
我对并发期货的实现要求您在指定数量的线程上加载列表,并将其传递给要提交的执行者。该程序从redis加载大量URL,并将其写出DOM到文件中。
整个redis集将作为列表读入return_contents()
。
URLs = return_contents()
# print(URLs)
print('complete')
start = time.time()
# passing the urls to the threadpoolexecutor
with ThreadPoolExecutor(max_workers=20) as executor:
future_to_url = {executor.submit(load_url, url): url for url in URLs}
print('jobs are loaded')
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
completed = return_to_orig_names(url)
name_check = check_ismemeber_crawled(completed)
if name_check == 1:
print('############################## we are skipping {0} because it is in crawled urls'.format(completed))
pass
else:
try:
data = future.result()
html_data = str(data)
add_url_es(url, html_data, es)
add_completed_to_redis(completed)
except Exception as exc:
# if there is a max retry error then we need to add them to a different set in redis or to a file.
# we need to add this to a different set in redis
print('%r generated an exception: %s' % (url, exc))
print('we are going to add this back into the queue')
del future_to_url[future]
end = time.time()
print(end - start)
我正在寻找更像这样的东西...
伪代码:
如果设置了URL,请执行以下操作: 将作业分配给线程1 将作业分配给thread2 将作业分配给thread3 ... 将作业分配给线程20。 如果作业返回有效结果: 从Redis中删除 其他: 重新添加到redis
类似的东西必须存在...