我必须从每个url获取JSON,它返回子URL和奖励(整数)值。目的是遍历整个URL树并计算奖励值的总和。 我的代码有效,但我正在尝试将其并行化。我发现了多处理,但我如何使用它来同时使用不同的URL执行自定义的fetch()函数?
def fetch(url):
json_data = requests.get(url).json()
try:
children = list(json_data['children']) #No duplicate children
for i in children:
next_url.append(i)
except:
print('Tree end')
reward = json_data['reward']
reward_list.append(reward)
答案 0 :(得分:0)
假设您有一个网址列表,并希望获取获得奖励 我将列表拆分为存储桶并在fetch中添加新循环
def fetch(urls, reward_lst):
for url in urls:
json_data = requests.get(url).json()
try:
children = list(json_data['children'])
for i in children:
next_url.append(i)
except:
print('Tree end')
reward = json_data['reward']
reward_lst.append(reward)
def run():
core_num = mp.cpu_count()
bucket_size = (len(urls)//core_num) + 1
reward_lst = mp.Manager().list()
jobs = []
for i in range(core_num):
url_bucket = urls[i*bucket_size,(i+1)*bucket_size]
p = mp.Process(target=fetch, args=(url_bucket,reward_lst,))
p.start()
jobs.append(p)
[p.join() for p in jobs]
答案 1 :(得分:0)
如果我理解了这个问题,您需要将url传递给系统,并让它在发现时动态地将新网址反馈到系统中。您可以使用任务队列和线程列表来完成此操作。将一个url输入队列,线程将提供他们发现的URL以进行更多处理。
import threading
import queue
def fetch_worker(url_q, reward_list):
while True:
try:
url = url_q.get()
# controller requests exit
if url is None:
return
# get url data
json_data = requests.get(url).json()
# queue more url_q tasks
for child in json_data.get('children', []): #No duplicate children
next_url.append(child)
# add found reward to list
reward_list .append(json_data['reward'])
finally:
url_q.task_done()
def fetch(url):
NUM_WORKERS = 10 # just a guess
reward_list = []
url_q = queue.Queue()
threads = [threading.Thread(target=fetch_worker, args=(url_q, reward_list))
for _ in range(NUM_WORKERS)]
for t in threads:
t.start()
url_q.put(url)
# wait for url and all subordinate urls to process
url_q.join()
# kill the workers
for _ in range(NUM_WORKERS):
url_q.put(None)
for t in threads:
t.join()
return reward_list