所以我正在利用multiprocessing.Process来尝试并行化一些代码的执行,这些代码下载json数据,准备它并将其写入csv。
下面的代码工作正常,但是从我一直在阅读的内容,例如here,似乎我也应该(?)线程执行代码中繁重的CPU部分。
我会稍微简化一下代码,以便更容易理解:
from multiprocessing import Process, Lock
import requests, json, csv, os, re, pathlib
from lxml import html
import urllib.request
if __name__ == '__main__':
def run_parallel(*fn):
# bunch of code to prep a dictionary structure "links_dict"
# "links_dict" has 40 keys, and each key is associated with a list of 3000+ urls
# "sliced_dicts" list is created, with len() = 4
# each index is dict with 10 of the keys, and their list, from links_dict
l = Lock()
proc = []
for i in range(4):
p = Process(target = getData, args = (sliced_dicts[i], l))
p.start()
proc.append(p)
for p in proc:
p.join()
run_parallel(getData)
getData(links_dict, l):
for (key, vals) in links_dict.items(): # iterate through the keys
dir = '/data/%s' %key
pathlib.Path(dir).mkdir(parents=True, exist_ok=True)
for link in vals: # iterate through the links within a given key
resp = urllib.request.urlopen(val).read()
data = json.loads(resp)
# process and clean the data, grab the relevant parts, store it in "curr_dict"
# ...
l.acquire()
with open(dir+'/'+current data name+".csv", mode="w",newline='') as csvfile:
writer = csv.writer(csvfile, delimiter = ",")
writer.writerow(["Date","Value"])
for row in curr_dict.items():
writer.writerow(row)
csvfile.close()
l.release()
我最初避免线程的原因是我不能拥有不同的进程共享状态;因为每次分析链接时,每个人都需要写入单独的csv文件。
基本上每个进程初始化为总链接的1/4(因为我有4个核心)。在下载数据之后,唯一的CPU密集型操作就出现了,因此根据我的理解,我应该能够为每个下载资源添加一个新线程(如here所述)。
所以我的问题是我是否可以在multiprocessing.Process中并行下载资源(使用线程),而不是在这些进程/线程内部或之间组合输出?
否则,我们很乐意听取您关于如何加快执行的一般想法/建议。