I / O Python任务的多处理和CPU使用率

时间:2018-05-30 15:21:14

标签: python multithreading python-3.x parallel-processing multiprocessing

所以我正在利用multiprocessing.Process来尝试并行化一些代码的执行,这些代码下载json数据,准备它并将其写入csv。

下面的代码工作正常,但是从我一直在阅读的内容,例如here,似乎我也应该(?)线程执行代码中繁重的CPU部分。

我会稍微简化一下代码,以便更容易理解:

from multiprocessing import Process, Lock
import requests, json, csv, os, re, pathlib
from lxml import html
import urllib.request

if __name__ == '__main__':
    def run_parallel(*fn):
        # bunch of code to prep a dictionary structure "links_dict"
        # "links_dict" has 40 keys, and each key is associated with a list of 3000+ urls
        # "sliced_dicts" list is created, with len() = 4
        # each index is dict with 10 of the keys, and their list, from links_dict
        l = Lock()
        proc = []
        for i in range(4):
            p = Process(target = getData, args = (sliced_dicts[i], l))
            p.start()
            proc.append(p)
        for p in proc:
            p.join()
    run_parallel(getData)


getData(links_dict, l):
    for (key, vals) in links_dict.items(): # iterate through the keys
        dir = '/data/%s' %key
        pathlib.Path(dir).mkdir(parents=True, exist_ok=True)
        for link in vals: # iterate through the links within a given key
            resp = urllib.request.urlopen(val).read()
            data = json.loads(resp)
            # process and clean the data, grab the relevant parts, store it in "curr_dict"
            # ...
            l.acquire()
            with open(dir+'/'+current data name+".csv", mode="w",newline='') as csvfile:
                writer = csv.writer(csvfile, delimiter = ",")
                writer.writerow(["Date","Value"])
                for row in curr_dict.items():
                    writer.writerow(row)
            csvfile.close()
            l.release()

我最初避免线程的原因是我不能拥有不同的进程共享状态;因为每次分析链接时,每个人都需要写入单独的csv文件。

基本上每个进程初始化为总链接的1/4(因为我有4个核心)。在下载数据之后,唯一的CPU密集型操作就出现了,因此根据我的理解,我应该能够为每个下载资源添加一个新线程(如here所述)。

所以我的问题是我是否可以在multiprocessing.Process中并行下载资源(使用线程),而不是在这些进程/线程内部或之间组合输出?

否则,我们很乐意听取您关于如何加快执行的一般想法/建议。

0 个答案:

没有答案