Pythons多处理池未并行运行

时间:2019-03-18 15:54:16

标签: python parallel-processing multiprocessing

我目前正在尝试在python中计算数千个向量到数百万个具有50个维度的向量的最接近邻居。让它按顺序运行似乎是浪费时间,所以我想并行化它。

首先,我用pickle加载预处理的数据,在这里我有一个字典,对象p为键,其矢量形式(p2v)为值,还有一些确定要查看哪个ps的ID。然后,将id分成8个块,以便每个进程必须计算多个p的邻居。方法find_closest_n_chunks接受一个id数组,并在p2v dict中查找该向量。然后,它将遍历整个字典以计算最近的邻居(对于我的测试,我仅查看字典中的10个元素)。

当前,我在Windows计算机上使用以下代码:

import time
import multiprocessing as mp
import pickle

def find_closest_n_chunks(ids, p2v, n):
    print("starting a chunk")
    neighbors = []
    for p in ids:
        p2distance = {}
        vector = p2v[p]
        for key in list(p2v.keys())[:10]:
            distance = cosine(vector, p2v[key])
            p2distance[key] = distance
        sorted_by_distance = sorted(p2distance.items(), key=lambda kv: kv[1])
        neighbors.append(sorted_by_distance[:n])
    print("finished chunk")
    return neighbors

if __name__ == '__main__':
    file = open('./prelim_results/DataHandlerC08.obj', "rb")
    handler = pickle.load(file)
    file = open("./prelim_results/vectorsC08.obj", "rb")
    p2v = pickle.load(file)
    ids = handler.evaluation_ids

    chunks = []
    chunk_p2v = []
    for _ in range(8):
        chunks.append([])
        chunk_p2v.append(p2v.copy())
    for idx, p in enumerate(ids):
        array_idx = idx % 8
        chunks[array_idx].append(p)
    print("starting")
    start = time.time()
    with mp.Pool(processes=8, maxtasksperchild=1) as pool:
        results = []
        for idx, chunk in enumerate(chunks):
            results.append(pool.apply_async(find_closest_n_chunks, (chunk, chunk_p2v[idx], 100)))
        for result in results:
            step = time.time()
            print("step: {0}".format(step-start))
            result.wait()
    end = time.time()
    print(end - start)

这在启动多个不同的进程时效果很好,但是它们不能在我的4核(8个带多线程)CPU上并行工作。输出确认:

starting
step: 0.1266627311706543
starting a chunk
finished chunk
step: 15.884509086608887
starting a chunk
finished chunk
step: 24.54252290725708
starting a chunk
finished chunk
step: 33.269429445266724
starting a chunk
finished chunk
step: 42.065810680389404

仅在“完成一个块”之后才调用“开始一个块”消息,每个块大约需要8-9秒,这意味着每个块仅在另一个块完成后才开始。

注意,我已经尝试复制p2v对象,以确保任务之间没有共享对象(按预期,因为它是只读的,所以没有任何改变)。

您能指出我在尝试并行化时犯的任何错误吗?

编辑:在代码和加载我的矢量和ID的pickle语句的顶部添加了导入

Edit2:我用一个伪方法换出了find_closest_n_chunk():

def dummy():
    print("start dummy")
    time.sleep(5)
    print("end dummy")

这给了我正确的输出,其中所有过程将在第一个过程完成之前开始。然后,我直接在开始打印行之后将time.sleep(20)添加到实际的find_closest方法中。这表明存在一些并行化:

starting
step: 0.07980728149414062
starting a chunk
starting a chunk
starting a chunk
starting a chunk
finished chunk
step: 35.71958327293396
starting a chunk
finished chunk
step: 45.673808574676514
finished chunk
starting a chunk
finished chunk
step: 55.73376154899597
step: 55.73949694633484
starting a chunk
finished chunk
step: 64.74968791007996
starting a chunk
finished chunk
step: 73.77850985527039
finished chunk
step: 82.86862134933472
finished chunk
91.36373448371887

但是每个“开始大块”的出现要花费大约5-10秒。因此,它们不会同时启动

0 个答案:

没有答案
相关问题