我目前正在尝试在python中计算数千个向量到数百万个具有50个维度的向量的最接近邻居。让它按顺序运行似乎是浪费时间,所以我想并行化它。
首先,我用pickle加载预处理的数据,在这里我有一个字典,对象p为键,其矢量形式(p2v)为值,还有一些确定要查看哪个ps的ID。然后,将id分成8个块,以便每个进程必须计算多个p的邻居。方法find_closest_n_chunks接受一个id数组,并在p2v dict中查找该向量。然后,它将遍历整个字典以计算最近的邻居(对于我的测试,我仅查看字典中的10个元素)。
当前,我在Windows计算机上使用以下代码:
import time
import multiprocessing as mp
import pickle
def find_closest_n_chunks(ids, p2v, n):
print("starting a chunk")
neighbors = []
for p in ids:
p2distance = {}
vector = p2v[p]
for key in list(p2v.keys())[:10]:
distance = cosine(vector, p2v[key])
p2distance[key] = distance
sorted_by_distance = sorted(p2distance.items(), key=lambda kv: kv[1])
neighbors.append(sorted_by_distance[:n])
print("finished chunk")
return neighbors
if __name__ == '__main__':
file = open('./prelim_results/DataHandlerC08.obj', "rb")
handler = pickle.load(file)
file = open("./prelim_results/vectorsC08.obj", "rb")
p2v = pickle.load(file)
ids = handler.evaluation_ids
chunks = []
chunk_p2v = []
for _ in range(8):
chunks.append([])
chunk_p2v.append(p2v.copy())
for idx, p in enumerate(ids):
array_idx = idx % 8
chunks[array_idx].append(p)
print("starting")
start = time.time()
with mp.Pool(processes=8, maxtasksperchild=1) as pool:
results = []
for idx, chunk in enumerate(chunks):
results.append(pool.apply_async(find_closest_n_chunks, (chunk, chunk_p2v[idx], 100)))
for result in results:
step = time.time()
print("step: {0}".format(step-start))
result.wait()
end = time.time()
print(end - start)
这在启动多个不同的进程时效果很好,但是它们不能在我的4核(8个带多线程)CPU上并行工作。输出确认:
starting
step: 0.1266627311706543
starting a chunk
finished chunk
step: 15.884509086608887
starting a chunk
finished chunk
step: 24.54252290725708
starting a chunk
finished chunk
step: 33.269429445266724
starting a chunk
finished chunk
step: 42.065810680389404
仅在“完成一个块”之后才调用“开始一个块”消息,每个块大约需要8-9秒,这意味着每个块仅在另一个块完成后才开始。
注意,我已经尝试复制p2v对象,以确保任务之间没有共享对象(按预期,因为它是只读的,所以没有任何改变)。
您能指出我在尝试并行化时犯的任何错误吗?
编辑:在代码和加载我的矢量和ID的pickle语句的顶部添加了导入
Edit2:我用一个伪方法换出了find_closest_n_chunk():
def dummy():
print("start dummy")
time.sleep(5)
print("end dummy")
这给了我正确的输出,其中所有过程将在第一个过程完成之前开始。然后,我直接在开始打印行之后将time.sleep(20)添加到实际的find_closest方法中。这表明存在一些并行化:
starting
step: 0.07980728149414062
starting a chunk
starting a chunk
starting a chunk
starting a chunk
finished chunk
step: 35.71958327293396
starting a chunk
finished chunk
step: 45.673808574676514
finished chunk
starting a chunk
finished chunk
step: 55.73376154899597
step: 55.73949694633484
starting a chunk
finished chunk
step: 64.74968791007996
starting a chunk
finished chunk
step: 73.77850985527039
finished chunk
step: 82.86862134933472
finished chunk
91.36373448371887
但是每个“开始大块”的出现要花费大约5-10秒。因此,它们不会同时启动