Question

我要做的是检查哪种多处理最适合我的数据。我试图对这个循环进行多处理：

def __pure_calc(args):

    j = args[0]
    point_array = args[1]
    empty = args[2]
    tree = args[3] 

    for i in j:
            p = tree.query(i)   

            euc_dist = math.sqrt(np.sum((point_array[p[1]]-i)**2))  

            ##add one row at a time to empty list
            empty.append([i[0], i[1], i[2], euc_dist, point_array[p[1]][0], point_array[p[1]][1], point_array[p[1]][2]]) 

    return empty

纯粹的功能正在 6.52秒

我的第一个方法是multiprocessing.map：

from multiprocessing import Pool 

def __multiprocess(las_point_array, point_array, empty, tree):

    pool = Pool(os.cpu_count()) 

    for j in las_point_array:
        args=[j, point_array, empty, tree]
        results = pool.map(__pure_calc, args)

    #close the pool and wait for the work to finish 
    pool.close() 
    pool.join() 

    return results

当我检查其他答案如何进行多进程功能时，应该很容易：map（调用函数，输入） - 完成。但由于某种原因，我的multiproccess不是我的输入，scipy.spatial.ckdtree.cKDTree对象不能下标的错误上升。

所以我尝试了apply_async：

from multiprocessing.pool import ThreadPool

def __multiprocess(arSegment, wires_point_array, ptList, tree):

    pool = ThreadPool(os.cpu_count())

    args=[arSegment, point_array, empty, tree]

    result = pool.apply_async(__pure_calc, [args])

    results = result.get()

它没有问题。对于我的测试数据，我设法在 6.42秒

中进行计算

为什么apply_async接受ckdtree没有任何问题而pool.map没有？我需要改变什么才能让它运行？

Answer 1

DataVector <-get_object("s3://eventstore/footballStats/2017-04-22/*") ERROR : chr "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error> <Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><K"| __truncated__，它与itertool的pool.map(function, iterable)基本相同。可迭代的每个项目都是map函数的args。

在这种情况下，我想你可能会改变这个：

__pure_calc

Python多进程/线程循环。

1 个答案: