Question

我正在使用Pool.map作为评分程序：

“游标”，包含来自数据源的数百万个数组
计算
将结果保存在数据接收器中

结果是独立的。

我只是想知道我是否可以避免内存需求。首先似乎每个数组都进入python，然后是2和3 继续。无论如何，我的速度有所改善。

#data src and sink is in mongodb#
def scoring(some_arguments):
        ### some stuff  and finally persist  ###
    collection.update({uid:_uid},{'$set':res_profile},upsert=True)


cursor = tracking.find(timeout=False)
score_proc_pool = Pool(options.cores)    
#finaly I use a wrapper so I have only the document as input for map
score_proc_pool.map(scoring_wrapper,cursor,chunksize=10000)

我做错了什么或为了这个目的有没有更好的python方法？

Answer 1

map的{{1}}函数在内部将iterable转换为列表（如果它没有Pool属性。相关代码位于Pool.map_async，因为__len__（和Pool.map）使用该代码生成结果 - 这也是一个列表。

如果您不想先将所有数据读入内存，则应使用Pool.imap或starmap，这将产生一个迭代器，在它们进入时产生结果。

Python多处理.Pool＆amp;记忆

1 个答案: