Pymongo并发和多处理模块

时间:2015-08-28 21:10:51

标签: python mongodb multiprocessing pymongo

我正在尝试了解使用pymongo并行处理查询或查询结果的最佳方法。

我的所有阅读都说你应该有少量的MongoClient()对象。假设我有两个不同的模块data_interface.py实现

from pymongo import MongoClient
client = MongoClient('localhost',27017)

def execute_query(id_to_find):
    db = client['mydatabase']
    my_collection = db.my_collection
    data_cursor = my_collection.find({'_id':my_id_to_find})
    return data_cursor

from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client['mydatabase']
my_collection = db.my_collection

def execute_query(id_to_find):
    data_cursor = my_collection.find({'_id':my_id_to_find})
    return data_cursor

假设函数process_data执行一些简单的计算,并且集合是多对一的(一个id查询返回一千个结果)并假设我有:

import data_interface
from multiprocessing import Pool

def process_data(ids_to_process):
    # ids_to_process is a list of ids to query
    pool = Pool(processes=4)
    results = pool.map(query_and_process_data, ids_to_process)

def query_and_process_data(id_to_query):
    cursor = data_interface.execute_query(id_to_query)
    processed_results = []
    for result in cursor:
        processed_result = process_data(result)
        processed_results.append(processed_result)

    return processed_results

或者:

import data_interface
from multiprocessing import Pool

def process_data(ids_to_process):
    # ids_to_process is a list of ids to query
    pool = Pool(processes=4)
    for id in ids_to_process:
        cursor = data_interface.execute_query(id)
        data_returned = cursor[:]
        results = pool.map(process_data, data_returned)

这里有4种不同的实现方式。在任何实现中都存在明显的缺陷吗?在每个实现中,我相信当Pool创建一个新的python解释器时,会为每个进程创建一个MongoClient。 data_interface的第二个实现是否支持并行查询?或者我需要一个新的集合对象实例来实现它? process_data的两个实现之间的区别在于是执行并行查询还是并行处理每个文档。

注意:此代码均未经过测试,可能存在错误。我希望能够清楚地传达我的想法。

0 个答案:

没有答案