Question

如何从mongodb以并行方式导入数据。一个解决方案是扫描所有mongodb，让我们说它是1000行。然后拆分，然后在100个批次中取出它们，然后再将它们组合起来，这样所有都是1000个。

以下是从mongodb导入数据到python的代码。

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Answer 1

正如我所说的那样，您是否尝试过使用索引优化数据库？如果数据库很慢，我不认为并行化会改进它。如果您仍想使用parallel，请使用多个线程调用read_mongo。

对于索引，您应该检查https://docs.mongodb.com/manual/indexes/

此处没有与代码相关的内容，您只需要更好地了解您的数据库。

至于代码，python有并发（线程）或并行（多处理包）。您需要使用已经定义/拆分的查询调用read_mongo程序。

那里有很多例子。我之前尝试过这些索引，因为它会对以后的并行内容有所帮助。

从mongodb python并行化数据导入

1 个答案: