Question

I am working on a backend with these: nodejs, mongoose, mongodb, ironmq. And there is another app (a python FTP server) which is used as a data source.

The system, more or less, works like this:

User uploads a csv dump of data (almost 3 million entries) to the FTP server (this happens periodically, once every 24 hrs)
The FTP server parses the data, and pushes to a IronMQ queue in batches (of 2000) synchronously. I'm doing the batching here to optimize for memory
Another app (nodejs) keeps polling this queue for the data, 100 messages (which is the maximum number allowed) every 10 seconds, works on this data and then updates my db (using findOneAndUpdate for each message). I have 5 of these apps running.

Now there aren't any glaring issues with this setup except for the time taken for the whole operation to complete. It takes almost 2 hours for the parsed data to be pushed to the MQ completely, but this is not much of a problem since its being done in batches. The actual problem comes with the "saving/updating to db" part.

On an average, 20-24K entries are updated in the db every hour. But since I have 3 million entries, this is taking more than 24 hrs (which doesn't work since the files on FTP gets refreshed every 24 hrs and the data will be used to perform certain operations in other parts of my app).

I'm not exactly sure how to go on from here, but I have a couple of questions.

Can my above approach be considered optimal/efficient? Or what can be improved?
How can I reduce the time taken for the whole update operation either via db or by changing the design?
Is mongodb considered good for this case, or are there any better alternatives?

It would be awesome if you can provide some help on this. Please do let me know in case you guys need more information.

Answer 1

您可以使用批量API方法优化更新，这些方法非常高效，因为它们允许您在单个请求（作为批处理）中向服务器发送许多更新操作。请考虑以下示例，这些示例演示了针对不同MongoDB版本的此方法：

假设您的nodejs应用程序将消息数据轮询到列表，并且对于支持MongoDB服务器>=4.3.0的Mongoose版本3.2.x，您可以使用 bulkWrite() 进行更新集合为：

var bulkUpdateCallback = function(err, r){
        console.log(r.matchedCount);
        console.log(r.modifiedCount);
    },
    operations = []; // Initialise the bulk operations array

messages.forEach(function (msg) { 
    operations.push({
        "updateOne": {
            "filter": { "_id": msg._id } ,              
            "update": { "$set": { "value": msg.value } } // example update operation
        }
    });

    // Send once in 500 requests only
    if (operations.length % 500 === 0 ) {
        Model.collection.bulkWrite(
            operations, 
            { "ordered": true, w: 1 }, 
            bulkUpdateCallback
        ); 
        operations = [];
    }    
});

// Get the underlying collection via the native node.js driver collection object
Model.collection.bulkWrite(bulkOps, { "ordered": true, w: 1 }, bulkUpdateCallback);

在上面，您初始化更新操作数组并将操作限制为500个批次。选择低于默认批次限制1000的值的原因通常是受控制的选择。正如那里的文档所述，默认情况下MongoDB将发送到server in batches of 1000 operations at a time at maximum，并且无法保证确保这些默认的1000个操作请求实际适合16MB BSON limit。所以你仍然需要在＆＃34; safe＆＃34;并且施加较小的批量大小，您只能有效地管理它，以便在发送到服务器时总数小于数据限制。

如果您使用支持MongoDB Server ~3.8.8, ~3.8.22, 4.x的旧版Mongoose >=2.6.x，则可以使用 Bulk() API，如下所示

var bulk = Model.collection.initializeOrderedBulkOp(),
    bulkUpdateCallback = function(err, r){
        console.log(r.matchedCount);
        console.log(r.modifiedCount);
    },
    counter = 0;

messages.forEach(function(msg) {
    bulk.find({ "_id": msg._id }).updateOne({ 
        "$set": { "value": msg.value }
    });

    counter++;
    if (counter % 500 == 0) {
        bulk.execute(function(err, r) {
           // do something with the result
           bulk = Model.collection.initializeOrderedBulkOp();
           counter = 0;
        });
    }
});

// Catch any docs in the queue under or over the 500's
if (counter > 0) {
    bulk.execute(bulkUpdateCallback);
}

How can I scale up findOneAndUpdate in mongoose/mongodb for 5 million updates?

1 个答案: