Question

我正在使用脚本删除mongo上的重复项，它在一个包含10个项目的集合中工作，我用作测试但是当我使用600万个文档的真实集合时，我收到错误。

这是我在Robomongo（现在称为Robo 3T）中运行的脚本：

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();
var count = 0;

db.getCollection('RAW_COLLECTION').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();

这是错误消息：

Error: command failed: {
    "errmsg" : "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.",
    "code" : 16945,
    "ok" : 0
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:23:13
doassert@src/mongo/shell/assert.js:13:14
assert.commandWorked@src/mongo/shell/assert.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1215:5
@(shell):1:1

所以我需要设置allowDiskUse:true才能正常工作？我在哪里在脚本中这样做，这样做有什么问题吗？

Answer 1

{ allowDiskUse: true }

应该放在聚合管道之后。

在你的代码中，这应该是这样的：

db.getCollection('RAW_COLLECTION').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
], { allowDiskUse: true } )

Answer 2

这是一个简单的未公开记录的技巧，可以在很多情况下帮助避免使用磁盘。

您可以使用中间的$project阶段来减小在$sort阶段传递的记录的大小。

在此示例中，它将驱动至：

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();
var count = 0;

db.getCollection('RAW_COLLECTION').aggregate([
  // here is the important stage
  { "$project": { "_id": 1, "RegisterNumber": 1, "Region": 1 } }, // this will reduce the records size
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();  // re-init after execute
  }
});

请参阅第一个$project阶段，这里只是为了避免使用磁盘。

这对于收集大型记录而聚合中未使用大多数数据的情况特别有用

Answer 3

From MongoDB Docs

$ group阶段的RAM限制为100兆字节。默认情况下，如果阶段超过此限制，$ group将产生错误。然而，要允许处理大型数据集，请设置allowDiskUse 选项为true以启用$ group操作以写入临时文件。请参见db.collection.aggregate（）方法和aggregate命令详情。

Answer 4

当您有大量数据时，最好在分组之前使用match。如果您使用分组前比赛，则不会遇到此问题。

db.getCollection('sample').aggregate([
   {$match:{State:'TAMIL NADU'}},
   {$group:{
       _id:{DiseCode:"$code", State:"$State"},
       totalCount:{$sum:1}
   }},

   {
     $project:{
        Code:"$_id.code",
        totalCount:"$totalCount",
        _id:0 
     }   

   }

])

如果您确实克服了这个难题，那么解决方法是{ allowDiskUse: true }

Robomongo：超出$ group的内存限制

4 个答案: