MongoDB:根据字段删除重复项

时间:2016-10-31 09:41:09

标签: mongodb nosql

我的Mongo数据库中有很多重复的条目。有没有快速的方法来删除这些重复? 我对两种不同的场景感兴趣:

  1. 对于重复条目,每个字段都相等(ObjectID除外)

  2. 对于重复条目,只有所有字段的子集相等。在这种情况下,我想指定这些字段并根据它们删除重复项。

  3. 这样做的“mongoic”方式是什么?

    示例条目是:

    {
    "_id" : ObjectId("57294d7071f55974cdae318e"),
    "category" : "house",
    "city" : "Boston",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
    }
    

1 个答案:

答案 0 :(得分:1)

这是一个js脚本,您可以用它来实现这个目标:

var matchingId = [];

db.collectionName.aggregate([
   {
      // group stage: group document by field 
      // this return one document per unique value
      $group:{
         _id:{
            category:"$category",
            city:"$city"
        // ...
        // here add as many field as you want for duplicate check
         },
         // this field count the number of documents having the same 
         // values for the selected fields
         count:{
            $sum:1
         },
         // this field store the _id of documents that have the same 
         // value for selected fields  
         match:{
            $push:"$_id"
         }
      }
   },
   {
      // only keep documents where count > 1
      $match:{
         count:{
            $gt:1
         }
      }
   }], 
   {
      // allow mongoDB to write to disk if your collection is too big
      allowDiskUse: true
   } 
).forEach( function(doc) {
   doc.match.shift(); // remove the first objectId
   doc.match.forEach( function(duplicateId) {
   matchingId.push(duplicateId);
   });
});

// remove duplicate documents
db.collectionName.remove({_id: {$in: matchingId}})

使用它将其写入名为“script.js”的文件,并在终端中使用它:

mongo databaseName < script.js

你应该在测试数据库上试一试,以确保它的行为符合你的要求!

编辑:示例

假设您的收藏品看起来像

{
    "_id" : ObjectId("57294d7071f55974cdae318e"),
    "category" : "house",
    "city" : "Boston",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
}
{
    "_id" : ObjectId("57294d7071f55974cdae318b"),
    "category" : "house",
    "city" : "NY",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
}
{
    "_id" : ObjectId("57294d7071f55974cdae318f"),
    "category" : "house",
    "city" : "Boston",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
}
{
    "_id" : ObjectId("57294d7071f55974cdae318c"),
    "category" : "house",
    "city" : "Boston",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
}

聚合查询的输出将是

{
    "_id" : {
        "category" : "house",
        "city" : "Boston"
    },
    "count" : 3,
    "match" : [
        ObjectId("57294d7071f55974cdae318e"),
        ObjectId("57294d7071f55974cdae318f"),
        ObjectId("57294d7071f55974cdae318c")
    ]
}

所以你迭代结果,并为每个文档删除第一个_id(因为你需要在重复项中保留一个文档)与match.shift() 然后存储其他_ids,以便您可以删除相应的文档

运行脚本后,集合仅包含那些文档

{
    "_id" : ObjectId("57294d7071f55974cdae318e"),
    "category" : "house",
    "city" : "Boston",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
}
{
    "_id" : ObjectId("57294d7071f55974cdae318b"),
    "category" : "house",
    "city" : "NY",
    "title" : "title here",
    "url" : "http://url.com",
    "text" : " some text here",
    "time" : ISODate("2016-05-03T23:49:00Z"),
    "user_online_since" : ISODate("2012-10-01T00:00:00Z"),
    "price_eur" : 85000
}