我的Mongo数据库中有很多重复的条目。有没有快速的方法来删除这些重复? 我对两种不同的场景感兴趣:
对于重复条目,每个字段都相等(ObjectID除外)
对于重复条目,只有所有字段的子集相等。在这种情况下,我想指定这些字段并根据它们删除重复项。
这样做的“mongoic”方式是什么?
示例条目是:
{
"_id" : ObjectId("57294d7071f55974cdae318e"),
"category" : "house",
"city" : "Boston",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}
答案 0 :(得分:1)
这是一个js脚本,您可以用它来实现这个目标:
var matchingId = [];
db.collectionName.aggregate([
{
// group stage: group document by field
// this return one document per unique value
$group:{
_id:{
category:"$category",
city:"$city"
// ...
// here add as many field as you want for duplicate check
},
// this field count the number of documents having the same
// values for the selected fields
count:{
$sum:1
},
// this field store the _id of documents that have the same
// value for selected fields
match:{
$push:"$_id"
}
}
},
{
// only keep documents where count > 1
$match:{
count:{
$gt:1
}
}
}],
{
// allow mongoDB to write to disk if your collection is too big
allowDiskUse: true
}
).forEach( function(doc) {
doc.match.shift(); // remove the first objectId
doc.match.forEach( function(duplicateId) {
matchingId.push(duplicateId);
});
});
// remove duplicate documents
db.collectionName.remove({_id: {$in: matchingId}})
使用它将其写入名为“script.js”的文件,并在终端中使用它:
mongo databaseName < script.js
你应该在测试数据库上试一试,以确保它的行为符合你的要求!
编辑:示例
假设您的收藏品看起来像
{
"_id" : ObjectId("57294d7071f55974cdae318e"),
"category" : "house",
"city" : "Boston",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}
{
"_id" : ObjectId("57294d7071f55974cdae318b"),
"category" : "house",
"city" : "NY",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}
{
"_id" : ObjectId("57294d7071f55974cdae318f"),
"category" : "house",
"city" : "Boston",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}
{
"_id" : ObjectId("57294d7071f55974cdae318c"),
"category" : "house",
"city" : "Boston",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}
聚合查询的输出将是
{
"_id" : {
"category" : "house",
"city" : "Boston"
},
"count" : 3,
"match" : [
ObjectId("57294d7071f55974cdae318e"),
ObjectId("57294d7071f55974cdae318f"),
ObjectId("57294d7071f55974cdae318c")
]
}
所以你迭代结果,并为每个文档删除第一个_id(因为你需要在重复项中保留一个文档)与match.shift() 然后存储其他_ids,以便您可以删除相应的文档
运行脚本后,集合仅包含那些文档
{
"_id" : ObjectId("57294d7071f55974cdae318e"),
"category" : "house",
"city" : "Boston",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}
{
"_id" : ObjectId("57294d7071f55974cdae318b"),
"category" : "house",
"city" : "NY",
"title" : "title here",
"url" : "http://url.com",
"text" : " some text here",
"time" : ISODate("2016-05-03T23:49:00Z"),
"user_online_since" : ISODate("2012-10-01T00:00:00Z"),
"price_eur" : 85000
}