Question

我的数据如下：

{

    "foo_list": [
      {
        "id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
        "name": "Foo 1",
        "slug": "foo-1"
      },
      {
        "id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
        "name": "Foo 1",
        "slug": "foo-1"
      },
      {
        "id": "157569ec-abab-4bfb-b732-55e9c8f4a57d",
        "name": "Foo 3",
        "slug": "foo-3"
      }
    ]
}

foo_list是名为Bar的模型中的字段。请注意，数组中的第一个和第二个对象是完全重复的。

除了切换到PostgresSQL的明显解决方案之外，我可以运行哪些MongoDB查询来删除foo_list中的重复条目？

类似的答案并没有完全削减它：

如果数组中只有字符串，这些问题就会回答这个问题。但是在我的情况下，数组中充满了对象。

我希望很明显我对查询数据库不感兴趣;我希望副本永远不会从数据库中消失。

Answer 1

纯粹从聚合框架的角度来看，有一些方法。

您可以在现代版本中应用$setUnion：

 db.collection.aggregate([
     { "$project": { 
         "foo_list": { "$setUnion": [ "$foo_list", "$foo_list" ] }
     }}
 ])

传统上使用$unwind和$addToSet：

db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": "$_id",
        "foo_list": { "$addToSet": "$foo_list" }
    }}
])

或者，如果您只对副本感兴趣，那么只能通过一般分组：

db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": {
            "_id": "$_id",
            "foo_list": "$foo_list"
        },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$ne": 1 } } },
    { "$group": {
        "_id": "$_id._id",
        "foo_list": { "$push": "$_id.foo_list" }
    }}
])

如果你真的想用另一个更新语句“删除”你的数据中的重复项，那么最后一个表格对你有用，因为它识别了重复的元素。

因此，在最后一种形式中，样本数据的返回结果标识了副本：

{
    "_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
    "foo_list" : [
            {
                    "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
                    "name" : "Foo 1",
                    "slug" : "foo-1"
            }
    ]
}

每个包含数组中重复条目的文档从您的集合返回结果，以及哪些条目重复。这是您需要更新的信息，您可以根据需要指定结果中的更新信息来循环结果，以便删除重复项。

这实际上是通过每个文档的两个更新语句完成的，因为简单的$pull操作会删除“两个”项目，这不是您想要的：

var cursor = db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": {
            "_id": "$_id",
            "foo_list": "$foo_list"
        },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$ne": 1 } } },
    { "$group": {
        "_id": "$_id._id",
        "foo_list": { "$push": "$_id.foo_list" }
    }}
])    

var batch = db.collection.initializeOrderedBulkOp();
var count = 0;

cursor.forEach(function(doc) {
    doc.foo_list.forEach(function(dup) {
        batch.find({ "_id": doc._id, "foo_list": { "$elemMatch": dup } }).updateOne({
            "$unset": { "foo_list.$": "" }
        });
        batch.find({ "_id": doc._id }).updateOne({ 
            "$pull": { "foo_list": null }
        });
    ]);

    count++;
    if ( count % 500 == 0 ) {
        batch.execute();
        batch = db.collection.initializeOrderedBulkOp();
    }
});

if ( count % 500 != 0 )
    batch.execute();

这是现代MongoDB 2.6及以上版本的方法，其中游标结果来自聚合和Bulk操作以进行更新。但原则保持不变：

识别文件中的重复项
循环结果以发布对受影响文档的更新
使用$unset与positional $运算符将“第一个”匹配的数组元素设置为null
使用$pull从数组中删除null条目

因此，在处理完上述操作后，您的样本现在看起来像这样：

{
    "_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
    "foo_list" : [
            {
                    "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
                    "name" : "Foo 1",
                    "slug" : "foo-1"
            },
            {
                    "id" : "157569ec-abab-4bfb-b732-55e9c8f4a57d",
                    "name" : "Foo 3",
                    "slug" : "foo-3"
            }
    ]
}

删除副本时“重复”项仍然完好无损。这就是您处理如何识别和删除集合中的重复数据的方法。

如何从MongoDB数组中删除重复的对象？

1 个答案: