Question

我的文件具有以下结构：

{
    _id: ObjectId("59303aa1bad1081d4b98d636"),
    clear_number: "83490",
    items: [ 
        {
            name: "83490_1",
            file_id: "e7209bbb",
            hash: "2f568bb196f74263c64b7cf273f8ceaa",
        }, 
        {
            name: "83490_2",
            file_id: "9a56a935",
            hash: "9c6230f7bf19d3f3186c6c3231ac2055",
        }, 
        {
            name: "83490_2",
            file_id: "ce5f6773",
            hash: "9c6230f7bf19d3f3186c6c3231ac2055",
        }
    ],
    group_id: null
}

如何删除具有相同项哈希的两个子文档中的一个？

Answer 1

如果我理解你的问题，以下应该可以解决问题：

collection.aggregate({
    $unwind: "$items" // flatten the items array
}, {
    $group: {
        "_id": { "_id": "$_id", "clear_number": "$clear_number", "group_id": "$group_id", "hash": "$items.hash" }, // per each document group by hash value
        "items": { $first: "$items" } // keep only the first of all matching ones per group
    }
}, {
    $group: {
        "_id": { "_id": "$_id._id", "clear_number": "$_id.clear_number", "group_id": "$_id.group_id" }, // now let's group everything again without the hashes
        "items": { $push: "$items" } // push all single items into the "items" array
    }
}, {
    $project: { // this is just to restore the original document layout
        "_id": "$_id._id",
        "clear_number": "$_id.clear_number",
        "group_id": "$_id.group_id",
        "items": "$items"
    }
})

在回复您的评论时，我建议使用以下查询来获取包含重复哈希的所有文档ID的列表：

collection.aggregate({
    $addFields: {
        "hashes": {
            $setUnion: [
                [ { $size: "$items.hash" } ], // total number of hashes
                [ { $size: { $setUnion: "$items.hash" } } ] // number of distinct hashes
            ]
        }
    }
}, {
    $match:
    {
        "hashes.1": { $exists: true } // find all documents with a different value for distinct vs total number of hashes
    }
}, {
    $project: { _id: 1 } // only return _id field
})

可能有不同的方法，但这个方法看起来非常简单：

基本上，在$addFields部分，对于每个文档，我们首先创建一个由两个数字组成的数组：

哈希总数
不同哈希的数量

然后我们通过$setUnion驱动这两个数字的数组。在这一步之后可以

要么是数组中剩下的两个不同的数字，在这种情况下hash字段确实包含重复项
或者只剩下一个元素，在这种情况下，不同哈希的数量等于哈希的总数（因此没有重复）。

我们可以通过测试是否存在位置1的元素（数组是从零开始的！）来检查数组中是否有两个项目。这就是$match阶段的作用。

最后的$project阶段只是将输出限制为_id字段。

MongoDB根据特定字段删除数组中的重复子文档

1 个答案: