如何将查找限制为MongoDB中聚合中的唯一值

时间:2016-09-16 09:27:49

标签: mongodb mongodb-query aggregation-framework

示例数据集:

{
    "source": "http://adress.com/",
    "date": ISODate("2016-08-31T08:41:00.000Z"),
    "author": "Some Guy",
    "thread": NumberInt(115265),
    "commentID": NumberInt(2693454),
    "title": ["A", "title", "for", "a", "comment"],
    "comment": ["This", "is", "a", "comment", "with", "a", "duplicate"]
}

我正在使用的数据集基本上是来自用户的评论,具有唯一的commentID。评论本身是一系列文字。我已经设法解开数组,匹配流行语并找回所有发现。

我现在的问题是摆脱重复,其中流行语在评论中出现多次。我想我必须使用一个小组,但找不到办法。

目前的管道是:

[
    {"$unwind": "$comment"},
    {"$match": {"comment": buzzword } }
]

哪种方法效果很好。但是,如果我在搜索流行语“a”,在上面的示例中,它会发现两次注释,因为单词“a”会出现两次。

我需要的是管道的JSON,以便将所有重复项放在第一个之后。

2 个答案:

答案 0 :(得分:2)

您可以运行没有 $unwind 的单个管道,它利用数组运算符 $arrayElemAt $filter 即可。前者将为您提供给定数组中的第一个元素,此数组将是使用后者过滤元素的结果, $filter

按照此示例获得所需结果:

db.collection.aggregate([
    { "$match": { "comment": buzzword } },
    {
        "$project": {
            "source": 1,
            "date": 1,
            "author": 1,
            "thread": 1,
            "commentID": 1,
            "title": 1,
            "comment": 1,
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    {
                        "$filter": {
                            "input": "$comment",
                            "as": "word",
                            "cond": {
                                "$eq": ["$$word", buzzword]
                            }
                        }
                    }, 0
                ]
            }
        }
    }
])

<强>说明

在上面的管道中,诀窍是首先通过选择满足给定条件的元素来过滤注释数组。例如,要演示此概念,请运行此管道:

db.collection.aggregate([
    {
        "$project": {
            "filtered_comment": {
                "$filter": {
                    "input": ["This", "is", "a", "comment", "with", "a", "duplicate"], /* hardcoded input array for demo */
                    "as": "word", /* The variable name for the element in the input array. 
                                     The as expression accesses each element in the input array by this variable.*/
                    "cond": { /* this condition determines whether to include the element in the resulting array. */
                        "$eq": ["$$word", "a"] /* condition where the variable equals the buzzword "a" */
                    }
                }
            }
        }
    }
])

<强>输出

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "filtered_comment" : [ 
        "a", 
        "a"
    ]
}

由于 $filter input参数接受解析为数组的表达式,因此您可以使用数组字段。

进一步了解上述结果,我们可以展示 $arrayElemAt 运算符的工作原理:

db.collection.aggregate([
    {
        "$project": {
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    ["a", "a"], /* array produced by the above $filter expression */
                    0 /* the index position of the element we want to return, here being the first */
                ]   
            }
        }
    }
])

<强>输出

{
    "_id" : ObjectId("57dbd747be80cdcab63703dc"),
    "distinct_matched_comment": "a"
}

$arrayElemAt 运算符

中的表达式
{ "$arrayElemAt": [ <array>, <idx> ] } 

可以是任何有效的表达式,只要它解析为数组,您可以将此示例开头的 $filter 表达式组合为数组表达式,因为它返回一个数组你的最终管道将如下所示:

db.collection.aggregate([
    {
        "$project": {
            "distinct_matched_comment": {
                "$arrayElemAt": [ 
                    {  /* expression that produces an array with elements that match a condition */
                        "$filter": {
                            "input": "$comment",
                            "as": "word",
                            "cond": {
                                "$eq": ["$$word", buzzword]
                            }
                        }
                    },                  
                    0 /* the index position of the element we want to return, here being the first */
                ]
            }
        }
    }
])

答案 1 :(得分:1)

一种可能的解决方案可能是$group,如此

...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
    $group: {
        _id : "$_id",
        source: { $first: "$source" },
        date: { $first: "$date" },
        author: { $first: "$author" },
        thread: { $first: "$thread" },
        commentID: { $first: "$commentID" },
        title: { $first: "$title" }
    } 
}
...

另一种方法是在展开数组之前使用$project来摆脱像这样的重复单词

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...

由于评论而更新:

要保留comment数组,您可以将数组投影到另一个字段并将其展开,而不是这样

...
{
    $project: {             
        source: 1,
        date: 1,
        author: 1,
        thread: 1,
        commentID: 1,
        title: 1,
        comment: 1,
        commentWord: { $setUnion: ["$comment"] }
    }
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...

Pipeline execution

希望有所帮助