示例数据集:
{
"source": "http://adress.com/",
"date": ISODate("2016-08-31T08:41:00.000Z"),
"author": "Some Guy",
"thread": NumberInt(115265),
"commentID": NumberInt(2693454),
"title": ["A", "title", "for", "a", "comment"],
"comment": ["This", "is", "a", "comment", "with", "a", "duplicate"]
}
我正在使用的数据集基本上是来自用户的评论,具有唯一的commentID
。评论本身是一系列文字。我已经设法解开数组,匹配流行语并找回所有发现。
我现在的问题是摆脱重复,其中流行语在评论中出现多次。我想我必须使用一个小组,但找不到办法。
目前的管道是:
[
{"$unwind": "$comment"},
{"$match": {"comment": buzzword } }
]
哪种方法效果很好。但是,如果我在搜索流行语“a”,在上面的示例中,它会发现两次注释,因为单词“a”会出现两次。
我需要的是管道的JSON,以便将所有重复项放在第一个之后。
答案 0 :(得分:2)
您可以运行没有 $unwind
的单个管道,它利用数组运算符 $arrayElemAt
和 $filter
即可。前者将为您提供给定数组中的第一个元素,此数组将是使用后者过滤元素的结果, $filter
。
按照此示例获得所需结果:
db.collection.aggregate([
{ "$match": { "comment": buzzword } },
{
"$project": {
"source": 1,
"date": 1,
"author": 1,
"thread": 1,
"commentID": 1,
"title": 1,
"comment": 1,
"distinct_matched_comment": {
"$arrayElemAt": [
{
"$filter": {
"input": "$comment",
"as": "word",
"cond": {
"$eq": ["$$word", buzzword]
}
}
}, 0
]
}
}
}
])
<强>说明强>
在上面的管道中,诀窍是首先通过选择满足给定条件的元素来过滤注释数组。例如,要演示此概念,请运行此管道:
db.collection.aggregate([
{
"$project": {
"filtered_comment": {
"$filter": {
"input": ["This", "is", "a", "comment", "with", "a", "duplicate"], /* hardcoded input array for demo */
"as": "word", /* The variable name for the element in the input array.
The as expression accesses each element in the input array by this variable.*/
"cond": { /* this condition determines whether to include the element in the resulting array. */
"$eq": ["$$word", "a"] /* condition where the variable equals the buzzword "a" */
}
}
}
}
}
])
<强>输出强>
{
"_id" : ObjectId("57dbd747be80cdcab63703dc"),
"filtered_comment" : [
"a",
"a"
]
}
由于 $filter
的input
参数接受解析为数组的表达式,因此您可以使用数组字段。
进一步了解上述结果,我们可以展示 $arrayElemAt
运算符的工作原理:
db.collection.aggregate([
{
"$project": {
"distinct_matched_comment": {
"$arrayElemAt": [
["a", "a"], /* array produced by the above $filter expression */
0 /* the index position of the element we want to return, here being the first */
]
}
}
}
])
<强>输出强>
{
"_id" : ObjectId("57dbd747be80cdcab63703dc"),
"distinct_matched_comment": "a"
}
自 $arrayElemAt
运算符
{ "$arrayElemAt": [ <array>, <idx> ] }
可以是任何有效的表达式,只要它解析为数组,您可以将此示例开头的 $filter
表达式组合为数组表达式,因为它返回一个数组你的最终管道将如下所示:
db.collection.aggregate([
{
"$project": {
"distinct_matched_comment": {
"$arrayElemAt": [
{ /* expression that produces an array with elements that match a condition */
"$filter": {
"input": "$comment",
"as": "word",
"cond": {
"$eq": ["$$word", buzzword]
}
}
},
0 /* the index position of the element we want to return, here being the first */
]
}
}
}
])
答案 1 :(得分:1)
一种可能的解决方案可能是$group
,如此
...
{ $unwind: "$comment"},
{ $match: {"comment": buzzword } },
{
$group: {
_id : "$_id",
source: { $first: "$source" },
date: { $first: "$date" },
author: { $first: "$author" },
thread: { $first: "$thread" },
commentID: { $first: "$commentID" },
title: { $first: "$title" }
}
}
...
另一种方法是在展开数组之前使用$project
来摆脱像这样的重复单词
...
{
$project: {
source: 1,
date: 1,
author: 1,
thread: 1,
commentID: 1,
title: 1,
comment: { $setUnion: ["$comment"] }
}
},
{$unwind: "$comment"},
{$match: {"comment": buzzword } }
...
由于评论而更新:
要保留comment
数组,您可以将数组投影到另一个字段并将其展开,而不是这样
...
{
$project: {
source: 1,
date: 1,
author: 1,
thread: 1,
commentID: 1,
title: 1,
comment: 1,
commentWord: { $setUnion: ["$comment"] }
}
},
{$unwind: "$commentWord"},
{$match: {"commentWord": buzzword } }
...
希望有所帮助