MongoDB从数组中获取不同的元素WITH每个元素的出现次数

时间:2018-04-10 16:02:01

标签: mongodb group-by aggregation-framework

我的收藏中有以下文件。每个文档都包含推文的文本和从推文中挑选出来的实体数组(使用AWS Comprehend):

{
"text" : "some tweet by John Smith in New York about Stack Overflow",
"entities" : [
    {
        "Type" : "ORGANIZATION",
        "Text" : "stack overflow"
    },
    {
        "Type" : "LOCATION",
        "Text" : "new york"
    },
    {
        "Type" : "PERSON",
        "Text" : "john smith"
    }
  ]
},
{
    "text" : "another tweet by John Smith but this one from California and about Google",
    "entities" : [
    {
        "Type" : "ORGANIZATION",
        "Text" : "google"
    },
    {
        "Type" : "LOCATION",
        "Text" : "california"
    },
    {
        "Type" : "PERSON",
        "Text" : "john smith"
    }
  ]
}

我想获得一份不同的entities.Text列表,按entities.Type分组,并计算每个entities.Text的出现次数,如下所示:

{ "_id" : "ORGANIZATION", "values" : [ {text:"stack overflow",count:1},{text:"google",count:1} ] }
{ "_id" : "LOCATION", "values" : [ {text:"new york",count:1},{text:"california",count:1} ] }
{ "_id" : "PERSON", "values" : [ {text:"john smith",count:2} ] }

我可以按entities.Type进行分组,并将所有entities.Text放入包含此查询的数组中:

db.collection.aggregate([
{
    $unwind: '$entities'
}, 
{
    $group: {
        _id: '$entities.Type',
        values: {
            $push: '$entities.Text'
    }
}
}])

导致此输出包含重复值且无计数。

{ "_id" : "ORGANIZATION", "values" : [ "stack overflow", "google" ] }
{ "_id" : "LOCATION", "values" : [ "new york", "california" ] }
{ "_id" : "PERSON", "values" : [ "john smith", "john smith" ] }

我开始沿着使用$project作为聚合的最后一步并添加带有javascript函数的计算字段valuesMap的路径。但后来我意识到你不能在聚合管道中编写javascript。

我的下一步是使用普通的javascript处理mongoDB输出,但我希望(为了学习)使用mongoDB查询完成所有这些。

谢谢!

2 个答案:

答案 0 :(得分:4)

您可以尝试以下查询。您需要额外的$group来推送计数和文字。

db.collection.aggregate(
[
  {"$unwind":"$entities"},
  {"$group":{
    "_id":{"type":"$entities.Type","text":"$entities.Text"},
    "count":{"$sum":1}
  }},
  {"$group":{
    "_id":"$_id.type",
    "values":{"$push":{"text":"$_id.text","count":"$count"}}
  }}
])

答案 1 :(得分:0)

db.collection.aggregate(

    // Pipeline
    [
        // Stage 1
        {
            $unwind: {
                path: '$entities'
            }
        },

        // Stage 2
        {
            $group: {
                _id: {
                    Text: '$entities.Text'
                },
                count: {
                    $sum: 1
                },
                Type: {
                    $addToSet: '$entities.Type'
                }
            }
        },

        // Stage 3
        {
            $group: {
                _id: {
                    Type: '$Type'
                },
                values: {
                    $addToSet: {
                        text: '$_id.Text',
                        count: '$count'
                    }
                }
            }
        },

        // Stage 4
        {
            $project: {
                values: 1,
                _id: {
                    $arrayElemAt: ['$_id.Type', 0]
                }
            }
        }

    ]


);