Question

我目前正在尝试使用MongoDB。使用Twitters Streaming API我收集了一堆推文（似乎是学习使用MongoDB聚合选项的好方法）。

我有以下查询

db.twitter.aggregate([
    { $group : { _id : '$status.user.screen_name', count: { $sum : 1 } } },
    { $sort : { count : -1, _id : 1 } },
    { $skip : 0 },
    { $limit : 5 },
]);

正如所料，这是结果：

{
    "result" : [ 
        {
            "_id" : "VacaturesBreda",
            "count" : 5
        }, 
        {
            "_id" : "breda_nws",
            "count" : 3
        }, 
        {
            "_id" : "BredaDichtbij",
            "count" : 2
        }, 
        {
            "_id" : "JobbirdUTITBaan",
            "count" : 2
        }, 
        {
            "_id" : "vacatures_nr1",
            "count" : 2
        }
    ],
    "ok" : 1
}

问题是如何在用户id_str上匹配并返回screen_name，例如用户的followers_count。我尝试使用{ $project .... }执行此操作，但我最终得到一个空结果集。

对于那些不熟悉Twitters JSON响应中的用户对象的人来说，这是它的一部分（刚刚选择了db中的第一个用户）。

"user" : {
        "id" : 2678963916,
        "id_str" : "2678963916",
        "name" : "JobbirdUT IT Banen",
        "screen_name" : "JobbirdUTITBaan",
        "location" : "Utrecht",
        "url" : "http://www.jobbird.com",
        "description" : "Blijf op de hoogte van de nieuwste IT/Automatisering vacatures in Utrecht, via http://Jobbird.com",
        "protected" : false,
        "verified" : false,
        "followers_count" : 1,
        "friends_count" : 1,
        "listed_count" : 0,
        "favourites_count" : 0,
        "statuses_count" : 311,
        "created_at" : "Fri Jul 25 07:35:48 +0000 2014",
        ...
    },

更新：根据要求提供了关于建议回复的明确示例（抱歉不添加回复）。

因此，不要对screen_name上的id_str分组进行分组。为什么你可能会问，可以编辑你的screen_name，但你仍然是Twitter的同一个用户（所以应该返回最后一个screen_name：

db.twitter.aggregate([
    { $group : { _id : '$status.user.id_str', count: { $sum : 1 } } },
    { $sort : { count : -1, _id : 1 } },
    { $skip : 0 },
    { $limit : 5 },
]);

并且响应如下：

{
    "result" : [ 
        {
            "_id" : "123456789",
            "screen_name": "awsome_screen_name",
            "followers_count": 523,
            "count" : 5
        }, 
        ....
    ],
    "ok" : 1
}

Answer 1

您基本上是在寻找一个没有专门“聚合”内容的运营商，这基本上是$first和$last运营商所做的事情：

db.twitter.aggregate([
    { "$group": {
        "_id": "$status.user.id_str",
        "screen_name": { "$first": "$status.user.screen_name" },
        "followers_count": { "$sum": "$status.user.followers_count" },
        "count": { "$sum": 1 }
    }},
    { "$sort": { "followers_count": -1, "count": -1 } },
    { "$limit": 5 }
])

根据分组键选择字段的“第一次”出现。在文档中存在与分组键重复的相关数据时，这通常很有用。

另一种方法是在分组键中包含字段。您可以稍后使用 $project 进行重组：

db.twitter.aggregate([
    { "$group": {
        "_id": { 
            "_id": "$status.user.id_str",
             "screen_name": "$status.user.screen_name"
        },
        "followers_count": { "$sum": "$status.user.followers_count" },
        "count": { "$sum": 1 }
    }},
    { "$project": {
        "_id": "$_id._id",
        "screen_name": "$_id.screen_name"
        "followers_count": 1,
        "count": 1
    }},
    { "$sort": { "followers_count": -1, "count": -1 } },
    { "$limit": 5 }
])

如果您不确定相关的“唯一性”，那么这是有用的。

MongoDB中的聚合返回更多字段

1 个答案: