Mongodb明显汇总了30亿个文件

时间:2015-01-23 19:29:54

标签: mongodb mapreduce mongodb-query aggregation-framework

我收藏了30亿份文件。每个文档如下所示:

"_id" : ObjectId("54c1a013715faf2cc0047c77"),
"service_type" : "JE",
"receiver_id" : NumberLong("865438083645"),
"time" : ISODate("2012-12-05T23:07:36Z"),
"duration" : 24,
"service_description" : "NQ",
"receiver_cell_id" : null,
"location_id" : "658_55525",
"caller_id" : NumberLong("475035504705")

我想获得不同用户的列表(他们至少应该作为来电者出现一次' caller_id'),他们的计数(每个用户在集合中作为来电者或接收者出现的次数)以及如果他们是呼叫者的位置计数(即每个用户的每个location_id的计数)。

我想最终得到以下结论:

"number_of_records" : 20,
"locations" : [{location_id: 658_55525, count:5}, {location_id: 840_5425, count:15}],
"user" : NumberLong("475035504705")

我尝试了herehere所描述的解决方案,但它们效率不高(非常慢)。什么是实现这一目标的有效方法?

2 个答案:

答案 0 :(得分:2)

对结果使用聚合:

db.<collection>.aggregate([
   { $group : { _id : { user:  "$caller_id", localtion: '$location_id'} , count : { $sum : 1}  } },
   { $project : { _id : 0, _id : '$_id.user', location : '$_id.localtion', count : '$count' } },
   { $group : { _id : '$_id', 'locations' : { $push : { location_id : '$location', count : '$count' } }, number_of_records : {$sum : '$count'} } },
   { $project : { _id : 0, user : '$_id', locations : '$locations', number_of_records : '$number_of_records'} },
   { $out : 'outputCollection'},
])

输出将是:

{
    "0" : {
        "locations" : [ 
            {
                "location_id" : "840_5425",
                "count" : 8
            }, 
            {
                "location_id" : "658_55525",
                "count" : 5
            }
        ],
        "number_of_records" : 13,
        "user" : NumberLong(475035504705)
    }
}
使用allowDiskUse

更新

var pipe = [
   { $group : { _id : { user:  "$caller_id", localtion: '$location_id'} , count : { $sum : 1}  } },
   { $project : { _id : 0, _id : '$_id.user', location : '$_id.localtion', count : '$count' } },
   { $group : { _id : '$_id', 'locations' : { $push : { location_id : '$location', count : '$count' } }, number_of_records : {$sum : '$count'} } },
   { $project : { _id : 0, user : '$_id', locations : '$locations', number_of_records : '$number_of_records'} },
   { $out : 'outputCollection'},
];

db.runCommand(
   { aggregate: "collection",
     pipeline: pipe,
     allowDiskUse: true
   }
)

答案 1 :(得分:1)

map-reduce解决方案更适合此而非aggregation管道,因为它避免了两个unwinds。如果您可以通过一次展开来推出聚合解决方案,那就是它。但是下面的map-reduce解决方案是一种方法,尽管你需要根据大数据来衡量它的运行时间,看看它是否适合你。

map功能:

var map = function(){
    emit(this.caller_id,
        {locs:[{"location_id":this.location_id,"count":1}]});
}

reduce功能:

var reduce = function(key,values){
    var result = {locs:[]};
    var locations = {};
    values.forEach(function(value){
        value.locs.forEach(function(loc){
                if(!locations[loc.location_id]){
                    locations[loc.location_id] = loc.count;
                }
                else{
                    locations[loc.location_id]++;
                }
        })
    })
    Object.keys(locations).forEach(function(k){
        result.locs.push({"location_id":k,"count":locations[k]});
    })
    return result;
}

finalize功能:

var finalize = function(key,value){
    var total = 0;
    value.locs.forEach(function(loc){
        total += loc.count;
    })
    return {"total":total,"locs":value.locs};
}

调用map-reduce:

db.collection.mapReduce(map,reduce,{"out":"t1","finalize":finalize});

map-reduce生成输出后聚合结果。

db.t1.aggregate([
{$project:{"_id":0,
           "number_of_records":"$value.total",
           "locations":"$value.locs","user":"$_id"}}
])

样本o / p:

{
        "number_of_records" : 3,
        "locations" : [
                {
                        "location_id" : "658_55525",
                        "count" : 1
                },
                {
                        "location_id" : "658_55525213",
                        "count" : 2
                }
        ],
        "user" : 2
}
{
        "number_of_records" : 1,
        "locations" : [
                {
                        "location_id" : "658_55525",
                        "count" : 1
                }
        ],
        "user" : NumberLong("475035504705")
}

map-reduce java脚本代码应该是自解释的。