是什么原因在mapreduce上有时候mapper会生成比mongodb中的原始数据更多的文档?

时间:2014-02-09 11:14:54

标签: mongodb mapreduce

我正在执行州的人口统计并获得原始输出的额外文档。要检查我发现mappers会生成中间数据的原因远远超过mongodb中的原始数据。我该如何解决这个问题?源集合中的文档总数为29468。

数据集中的示例:

{ "city" : "SPLENDORA", "loc" : [ -95.199308, 30.232609 ], "pop" : 11287, "state" : "TX", "_id" : "77372" }

{ "city" : "SPRING", "loc" : [ -95.377329, 30.053241 ], "pop" : 33118, "state" : "TX", "_id" : "77373" }

{ "city" : "TOMBALL", "loc" : [ -95.62006, 30.073923 ], "pop" : 19801, "state" : "TX", "_id" : "77375" }

{ "city" : "WILLIS", "loc" : [ -95.497583, 30.432025 ], "pop" : 9988, "state" : "TX", "_id" : "77378" }

{ "city" : "KLEIN", "loc" : [ -95.528481, 30.023377 ], "pop" : 35275, "state" : "TX", "_id" : "77379" }

{ "city" : "CONROE", "loc" : [ -95.492392, 30.225725 ], "pop" : 1635, "state" : "TX", "_id" : "77384" }

地图功能:

var m=function(){ emit(this.city,this.pop);}

减少功能:

var r=function(c,p){ return p;}

MR输出到新集合:

{ "_id" : "81080", "value" : 172 }
{ "_id" : "81250", "value" : 467 }
{ "_id" : "82057", "value" : 60 }
{ "_id" : "95411", "value" : 133 }
{ "_id" : "95414", "value" : 226 }
{ "_id" : "95440", "value" : 2876 }
{ "_id" : "95455", "value" : 843 }
{ "_id" : "95467", "value" : 328 }
{ "_id" : "95489", "value" : 358 }
{ "_id" : "95495", "value" : 367 }
{ "_id" : "98791", "value" : 5345 }
{ "_id" : "PLEASANT GROVE", "value" : [ 8458, 15703, 80, 772,
{ "_id" : "POINTBLANK", "value" : 2911 }
{ "_id" : "PORTER", "value" : [ 13541, 19024, 985, 425, 2705 ]
{ "_id" : "SHEPHERD", "value" : [ 9604, 17397, 2078 ] }
{ "_id" : "SPLENDORA", "value" : 11287 }
{ "_id" : "SPRING", "value" : [ 33118, 8379, 21805, 8540 ] }
{ "_id" : "TOMBALL", "value" : 19801 }
{ "_id" : "WILLIS", "value" : [ 9988, 2769, 2574 ] }
{ "_id" : "KLEIN", "value" : 35275 }

1 个答案:

答案 0 :(得分:0)

由于reduce功能不正确,您的输出不符合预期。 reduce函数的原型为function(key,values) {...},其中values是与key关联的数组。

您的reduce函数正在返回values数组而不是减少它。

要总结给定键的值,您的reduce()函数应如下所示:

  var r=function(key, values) {
     return Array.sum(values);
  }

如果你想按州计算人口,你的map()功能也是错误的:你应该发出状态&人口而不是城市和人口人口:

  var m=function() {
     emit(this.state,this.pop);
  }

将它们放在一起,你的输出应该看起来像:

    {
        "_id" : "AK",
        "value" : 550043
    },
    {
        "_id" : "AL",
        "value" : 4040587
    },
    {
        "_id" : "AR",
        "value" : 2350725
    }
    ...

MongoDB手册提供了有关编写和测试reduce功能的更多详细信息: