这是我的mapreduce代码:
DBCollection mongoCollection = MongoDAO.getCollection();
String map = "function() {"
+ "for (index in this.positions.positionList) {"
+ "emit(this._id+'|'+this.headline+'|'+"
+ "this.location.name+'|'+this.location.country.code+'|'+this.publicProfileUrl+'|'+"
+ "this.positions.positionList[index].title+'|'+"
+ "this.positions.positionList[index].company.name+'|'+this.positions.positionList[index].company.industry+'|'+"
+ "this.positions.positionList[index].company.type+'|'+this.positions.positionList[index].company.size+'|'+"
+ "this.lastName+'|'+this.firstName+'|'+this.industry+'|'+this.updatedDate+'|' , {count: 1});"
+ "}}";
String reduce = "";
MapReduceCommand mapReduceCommand = new MapReduceCommand(
mongoCollection, map, reduce.toString(), "final_result",
MapReduceCommand.OutputType.REPLACE, null);
MapReduceOutput out = mongoCollection.mapReduce(mapReduceCommand);
目前我正在处理140,000条记录。但在做mapreduce时,记录数量减少到90,000。数据集中没有重复记录。
答案 0 :(得分:1)
更改您的emit以将_id作为键并以管道分隔的字符串作为值发出。举个例子:
emit(this._id, [this._id, this.a, this.b,...].join('|'))
我认为发生的事情是你在关键字中制作了过长的字符串。对于_id值,有一个限制为1KB(在2.0之前,从之前的800B开始),这就是密钥的变化。
此外,您可能需要查看预先打包的mongodb-hadoop连接器,而不是滚动自己的连接器:https://github.com/mongodb/mongo-hadoop