假设我有来自Twitter API的流数据,并且我将数据存储为MongoDB中的文档。我想要找的是screen_name
下entities.user_mentions
的数量。
{
"_id" : ObjectId("50657d5844956d06fb5b36c7"),
"contributors" : null,
"text" : "",
"entities" : {
"urls" : [ ],
"hashtags" : [
{
"text" : "",
"indices" : [
26,
30
]
},
{
"text" : "",
"indices" : []
}
],
"user_mentions" : [
{
"name":"Twitter API",
"indices":[4,15],
"screen_name":"twitterapi",
"id":6253282, "id_str":"6253282"
}]
},
...
我试图使用map reduce:
map = function() {
if (!this.entities.user_mentions.screen_name) {
return;
}
for (index in this.entities.user_mentions.screen_name) {
emit(this.entities.user_mentions.screen_name[index], 1);
}
}
reduce = function(previous, current) {
var count = 0;
for (index in current) {
count += current[index];
}
return count;
}
result = db.runCommand({
"mapreduce" : "twitter_sample",
"map" : map,
"reduce" : reduce,
"out" : "user_mentions"
});
但它不太有用......
答案 0 :(得分:3)
由于entities.user_mentions
是一个数组,您希望为map()
中的每个screen_name发出一个值:
var map = function() {
this.entities.user_mentions.forEach(function(mention) {
emit(mention.screen_name, { count: 1 });
})
};
然后按reduce()
:
var reduce = function(key, values) {
// NB: reduce() uses same format as results emitted by map()
var result = { count: 0 };
values.forEach(function(value) {
result.count += value.count;
});
return result;
};
注意:要调试map / reduce JavaScript函数,可以使用print()
和printjson()
命令。输出将显示在mongod
日志中。
编辑:为了比较,这是一个在MongoDB 2.2中使用新Aggregation Framework的示例:
db.twitter_sample.aggregate(
// Project to limit the document fields included
{ $project: {
_id: 0,
"entities.user_mentions" : 1
}},
// Split user_mentions array into a stream of documents
{ $unwind: "$entities.user_mentions" },
// Group and count the unique mentions by screen_name
{ $group : {
_id: "$entities.user_mentions.screen_name",
count: { $sum : 1 }
}},
// Optional: sort by count, descending
{ $sort : {
"count" : -1
}}
)
原始的Map / Reduce方法最适合大数据集,如Twitter数据所暗示的那样。有关Map / Reduce与Aggregation Framework限制的比较,请参阅StackOverflow问题MongoDB group(), $group and MapReduce的相关讨论。