好吧,我想创建某种MapReduce算法来为文本文档创建反向索引。 在映射部分,我要做类似的事情
letters = ['a']
regx = re.compile("^("+"|".join(letters)+')')
selectedWords = directIndex.aggregate([
{ "$match": { "words.word": regx } },
{ "$unwind": "$words" },
{ "$match": { "words.word": regx } },
{ "$group": { "_id": { "word":"$words.word", "count":"$words.count", 'document' : '$document' } } }])
好吧,在这里,我要按首字母选择所有与它们相关的单词和信息。之后,我将此信息写入另一个集合:
myinvcol.insert_one({'letter':str(''.join(letters)),'words':selectedWords })
在下一步中,我将读取每个插入的文档并执行reduce操作dict('wordName':{documents:[document1:count1,document2:count2等],'wordName2:{documents:[...] }'),然后对该字典进行一些其他操作
现在,有趣的部分)): 是否可以进行第一步(映射部分)又称为聚合,以完全在MongoDB服务器上执行?换句话说,我知道这里有'$ out'运算符:
letters = ['a']
regx = re.compile("^("+"|".join(letters)+')')
selectedWords = directIndex.aggregate([
{ "$match": { "words.word": regx } },
{ "$unwind": "$words" },
{ "$match": { "words.word": regx } },
{ "$group": { "_id": { "word":"$words.word", "count":"$words.count", 'document' : '$document' } } }
{ "$out" : 'InverseIndex'}])
它允许我将聚合结果写入另一个集合,但是它并没有实现我想要的:而不是插入一个文档:
{'letter':str(''.join(letters)),'words':selectedWords },
我插入了
{ "_id": { "word":"$words.word", "count":"$words.count", 'document' : '$document' } }.
最后,有没有一种方法可以创建一个聚合文档,将其所有结果合并到$ out语句之前的一个数组中?
答案 0 :(得分:0)
经过一些研究,发现这可能是一个解决方案>
regx = re.compile("^("+"|".join('ab')+')')
myinvcol.insertMany(mydb.runCommand(
{
'aggregate': "DirectIndex",
'pipeline':
[
{ "$match": { "words.word": regx } },
{ "$unwind": "$words" },
{ "$match": { "words.word": regx } },
{ "$group": { "_id": { "word":"$words.word", "count":"$words.count", 'document' : '$document' } } },
{ "$group": {
"_id": {'$substr':[''.join('ab'),0,len(''.join('ab'))]},
"words": {
"$push": {
"word": "$_id.word",
"count":"$_id.count",
'document' : '$_id.document'
}
}
}},
{'$out':"InverseIndex"}
]}).result)
(在这里mongoDB: how to reverse $unwind找到) 但是在这里,蒙哥很烂。 out参数将覆盖集合的内容。因此,如果我多次拨打此电话,以前的结果将不复存在。 正如我在这里看到的:How do I append Mongo DB aggregation results to an existing collection?,Mongo 4.2将为$ out提供特殊参数,称为-模式:“ replaceDocuments”。这将允许您将新内容添加到您的收藏中。但目前来看,这个主意不妙。
好吧,我试图通过mongo内置的map_reduce函数调用来做到这一点:
mape = Code("function () {"
"var docName =this.document;"
"this.words.forEach(function(z) {"
"z['document'] = docName;"
"var temp = z.word;"
"delete z.word;"
" emit(temp, {'documents':[z]});"
" });"
"}")
reduce = Code("function (key, values) {"
" var total = [];"
" for (var i = 0; i < values.length; i++) {"
"for (var j=0;j<values[i]['documents'].length;j++){"
"total.push({'document':values[i]['documents'][j]['document'], 'count':values[i]['documents'][j]['count'], 'tf':values[i]['documents'][j]['tf']});"
" }}"
" return {'documents': total};"
"}")
finalizeFunction = Code("function (key, reducedVal) {"
"if('documents' in reducedVal){"
"var normVal = Math.log((1+"+str(nrDocs)+")/(1+1+reducedVal.documents.length));"
"reducedVal['idf']=normVal;"
"return reducedVal;} else{ return null;}"
"};")
result = mydb.DirectIndex.map_reduce(mape, reduce, {'merge':"InverseIndex"},finalize=finalizeFunction)
这以某种方式满足了我的需求。缺点是速度。与手动实现的MapReduce相比(键是单词的字典聚合+映射),差异很大。无论如何,如果有人遇到这个问题,我只知道这两种解决方法。