Question

我有一套10.000个txt文档，其中包含旧维基百科文章。这些文章使用自定义java程序加载到mongoDB集合中。

我的每篇文章的文档都是这样的：

{ 
"_id" : ObjectID("....."),
"doc_id" : 335814,
"terms" : 
    [
          "2012", "2012", "adam", "knick", "basketball", ....
    ]
}

现在我想计算数组中每个单词的出现次数，即所谓的术语频率。

结果文档应如下所示：

{
"doc_id" : 335814,
"term_tf": [
      {term: "2012", tf: 2},
      {term: "adam", tf: 1},
      {term: "knick", tf: 1},
      {term: "basketball", tf: 1},
      .....
      ]
}

但是到目前为止我所能达到的目标我可以实现这样的目标：

db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"},  tf: {$sum : 1}}}], { allowDiskUse:true } );

{ "_id" : { "id" : 335814, "term" : "2012" }, "tf" : 2 }
{ "_id" : { "id" : 335814, "term" : "adam" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "knick" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "basketball" }, "tf" : 1 }

但是你可以看到文档结构不符合我的需求。我只想拥有doc_id一次，然后是一个包含所有术语和相应术语频率的数组。

所以我搜索一些与$unwind运算符相反的东西。

感谢您的帮助。

Answer 1

使用第二个$group和$out，您的管道应如下所示：

db.stemmedTerms.aggregate([
    {$unwind: "$terms" }, 
    // count
    {$group: {
        _id: {id: "$doc_id", term: "$terms"},  
        tf: {$sum : 1}  
    }},
    // build array
    {$group: {
        _id: "$_id.id",  
        term_tf: {$push:  { term: "$_id.term", tf: "$tf" }}
    }},
    // write to new collection
    { $out : "occurences" }     
], 
{ allowDiskUse: true});

如何计算mongo db中数组元素的出现次数？

1 个答案: