Mongodb的同谋

时间:2017-12-05 21:44:31

标签: mongodb self-join

我熟悉关系数据库,比如Mysql。最近,我正在研究使用MongoDb存储100000份文档的项目。这些文档的结构如下:

    {
      "_id" : "1",
      "abstract" : "In this book, we present ....",
      "doi" : "xxxx/xxxxxx",
      "authors" : [ 
                   "Davis, a", 
                   "louis, X", 
                   "CUI, Li", 
                   "FANG, Y"
                    ]
      }

我想提取所有可能的作者组合(仅对)的cooccurence矩阵或cooccurence值。预期的输出是: {[auth1,auth2:5] [auth1,auth3:1] [auth2,auth8:9] ....} 这意味着auth1和auth2 coollaborate 5次(在5本书中)auth2和auth8合作9次.... 在关系数据库中,可能的解决方案是:  例如,在表auth-book中:

    INSERT INTO auth-book (id_book, id_auth) VALUES
              (1, 'auth1'),
              (1, 'auth2'),
              (1, 'auth3'),
              (2, 'auth1'),
              (2, 'auth5'),
              (2, 'auth87'),
              (2, 'auth2')...

计算作者的共同合作或协作的查询是:

     SELECT   a.id_auth a, b.id_auth b, COUNT(*) cnt
     FROM     auth-book a JOIN auth-booke b ON b.id_book= a.id_book AND   .id_auth > a.id_auth
     GROUP BY a.id_auth, b.id_auth

输出:[auth1 auth2 => 2] [auth1 auth3 => 1] [auth2 auth3 => 1] .......等

我不知道如何在mongodb中实现这样的查询

2 个答案:

答案 0 :(得分:0)

可能不是你要找的东西,但这个"有效。"挑战在于创建对列表,这是SQL中的自联接正在做的事情。将配对移出查询也意味着以后更容易从成对更改为三元组,因为find()使用$all函数来确保匹配任意数量的项目。

c=db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
            ]);

doc = c.next(); // Get single doc output from above
ulist = doc['uu'];
c.close();

// Create unique pairs:
pairs = [];
for(var index = 0; index < ulist.length; index++) {
    var first = ulist[index];
    for(var p2 = index+1; p2 < ulist.length; p2++) {
        pairs.push([first, ulist[p2]]);
    }
}

// And now go get 'em:
pairs.forEach(function(qq) {
    n=db.foo.find({"authors": {$all: qq }}).count();
    print(qq[0] + "-" + qq[1] + ": " + n);
    });

答案 1 :(得分:0)

这有点可怕,但似乎有效。基本上,我们$lookup反对自己,并利用这样一个事实,即将字符串查找到字符串数组中是一种隐式的$in类操作。

 db.foo.aggregate([
 // Create single deduped list of authors:
 {$unwind: "$authors"}
 ,{$group: {_id:null, uu: {$addToSet: "$authors"} }}

 // Now unwind and go lookup into the authors array in the same collection.  If $uu
 // appears anywhere in the authors array, it is a match.
 ,{$unwind: "$uu"}
 ,{$lookup: { from: "foo", localField: "uu", foreignField: "authors", as: "X"}}

 // OK, lots of material in the pipe here.  Brace for a double unwind -- but
 // in this data domain it is probably OK because not every book has the same
 // set of authors, which will attenuate the result $lookup above.
 ,{$unwind: "$X"}
 ,{$unwind: "$X.authors"}

 // The data is starting to be clear here.  But we see two things: 
 // self-matching (author1->author1) and pairs of inverses
 // (author1->author2 / author2->author1).  We don't need self matches and we
 // only need "one side" of the inverses.   So... let's do a strcasecmp
 // on them.  0 means the same so that's out.  So pick 1 or -1 to get one side
 // of the pair; doesn't really matter!  This is why the OP has b.id_auth > a.id_auth
 // in the SQL equivalent: to eliminate self match and one side of inverses.
 ,{$addFields: {q: {$strcasecmp: ["$uu", "$X.authors"] }} }
 ,{$match: {q: 1}}  

 // And now group!
 ,{$group: {_id: {a: "$uu", b: "$X.authors"}, n: {$sum:1}}}
            ]);