我熟悉关系数据库,比如Mysql。最近,我正在研究使用MongoDb存储100000份文档的项目。这些文档的结构如下:
{
"_id" : "1",
"abstract" : "In this book, we present ....",
"doi" : "xxxx/xxxxxx",
"authors" : [
"Davis, a",
"louis, X",
"CUI, Li",
"FANG, Y"
]
}
我想提取所有可能的作者组合(仅对)的cooccurence矩阵或cooccurence值。预期的输出是: {[auth1,auth2:5] [auth1,auth3:1] [auth2,auth8:9] ....} 这意味着auth1和auth2 coollaborate 5次(在5本书中)auth2和auth8合作9次.... 在关系数据库中,可能的解决方案是: 例如,在表auth-book中:
INSERT INTO auth-book (id_book, id_auth) VALUES
(1, 'auth1'),
(1, 'auth2'),
(1, 'auth3'),
(2, 'auth1'),
(2, 'auth5'),
(2, 'auth87'),
(2, 'auth2')...
计算作者的共同合作或协作的查询是:
SELECT a.id_auth a, b.id_auth b, COUNT(*) cnt
FROM auth-book a JOIN auth-booke b ON b.id_book= a.id_book AND .id_auth > a.id_auth
GROUP BY a.id_auth, b.id_auth
输出:[auth1 auth2 => 2] [auth1 auth3 => 1] [auth2 auth3 => 1] .......等
我不知道如何在mongodb中实现这样的查询
答案 0 :(得分:0)
可能不是你要找的东西,但这个"有效。"挑战在于创建对列表,这是SQL中的自联接正在做的事情。将配对移出查询也意味着以后更容易从成对更改为三元组,因为find()
使用$all
函数来确保匹配任意数量的项目。
c=db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
]);
doc = c.next(); // Get single doc output from above
ulist = doc['uu'];
c.close();
// Create unique pairs:
pairs = [];
for(var index = 0; index < ulist.length; index++) {
var first = ulist[index];
for(var p2 = index+1; p2 < ulist.length; p2++) {
pairs.push([first, ulist[p2]]);
}
}
// And now go get 'em:
pairs.forEach(function(qq) {
n=db.foo.find({"authors": {$all: qq }}).count();
print(qq[0] + "-" + qq[1] + ": " + n);
});
答案 1 :(得分:0)
这有点可怕,但似乎有效。基本上,我们$lookup
反对自己,并利用这样一个事实,即将字符串查找到字符串数组中是一种隐式的$in
类操作。
db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
// Now unwind and go lookup into the authors array in the same collection. If $uu
// appears anywhere in the authors array, it is a match.
,{$unwind: "$uu"}
,{$lookup: { from: "foo", localField: "uu", foreignField: "authors", as: "X"}}
// OK, lots of material in the pipe here. Brace for a double unwind -- but
// in this data domain it is probably OK because not every book has the same
// set of authors, which will attenuate the result $lookup above.
,{$unwind: "$X"}
,{$unwind: "$X.authors"}
// The data is starting to be clear here. But we see two things:
// self-matching (author1->author1) and pairs of inverses
// (author1->author2 / author2->author1). We don't need self matches and we
// only need "one side" of the inverses. So... let's do a strcasecmp
// on them. 0 means the same so that's out. So pick 1 or -1 to get one side
// of the pair; doesn't really matter! This is why the OP has b.id_auth > a.id_auth
// in the SQL equivalent: to eliminate self match and one side of inverses.
,{$addFields: {q: {$strcasecmp: ["$uu", "$X.authors"] }} }
,{$match: {q: 1}}
// And now group!
,{$group: {_id: {a: "$uu", b: "$X.authors"}, n: {$sum:1}}}
]);