我是MongoDB的新手。我试图从几个集合中读取数据。我想对GHTorrent做一些统计,所以我试图用我感兴趣的数据打印一个.csv。问题是我的查询现在运行了大约30分钟,我确信我的搜索量少了虽然有效,但我不确定如何改进它。
首先,我做
closed_issues = ghdb.issues.find(
{ "state": "closed" }, # query criteria
{ #projection
"id": 1,
"created_at": 1,
"closed_at": 1,
"comments": 1,
"repo": 1,
"owner": 1,
"number": 1,
}
然后,在打开文件并撰写标题之后,我做了
for issue in closed_issues:
countMentioned = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "mentioned" }).count();
countSubscribed = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "subscribed" }).count();
countAssigned = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "assigned" }).count();
time_created = parser.parse(issue['created_at'])
time_closed = parser.parse(issue['closed_at'])
timediff = time_closed - time_created;
f.write(
str(issue['id']) +","+
str(issue['number']) +","+
str(issue['repo']) +","+
str(issue['owner']) +","+
str(timediff.total_seconds()) +","+
str(issue['comments']) +","+
str(countMentioned) +","+
str(countSubscribed) +","+
str(countAssigned) +'\n'
)
如您所见,我对每个问题的三个不同发现使用四个相同标准中的三个。搜索issue_id
,repo
和owner
的一个组合并对三个不同event
中的每一个进行计数的最有效方法是什么?
答案 0 :(得分:1)
mongodb聚合框架是一个很好的工具,可用于生成聚合统计数据的查询,例如计数 - http://docs.mongodb.org/manual/core/aggregation/
我从那里开始并稍微玩一下。对于这种用例,您通常可以从那里开始,然后在结果周围包含一些额外的代码,以便以您需要的格式导出数据。