加速Python MongoDB查询

时间:2014-04-21 23:15:24

标签: python mongodb

我是MongoDB的新手。我试图从几个集合中读取数据。我想对GHTorrent做一些统计,所以我试图用我感兴趣的数据打印一个.csv。问题是我的查询现在运行了大约30分钟,我确信我的搜索量少了虽然有效,但我不确定如何改进它。

首先,我做

closed_issues = ghdb.issues.find(
    { "state": "closed" }, # query criteria
    { #projection
    "id": 1,
    "created_at": 1,
    "closed_at": 1,
    "comments": 1,
    "repo": 1,
    "owner": 1,
    "number": 1,
    }

然后,在打开文件并撰写标题之后,我做了

for issue in closed_issues:
    countMentioned = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event": "mentioned" }).count();
    countSubscribed = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event": "subscribed" }).count();
    countAssigned = ghdb.issue_events.find({
        "issue_id": issue['number'],
        "repo": issue['repo'],
        "owner": issue['owner'],
        "event":  "assigned" }).count();
    time_created = parser.parse(issue['created_at'])
    time_closed = parser.parse(issue['closed_at'])
    timediff = time_closed - time_created;

    f.write(
        str(issue['id']) +","+
        str(issue['number']) +","+
        str(issue['repo']) +","+
        str(issue['owner']) +","+
        str(timediff.total_seconds()) +","+
        str(issue['comments']) +","+
        str(countMentioned) +","+
        str(countSubscribed) +","+
        str(countAssigned) +'\n'
        )

如您所见,我对每个问题的三个不同发现使用四个相同标准中的三个。搜索issue_idrepoowner的一个组合并对三个不同event中的每一个进行计数的最有效方法是什么?

1 个答案:

答案 0 :(得分:1)

mongodb聚合框架是一个很好的工具,可用于生成聚合统计数据的查询,例如计数 - http://docs.mongodb.org/manual/core/aggregation/

我从那里开始并稍微玩一下。对于这种用例,您通常可以从那里开始,然后在结果周围包含一些额外的代码,以便以您需要的格式导出数据。