pymongo需要24小时才能完成200K记录

时间:2012-02-11 15:07:38

标签: python mongodb pymongo

我在数据库pagepagearchive中有两个集合我正在尝试清理。我注意到pagearchive中正在创建新文档,而不是按预期向嵌入文档添加值。基本上这个脚本正在做的是遍历page中的每个文档,然后在pagearchive中找到该文档的所有副本,并将我想要的数据移动到单个文档中并删除附加内容。

问题是pagearchive中只有200K文档,并且根据我在底部打印的计数变量,它需要花费30分钟到60分钟以上来迭代1000条记录。这非常慢。我看到的重复文档中最大的计数是88.但是,在我pageArchiveuu查询时,我看到1-2个重复的文档。

mongodb是一台具有16GB内存的64位计算机。  正在uu集合上迭代的pageArchive键是一个字符串。我确定该字段上有一个索引db.pagearchive.ensureIndex({uu:1})我还做了mongod --repair以获得良好的衡量标准。

我的猜测是问题在于我的邋py python代码(不是很擅长)或者也许我错过了mongodb所必需的东西。为什么它会如此缓慢或者我该怎么做才能大大加快速度呢?

我想也许是因为uu字段是一个导致瓶颈的字符串,但这是文档中的唯一属性(或者一旦我清理了这个集合)。最重要的是,当我停止进程并重新启动它时,它每秒加速大约1000条记录。直到它开始在集合中再次发现重复,然后它再次变慢(每10-20分钟删除约100条记录)

from pymongo import Connection
import datetime


def match_dates(old, new):
    if old['coll_at'].month == new['coll_at'].month and old['coll_at'].day == new['coll_at'].day and old['coll_at'].year == new['coll_at'].year:
        return False

    return new

connection = Connection('dashboard.dev')


db = connection['mydb']

pageArchive = db['pagearchive']
pages = db['page']

count = 0
for page in pages.find(timeout=False):

    archive_keep = None
    ids_to_delete = []
    for archive in pageArchive.find({"uu" : page['uu']}):

        if archive_keep == None:
            #this is the first record we found, so we will store data from duplicate records with this one; delete the rest
            archive_keep = archive
        else:
            for attr in archive_keep.keys():
                #make sure we are dealing with an embedded document field
                if isinstance(archive_keep[attr], basestring) or attr == 'updated_at':
                    continue
                else:
                    try:
                        if len(archive_keep[attr]) == 0:
                            continue
                    except TypeError:
                        continue
                    try:
                        #We've got our first embedded doc from a property to compare against
                        for obj in archive_keep[attr]:
                            if archive['_id'] not in ids_to_delete:
                                ids_to_delete.append(archive['_id'])
                            #loop through secondary archive doc (comparing against the archive keep)
                            for attr_old in archive.keys():
                                #make sure we are dealing with an embedded document field
                                if isinstance(archive[attr_old], basestring) or attr_old == 'updated_at':
                                    continue
                                else:
                                    try:
                                        #now we know we're dealing with a list, make sure it has data
                                        if len(archive[attr_old]) == 0:
                                            continue
                                    except TypeError:
                                        continue
                                    if attr == attr_old:
                                        #document prop. match; loop through embedded document array and make sure data wasn't collected on the same day
                                        for obj2 in archive[attr_old]:
                                            new_obj = match_dates(obj, obj2)
                                            if new_obj != False:
                                                archive_keep[attr].append(new_obj)
                    except TypeError, te:
                        'not iterable'
        pageArchive.update({
                            '_id':archive_keep['_id']}, 
                           {"$set": archive_keep}, 
                           upsert=False)
        for mongoId in ids_to_delete:
            pageArchive.remove({'_id':mongoId})
        count += 1
        if count % 100 == 0:
            print str(datetime.datetime.now()) + ' ### ' + str(count) 

1 个答案:

答案 0 :(得分:2)

我会对代码进行以下更改:

  • match_dates返回None而不是False并执行if new_obj is not None:它会检查引用,而不会调用对象__ne____nonzero__

  • for page in pages.find(timeout=False):如果仅使用uu密钥且页面较大,则fields=['uu'] find参数应加快查询速度。

  • archive_keep == Nonearchive_keep is None

  • archive_keep[attr]被调用4次。保存keep_obj = archive_keep[attr]然后使用keep_obj

  • 会快一点
  • ids_to_delete = []更改为ids_to_delete = set()。然后if archive['_id'] not in ids_to_delete:将是O(1)