pymongo在大型集合上更新相同的字段值

时间:2020-07-18 07:27:57

标签: python mongodb pymongo

我的mongo收藏集中有超过 1000万份文档。而且图像文件扩展名.jpg在大约〜600万个文档中缺少表单字段variants.image_url,我想用扩展名.jpg来更新此字段。

我先运行find查询以查找所有这些文档,然后运行update查询以对其进行更新,但这非常慢。我该如何优化呢?

示例:

{“ variants”:[{“ image_url”:“ http://assets.myassets.com/assets/images/2020/3/5/158642332113146/Arrow-ShirtsFossil-Smart-WatchesLee-Cooper-Formal-ShoesRoadster -Jeans“}]}

将更改为.jpg

{“ variants”:[{“ image_url”:“ http://assets.myassets.com/assets/images/2020/3/5/158642332113146/Arrow-ShirtsFossil-Smart-WatchesLee-Cooper-Formal-ShoesRoadster -Jeans.jpg“}]

# query through all where verion in not v2 and return only variants.image_url
 cursor = collection.find({"version": {"$ne": "v2"}}, {"variants.image_url": 1, "_id": 0})

 modified_count  = 0
 for record in cursor:
    modified_count = modified_count + update_image_url(record) 
 return modified_count


 def update_image_url(record)
    for key1 in record:
        # list
        for idx, elem in enumerate(record[key1]):
            # dict
            for key2 in elem:
                if str(elem[key2])[-4:] == ".jpg" or str(elem[key2])[-4:] == ".JPG":
                    print(".jpg or .JPG extension present. skipping")
                    return 0
                else:
                    query = {"variants.image_url": {"$eq": elem[key2]}}
                    new_value = {"$set": {"variants." + str(idx) + ".image_url": str(elem[key2]) + ".jpg"}}

                    update_result = collection.update(query, new_value)
                    print(update_result["nModified"], "nModified documents updated.")
                    return update_result["nModified"]

1 个答案:

答案 0 :(得分:0)

我通过优化第二部分(即update时间 O(1))来解决此问题。我使用的是variants.image_url,而不是在_id上使用查询,并且由于数据是在字段_id上建立索引的,因此需要O(1)的时间。

# return _id as well 
 cursor = collection.find({"version": {"$ne": "v2"}}, {"variants.image_url": 1})

 modified_count  = 0
 for record in cursor:
    modified_count = modified_count + update_image_url(record) 
 return modified_count


 def update_image_url(record)
    for key1 in record:
        # list
        for idx, elem in enumerate(record[key1]):
            # dict
            for key2 in elem:
                if str(elem[key2])[-4:] == ".jpg" or str(elem[key2])[-4:] == ".JPG":
                    print(".jpg or .JPG extension present. skipping")
                    return 0
                else:
                    # query on _id field O(1) time 
                    query = {"_id": {"$eq": record["_id"]}}
                    new_value = {"$set": {"variants." + str(idx) + ".image_url": str(elem[key2]) + ".jpg"}}

                    update_result = collection.update(query, new_value)
                    print(update_result["nModified"], "nModified documents updated.")
                    return update_result["nModified"]