我的mongo收藏集中有超过 1000万份文档。而且图像文件扩展名.jpg
在大约〜600万个文档中缺少表单字段variants.image_url
,我想用扩展名.jpg
来更新此字段。
我先运行find
查询以查找所有这些文档,然后运行update
查询以对其进行更新,但这非常慢。我该如何优化呢?
示例:
{“ variants”:[{“ image_url”:“ http://assets.myassets.com/assets/images/2020/3/5/158642332113146/Arrow-ShirtsFossil-Smart-WatchesLee-Cooper-Formal-ShoesRoadster -Jeans“}]}
将更改为.jpg
{“ variants”:[{“ image_url”:“ http://assets.myassets.com/assets/images/2020/3/5/158642332113146/Arrow-ShirtsFossil-Smart-WatchesLee-Cooper-Formal-ShoesRoadster -Jeans.jpg“}]
# query through all where verion in not v2 and return only variants.image_url
cursor = collection.find({"version": {"$ne": "v2"}}, {"variants.image_url": 1, "_id": 0})
modified_count = 0
for record in cursor:
modified_count = modified_count + update_image_url(record)
return modified_count
def update_image_url(record)
for key1 in record:
# list
for idx, elem in enumerate(record[key1]):
# dict
for key2 in elem:
if str(elem[key2])[-4:] == ".jpg" or str(elem[key2])[-4:] == ".JPG":
print(".jpg or .JPG extension present. skipping")
return 0
else:
query = {"variants.image_url": {"$eq": elem[key2]}}
new_value = {"$set": {"variants." + str(idx) + ".image_url": str(elem[key2]) + ".jpg"}}
update_result = collection.update(query, new_value)
print(update_result["nModified"], "nModified documents updated.")
return update_result["nModified"]
答案 0 :(得分:0)
我通过优化第二部分(即update
时间 O(1))来解决此问题。我使用的是variants.image_url
,而不是在_id
上使用查询,并且由于数据是在字段_id
上建立索引的,因此需要O(1)的时间。
# return _id as well
cursor = collection.find({"version": {"$ne": "v2"}}, {"variants.image_url": 1})
modified_count = 0
for record in cursor:
modified_count = modified_count + update_image_url(record)
return modified_count
def update_image_url(record)
for key1 in record:
# list
for idx, elem in enumerate(record[key1]):
# dict
for key2 in elem:
if str(elem[key2])[-4:] == ".jpg" or str(elem[key2])[-4:] == ".JPG":
print(".jpg or .JPG extension present. skipping")
return 0
else:
# query on _id field O(1) time
query = {"_id": {"$eq": record["_id"]}}
new_value = {"$set": {"variants." + str(idx) + ".image_url": str(elem[key2]) + ".jpg"}}
update_result = collection.update(query, new_value)
print(update_result["nModified"], "nModified documents updated.")
return update_result["nModified"]