我是MongoDB和Python的新手,必须使用pymongo编写脚本。有一个网站供用户执行搜索,后端有MongoDB,其中一个集合存储所有用户的搜索历史,另一个集合存储所有用户。
我需要遍历所有用户,获取他们过去30天的所有搜索历史记录,然后取总和,然后在其用户字段之一中设置该总和。下面是我写的。有没有一种方法可以加快速度,即改为使用聚合,还是通过多线程或使其异步?
import pymongo
from datetime import datetime, timedelta
from bson.objectid import ObjectId
def lambda_handler(event, context):
mongohost = '10.0.0.1'
mongoport = 27017
mongoclient = pymongo.MongoClient(mongohost, mongoport)
mongodb = mongoclient["maindb"]
mongo_search_logs_collection = mongodb["searchlogs"]
mongo_users_collection = mongodb["users"]
days_to_subtract_from_today = 30
search_count_start_date = (datetime.today() - timedelta(days_to_subtract_from_today)).date()
count = 0
# Iterate over all users and update searchCount value
for x in mongo_users_collection.find():
# Get total searches last X days
total_search_count = mongo_search_logs_collection.count_documents({
'createdBy': ObjectId(x['_id']),
'created': {'$gte': datetime(search_count_start_date.year, search_count_start_date.month, search_count_start_date.day)}
})
# Update searchCount value
mongo_users_collection.update_one({
'_id': ObjectId(x['_id'])
}, {
'$set': {
'searchCount': total_search_count
}
}, upsert=False)
# Increment counter
count += 1
print("Processed " + str(count) + " records")
答案 0 :(得分:1)
这可能是使用aggregation
和bulk
操作的一种方式:
import pymongo
from datetime import datetime, timedelta
from bson.objectid import ObjectId
def lambda_handler(event, context):
mongohost = '10.0.0.1'
mongoport = 27017
mongoclient = pymongo.MongoClient(mongohost, mongoport)
mongodb = mongoclient["maindb"]
mongo_search_logs_collection = mongodb["searchlogs"]
mongo_users_collection = mongodb["users"]
days_to_subtract_from_today = 30
search_count_start_date = (datetime.today() - timedelta(days_to_subtract_from_today)).date()
cursor = mongo_search_logs_collection.aggregate([
{
"$match":{
"created": {"$gte": datetime(search_count_start_date.year, search_count_start_date.month, search_count_start_date.day)}
}
},
{
"$group":{
"_id": "$createdBy", "searchCount": { "$sum": 1 }
}
}
])
bulk = mongo_users_collection.initialize_unordered_bulk_op()
for res in cursor:
bulk.find({ "_id": res["_id"] }).update({ "$set": { "searchCount": res["searchCount"] } }, upsert=False)
bulk.execute()
让我知道您是否有任何问题或疑问,因为我没有测试过;)
答案 1 :(得分:1)
在循环中多次查询 mongo_search_logs_collection 时,这会减慢处理速度。相反,您可以一次性获得用户的searchCount,然后对其进行更新。这样会更快。在stmt下面检查一次抓取中所有用户的抓取次数。
mongo_search_logs_collection.aggregate(
[
{
"$match": {
"created": {
"$gte": datetime(search_count_start_date.year, search_count_start_date.month, search_count_start_date.day)
}
}
},
{
"$group": {
"_id": "$createdBy",
"total_search_count": {
"$sum": 1
}
}
}
]
)