我有一个Mongoengine模型,代表距离矩阵的一个条目:
class Distance(db.Document):
"""
An instance of the distance matrix.
"""
orig = db.ObjectIdField(required=True, unique_with='dest')
dest = db.ObjectIdField(required=True, unique_with='orig')
paths = db.ListField(db.EmbeddedDocumentField(PathEmbedded))
meta = {
'indexes': [
{'fields': ['orig'], 'cls': False},
{'fields': ['dest'], 'cls': False},
{'fields': ['orig', 'dest'], 'cls': False},
{'fields': ['dest', 'orig'], 'cls': False}
],
}
因为MongoEngine不支持聚合,所以我创建了一个pymongo聚合来获取位置ID列表之间的所有距离,所以我有:
pipeline = [
{'$match': {
'orig': {
'$in': list_of_origins
},
'dest': {
'$in': list_of_destinations
}
}},
{'$project': {
'paths': 1,
'_id': 0,
'ids': {'orig': '$orig', 'dest': '$dest'}
}},
]
# Open a pymongo connection to the db
# (we're skipping the moongengine ORM because it lacks support for aggregation)
pymongo_connection = MongoClient(host=config.db_host, port=config.db_port)
pymongo_db = pymongo_connection[config.db_name]
cursor = pymongo_db.distance.aggregate(pipeline, cursor={}, explain=False)
response.data = json_util.dumps(cursor)
目前我的list_of_origins
和list_of_destinations
每个都有100个元素,因此可以获取10.000个距离。
pymongo中此查询的平均运行时间约为1.15秒
这对我来说似乎很慢,所以为了比较我写了一个与JS相同的聚合:
db.distance.aggregate([
{$match: {
'dest': {$in: [
ObjectId('5436828e4ee264cf95bbb2a0'),
ObjectId('543682904ee264cf95bbbd1d'),
...
ObjectId('5436828e4ee264cf95bbb23a')
]},
'orig': {$in: [
ObjectId('5436828e4ee264cf95bbb0e1'),
ObjectId('543682904ee264cf95bbbe5b'),
...
ObjectId('543682904ee264cf95bbbc86')
]}
}},
{$project: {
'paths': 1,
'_id': 0,
'ids': {'dest': '$dest', 'orig': '$orig'}}
}
])
并从shell运行它以获得时间:time mongo < distance_matrix_aggregation.js
此格式的平均运行时间为65毫秒
因此使用pymongo将查询时间增加了近20倍。考虑到这个应用程序最终需要返回带有数百万个元素的距离矩阵,这种性能损失将成为一个严重的问题。
为了减少查询时间,我运行explain
来检查索引是否正在使用,但是我收到一个错误:"planError" : "InternalError No plan available to provide stats"
这似乎与MongoDB错误有关。但是因为我在MongoEngine模型中明确地创建了索引,所以我认为它没有丢失索引。
我想知道性能损失实际发生在哪里:
json_util.dumps(cursor)
单独遍历每个游标元素,从而产生大量的数据库访问?我已经尝试设置cursor.batch_size(20000)
并且没有看到任何性能改进...... 提高性能的任何提示?
更新
正如评论中所建议的,我跑了:
> db.distance.find({ "orig" : { "$in" : origins }, "dest" : { "$in" : destinations }}, { "paths" : 1, "_id" : 0 }).explain()
返回了:
{
"cursor" : "BtreeCursor orig_1_dest_1",
"isMultiKey" : false,
"n" : 10100,
"nscannedObjects" : 10100,
"nscanned" : 10200,
"nscannedObjectsAllPlans" : 10405,
"nscannedAllPlans" : 10506,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 81,
"nChunkSkips" : 0,
"millis" : 26,
"indexBounds" : {
"orig" : [
[
ObjectId("5436828e4ee264cf95bbb0e1"),
ObjectId("5436828e4ee264cf95bbb0e1")
],
<SNIP - 100 more OIds >
],
"dest" : [
[
ObjectId("5436828e4ee264cf95bbb0e1"),
ObjectId("5436828e4ee264cf95bbb0e1")
],
<SNIP - 100 more OIds >
]
},
"server" : <SNIP>,
"filterSet" : false
}
所以,这里的查询运行时间为26毫秒,这很好。不幸的是,等效的MongoEngine查询Distance.objects.filter(orig__in=origins, dest__in=destinations)
需要3.9秒,我真的不知道该怎么做explain()
输出:
{
"nYields": 81,
"nscannedAllPlans": 10608,
"filterSet": false,
"allPlans": [
{
"nChunkSkips": 0,
"n": 10100,
"cursor": "BtreeCursor orig_1_dest_1",
"scanAndOrder": false,
"indexBounds": {
"dest": [ <SNIP> ],
"orig": [ <SNIP> ],
]
},
"nscannedObjects": 10100,
"isMultiKey": false,
"indexOnly": false,
"nscanned": 10200
},
{
"nChunkSkips": 0,
"n": 101,
"cursor": "BtreeCursor dest_1_orig_1",
"scanAndOrder": false,
"indexBounds": {
"dest": [ <SNIP> ],
"orig": [ <SNIP> ],
},
"nscannedObjects": 101,
"isMultiKey": false,
"indexOnly": false,
"nscanned": 102
},
{
"nChunkSkips": 0,
"n": 97,
"cursor": "BtreeCursor orig_1",
"scanAndOrder": false,
"indexBounds": {
"orig": [ <SNIP> ],
},
"nscannedObjects": 102,
"isMultiKey": false,
"indexOnly": false,
"nscanned": 102
},
{
"nChunkSkips": 0,
"n": 95,
"cursor": "BtreeCursor dest_1",
"scanAndOrder": false,
"indexBounds": {
"dest": [ <SNIP> ],
},
"nscannedObjects": 102,
"isMultiKey": false,
"indexOnly": false,
"nscanned": 102
},
{
"nChunkSkips": 0,
"n": 94,
"cursor": "BtreeCursor _cls_1",
"scanAndOrder": false,
"indexBounds": {
"_cls": [
[
"Distance",
"Distance"
]
]
},
"nscannedObjects": 102,
"isMultiKey": false,
"indexOnly": false,
"nscanned": 102
}
],
"millis": 11,
"nChunkSkips": 0,
"server": "darkStar9:27017",
"n": 10100,
"cursor": "BtreeCursor orig_1_dest_1",
"scanAndOrder": false,
"indexBounds": {
"dest": [ <SNIP> ],
"orig": [ <SNIP> ],
},
"nscannedObjectsAllPlans": 10507,
"isMultiKey": false,
"stats": {
"works": 10201,
"isEOF": 1,
"needFetch": 0,
"needTime": 100,
"yields": 81,
"invalidates": 0,
"unyields": 81,
"type": "KEEP_MUTATIONS",
"children": [
{
"works": 10201,
"isEOF": 1,
"forcedFetches": 0,
"needFetch": 0,
"matchTested": 10100,
"needTime": 100,
"yields": 81,
"alreadyHasObj": 0,
"invalidates": 0,
"unyields": 81,
"type": "FETCH",
"children": [
{
"works": 10201,
"boundsVerbose": "field #0['orig']: [ <SNIP> ], field #1['dest']: [ <SNIP> ]",
"dupsTested": 0,
"yieldMovedCursor": 0,
"isEOF": 1,
"needFetch": 0,
"matchTested": 0,
"needTime": 100,
"keysExamined": 10200,
"seenInvalidated": 0,
"dupsDropped": 0,
"yields": 81,
"isMultiKey": 0,
"invalidates": 0,
"unyields": 81,
"type": "IXSCAN",
"children": [ ],
"advanced": 10100,
"keyPattern": "{ orig: 1, dest: 1 }"
}
],
"advanced": 10100
}
],
"advanced": 10100
},
"indexOnly": false,
"nscanned": 10200,
"nscannedObjects": 10100
}