如何提高Mongo聚合的性能

时间:2014-12-19 18:35:24

标签: python mongodb pymongo mongoengine

我有一个Mongoengine模型,代表距离矩阵的一个条目:

class Distance(db.Document):
    """
        An instance of the distance matrix.
    """

    orig = db.ObjectIdField(required=True, unique_with='dest')
    dest = db.ObjectIdField(required=True, unique_with='orig')
    paths = db.ListField(db.EmbeddedDocumentField(PathEmbedded))

    meta = {
        'indexes': [
            {'fields': ['orig'], 'cls': False},
            {'fields': ['dest'], 'cls': False},
            {'fields': ['orig', 'dest'], 'cls': False},
            {'fields': ['dest', 'orig'], 'cls': False}
        ],
    }

因为MongoEngine不支持聚合,所以我创建了一个pymongo聚合来获取位置ID列表之间的所有距离,所以我有:

    pipeline = [
        {'$match': {
            'orig': {
                '$in': list_of_origins
            },
            'dest': {
                '$in': list_of_destinations
            }
        }},
        {'$project': {
            'paths': 1,
            '_id': 0,
            'ids': {'orig': '$orig', 'dest': '$dest'}
        }},
    ]

    # Open a pymongo connection to the db
    # (we're skipping the moongengine ORM because it lacks support for aggregation)
    pymongo_connection = MongoClient(host=config.db_host, port=config.db_port)
    pymongo_db = pymongo_connection[config.db_name]

    cursor = pymongo_db.distance.aggregate(pipeline, cursor={}, explain=False)

    response.data = json_util.dumps(cursor)

目前我的list_of_originslist_of_destinations每个都有100个元素,因此可以获取10.000个距离。
pymongo中此查询的平均运行时间约为1.15秒 这对我来说似乎很慢,所以为了比较我写了一个与JS相同的聚合:

db.distance.aggregate([
    {$match: {
        'dest': {$in: [
            ObjectId('5436828e4ee264cf95bbb2a0'),
            ObjectId('543682904ee264cf95bbbd1d'),
            ...
            ObjectId('5436828e4ee264cf95bbb23a')
        ]}, 
        'orig': {$in: [
            ObjectId('5436828e4ee264cf95bbb0e1'),
            ObjectId('543682904ee264cf95bbbe5b'),
            ...
            ObjectId('543682904ee264cf95bbbc86')
        ]}
    }},
    {$project: {
        'paths': 1,
        '_id': 0,
        'ids': {'dest': '$dest', 'orig': '$orig'}}
    }
])

并从shell运行它以获得时间:time mongo < distance_matrix_aggregation.js
此格式的平均运行时间为65毫秒

因此使用pymongo将查询时间增加了近20倍。考虑到这个应用程序最终需要返回带有数百万个元素的距离矩阵,这种性能损失将成为一个严重的问题。

为了减少查询时间,我运行explain来检查索引是否正在使用,但是我收到一个错误:"planError" : "InternalError No plan available to provide stats"这似乎与MongoDB错误有关。但是因为我在MongoEngine模型中明确地创建了索引,所以我认为它没有丢失索引。

我想知道性能损失实际发生在哪里:

  1. 在对MongoDB执行实际查询之前,pymongo是否进行了一些预处理?如果是这样,运行原始查询应该避免这种情况,所以我正在寻找一种从pymongo运行原始mongodb查询的方法,但是没有找到如何做到这一点。
  2. 是因为json_util.dumps(cursor)单独遍历每个游标元素,从而产生大量的数据库访问?我已经尝试设置cursor.batch_size(20000)并且没有看到任何性能改进......
  3. 提高性能的任何提示?

    更新

    正如评论中所建议的,我跑了:

    > db.distance.find({ "orig" : { "$in" : origins }, "dest" : { "$in" : destinations }}, { "paths" : 1, "_id" : 0 }).explain()
    

    返回了:

    {
            "cursor" : "BtreeCursor orig_1_dest_1",
            "isMultiKey" : false,
            "n" : 10100,
            "nscannedObjects" : 10100,
            "nscanned" : 10200,
            "nscannedObjectsAllPlans" : 10405,
            "nscannedAllPlans" : 10506,
            "scanAndOrder" : false,
            "indexOnly" : false,
            "nYields" : 81,
            "nChunkSkips" : 0,
            "millis" : 26,
            "indexBounds" : {
                    "orig" : [
                            [
                                    ObjectId("5436828e4ee264cf95bbb0e1"),
                                    ObjectId("5436828e4ee264cf95bbb0e1")
                            ],
                            <SNIP - 100 more OIds >
                    ],
                    "dest" : [
                            [
                                    ObjectId("5436828e4ee264cf95bbb0e1"),
                                    ObjectId("5436828e4ee264cf95bbb0e1")
                            ],
                            <SNIP - 100 more OIds >
                    ]
            },
            "server" : <SNIP>,
            "filterSet" : false
    }
    

    所以,这里的查询运行时间为26毫秒,这很好。不幸的是,等效的MongoEngine查询Distance.objects.filter(orig__in=origins, dest__in=destinations)需要3.9秒,我真的不知道该怎么做explain()输出:

    {
        "nYields": 81,
        "nscannedAllPlans": 10608,
        "filterSet": false,
        "allPlans": [
            {
                "nChunkSkips": 0,
                "n": 10100,
                "cursor": "BtreeCursor orig_1_dest_1",
                "scanAndOrder": false,
                "indexBounds": {
                    "dest": [ <SNIP> ],
                    "orig": [ <SNIP> ],
                    ]
                },
                "nscannedObjects": 10100,
                "isMultiKey": false,
                "indexOnly": false,
                "nscanned": 10200
            },
            {
                "nChunkSkips": 0,
                "n": 101,
                "cursor": "BtreeCursor dest_1_orig_1",
                "scanAndOrder": false,
                "indexBounds": {
                    "dest": [ <SNIP> ],
                    "orig": [ <SNIP> ],
                },
                "nscannedObjects": 101,
                "isMultiKey": false,
                "indexOnly": false,
                "nscanned": 102
            },
            {
                "nChunkSkips": 0,
                "n": 97,
                "cursor": "BtreeCursor orig_1",
                "scanAndOrder": false,
                "indexBounds": {
                    "orig": [ <SNIP> ],
                },
                "nscannedObjects": 102,
                "isMultiKey": false,
                "indexOnly": false,
                "nscanned": 102
            },
            {
                "nChunkSkips": 0,
                "n": 95,
                "cursor": "BtreeCursor dest_1",
                "scanAndOrder": false,
                "indexBounds": {
                    "dest": [ <SNIP> ],
                },
                "nscannedObjects": 102,
                "isMultiKey": false,
                "indexOnly": false,
                "nscanned": 102
            },
            {
                "nChunkSkips": 0,
                "n": 94,
                "cursor": "BtreeCursor _cls_1",
                "scanAndOrder": false,
                "indexBounds": {
                    "_cls": [
                        [
                            "Distance",
                            "Distance"
                        ]
                    ]
                },
                "nscannedObjects": 102,
                "isMultiKey": false,
                "indexOnly": false,
                "nscanned": 102
            }
        ],
        "millis": 11,
        "nChunkSkips": 0,
        "server": "darkStar9:27017",
        "n": 10100,
        "cursor": "BtreeCursor orig_1_dest_1",
        "scanAndOrder": false,
        "indexBounds": {
            "dest": [ <SNIP> ],
            "orig": [ <SNIP> ],
        },
        "nscannedObjectsAllPlans": 10507,
        "isMultiKey": false,
        "stats": {
            "works": 10201,
            "isEOF": 1,
            "needFetch": 0,
            "needTime": 100,
            "yields": 81,
            "invalidates": 0,
            "unyields": 81,
            "type": "KEEP_MUTATIONS",
            "children": [
                {
                    "works": 10201,
                    "isEOF": 1,
                    "forcedFetches": 0,
                    "needFetch": 0,
                    "matchTested": 10100,
                    "needTime": 100,
                    "yields": 81,
                    "alreadyHasObj": 0,
                    "invalidates": 0,
                    "unyields": 81,
                    "type": "FETCH",
                    "children": [
                        {
                            "works": 10201,
                            "boundsVerbose": "field #0['orig']: [ <SNIP> ], field #1['dest']: [ <SNIP> ]",
                            "dupsTested": 0,
                            "yieldMovedCursor": 0,
                            "isEOF": 1,
                            "needFetch": 0,
                            "matchTested": 0,
                            "needTime": 100,
                            "keysExamined": 10200,
                            "seenInvalidated": 0,
                            "dupsDropped": 0,
                            "yields": 81,
                            "isMultiKey": 0,
                            "invalidates": 0,
                            "unyields": 81,
                            "type": "IXSCAN",
                            "children": [ ],
                            "advanced": 10100,
                            "keyPattern": "{ orig: 1, dest: 1 }"
                        }
                    ],
                    "advanced": 10100
                }
            ],
            "advanced": 10100
        },
        "indexOnly": false,
        "nscanned": 10200,
        "nscannedObjects": 10100
    }
    

0 个答案:

没有答案