在MongoDB中建立索引的速度没有预期的快?

时间:2018-12-24 15:00:00

标签: python mongodb indexing pymongo

我的数据库中有唯一的twitter单词集合。其中的文档具有以下形状:

{
    _id: <some-object-id>,
    initial: "t",
    word: "the",
    count: 986,
    tweets: <position information for the given word>
}

我尝试了以下代码来创建索引:

db.tweet_words.create_index([("word", pymongo.ASCENDING), ("initial", pymongo.ASCENDING)], background=True)

db.tweet_words.create_index("word", background=True)

db.tweet_words.create_index([("word", pymongo.HASHED)], background=True)

尽管使用这些索引有助于update命令更快地运行,但它仍然运行相对较慢。我认为应该有更好的方法。

这是我的更新命令:

from pymongo import UpdateOne
# connect to db stuff
# create indexing using one of the approaches above
commands = []
for word in words: # this is actually not the real loop I've used but it fits for this example
    # assume tweet_id's and position is calculated here
    initial = word[0]
    ret = {"tweet_id": tweet_id, "pos": (beg, end)} # additional information about word
    command = UpdateOne({"word": word, "initial": initial}, # use query {"word": word} only if one of bottom two indexing strategy is chosen
        {
            "$setOnInsert": {"initial": initial},
            "$inc": {"count": 1},
            "$push": {"tweets": ret},
        },
    commands.append(command)
    if len(commands) % 1000 == 0:
        db.tweet_words.bulk_write(commands, ordered=False)
        commands = []

您可以找到真实的代码here

当集合中存在约13万个文档时,这是pprint(db.tweet_words.find({"word": word}).explain())的输出(无法在UpdateOnebulk_write上使用解释方法): >

{'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 1,
                                        'alreadyHasObj': 0,
                                        'docsExamined': 1,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 1,
                                                       'direction': 'forward',
                                                       'dupsDropped': 0,
                                                       'dupsTested': 0,
                                                       'executionTimeMillisEstimate': 0,
                                                       'indexBounds': {'initial': ['[MinKey, '
                                                                                   'MaxKey]'],
                                                                       'word': ['["seval", '
                                                                                '"seval"]']},
                                                       'indexName': 'word_1_initial_1',
                                                       'indexVersion': 2,
                                                       'invalidates': 0,
                                                       'isEOF': 1,
                                                       'isMultiKey': False,
                                                       'isPartial': False,
                                                       'isSparse': False,
                                                       'isUnique': False,
                                                       'keyPattern': {'initial': 1,
                                                                      'word': 1},
                                                       'keysExamined': 1,
                                                       'multiKeyPaths': {'initial': [],
                                                                         'word': []},
                                                       'nReturned': 1,
                                                       'needTime': 0,
                                                       'needYield': 0,
                                                       'restoreState': 0,
                                                       'saveState': 0,
                                                       'seeks': 1,
                                                       'seenInvalidated': 0,
                                                       'stage': 'IXSCAN',
                                                       'works': 2},
                                        'invalidates': 0,
                                        'isEOF': 1,
                                        'nReturned': 1,
                                        'needTime': 0,
                                        'needYield': 0,
                                        'restoreState': 0,
                                        'saveState': 0,
                                        'stage': 'FETCH',
                                        'works': 2},
                    'executionSuccess': True,
                    'executionTimeMillis': 0,
                    'nReturned': 1,
                    'totalDocsExamined': 1,
                    'totalKeysExamined': 1},
 'ok': 1.0,
 'queryPlanner': {'indexFilterSet': False,
                  'namespace': 'twitter.tweet_words',
                  'parsedQuery': {'word': {'$eq': 'seval'}},
                  'plannerVersion': 1,
                  'rejectedPlans': [],
                  'winningPlan': {'inputStage': {'direction': 'forward',
                                                 'indexBounds': {'initial': ['[MinKey, '
                                                                             'MaxKey]'],
                                                                 'word': ['["seval", '
                                                                          '"seval"]']},
                                                 'indexName': 'word_1_initial_1',
                                                 'indexVersion': 2,
                                                 'isMultiKey': False,
                                                 'isPartial': False,
                                                 'isSparse': False,
                                                 'isUnique': False,
                                                 'keyPattern': {'initial': 1,
                                                                'word': 1},
                                                 'multiKeyPaths': {'initial': [],
                                                                   'word': []},
                                                 'stage': 'IXSCAN'},
                                  'stage': 'FETCH'}},
 'serverInfo': {'gitVersion': 'f288a3bdf201007f3693c58e140056adf8b04839',
                'host': 'MostWanted',
                'port': 27017,
                'version': '4.0.4'}}

我的代码中是否存在可解决的瓶颈?还是这和得到的一样好?

0 个答案:

没有答案