Question

我在MongoDB中有一个非常大的集合（约7M项目），主要由包含三个字段的文档组成。

我希望能够以合适的方式迭代其中一个字段的所有唯一值。

目前，我只查询该字段，然后通过迭代游标获得唯一性来处理返回的结果。这有效，但速度相当慢，我怀疑必须有更好的方法。

我知道mongo具有db.collection.distinct()功能，但这受到我的数据集超出的最大BSON大小（16 MB）的限制。

有没有办法迭代类似于db.collection.distinct()的东西，但使用游标或其他方法，所以记录大小限制不是一个问题？

我认为像map / reduce功能这样的东西可能适合这种事情，但我一开始并不真正理解map-reduce范例，所以我不知道我是什么＆＃ 39;我在做什么。我正在研究的项目部分是为了学习使用不同的数据库工具，所以我很缺乏经验。

_{我使用PyMongo是否相关（我不认为）。这应该主要取决于MongoDB。}

示例：

对于此数据集：

{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}

我想做的是迭代 basePath字段。对于上述数据集，这意味着我只需对foo，bar和baz进行一次迭代。

我不确定它是否相关，但我所拥有的数据库的结构是这样的，即每个字段都不是唯一的，所有三个字段的汇总都是唯一的（这是用索引强制执行的）。

我目前正在使用的查询和过滤操作（注意：我将查询限制为项目的子集以减少处理时间）：

    self.log.info("Running path query")
    itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
    self.log.info("Query complete. Processing")
    self.log.info("Query returned %d items", itemCursor.count())
    self.log.info("Filtering returned items to require uniqueness.")
    items = set()
    for item in itemCursor:
        # print item
        items.add(item["basePath"])

    self.log.info("total unique items = %s", len(items))

使用self.dbInt.coll.distinct("basePath")运行相同的查询会产生OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap

好的，这是我用完的解决方案。我将其添加为答案，但我不想减损让我来到这里的实际答案。

    reStr = "^%s" % fqPathBase
    pathRE = re.compile(reStr)
    self.log.info("Running path query")

    pipeline = [
        { "$match" :
            {
                "basePath" : pathRE
            }
        },
        # Group the keys
        {"$group":
            {
                "_id": "$basePath"
            }
        },

        # Output to a collection "tmp_unique_coll"
        {"$out": "tmp_unique_coll"}
        ]

    itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
    itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)

    self.log.info("Query complete. Processing")
    self.log.info("Query returned %d items", itemCursor.count())
    self.log.info("Filtering returned items to require uniqueness.")
    items = set()
    retItems = 0
    for item in itemCursor:
        retItems += 1
        items.add(item["_id"])


    self.log.info("Recieved items = %d", retItems)
    self.log.info("total unique items = %s", len(items))

与我之前的解决方案相比，一般性能在挂钟时间方面约为2倍。在返回834273个项目的查询上，包含11467个唯一身份：

原始方法（retreive，填入python set以强制执行唯一性）：

real    0m22.538s
user    0m17.136s
sys     0m0.324s

聚合管道方法：

real    0m9.881s
user    0m0.548s
sys     0m0.096s

因此，虽然整体执行时间只有约2倍，但就实际CPU时间而言，聚合管道的性能更高。

更新

我最近重新访问了这个项目，并重新编写了DB层以使用SQL数据库，一切都变得更加容易。复杂的处理管道现在是一个简单的SELECT DISTINCT(colName) WHERE xxx操作。

实际上，对于我在这里尝试做的事情，一般来说MongoDB和NoSQL数据库的数据库类型差异很大。

Answer 1

从讨论的角度来看，到目前为止，我将采取措施。我还注意到，在编写时，MongoDB的2.6版本应该即将到来，天气允许，所以我将在那里做一些参考。

哦，并且没有参加聊天的FYI，.distinct()是一个完全不同的动物，它早于这里的回答中使用的方法，因此受到许多限制。

此解决方案最终是2.6 up的解决方案，或者当前2.5版以上的任何版本

现在的替代方法是使用mapReduce，因为唯一的限制是输出大小

如果不进入明显的内部运作，我将继续推测聚合更有效地做到这一点[在即将发布的版本中更是如此]。

db.collection.aggregate([ // Group the key and increment the count per match {$group: { _id: "$basePath", count: {$sum: 1} }}, // Hey you can even sort it without breaking things {$sort: { count: 1 }}, // Output to a collection "output" {$out: "output"} ])

因此我们使用$out管道阶段将超过16MB的最终结果输入到它自己的集合中。你可以用它做你想做的事。

由于 2.6 是“指日可待”，还可以添加一个调整。

使用runCommand表单中的allowDiskUse，其中每个阶段都可以使用磁盘，不受内存限制。

这里的要点是，它几乎可以用于生产。并且性能将优于mapReduce中的相同操作。所以继续玩吧。立即安装2.5.5。

Answer 2

当前版本的Mongo中的MapReduce可以避免结果超过16MB的问题。

map = function() {
    if(this['basePath']) {
        emit(this['basePath'], 1);
    }
    // if basePath always exists you can just call the emit:
    // emit(this.basePath);
};

reduce = function(key, values) {
    return Array.sum(values);
};

对于每个文档，basePath发出一个值，表示该值的计数。 reduce只是创建所有值的总和。生成的集合将具有basePath的所有唯一值以及总出现次数。

并且，因为您需要存储结果以使用指定目标集合的out选项来防止错误。

db.yourCollectionName.mapReduce(
                 map,
                 reduce,
                 { out: "distinctMR" }
               )

Answer 3

@Neil Lunn的回答可以简化：

field = 'basePath' # Field I want db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])

$project为您过滤字段。特别是，'_id': 0会过滤掉_id字段。

结果还是太大了？使用$limit和$skip批量处理：

field = 'basePath' # Field I want db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])

Answer 4

我认为最具扩展性的解决方案是对每个唯一值执行查询。这些查询必须一个接一个地执行，每个查询将根据前一个查询结果为您提供“下一个”唯一值。这个想法是查询将返回一个文档，其中包含您要查找的唯一值。如果使用正确的投影，mongo将只使用加载到内存中的索引，而不必从磁盘读取。

您可以在mongo中使用$ gt运算符定义此策略，但必须考虑空字符串或空字符串之类的值，并可能使用$ ne或$ nin运算符丢弃它们。您还可以使用多个键来扩展此策略，对一个键使用$ gte，对另一个键使用$ gt。

此策略应为您提供按字母顺序排列的字符串字段的不同值，或按升序排列的不同数字值。

迭代MongoDB中一个字段中的不同项

4 个答案: