Question

我在MongoDB集合中有很多文档，每个文档都是这样的：

{
"_id" : ObjectId("539f5556e4b032123458ba30"),
"name" : "H0031324836",
"date" : ISODate("2014-04-01T04:00:00Z"),
"dateString" : "2014-04-01",
"elements" : [
    {
        "start_time" : ISODate("2014-04-01T15:00:00Z"),
        "end_time" : ISODate("2014-04-01T16:00:00Z"),
        "duration" : NumberLong(3600000),
        "value" : 0.6968
    },
    {
        "start_time" : ISODate("2014-04-01T16:00:00Z"),
        "end_time" : ISODate("2014-04-01T17:00:00Z"),
        "duration" : NumberLong(3600000),
        "value" : 1.4873
    },
    // ...
]
}

对于这些文档中的每一个，我希望（通过聚合框架，理想情况下）最终得到这样的文档：

{
"_id" : ObjectId("539f5556e4b032123458ba30"),
"name" : "H0031324836",
"date" : ISODate("2014-04-01T04:00:00Z"),
"dateString" : "2014-04-01",
"duration" : NumberLong(...blahblah...), // sum of all "$duration" fields
"value" : ...blahblah..., // sum of all "$value" fields
}

我没有看到在$ elements数组上进行矢量化并选择值的方法 - 也许$unwind是一个选项，但如果真正爆炸到文档流，这似乎效率很低这样我就可以再次破坏它们。

这个集合很大（现在大约有5亿个文档，加载完整数据时会有几十亿个），所以我希望尽可能使用聚合框架而不是MapReduce。

我有一个带有8个分片的散列分片集合上的MongoDB 2.6.0。

Answer 1

正如你所说，

$unwind就是答案。这样的方式是打算处理的。

首先，关于$unwind。请注意，聚合框架将仅使用您的集合的索引，直到它变换文档为止，因此请务必首先处理大部分过滤。更多关于"Aggregation pipeline and indexes"的SO回答。 $unwind在性能方面表现得非常出色，因为它是MongoDB的优化内部 - 它在聚合管道中本身发生（C ++），因此您不会通过运行解释的JavaScript（即在MR中）来降低性能。 MongoDB团队努力工作并经过多次迭代，以确保快速聚合。

现在这个管道实际上是什么样的？

db.collection.aggregate([
    { $match: { name: "H0031324836" } }, // limit to just this record (or set of records, uses index)
    { $unwind: "$elements" }, // explode that array into individual documents
    {
        $group: { // regroup all of the similar ones based on the `name` field
            _id: "$name",
            duration: { $sum: "$elements.duration" }, // sum elements[n].duration
            value: { $sum: "elements.sum" } // sum elements[n].value
        }
    }
]);

有关管道的更多信息，请参阅"$unwind an object in aggregation framework"的SO答案。

聚合肯定会利用您的8个分片"Aggregation Introduction - Additional Features and Behaviors"：

聚合管道支持对分片集合的操作。请参阅Aggregation Pipeline and Sharded Collections。

MongoDB：汇总子文档中的字段

1 个答案: