Question

我有一个包含GeoJSON Point形式坐标数据的集合，我需要从中查询区域内的10个最新条目。现在有1.000.000个条目，但会有大约10倍。

我的问题是，当所需区域内有大量条目时，我的查询性能会大幅下降（案例3）。我目前拥有的测试数据是随机的，但实际数据不会，因此根据区域的尺寸选择另一个索引（如案例4）是不可能的。

无论面积如何，我应该怎么做才能让它以可预测的方式执行？

1。收集统计信息：

> db.randomcoordinates.stats()
{
    "ns" : "test.randomcoordinates",
    "count" : 1000000,
    "size" : 224000000,
    "avgObjSize" : 224,
    "storageSize" : 315006976,
    "numExtents" : 15,
    "nindexes" : 3,
    "lastExtentSize" : 84426752,
    "paddingFactor" : 1,
    "systemFlags" : 0,
    "userFlags" : 0,
    "totalIndexSize" : 120416128,
    "indexSizes" : {
        "_id_" : 32458720,
        "position_2dsphere_timestamp_-1" : 55629504,
        "timestamp_-1" : 32327904
    },
    "ok" : 1
}

2。索引：

> db.randomcoordinates.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "ns" : "test.randomcoordinates",
        "name" : "_id_"
    },
    {
        "v" : 1,
        "key" : {
            "position" : "2dsphere",
            "timestamp" : -1
        },
        "ns" : "test.randomcoordinates",
        "name" : "position_2dsphere_timestamp_-1"
    },
    {
        "v" : 1,
        "key" : {
            "timestamp" : -1
        },
        "ns" : "test.randomcoordinates",
        "name" : "timestamp_-1"
    }
]

第3。使用2dsphere复合指数查找：

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("position_2dsphere_timestamp_-1").explain()
{
    "cursor" : "S2Cursor",
    "isMultiKey" : true,
    "n" : 10,
    "nscannedObjects" : 116775,
    "nscanned" : 283424,
    "nscannedObjectsAllPlans" : 116775,
    "nscannedAllPlans" : 283424,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 3876,
    "indexBounds" : {

    },
    "nscanned" : 283424,
    "matchTested" : NumberLong(166649),
    "geoTested" : NumberLong(166649),
    "cellsInCover" : NumberLong(14),
    "server" : "chan:27017"
}

4。使用时间戳索引查找：

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("timestamp_-1").explain()
{
    "cursor" : "BtreeCursor timestamp_-1",
    "isMultiKey" : false,
    "n" : 10,
    "nscannedObjects" : 63,
    "nscanned" : 63,
    "nscannedObjectsAllPlans" : 63,
    "nscannedAllPlans" : 63,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "timestamp" : [
            [
                {
                    "$maxElement" : 1
                },
                {
                    "$minElement" : 1
                }
            ]
        ]
    },
    "server" : "chan:27017"
}

有些人建议使用{timestamp: -1, position: "2dsphere"}索引，所以我也试过了，但它似乎表现不佳。

5。使用Timestamp + 2dsphere复合索引

查找

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("timestamp_-1_position_2dsphere").explain()
{
    "cursor" : "S2Cursor",
    "isMultiKey" : true,
    "n" : 10,
    "nscannedObjects" : 116953,
    "nscanned" : 286513,
    "nscannedObjectsAllPlans" : 116953,
    "nscannedAllPlans" : 286513,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 4597,
    "indexBounds" : {

    },
    "nscanned" : 286513,
    "matchTested" : NumberLong(169560),
    "geoTested" : NumberLong(169560),
    "cellsInCover" : NumberLong(14),
    "server" : "chan:27017"
}

Answer 1

我在寻找类似问题的解决方案时就看到了这个问题。这是一个非常老的问题，无法解决，如果其他人正在寻找针对此类情况的解决方案，我将尝试解释为什么所提到的方法不适合当前的任务，以及如何对这些查询进行微调。

在第一种情况下，被扫描的许多物品是完全正常的。让我尝试解释原因：

Mongodb构建复合索引"position_2dsphere_timestamp_-1"时，实际上创建了一个B树来容纳位置键中包含的所有几何图形（在本例中为Point），并为此B树中的每个不同值，则会创建另一个B树以按降序保存时间戳。这意味着，除非您的条目彼此非常接近（我的意思是非常彼此接近），否则二级B树将仅容纳一个条目，并且查询性能几乎与仅在position字段上具有索引相同。除了mongodb能够在辅助b树上使用时间戳值，而不是将实际文档带入内存并检查时间戳。

构建复合索引"timestamp_-1_position_2dsphere"时，其他情况也是如此。很难同时输入毫秒精度的两个条目。所以在这种情况下；是的，我们已经按照时间戳字段对数据进行了排序，但是对于许多不同的时间戳值，我们还有许多其他的B树仅包含一个条目。因此，应用geoWithin过滤器效果不佳，因为它必须检查每个条目，直到达到限制为止。

那么如何使这类查询表现良好？我个人首先将尽可能多的字段放在地理空间字段的前面。但是主要技巧是保留另一个字段，比如说“ createdDay”，该字段将以日精度保存数字。如果需要更高的精度，也可以使用小时级别的精度，但要以性能为代价，这完全取决于项目的需求。您的索引应如下所示：{createdDay:-1, position: "2dsphere"}。现在，将在同一天创建的每个文档存储在相同的2dsphere b树索引上并进行排序。因此mongodb将从当前日期开始，因为它应该是索引中的最大值，然后对createdDay为今天的文档的b树位置进行索引扫描。如果找到至少10个文档，它将停止并返回这些文档，否则，它将移至前一天，依此类推。这种方法可以大大提高您的性能。

我希望这对您有帮助。

Answer 2

您是否尝试在数据集上使用聚合框架？

您想要的查询类似于：

db.randomcoordinates.aggregate(
    { $match: {position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}},
    { $sort: { timestamp: -1 } },
    { $limit: 10 }
);

不幸的是，聚合框架在生产版本中还没有explain，因此您只会知道它是否会产生巨大的时差。如果您从源代码构建得很好，那么看起来它可能会在上个月末出现：https://jira.mongodb.org/browse/SERVER-4504。看起来它也将出现在计划于下周二（2013年10月15日）发布的Dev build 2.5.3中。

Answer 3

我应该做些什么才能让它以可预测的方式执行，而不管它是什么区域？

$geoWithin根本不以Θ（1）效率运行。根据我的理解，它将以Θ（n）效率平均情况运行（考虑到alg最多需要检查n个点，至少10个）。

但是，我绝对会对坐标集进行一些预处理，以确保首先处理最近添加的坐标，以便更好地获得Θ（10）效率（除了使用position_2dsphere_timestamp_-1将成为可能的方式）！

有些人建议使用{timestamp：-1，position：“2dsphere”} 索引，所以我也尝试了，但它似乎没有执行够了。

（请参阅对初始问题的回复。）

此外，以下可能有用！

Optimization Strategies for MongoDB

希望这有帮助！

TL; DR您可以随心所欲地使用索引，但除非您重写它，否则您不会从$geoWithin获得更高的效率。

话虽如此，你可以随时专注于优化索引性能并重写函数，如果你愿意的话！

MongoDB 2dsphere索引$ geoWithin性能

3 个答案: