Question

我们选择将 mongos 路由器部署在与我们的应用程序相同的 VM 中，但我们遇到了一些问题，即应用程序被 OOM 杀死，因为 mongos 占用的 RAM 比我们预期/想要的要多得多。

重新启动后，mongos 占用空间略低于 2GB，但从这里开始，它不断需要更多内存。每周大约 500MB。它上升到 4.5+GB

这是过去 2 周我们的一个 mongos 的统计数据，它显然看起来像是在泄漏内存......

所以我的问题是：如何调查这种行为？我们还没有真正找到解释为什么路由器可能需要更多内存，或者如何诊断行为。甚至如何为 mongos 设置内存使用限制。

使用 mongos 上的 db.serverStatus 我们可以看到分配：

    "tcmalloc" : {
        "generic" : {
            "current_allocated_bytes" : 536925728,
            "heap_size" : NumberLong("2530185216")
        },
        "tcmalloc" : {
            "pageheap_free_bytes" : 848211968,
            "pageheap_unmapped_bytes" : 213700608,
            "max_total_thread_cache_bytes" : NumberLong(1073741824),
            "current_total_thread_cache_bytes" : 819058352,
            "total_free_bytes" : 931346912,
            "central_cache_free_bytes" : 108358128,
            "transfer_cache_free_bytes" : 3930432,
            "thread_cache_free_bytes" : 819058352,
            "aggressive_memory_decommit" : 0,
            "pageheap_committed_bytes" : NumberLong("2316484608"),
            "pageheap_scavenge_count" : 35286,
            "pageheap_commit_count" : 64660,
            "pageheap_total_commit_bytes" : NumberLong("28015460352"),
            "pageheap_decommit_count" : 35286,
            "pageheap_total_decommit_bytes" : NumberLong("25698975744"),
            "pageheap_reserve_count" : 513,
            "pageheap_total_reserve_bytes" : NumberLong("2530185216"),
            "spinlock_total_delay_ns" : NumberLong("38522661660"),
            "release_rate" : 1
        }
    },

------------------------------------------------
MALLOC:      536926304 (  512.1 MiB) Bytes in use by application
MALLOC: +    848211968 (  808.9 MiB) Bytes in page heap freelist
MALLOC: +    108358128 (  103.3 MiB) Bytes in central cache freelist
MALLOC: +      3930432 (    3.7 MiB) Bytes in transfer cache freelist
MALLOC: +    819057776 (  781.1 MiB) Bytes in thread cache freelists
MALLOC: +     12411136 (   11.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2328895744 ( 2221.0 MiB) Actual memory used (physical + swap)
MALLOC: +    213700608 (  203.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   2542596352 ( 2424.8 MiB) Virtual address space used
MALLOC:
MALLOC:         127967              Spans in use
MALLOC:             73              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

但我不能说它真的很有帮助。至少对我来说。

在服务器统计信息中，我们还可以看到对 killCursors 的调用次数非常高 (2909015)，但我不确定这将如何解释内存使用量的稳步增长？由于游标会在 30 秒后自动终止，并且在整个期间对 mongos 的调用次数非常稳定。

是的，您知道如何诊断/去哪里寻找/寻找什么吗？

Mongos 版本：4.0.19

编辑：似乎我们的监控基于 virt 而不是 res 内存，因此图表可能不是很相关。然而，我们最终还是得到了 4GB 以上的 RES 内存

Answer 1

为什么路由器需要更多内存？

如果分片集群中有任何系统需要进行分散收集的查询，则合并活动由 mongos 本身负责。

例如我正在运行一个查询 db.collectionanme.find({something : 1})

如果这里的 something 字段不是分片键本身，那么默认情况下它将执行分散收集，使用解释计划检查 query。它执行分散收集，因为 mongos 与配置服务器交互并意识到它没有相应字段的信息。 {这适用于分片的集合}

更糟糕的是，如果您有无法使用索引的排序操作，那么即使现在也必须在 mongos 本身上完成。排序操作必须阻塞内存段以根据数据量将页面放在一起然后排序工作，想象一下这里排序操作的最佳大 O。在完成之前，该操作的内存将被阻塞。

你应该怎么做？

根据设置（你的slowms设置，默认应该是100ms），检查日志，看看你在系统中的慢查询。如果您看到大量 SHARD_MERGE 和内存排序发生，那么您的罪魁祸首就在那里。

为了快速修复，请增加交换内存的可用性并确保设置合适。

一切顺利。

Answer 2

在没有访问机器的情况下只能推测，说我想说这是 Mongo 的某种预期行为。

Mongo 喜欢内存，它会在 RAM 中保存很多东西以提高性能。

导致这种情况的原因有很多，我将列出一些：

MongoDB 使用 RAM 来处理打开的连接、聚合、服务器端代码、打开的游标等。
WiredTiger 在其缓存中保存多个版本的记录（多版本并发控制，读操作访问最后在他们操作之前提交的版本）。
WiredTiger 在缓存中保存数据的校验和。

mongo 缓存/存储在内存中的东西还有很多，一个这样的例子可能是集合的索引树。

如果 Mongo 有内存来存储一些东西，它就会。这就是为什么随着您越来越多地使用它，RAM 使用量会增加。但是我个人认为这不是内存泄漏。正如我所说，Mongo 就像 RAM，很多。

<块引用>

我们选择在与我们的应用程序相同的 VM 中部署 mongos 路由器。

从个人经验来看，这是一个很大的禁忌，因为 Mongo 很饿，如果可能的话，我个人会尽量避免这种情况。

总而言之，我不认为您有内存泄漏（尽管可能），我只是认为时间越长 Mongo 在使用时将更多东西存储到 RAM 中。

您应该注意长时间运行的查询，因为它们最有可能是 IMO 的罪魁祸首。

Mongos 内存使用量不断增加

2 个答案: