我们在MongoDB版本3.4上设置了一个由以下组成的三个成员副本集:
我们看到的问题是,辅助设备无法跟上主要设备。当我们用数据播种(从主要文件复制)并将其添加到副本集时,它通常会设法同步,但一小时后它可能会落后10分钟;几个小时后,它落后了一个小时,依此类推,直到一两天后,它变得陈旧。
我们正试图找出原因。主要是持续使用0-1%CPU,而次要负载在20-80%CPU的恒定重负载。这似乎是唯一潜在的资源限制。磁盘和网络负载似乎不是问题。在辅助设备上似乎有一些锁定,因为mongo shell中的操作(例如db.getReplicationInfo())经常需要5分钟或更长时间才能完成,而mongostat很少工作(它只是说i / o超时)。这是mongostat在极少数情况下报告次要数据的输出:
host insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
localhost:27017 *0 33 743 *0 0 166|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|1 2.33m 337k 739 rs PRI Mar 27 14:41:54.578
primary.XXX.com:27017 *0 36 825 *0 0 131|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|0 1.73m 322k 739 rs PRI Mar 27 14:41:53.614
secondary.XXX.com:27017 *0 *0 *0 *0 0 109|0 4.3% 80.0% 0 8.69G 7.54G 0|0 0|10 6.69k 134k 592 rs SEC Mar 27 14:41:53.673
我在辅助服务器上运行了db.serverStatus(),并与主服务器进行了比较,一个突出的数字如下:
"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")
当时辅助设备的正常运行时间为14000秒。
欢迎任何有关这可能是什么的想法,或如何调试此问题!我们可以将亚马逊实例升级为更强大的东西,但我们已经完成了三次,而且此时我们认为其他东西必定是错误的。
我将在下面的辅助中包含db.currentOp()的输出,以防它有帮助。 (该命令运行花了5分钟,之后记录了以下内容:由于错误而重新启动oplog查询:CursorNotFound:未找到游标,游标ID:15728290121。上次获取的optime(带哈希):{ts:Timestamp 1490613628000 | 756,t:48} [ - 5363878314895774690]。重新开始:3 )
"desc":"conn605", "connectionId":605,"client":"127.0.0.1:61098", "appName":"MongoDB Shell", "secs_running":0, "microsecs_running":NumberLong(16), "op":"command", "ns":"admin.$cmd", "query":{"currentOp":1}, "locks":{}, "waitingForLock":false, "lockStats":{} "desc":"repl writer worker 10", "secs_running":0, "microsecs_running":NumberLong(14046), "op":"none", "ns":"CustomerDB.ed2112ec779f", "locks":{"Global":"W","Database":"W"}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong(1)}}} "desc":"ApplyBatchFinalizerForJournal", "op":"none", "ns":"", "locks":{}, "waitingForLock":false, "lockStats":{} "desc":"ReplBatcher", "secs_running":11545, "microsecs_running":NumberLong("11545663961"), "op":"none", "ns":"local.oplog.rs", "locks":{}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog":{"acquireCount":{"r":NumberLong(1)}}} "desc":"rsBackgroundSync", "secs_running":11545, "microsecs_running":NumberLong("11545281690"), "op":"none", "ns":"local.replset.minvalid", "locks":{}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong(2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}} "desc":"TTLMonitor", "op":"none", "ns":"", "locks":{"Global":"r"}, "waitingForLock":true, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(341534123)}},"Database":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}} "desc":"SyncSourceFeedback", "op":"none", "ns":"", "locks":{}, "waitingForLock":false, "lockStats":{} "desc":"WT RecordStoreThread: local.oplog.rs", "secs_running":1163, "microsecs_running":NumberLong(1163137036), "op":"none", "ns":"local.oplog.rs", "locks":{}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong(1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}} "desc":"rsSync", "secs_running":11545, "microsecs_running":NumberLong("11545663926"), "op":"none", "ns":"local.replset.minvalid", "locks":{"Global":"W"}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564)},"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}} "desc":"clientcursormon", "secs_running":0, "microsecs_running":NumberLong(15649), "op":"none", "ns":"CustomerDB.b72ac80177ef", "locks":{"Global":"r"}, "waitingForLock":true, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(397538606)}},"Database":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}],"ok":1}
答案 0 :(得分:1)
JJussi完全正确(谢谢!)。问题是活动数据大于可用内存,我们使用的是Amazon EBS“吞吐量优化硬盘”。我们将其更改为“通用SSD”,问题立即消失。我们甚至能够将服务器从m4.2xlarge降级到m4large。
我们感到困惑的是,这表现为高CPU负载。我们认为基于每秒写入磁盘的相当少量的数据,磁盘加载不是问题。但是当我们尝试将AWS服务器作为主要实例时,我们注意到高CPU负载和磁盘队列长度之间存在非常强的相关性。对磁盘进行的进一步测试表明,它对Mongo拥有的流量类型的性能非常糟糕。