Question

我们已经嵌入了带有10个aws实例的hazelcast集群。 hazelcast的版本是3.7.3现在我们有以下设置的hazelcast

hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=150                
hazelcast.heartbeat.interval.seconds=1
hazelcast.operation.call.timeout.millis=5000
hazelcast.merge.first.run.delay.seconds=60

除上述设置外，其他属性值均为默认值。

最近有一个节点在几分钟左右无法访问，并且一些操作在从缓存中获取内容时速度变慢。我们为每个地图都有备份，所以如果一个分区没有可用的东西，那么hazelcast应该已经从另一个分区响应，但由于一个节点无法访问，所以一切都变慢了。

以下是我们在日志中看到的例外情况。

[3.7.2] PartitionIteratingOperation调用未能完成到期 to operation-heartbeat-timeout。当前时间：2017-05-30 16：12：52.442。总耗时：10825毫秒。最后的操作心跳：从不。持续会员操作心跳：2017-05-30 16：12：42.166。调用{OP = com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation {服务名=＆＃39;赫兹：IMPL：地图服务＆＃39 ;, identityHash = 1798676695，partitionId = -1，replicaIndex = 0，callId = 0， invocationTime = 1496160761670（2017-05-30 16：12：41.670）， waitTimeout = -1，callTimeout = 5000， operationFactory=com.hazelcast.map.impl.operation.MapGetAllOperationFactory@2afbcab7}，tryCount = 10，tryPauseMillis = 300，invokeCount = 1， callTimeoutMillis = 5000，firstInvocationTimeMs = 1496160761617， firstInvocationTime =＆＃39; 2017-05-30 16：12：41.617＆＃39;，lastHeartbeatMillis = 0， lastHeartbeatTime =＆＃39; 1970-01-01 00：00：00.000＆＃39;， target = [172.18.84.36]：9123，pendingResponse = {VOID}， backupsAcksExpected = 0，backupsAcksReceived = 0， connection = Connection [id = 12，/ 177.18.64.219：9123-＆gt; /172.18.84.36：48180， endpoint = [172.18.84.36]：9123，alive = true，type = MEMBER]}

有人可以建议什么是正确的hazelcast设置，以便一个临时无法访问的节点不会降低整个群集的速度？

Answer 1

操作调用超时不应设置为低值。可能最好将其保留为默认值。像心跳这样的内部机制依赖于呼叫超时。

Answer 2

根据参考手册版本3.11.7。

我建议阅读裂脑综合症。

也许您应该创建另一个仲裁，以防节点无法通信。

此外，根据经验，我建议您获取特定于您版本的参考手册。即使将默认值设置为5，我也发现特定版本建议使用其他值。

Hazelcast：调整群集中具有临时网络故障的节点的属性

2 个答案: