我有一个3节点的Akka群集,该群集的每个节点上都运行着3个参与者。群集可以正常运行大约2个小时,但是2个小时后,我收到以下警告:
[INFO] [06/07/2018 15:08:51.923] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://ClusterSystem@192.168.2.8:2552/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FClusterSystem%40192.168.2.7%3A2552-112] No response from remote for outbound association. Handshake timed out after [15000 ms].
[WARN] [06/07/2018 15:08:51.923] [ClusterSystem-akka.remote.default-remote-dispatcher-18] [akka.tcp://ClusterSystem@192.168.2.8:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.2.7%3A2552-8] Association with remote system [akka.tcp://ClusterSystem@192.168.2.7:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem@192.168.2.7:2552]] Caused by: [No response from remote for outbound association. Handshake timed out after [15000 ms].]
[WARN] [06/07/2018 16:07:06.347] [ClusterSystem-akka.actor.default-dispatcher-101] [akka.remote.PhiAccrualFailureDetector@3895fa5b] heartbeat interval is growing too large: 2839 millis
编辑:来自API的Akka CLuster Managemant响应
{
"selfNode": "akka.tcp://ClusterSystem@127.0.0.1:2551",
"leader": "akka.tcp://ClusterSystem@127.0.0.1:2551",
"oldest": "akka.tcp://ClusterSystem@127.0.0.1:2551",
"unreachable": [
{
"node": "akka.tcp://ClusterSystem@127.0.0.1:2552",
"observedBy": [
"akka.tcp://ClusterSystem@127.0.0.1:2551",
"akka.tcp://ClusterSystem@127.0.0.1:2560"
]
}
],
"members": [
{
"node": "akka.tcp://ClusterSystem@127.0.0.1:2551",
"nodeUid": "105742380",
"status": "Up",
"roles": [
"Frontend",
"dc-default"
]
},
{
"node": "akka.tcp://ClusterSystem@127.0.0.1:2552",
"nodeUid": "-150160059",
"status": "Up",
"roles": [
"RuleExecutor",
"dc-default"
]
},
{
"node": "akka.tcp://ClusterSystem@127.0.0.1:2560",
"nodeUid": "-158907672",
"status": "Up",
"roles": [
"RuleExecutor",
"dc-default"
]
}
]
}
**编辑1:**群集设置配置和故障检测器配置
cluster {
jmx.multi-mbeans-in-same-jvm = on
roles = ["Frontend"]
seed-nodes = [
"akka.tcp://ClusterSystem@192.168.2.9:2551"]
auto-down-unreachable-after = off
failure-detector {
# FQCN of the failure detector implementation.
# It must implement akka.remote.FailureDetector and have
# a public constructor with a com.typesafe.config.Config and
# akka.actor.EventStream parameter.
implementation-class = "akka.remote.PhiAccrualFailureDetector"
# How often keep-alive heartbeat messages should be sent to each connection.
# heartbeat-interval = 10 s
# Defines the failure detector threshold.
# A low threshold is prone to generate many wrong suspicions but ensures
# a quick detection in the event of a real crash. Conversely, a high
# threshold generates fewer mistakes but needs more time to detect
# actual crashes.
threshold = 18.0
# Number of the samples of inter-heartbeat arrival times to adaptively
# calculate the failure timeout for connections.
max-sample-size = 1000
# Minimum standard deviation to use for the normal distribution in
# AccrualFailureDetector. Too low standard deviation might result in
# too much sensitivity for sudden, but normal, deviations in heartbeat
# inter arrival times.
min-std-deviation = 100 ms
# Number of potentially lost/delayed heartbeats that will be
# accepted before considering it to be an anomaly.
# This margin is important to be able to survive sudden, occasional,
# pauses in heartbeat arrivals, due to for example garbage collect or
# network drop.
acceptable-heartbeat-pause = 15 s
# Number of member nodes that each member will send heartbeat messages to,
# i.e. each node will be monitored by this number of other nodes.
monitored-by-nr-of-members = 2
# After the heartbeat request has been sent the first failure detection
# will start after this period, even though no heartbeat message has
# been received.
expected-response-after = 10 s
}
}