我正在使用Akka.NET的集群(1.0.5)功能来实现一个服务,该服务由一个主节点组成,该节点通过HTTP接收请求并将工作交给已加入集群的工作节点。
这个想法是能够轻松完成以下任务:
在需求高时将工作节点添加到群集(检查)
能够重启主节点或使其脱机(维护/错误行为/无论如何)并让工作人员在可用时重新连接(检查)
升级/重新启动行为不端的工作人员并将其重新连接到主节点(失败!)
第一点可以按照您的预期运行:新实例(Azure云服务工作者角色)正在启动,并加入主服务器 - 也就是种子节点。
对于第二点,所有工作节点都有一个侦听集群八卦的actor,它确定主节点是否已经死亡。如果是这种情况,将重新启动工作节点actor系统。
最后一点是我被困住的地方。主节点还侦听群集八卦以确定工作人员何时无法访问(ClusterEvent.UnreachableMember
)或正在关闭(退出状态)并确定是否应该将其关闭。根据我从文档中了解到的,让同一节点的“新”版本重新加入群集的唯一方法是首先删除旧版本。
不幸的是,这似乎并没有发生。在测试场景中,我试图在计算模拟器中本地重现问题,这些步骤是:
启动主节点(端口8090)
启动工作节点(端口9090)
做一些工作
突然杀死工作节点
启动工作节点备份
以下是我在此测试期间为两个节点收集的日志中的相关摘录:
主
工人无法到达:
[WARNING][07/12/2015 20:39:35][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://InventoryService@0.0.0.0:9090, status = Up]
主节点在工作人员的地址上调用Cluster.Leave()
和Cluster.Down()
:
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Leave
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marked address [akka.tcp://InventoryService@0.0.0.0:9090] as Leaving]
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marking unreachable node [akka.tcp://InventoryService@0.0.0.0:9090] as Down
[DEBUG][07/12/2015 20:39:35][Thread 0020][[akka://InventoryService/system/cluster/core/daemon/heartbeatSender]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Heartbeat to [akka.tcp://InventoryService@0.0.0.0:9090]
[INFO][07/12/2015 20:39:36][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Leader is removing unreachable node [akka.tcp://InventoryService@0.0.0.0:9090]
Master确认将不再允许旧节点加入(虽然似乎有一个bug,请看第一行 - gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms
,我想这应该是它应该被门控的时间):< / p>
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms
[DEBUG][07/12/2015 20:39:36][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%400.0.0.0%3a9090-2/endpointWriter]] Disassociated [akka.tcp://InventoryService@127.0.0.1:8090] -> akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:39:36][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
工作人员启动并尝试连接到主服务器:
[DEBUG][07/12/2015 20:40:20][Thread 0013][remoting] Associated [akka.tcp://InventoryService@127.0.0.1:8090] <- akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:40:21][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:21][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:23][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:28][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:33][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:38][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:43][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:48][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:53][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:58][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:03][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:08][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:13][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:18][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
这里发生了什么?
工人:
被杀后重启:
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:20][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:21][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%40127.0.0.1%3a8090-1/endpointWriter]] Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000
就是这样......没有别的东西写入日志!
完整日志文件:
主群集配置:
cluster {
seed-nodes = ["master's address here"]
roles = [ InventoryServiceMaster, InventoryServiceWorker ]
failure-detector {
acceptable-heartbeat-pause = 5s
threshold = 10.0
}
}
工作人员的配置相同,但只有InventoryServiceWorker
角色。
我在这里缺少什么?这是配置问题吗? (我希望它不是一个错误 - 我在Github上见过其他人report a similar problem。
编辑:
为了清楚起见,我没有使用Nuget的Akka.dll
,因为它包含序列化错误 - 我检查了当前的主程序是否应用了修复程序并执行了发布版本。日志包含调试信息,因为我保留了构建中的PDB。
编辑2:
在工作日志中,重新启动后,事件Akka.Cluster.InternalClusterAction+JoinSeedNodes
出现两次,因为我最初手动调用了Cluster.JoinSeedNodes()
。我已经删除了这个,但结果仍然是一样的。
答案 0 :(得分:1)
从Akka.NET 1.1开始已经解决了这个问题 - 我们的UID系统在此之前没有正确实现(1.0.5,在本文发布时),但现在工作正常。