Question

我正在使用Akka.NET的集群（1.0.5）功能来实现一个服务，该服务由一个主节点组成，该节点通过HTTP接收请求并将工作交给已加入集群的工作节点。

这个想法是能够轻松完成以下任务：

在需求高时将工作节点添加到群集（检查）
能够重启主节点或使其脱机（维护/错误行为/无论如何）并让工作人员在可用时重新连接（检查）
升级/重新启动行为不端的工作人员并将其重新连接到主节点（失败！）

第一点可以按照您的预期运行：新实例（Azure云服务工作者角色）正在启动，并加入主服务器 - 也就是种子节点。

对于第二点，所有工作节点都有一个侦听集群八卦的actor，它确定主节点是否已经死亡。如果是这种情况，将重新启动工作节点actor系统。

最后一点是我被困住的地方。主节点还侦听群集八卦以确定工作人员何时无法访问（ClusterEvent.UnreachableMember）或正在关闭（退出状态）并确定是否应该将其关闭。根据我从文档中了解到的，让同一节点的“新”版本重新加入群集的唯一方法是首先删除旧版本。

不幸的是，这似乎并没有发生。在测试场景中，我试图在计算模拟器中本地重现问题，这些步骤是：

启动主节点（端口8090）
启动工作节点（端口9090）
做一些工作
突然杀死工作节点
启动工作节点备份

以下是我在此测试期间为两个节点收集的日志中的相关摘录：

主

工人无法到达：

[WARNING][07/12/2015 20:39:35][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://InventoryService@0.0.0.0:9090, status = Up]

主节点在工作人员的地址上调用Cluster.Leave()和Cluster.Down()：

[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Leave
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marked address [akka.tcp://InventoryService@0.0.0.0:9090] as Leaving]
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marking unreachable node [akka.tcp://InventoryService@0.0.0.0:9090] as Down
[DEBUG][07/12/2015 20:39:35][Thread 0020][[akka://InventoryService/system/cluster/core/daemon/heartbeatSender]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Heartbeat to [akka.tcp://InventoryService@0.0.0.0:9090]
[INFO][07/12/2015 20:39:36][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Leader is removing unreachable node [akka.tcp://InventoryService@0.0.0.0:9090]

Master确认将不再允许旧节点加入（虽然似乎有一个bug，请看第一行 - gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms，我想这应该是它应该被门控的时间）：< / p>

[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms
[DEBUG][07/12/2015 20:39:36][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%400.0.0.0%3a9090-2/endpointWriter]] Disassociated [akka.tcp://InventoryService@127.0.0.1:8090] -> akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:39:36][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

工作人员启动并尝试连接到主服务器：

[DEBUG][07/12/2015 20:40:20][Thread 0013][remoting] Associated [akka.tcp://InventoryService@127.0.0.1:8090] <- akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:40:21][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:21][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:23][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:28][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:33][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:38][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:43][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:48][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:53][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:58][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:03][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:08][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:13][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:18][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin

这里发生了什么？

工人：

被杀后重启：

[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:20][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:21][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%40127.0.0.1%3a8090-1/endpointWriter]] Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000

就是这样......没有别的东西写入日志！

完整日志文件：

主人：http://pastebin.com/raw.php?i=WtjEhV1V
工人：http://pastebin.com/raw.php?i=QGPxkqEd

主群集配置：

cluster {
    seed-nodes = ["master's address here"]
    roles = [ InventoryServiceMaster, InventoryServiceWorker ]
    failure-detector {
        acceptable-heartbeat-pause = 5s
        threshold = 10.0
    }
}

工作人员的配置相同，但只有InventoryServiceWorker角色。

我在这里缺少什么？这是配置问题吗？（我希望它不是一个错误 - 我在Github上见过其他人report a similar problem。

编辑：

为了清楚起见，我没有使用Nuget的Akka.dll，因为它包含序列化错误 - 我检查了当前的主程序是否应用了修复程序并执行了发布版本。日志包含调试信息，因为我保留了构建中的PDB。

编辑2：

在工作日志中，重新启动后，事件Akka.Cluster.InternalClusterAction+JoinSeedNodes出现两次，因为我最初手动调用了Cluster.JoinSeedNodes()。我已经删除了这个，但结果仍然是一样的。

Answer 1

从Akka.NET 1.1开始已经解决了这个问题 - 我们的UID系统在此之前没有正确实现（1.0.5，在本文发布时），但现在工作正常。

被击落后，节点不会重新加入群集

1 个答案: