被击落后,节点不会重新加入群集

时间:2015-12-07 23:12:10

标签: akka.net akka.net-cluster

我正在使用Akka.NET的集群(1.0.5)功能来实现一个服务,该服务由一个主节点组成,该节点通过HTTP接收请求并将工作交给已加入集群的工作节点。

这个想法是能够轻松完成以下任务:

  • 在需求高时将工作节点添加到群集(检查)

  • 能够重启主节点或使其脱机(维护/错误行为/无论如何)并让工作人员在可用时重新连接(检查)

  • 升级/重新启动行为不端的工作人员并将其重新连接到主节点(失败!

第一点可以按照您的预期运行:新实例(Azure云服务工作者角色)正在启动,并加入主服务器 - 也就是种子节点。

对于第二点,所有工作节点都有一个侦听集群八卦的actor,它确定主节点是否已经死亡。如果是这种情况,将重新启动工作节点actor系统。

最后一点是我被困住的地方。主节点还侦听群集八卦以确定工作人员何时无法访问(ClusterEvent.UnreachableMember)或正在关闭(退出状态)并确定是否应该将其关闭。根据我从文档中了解到的,让同一节点的“新”版本重新加入群集的唯一方法是首先删除旧版本。

不幸的是,这似乎并没有发生。在测试场景中,我试图在计算模拟器中本地重现问题,这些步骤是:

  1. 启动主节点(端口8090)

  2. 启动工作节点(端口9090)

  3. 做一些工作

  4. 突然杀死工作节点

  5. 启动工作节点备份

  6. 以下是我在此测试期间为两个节点收集的日志中的相关摘录:

    工人无法到达:

    [WARNING][07/12/2015 20:39:35][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://InventoryService@0.0.0.0:9090, status = Up]
    

    主节点在工作人员的地址上调用Cluster.Leave()Cluster.Down()

    [DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Leave
    [INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marked address [akka.tcp://InventoryService@0.0.0.0:9090] as Leaving]
    [DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down
    [INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marking unreachable node [akka.tcp://InventoryService@0.0.0.0:9090] as Down
    [DEBUG][07/12/2015 20:39:35][Thread 0020][[akka://InventoryService/system/cluster/core/daemon/heartbeatSender]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Heartbeat to [akka.tcp://InventoryService@0.0.0.0:9090]
    [INFO][07/12/2015 20:39:36][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Leader is removing unreachable node [akka.tcp://InventoryService@0.0.0.0:9090]
    

    Master确认将不再允许旧节点加入(虽然似乎有一个bug,请看第一行 - gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms,我想这应该是它应该被门控的时间):< / p>

    [WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms
    [DEBUG][07/12/2015 20:39:36][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%400.0.0.0%3a9090-2/endpointWriter]] Disassociated [akka.tcp://InventoryService@127.0.0.1:8090] -> akka.tcp://InventoryService@0.0.0.0:9090
    [DEBUG][07/12/2015 20:39:36][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
    [WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
    

    工作人员启动并尝试连接到主服务器:

    [DEBUG][07/12/2015 20:40:20][Thread 0013][remoting] Associated [akka.tcp://InventoryService@127.0.0.1:8090] <- akka.tcp://InventoryService@0.0.0.0:9090
    [DEBUG][07/12/2015 20:40:21][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:21][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:23][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:28][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:33][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:38][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:43][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:48][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:53][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:40:58][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:41:03][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:41:08][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:41:13][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    [DEBUG][07/12/2015 20:41:18][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
    

    这里发生了什么?

    工人:

    被杀后重启:

    [DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
    [DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
    [DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
    [DEBUG][07/12/2015 20:40:20][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
    [DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
    [DEBUG][07/12/2015 20:40:21][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%40127.0.0.1%3a8090-1/endpointWriter]] Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000
    

    就是这样......没有别的东西写入日志!

    完整日志文件:

    主群集配置:

    cluster {
        seed-nodes = ["master's address here"]
        roles = [ InventoryServiceMaster, InventoryServiceWorker ]
        failure-detector {
            acceptable-heartbeat-pause = 5s
            threshold = 10.0
        }
    }
    

    工作人员的配置相同,但只有InventoryServiceWorker角色。

    我在这里缺少什么?这是配置问题吗? (我希望它不是一个错误 - 我在Github上见过其他人report a similar problem

    编辑:

    为了清楚起见,我没有使用Nuget的Akka.dll,因为它包含序列化错误 - 我检查了当前的主程序是否应用了修复程序并执行了发布版本。日志包含调试信息,因为我保留了构建中的PDB。

    编辑2:

    在工作日志中,重新启动后,事件Akka.Cluster.InternalClusterAction+JoinSeedNodes出现两次,因为我最初手动调用了Cluster.JoinSeedNodes()。我已经删除了这个,但结果仍然是一样的。

1 个答案:

答案 0 :(得分:1)

从Akka.NET 1.1开始已经解决了这个问题 - 我们的UID系统在此之前没有正确实现(1.0.5,在本文发布时),但现在工作正常。