Question

摘要

我目前正在尝试使用部署为Docker服务的Keycloak构建身份验证应用程序。我的基础架构如下：

服务器：CentOS 7
Docker：17.06.2-ce，带有weaveworks网络插件
密钥斗篷：3.3.0-最终版
Postgre：9.4
5个Keycloak作为集群部署在Docker群中

构建集群时，我遇到了缓存问题。构建2个节点的群集时，我没有任何错误，但是当扩展到5个节点时，会出现许多类似这样的警告：

WARN [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-3) JGRP000041: bd3eeb23695b: message d8896fbba960::14 not found in retransmission table

当这些消息开始出现时，容器停止正确响应，最终其中一些停止其Keycloak实例。这种错误在各种情况下都会发生：

启动服务时，因此该应用甚至无法成功启动。
正确启动Keycloak之后，即使节点上很少活动，我们也要这样做。

症状

当应用崩溃时，我会看到：

1）基于上面显示的日志的许多日志似乎在重复（例如，来自某个节点的消息，这些消息永远都找不到）：

2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in retransmission table
2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in retransmission table
2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in retransmission table
2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in retransmission table
...
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in retransmission table
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in retransmission table
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in retransmission table
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in retransmission table
...

2）发出消息的节点应该显示各种缓存错误：

2018-08-22 09:58:37,130 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (ServerService Thread Pool -- 61) ISPN000136: Error executing command PutKeyValueCommand, writing keys [cluster-start-time]: org.infinispan.util.concurrent.TimeoutException: Replication timeout

2018-08-22 09:58:37,149 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 61) MSC000001: Failed to start service jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth: org.jboss.msc.service.StartException in service jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth: java.lang.RuntimeException: RESTEASY003325: Failed to construct public org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)

2018-08-22 09:58:37,178 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0013: Operation ("add") failed - address: ([("deployment" => "keycloak-server.war")]) - failure description: {"WFLYCTL0080: Failed services" => {"jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth" => "java.lang.RuntimeException: RESTEASY003325: Failed to construct public org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
    Caused by: java.lang.RuntimeException: RESTEASY003325: Failed to construct public org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
    Caused by: org.infinispan.util.concurrent.TimeoutException: Replication timeout"}}

2018-08-22 09:58:37,409 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (ServerService Thread Pool -- 60) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=actionTokens, type=LEAVE, sender=d8896fbba960, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, actualMembers=null, throwable=null, viewId=3}: java.lang.IllegalArgumentException: A cache topology's pending consistent hash must contain all the current consistent hash's members

然后，该节点通常停止所有缓存和Keycloak。

尝试配置和解决方案

我尝试失败：

更改Keycloak的各种缓存上的超时参数（以便留出更多时间稳定群集）
在Keycloak配置文件中更改协议NAKACK2的一些默认值。这样做的目的是限制节点之间的通信量并增加重传表中的元素数量，以使消息在所有节点接收到它们之前都不会丢失。但是，这些变化并没有减轻我的问题。

我当前使用的配置如下：

<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
    <cache-container name="keycloak" jndi-name="infinispan/Keycloak">
        <transport lock-timeout="500000"/>
        <local-cache name="realms">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <local-cache name="users">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <distributed-cache name="sessions" mode="SYNC" owners="3"/>
        <distributed-cache name="authenticationSessions" mode="SYNC" owners="3"/>
        <distributed-cache name="offlineSessions" mode="SYNC" owners="1"/>
        <distributed-cache name="loginFailures" mode="SYNC" owners="1"/>
        <local-cache name="authorization">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <replicated-cache name="work" mode="SYNC"/>
        <local-cache name="keys">
            <eviction max-entries="1000" strategy="LRU"/>
            <expiration max-idle="3600000"/>
        </local-cache>
        <distributed-cache name="actionTokens" mode="SYNC" owners="2">
            <eviction max-entries="-1" strategy="NONE"/>
            <expiration max-idle="-1" interval="300000"/>
        </distributed-cache>
    </cache-container>
...
    <cache-container name="ejb" aliases="sfsb" default-cache="dist" module="org.wildfly.clustering.ejb.infinispan">
        <transport lock-timeout="300000"/>
        <distributed-cache name="dist">
            <locking isolation="REPEATABLE_READ"/>
            <transaction mode="BATCH"/>
            <file-store/>
        </distributed-cache>
    </cache-container>
</subsystem>
...
<protocol type="pbcast.NAKACK2">
    <property name="use_mcast_xmit">false</property>
    <property name="xmit_table_num_rows">200</property>
</protocol>

因此，您是否知道为什么会这样？如何更新我的配置以解决此问题？

启动和使用Keycloak群集的重传表中的复制超时和丢失消息

0 个答案: