启动和使用Keycloak群集的重传表中的复制超时和丢失消息

时间:2018-08-22 15:27:07

标签: keycloak

摘要

我目前正在尝试使用部署为Docker服务的Keycloak构建身份验证应用程序。我的基础架构如下:

  • 服务器:CentOS 7
  • Docker:17.06.2-ce,带有weaveworks网络插件
  • 密钥斗篷:3.3.0-最终版
  • Postgre:9.4
  • 5个Keycloak作为集群部署在Docker群中

构建集群时,我遇到了缓存问题。构建2个节点的群集时,我没有任何错误,但是当扩展到5个节点时,会出现许多类似这样的警告:

WARN [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-3) JGRP000041: bd3eeb23695b: message d8896fbba960::14 not found in retransmission table

当这些消息开始出现时,容器停止正确响应,最终其中一些停止其Keycloak实例。这种错误在各种情况下都会发生:

  • 启动服务时,因此该应用甚至无法成功启动。
  • 正确启动Keycloak之后,即使节点上很少活动,我们也要这样做。

症状

当应用崩溃时,我会看到:

1)基于上面显示的日志的许多日志似乎在重复(例如,来自某个节点的消息,这些消息永远都找不到):

2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in retransmission table
2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in retransmission table
2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in retransmission table
2018-08-22 09:59:33,346 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in retransmission table
...
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in retransmission table
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in retransmission table
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in retransmission table
2018-08-22 09:59:33,040 WARN  [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2) JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in retransmission table
...

2)发出消息的节点应该显示各种缓存错误:

2018-08-22 09:58:37,130 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (ServerService Thread Pool -- 61) ISPN000136: Error executing command PutKeyValueCommand, writing keys [cluster-start-time]: org.infinispan.util.concurrent.TimeoutException: Replication timeout

2018-08-22 09:58:37,149 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 61) MSC000001: Failed to start service jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth: org.jboss.msc.service.StartException in service jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth: java.lang.RuntimeException: RESTEASY003325: Failed to construct public org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)

2018-08-22 09:58:37,178 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0013: Operation ("add") failed - address: ([("deployment" => "keycloak-server.war")]) - failure description: {"WFLYCTL0080: Failed services" => {"jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth" => "java.lang.RuntimeException: RESTEASY003325: Failed to construct public org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
    Caused by: java.lang.RuntimeException: RESTEASY003325: Failed to construct public org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
    Caused by: org.infinispan.util.concurrent.TimeoutException: Replication timeout"}}

2018-08-22 09:58:37,409 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (ServerService Thread Pool -- 60) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=actionTokens, type=LEAVE, sender=d8896fbba960, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, actualMembers=null, throwable=null, viewId=3}: java.lang.IllegalArgumentException: A cache topology's pending consistent hash must contain all the current consistent hash's members

然后,该节点通常停止所有缓存和Keycloak。

尝试配置和解决方案

我尝试失败:

  • 更改Keycloak的各种缓存上的超时参数(以便留出更多时间稳定群集)
  • 在Keycloak配置文件中更改协议NAKACK2的一些默认值。这样做的目的是限制节点之间的通信量并增加重传表中的元素数量,以使消息在所有节点接收到它们之前都不会丢失。但是,这些变化并没有减轻我的问题。

我当前使用的配置如下:

<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
    <cache-container name="keycloak" jndi-name="infinispan/Keycloak">
        <transport lock-timeout="500000"/>
        <local-cache name="realms">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <local-cache name="users">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <distributed-cache name="sessions" mode="SYNC" owners="3"/>
        <distributed-cache name="authenticationSessions" mode="SYNC" owners="3"/>
        <distributed-cache name="offlineSessions" mode="SYNC" owners="1"/>
        <distributed-cache name="loginFailures" mode="SYNC" owners="1"/>
        <local-cache name="authorization">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <replicated-cache name="work" mode="SYNC"/>
        <local-cache name="keys">
            <eviction max-entries="1000" strategy="LRU"/>
            <expiration max-idle="3600000"/>
        </local-cache>
        <distributed-cache name="actionTokens" mode="SYNC" owners="2">
            <eviction max-entries="-1" strategy="NONE"/>
            <expiration max-idle="-1" interval="300000"/>
        </distributed-cache>
    </cache-container>
...
    <cache-container name="ejb" aliases="sfsb" default-cache="dist" module="org.wildfly.clustering.ejb.infinispan">
        <transport lock-timeout="300000"/>
        <distributed-cache name="dist">
            <locking isolation="REPEATABLE_READ"/>
            <transaction mode="BATCH"/>
            <file-store/>
        </distributed-cache>
    </cache-container>
</subsystem>
...
<protocol type="pbcast.NAKACK2">
    <property name="use_mcast_xmit">false</property>
    <property name="xmit_table_num_rows">200</property>
</protocol>

因此,您是否知道为什么会这样?如何更新我的配置以解决此问题?

0 个答案:

没有答案