这是我们的背景。我们有一个带有23个节点(3M + 20D)的ES 5.6.2集群。在该集群上,大约一半的索引是在迁移到5.6.2之前创建的,而另一半是在迁移之后创建的。为了从新功能中受益并与新版本保持同步,我们决定:
此拆分背后的原因是,较旧的索引可以在不中断业务的情况下被大量重新索引到ES 6.5.1中。这将花费数周时间。不过,我们可能仍会在某个时候这样做,但是由于这些索引在某个时候会变成frozen,所以我们看不到浪费时间来迁移无论如何都会死掉/冻结的数据。 / p>
因此,我们对将较新的索引移至6.5.1充满信心,并且确实进展顺利。碎片分配过滤帮助我们将所有这些索引移到了将要成为新6.5.1群集一部分的节点上。然后,在滚动迁移中,我们将每个节点迁移到了新的6.5.1集群中,此集群从那以后一直是绿色且嗡嗡作响。
接下来是棘手的部分,您可能已经看到了。我们使用旧群集中的三个种子(数据)节点设置了CCS,那时旧群集开始动摇。除了我们发现并归档的another CCS search bug外,症状还在于数据节点经常离开并重新加入集群,从而导致碎片调整几乎恒定。
换句话说,我们留有一个黄色的簇,我们担心它随时都可能变红。有时,它会再次变绿几分钟,然后又变回黄色(有时会短暂变红)。请参阅下面的运行状况历史记录(左侧的大红色初始状态是将新索引移到新集群时的状态,但是由于下面将要描述的错误,所有其他绿色/红色箭头对均为临时红色状态):
具体地,我们在旧的5.6.2群集的主节点上的日志中看到的是,在发生以下一系列事件之后,主节点将断开与数据节点的连接:
首先,我们看到以下错误(非常类似于#23939),其中我们看到节点无法获得给定分片上的锁。 (注意:我们广泛使用滚动搜索,因此这可能是该问题中解释的原因)
[2019-02-14T23:53:38,331][WARN ][o.e.c.a.s.ShardStateAction] [IK-PRD-M3] [transactions_2016][1] received shard failed for shard id [[transactions_2016][1]], allocation id [Hy0REX6nScy49_2uXpKqrw], primary term [0], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[transactions_2016][1]: obtaining shard lock timed out after 5000ms]; ]
java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:364) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:499) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:147) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210) ~[elasticsearch-5.6.2.jar:5.6.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_74]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_74]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [transactions_2016][1]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:724) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:643) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:294) ~[elasticsearch-5.6.2.jar:5.6.2]
... 17 more
此后,我们看到传输级别的问题,其中的消息无法完全读取:
[2019-02-14T23:53:52,630][WARN ][o.e.t.n.Netty4Transport ] [IK-PRD-M3] exception caught on transport layer [[id: 0xd97a9d8c, L:/10.10.1.184:51594 - R:10.10.1.166/10.10.1.166:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [7719647], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportActionProxy$ProxyResponseHandler@7f2fcd88], error [false]; resetting
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1399) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.6.2.jar:5.6.2]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]
然后该数据节点被删除...
[2019-02-14T23:53:52,639][INFO ][o.e.c.s.ClusterService ] [IK-PRD-M3] removed {{IK-PRD-D103}{gAwPc0AvTyGR58ugLQ7K4Q}{-MdtgQHlT4SEQsDYTjvRBw}{10.10.1.166}{10.10.1.166:9300}{ml.max_open_jobs=10, ml.enabled=true, tag=hot},}, reason: zen-disco-node-failed({IK-PRD-D103}{gAwPc0AvTyGR58ugLQ7K4Q}{-MdtgQHlT4SEQsDYTjvRBw}{10.10.1.166}{10.10.1.166:9300}{ml.max_open_jobs=10, ml.enabled=true, tag=hot}), reason(transport disconnected)[{IK-PRD-D103}{gAwPc0AvTyGR58ugLQ7K4Q}{-MdtgQHlT4SEQsDYTjvRBw}{10.10.1.166}{10.10.1.166:9300}{ml.max_open_jobs=10, ml.enabled=true, tag=hot} transport disconnected]
...并在几秒钟后阅读
[2019-02-14T23:53:58,367][INFO ][o.e.c.s.ClusterService ] [IK-PRD-M3] added {{IK-PRD-D103}{gAwPc0AvTyGR58ugLQ7K4Q}{-MdtgQHlT4SEQsDYTjvRBw}{10.10.1.166}{10.10.1.166:9300}{ml.max_open_jobs=10, ml.enabled=true, tag=hot},}, reason: zen-disco-node-join[{IK-PRD-D103}{gAwPc0AvTyGR58ugLQ7K4Q}{-MdtgQHlT4SEQsDYTjvRBw}{10.10.1.166}{10.10.1.166:9300}{ml.max_open_jobs=10, ml.enabled=true, tag=hot}]
还值得注意的是,从集群上跳下来的节点几乎总是相同的三个,其中之一在CCS的种子节点列表中。
同意,完全没有线索表明CCS与此有关,但是由于这几乎是旧的5.6.2集群经历的唯一变化,并且跳跃节点之一是CCS网关节点这一事实,导致CCS造成这种情况的可能性很高。
想到的一件事是将旧的5.6.2群集迁移到最新的5.6.14修补程序版本,但是尝试在不稳定的黄色群集上进行迁移可能会很冒险,这就是我们在这里寻求建议的原因。
看看5.6.3 release notes,我们看到了可以解决我们问题的解决方案(#26833在PR @javanna中由#27881修复),但是我们不确定是否需要将整个群集迁移到5.6.3或仅迁移到种子节点。我们试图做的一件事是在5.6.2集群中添加两个5.6.3客户端节点(即非主节点和非数据节点),并将它们用作CCS的种子节点,但这使集群更加不稳定。所以我们撤消了更改,但是也许我们没有做正确的事
我们在其他5.6版中都没有看到。本发行版记录了所有修复可能导致我们所看到的错误的内容。我们正在寻求有关如何解决此问题的专家建议,再次感谢您的关注。
注意:这也已发布在官方Elasticsearch论坛中:https://discuss.elastic.co/t/shaky-cross-cluster-search-between-6-5-1-and-5-6-2/168518/6
答案 0 :(得分:0)
将我们的5.6.2集群升级到5.6.3确实解决了问题。
过去的几个小时里,我们的集群再次变绿了。
感谢Elastic支持团队帮助我们查明和解决此问题。