弹性搜索:如何移动主分片?

时间:2015-06-11 08:35:08

标签: elasticsearch distributed-system

目前我有一个包含三个节点的集群。所有节点都包含数据并且是主要的。有6个主分片,因此每个节点有两个主分片。参数discovery.zen.minimum_master_nodes为1。

我想要的配置是六个节点,有6个主分片,每个分片有一个副本,discovery.zen.minimum_master_nodes = 3。

问题是群集是生产群集,我必须迁移到第二个配置而不会丢失数据或可用性。

我正在做的第一步是将节点数增加到6,并且当分片放置得很好时,我将开始复制。

我要做的第一件事是添加一个新节点。但是当我这样做时,群集无法重定位任何碎片。在新节点的错误日志中,我有:

[2015-06-10 18:43:25,929][WARN ][indices.cluster          ] [NEW_NODE] [[NAME_CLUSTER][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [NAME_CLUSTER][2]: Recovery failed from [NODE1][fD-WDXVuSsu2QahBNKRLjg][NODE1][inet[IP_NODE1:9300]]{master=true} into [NEW_NODE][2xGUA-l8Qn-YUGzWkuUdSQ][NEW_NODE][inet[IP_NEW_NODE:9300]]{master=true}
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [NODE1][inet[IP_NODE1:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [NAME_CLUSTER][2] Phase[2] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:861)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [NEW_NODE][inet[IP_NEW_NODE:9300]][internal:index/shard/recovery/prepare_translog] request_id [796196] timed out after [900000ms]

其他信息:

碎片大小:75 GB, RAM节点:8GB, 磁盘节点:300GB

ElasticSearch版本:1.5.2

更新

似乎存在问题的阶段是由

激活的阶段

index.shard.check_on_startup:true

如果我将此字段设为false,则复制有效。此字段启用一个阶段,用于检查分片是否已损坏。我的猜测是,由于分片非常大,阶段会持续很多,而TransportService会抛出超时异常。如果这是正确的,我想知道一种增加此超时的方法。

0 个答案:

没有答案