目前我有一个包含三个节点的集群。所有节点都包含数据并且是主要的。有6个主分片,因此每个节点有两个主分片。参数discovery.zen.minimum_master_nodes
为1。
我想要的配置是六个节点,有6个主分片,每个分片有一个副本,discovery.zen.minimum_master_nodes
= 3。
问题是群集是生产群集,我必须迁移到第二个配置而不会丢失数据或可用性。
我正在做的第一步是将节点数增加到6,并且当分片放置得很好时,我将开始复制。
我要做的第一件事是添加一个新节点。但是当我这样做时,群集无法重定位任何碎片。在新节点的错误日志中,我有:
[2015-06-10 18:43:25,929][WARN ][indices.cluster ] [NEW_NODE] [[NAME_CLUSTER][2]] marking and sending shard failed due to [failed recovery] org.elasticsearch.indices.recovery.RecoveryFailedException: [NAME_CLUSTER][2]: Recovery failed from [NODE1][fD-WDXVuSsu2QahBNKRLjg][NODE1][inet[IP_NODE1:9300]]{master=true} into [NEW_NODE][2xGUA-l8Qn-YUGzWkuUdSQ][NEW_NODE][inet[IP_NEW_NODE:9300]]{master=true} at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274) at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69) at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.transport.RemoteTransportException: [NODE1][inet[IP_NODE1:9300]][internal:index/shard/recovery/start_recovery] Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [NAME_CLUSTER][2] Phase[2] Execution failed at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:861) at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699) at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125) at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49) at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146) at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [NEW_NODE][inet[IP_NEW_NODE:9300]][internal:index/shard/recovery/prepare_translog] request_id [796196] timed out after [900000ms]
其他信息:
碎片大小:75 GB, RAM节点:8GB, 磁盘节点:300GB
ElasticSearch版本:1.5.2
更新
似乎存在问题的阶段是由
激活的阶段index.shard.check_on_startup:true
如果我将此字段设为false,则复制有效。此字段启用一个阶段,用于检查分片是否已损坏。我的猜测是,由于分片非常大,阶段会持续很多,而TransportService会抛出超时异常。如果这是正确的,我想知道一种增加此超时的方法。