目前我有一个有三个节点的集群 . 所有节点都包含数据并且是主要的 . 有6个主分片,因此每个节点有两个主分片 . 参数 discovery.zen.minimum_master_nodes 为1 .

我想要的配置是六个节点,有6个主分片,每个分片有一个副本, discovery.zen.minimum_master_nodes = 3 .

问题是群集是 生产环境 群集,我必须迁移到第二个配置而不会丢失数据或可用性 .

我正在做的第一步是将节点数增加到6,当分片放置得很好时,我将开始复制 .

我做的第一件事是添加一个新节点 . 但是当我这样做时,群集无法重定位任何碎片 . 在新节点的错误日志中,我有:

[2015-06-10 18:43:25,929][WARN ][indices.cluster          ] [NEW_NODE] [[NAME_CLUSTER][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [NAME_CLUSTER][2]: Recovery failed from [NODE1][fD-WDXVuSsu2QahBNKRLjg][NODE1][inet[IP_NODE1:9300]]{master=true} into [NEW_NODE][2xGUA-l8Qn-YUGzWkuUdSQ][NEW_NODE][inet[IP_NEW_NODE:9300]]{master=true}
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [NODE1][inet[IP_NODE1:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [NAME_CLUSTER][2] Phase[2] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:861)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [NEW_NODE][inet[IP_NEW_NODE:9300]][internal:index/shard/recovery/prepare_translog] request_id [796196] timed out after [900000ms]

附加信息:

分片大小:75 GB,RAM节点:8GB,磁盘节点:300GB

ElasticSearch版本:1.5.2

UPDATE

似乎给出问题的阶段是被激活的阶段

index.shard.check_on_startup:true

如果我将此字段设为false,则复制有效 . 此字段启用一个阶段,用于检查分片是否已损坏 . 我的猜测是,由于分片非常大,阶段会持续很多,而TransportService会抛出Timeout Exception . 如果这是正确的,我想知道一种增加此超时的方法 .