首页 文章

崩溃后Elasticsearch无法恢复

提问于
浏览
5

跑出磁盘空间,搞砸了弹性搜索碎片 . 现在有三个节点为红色,两个节点已恢复,状态为黄色 . ES在CPU上运行150%,在内存上运行很高,试图恢复它们 . 但看起来有一些版本匹配冲突 .

我清理了磁盘空间并删除了分片的translog以停止从translog加载 . 但令人惊讶的是,translog再次被创建!

请分享如何阻止此尝试从translog恢复并恢复正常索引操作 . 我不想删除分片数据 .

[2014-10-31 03:11:43,742][WARN ][cluster.action.shard     ] [Angela Cairn] [western_europe][4] sending failed shard for [western_europe][4], node[x5M73qVXS5eZIBdz40boEg], [P], s[INITIALIZING], indexUUID [wy-tIJqdQiynz5SGQ2IrGA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[western_europe][4] failed to recover shard]; nested: ElasticsearchException[failed to read [tweet][527924645014818817]]; nested: ElasticsearchIllegalArgumentException[No version type match [101]]; ]]
[2014-10-31 03:11:43,742][WARN ][cluster.action.shard     ] [Angela Cairn] [western_europe][4] received shard failed for [western_europe][4], node[x5M73qVXS5eZIBdz40boEg], [P], s[INITIALIZING], indexUUID [wy-tIJqdQiynz5SGQ2IrGA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[western_europe][4] failed to recover shard]; nested: ElasticsearchException[failed to read [tweet][527924645014818817]]; nested: ElasticsearchIllegalArgumentException[No version type match [101]]; ]]
[2014-10-31 03:11:43,859][WARN ][indices.cluster          ] [Angela Cairn] [western_europe][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [western_europe][2] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:269)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.ElasticsearchException: failed to read [tweet][527936245440065536]
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:511)
    at org.elasticsearch.index.translog.TranslogStreams.readTranslogOperation(TranslogStreams.java:52)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:241)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:508)

1 回答

  • 4

    首先,检查分片本身是否真的没有问题 . cd 到你 /usr/share/elasticsearch/lib 目录或等价物,并使用Lucene的CheckIndex如下:

    java -cp "*" -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/<ES-NAME>/nodes/<NODE-NUMBER>/indices/<INDEX-NAME>/<SHARD-NUMBER/index/
    

    这将检查分片是否存在问题,如果分片很大,则需要一段时间 .

    请注意,如果您的Java类路径错误,则会丢失一些必需的jar文件,并且CheckIndex可能会抛出错误并错误地声明分片中的所有段都已损坏,因此请仔细阅读输出 .

    如果分片存在问题,并且您没有其他方法可以还原它,则使用 -fix 参数运行相同的命令将修复分片 but you will lose data . CheckIndex会警告你从碎片中丢失了多少文件(如果有的话) .

    如果CheckIndex报告所有碎片都很好,那么希望你的问题只在translog中 . 事务日志是ElasticSearch用于原子性的预写日志 . 崩溃后,ES将尝试恢复分片,包括尚未刷新到分片索引本身的写入 . 这些都在translog中,所以 you will lose them if you delete it . 然而,这比丢失碎片要好得多 . 在您的情况下,translog已经显示已损坏,我不知道有任何方法可以恢复它 .

    要删除用于恢复的损坏的事务日志,只需删除 /var/lib/elasticsearch/<ES-NAME>/nodes/<NODE-NUMBER>/indices/<INDEX-NAME>/<SHARD-NUMBER>/translog/ for each relevant shard for each affected node 中的translog文件_116159_的translog . 后一部分很重要,因为您可能会看到群集尝试从一个节点中删除它后从另一个节点重新生成分片的translog .

    然后,碎片应该正确初始化,尽管通常可能需要一段时间才能完成 .

相关问题