我们有一个具有5个dn节点的2.3.3 ES集群,具有以下ES配置:
index.number_of_shards: 1
index.number_of_replicas: 4
其余几乎都是默认值。一切都很好,但是在重读时我们的几个索引会在ES日志中显示以下堆栈跟踪:
[2017-05-12 04:33:55,745][DEBUG][action.search ] [qa13-ost-1020x-h-ds01] All shards failed for phase: [query_fetch]
RemoteTransportException[[qa13-ost-1020x-h-as01][192.168.104.110:9300][indices:data/read/search[phase/query+fetch]]]; nested: ShardNotFoundException[no such shard];
Caused by: [qa-hsbcuk1][[qa-hsbcuk1][0]] ShardNotFoundException[no such shard]
at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:639)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:620)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:392)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:389)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
...
这些最终将503返回到我们的应用程序,该应用程序通过REST API调用ES。他们会简要介绍,之后碎片会恢复绿色。
在我们尝试调试时,我们已经注意到这些恢复,它们看起来与我们看到上述情况的相同时间段相对应。它们从服务器中的STORE开始,该服务器似乎具有主要分片:
"qa-hsbcuk1" : {
"shards" : [ {
"id" : 0,
"type" : "STORE",
"stage" : "DONE",
"primary" : true,
"start_time" : "2017-05-12T08:33:55.817Z",
"start_time_in_millis" : 1494578035817,
"stop_time" : "2017-05-12T08:33:55.827Z",
"stop_time_in_millis" : 1494578035827,
"total_time" : "10ms",
"total_time_in_millis" : 10,
"source" : {
"id" : "QZdQAM-oQ_e__vUeAzNOsw",
"host" : "192.168.104.110",
"transport_address" : "192.168.104.110:9300",
"ip" : "192.168.104.110",
"name" : "qa13-ost-1020x-h-as01"
},
"target" : {
"id" : "QZdQAM-oQ_e__vUeAzNOsw",
"host" : "192.168.104.110",
"transport_address" : "192.168.104.110:9300",
"ip" : "192.168.104.110",
"name" : "qa13-ost-1020x-h-as01"
},
"index" : {
"size" : {
"total" : "0b",
"total_in_bytes" : 0,
"reused" : "0b",
"reused_in_bytes" : 0,
"recovered" : "0b",
"recovered_in_bytes" : 0,
"percent" : "0.0%"
},
"files" : {
"total" : 0,
"reused" : 0,
"recovered" : 0,
"percent" : "0.0%"
},
"total_time" : "0s",
"total_time_in_millis" : 0,
"source_throttle_time" : "-1",
"source_throttle_time_in_millis" : 0,
"target_throttle_time" : "-1",
"target_throttle_time_in_millis" : 0
},
"translog" : {
"recovered" : 0,
"total" : 0,
"percent" : "100.0%",
"total_on_start" : 0,
"total_time" : "9ms",
"total_time_in_millis" : 9
},
"verify_index" : {
"check_index_time" : "0s",
"check_index_time_in_millis" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
}:
其次是4个REPLICA:
}, {
"id" : 0,
"type" : "REPLICA",
"stage" : "DONE",
"primary" : false,
"start_time" : "2017-05-12T08:33:55.881Z",
"start_time_in_millis" : 1494578035881,
"stop_time" : "2017-05-12T08:33:55.925Z",
"stop_time_in_millis" : 1494578035925,
"total_time" : "43ms",
"total_time_in_millis" : 43,
"source" : {
"id" : "QZdQAM-oQ_e__vUeAzNOsw",
"host" : "192.168.104.110",
"transport_address" : "192.168.104.110:9300",
"ip" : "192.168.104.110",
"name" : "qa13-ost-1020x-h-as01"
},
"target" : {
"id" : "v25bTq0sQcadYs-ORzisJg",
"host" : "192.168.104.109",
"transport_address" : "192.168.104.109:9300",
"ip" : "192.168.104.109",
"name" : "qa13-ost-1020x-h-ds01"
},
"index" : {
"size" : {
"total" : "130b",
"total_in_bytes" : 130,
"reused" : "0b",
"reused_in_bytes" : 0,
"recovered" : "130b",
"recovered_in_bytes" : 130,
"percent" : "100.0%"
},
"files" : {
"total" : 1,
"reused" : 0,
"recovered" : 1,
"percent" : "100.0%"
},
"total_time" : "30ms",
"total_time_in_millis" : 30,
"source_throttle_time" : "0s",
"source_throttle_time_in_millis" : 0,
"target_throttle_time" : "-1",
"target_throttle_time_in_millis" : 0
},
"translog" : {
"recovered" : 0,
"total" : 0,
"percent" : "100.0%",
"total_on_start" : 0,
"total_time" : "9ms",
"total_time_in_millis" : 9
},
"verify_index" : {
"check_index_time" : "0s",
"check_index_time_in_millis" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
}
....
我们不清楚为什么会这样。