我们有一个3节点(c4.large)ElasticSearch 2.1.0集群(由节点组成,我称之为es-live-0,es-live-1,es-live- 2)在EC2 AWS上进行设置,它一直运行良好,并提供webapp。
其中一个节点还托管一个kibana,它收集并显示marvel-agent发送给它的数据。
昨天,marvel-agent无法创建新的奇迹数据索引,并且失败了,下面提供了一些示例日志。随后集群发生故障,但在大约20分钟后其状态恢复到绿色。但是,今天早上抵达办公室后,我发现这些都是谎言!我们的webapp请求已超时,即使它在EC2的监控仪表板上看起来很好,我也无法进入es-live-0。重新启动修复此问题,但看到这是我们的生产系统,我真的想深究这一点。
在阅读这个帖子:https://discuss.elastic.co/t/marvel-high-index-rate/38935/4后,我意识到我们应该将kibana移动到独立节点并将奇迹数据发送到在其上运行的ElasticSearch。 这可能是潜在的问题吗?为了给你一个想法,我们有超过23个分片的webapp使用了3个主要索引。系统上的分片总数为145,其中大部分与奇迹数据有关。 与此同时,感觉大量的分片不应该使其中一个节点无响应,或者假设这样做我错了吗?
此外,如果其中一个节点没有响应,为什么群集没有弹出它并继续作为双节点设置?
示例日志:
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: [2016-02-02 00:02:17,006][ERROR][marvel.agent ] [Joshua Guthrie] background thread had an uncaught exception
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: ElasticsearchException[failed to flush exporter bulks]
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at java.lang.Thread.run(Thread.java:745)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: [0]: index [.marvel-es-2016.02.02], type [node_stats], id [null], message [RemoteTransportException[[Corruptor][es-live-0:9300][indices:admin/create]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (acquire index lock) within 1m];]];
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: ... 3 more
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: Caused by: ElasticsearchException[failure in bulk execution:
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: [0]: index [.marvel-es-2016.02.02], type [node_stats], id [null], message [RemoteTransportException[[Corruptor][es-live-0:9300][indices:admin/create]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (acquire index lock) within 1m];]]
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log: ... 3 more