Question

Cluser：我在不同的服务器中使用elasticsearch 1.3.1和6个节点，这些服务器都通过LAN连接。带宽很高，每个带有45 GB RAM。

配置我们为要运行的节点分配的堆大小为10g。除了唯一的discoverym，集群名称，节点名称和我们2区域之外，我们确实有elasticsearch默认配置。 3节点属于一个区域，另一个节点属于另一个区域。

指数：15，指数的总大小为76GB。

现在我正面临着SearchContextMissingException例外的DEBUG日志。它闻起来像一些搜索查询花了很多时间来获取。但我查询了查询，没有查询产生大量的群集负载...我想知道为什么会发生这种情况。

问题：由于此问题，所有节点逐个开始收集GC。并导致OOM :(

这是我的例外。请帮我解释一下。

什么是SearchContextMissingException？为什么会这样？
我们如何阻止群集进行这类查询？

错误：

[YYYY-MM-DD HH:mm:ss,039][DEBUG][action.search.type ] [es_node_01] [5031530] 
   Failed to execute fetch phase 
   org.elasticsearch.transport.RemoteTransportException: [es_node_02][inet[/1x.x.xx.xx:9300]][search/phase/fetch/id] 
   Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [5031530] 
       at org.elasticsearch.search.SearchService.findContext(SearchService.java:480) 
       at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:450) 
       at org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:793) 
       at org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:782) 
       at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) 
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) 
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) 
       at java.lang.Thread.run(Thread.java:722)

Answer 1

如果可以，请更新到1.4.2。它修复了一些已知的弹性问题，包括您描述的级联故障。

无论如何，默认配置肯定会让您遇到麻烦。最小值，您可能需要查看设置电路断路器，例如现场数据缓存。

这是从我们的生产配置中取消的片段。我假设您还正确配置了linux文件句柄限制：see here

# prevent swapping
bootstrap.mlockall: true

indices.breaker.total.limit: 70%
indices.fielddata.cache.size: 70%

# make elasticsearch work harder to migrate/allocate indices on startup (we have a lot of shards due to logstash); default was 2
cluster.routing.allocation.node_concurrent_recoveries: 8

# enable cors
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/(localhost|kibana.*\.linko\.io)(:[0-9]+)?/

index.query.bool.max_clause_count: 4096

Answer 2

1.6.0中仍然会出现相同的错误（或调试语句），并且不是错误。

创建新的滚动请求时：

SearchResponse scrollResponse = client.prepareSearch(index).setTypes(types).setSearchType(SearchType.SCAN)
            .setScroll(new TimeValue(60000)).setSize(maxItemsPerScrollRequest).setQuery(ElasticSearchQueryBuilder.createMatchAllQuery()).execute().actionGet();
String scrollId = scrollResponse.getScrollId();

创建一个新的滚动ID（除了scrollId，响应为空）。要获取结果：

long resultCounter = 0l; // to keep track of the number of results retrieved 
Long nResultsTotal = null; // total number of items we will be expecting
do {
    final SearchResponse response = client.prepareSearchScroll(scrollId).setScroll(new TimeValue(600000)).execute().actionGet();
    // handle result
    if(nResultsTotal==null) // if not initialized 
        nResultsTotal = response.getHits().getTotalHits(); //set total number of Documents
    resultCounter += response.getHits().getHits().length; //keep track of the items retrieved
} while (resultCounter < nResultsTotal);

无论您拥有多少分片，此方法都有效。另一种选择是在以下时间添加break语句：

boolean breakIf = response.getHits().getHits().length < (nShards * maxItemsPerScrollRequest);

要返回的项目数为maxItemsPerScrollRequest（每个分片！），因此我们希望请求的项目数乘以分片数。但是当我们有多个分片，其中一个没有文档，而其他分片没有，那么前一个方法仍然会给我们所有可用的文档。后者会过早停止 - 我希望（没试过！）

另一种停止查看此异常的方法（因为它只是'DEBUG），是打开ElasticSearch的{{1}}目录中的logging.yml文件，然后更改：

config

到

action: DEBUG

SearchContextMissingException无法执行获取阶段[search / phase / fetch / id]

2 个答案: