如何从ES 1.7读取海量数据以索引到ES 6.7

时间:2019-05-08 13:46:25

标签: java elasticsearch elastic-stack jest

需要从ES 1.7读取数据以将其索引到6.7。 由于没有可用的升级。需要索引2亿条记录的近5 TB数据。我们通过搜索和滚动方法使用ES_REST_high_level_client(6.7.2)。但无法使用滚动ID进行滚动。另一种尝试的方法是使用from和批处理大小。最初,读取速度较快,因为从偏移量开始增加,读取确实很差。最好的方法是什么?

使用搜索和滚动的第一种方法。

            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            searchSourceBuilder.size(10);
            searchRequest.source(searchSourceBuilder);
            searchRequest.scroll(TimeValue.timeValueMinutes(2));
            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            String scrollId = searchResponse.getScrollId();

    while (run) {
                SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
                scrollRequest.scroll(TimeValue.timeValueSeconds(60));
                SearchResponse searchScrollResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
                scrollId = searchScrollResponse.getScrollId();
                hits = searchScrollResponse.getHits();

                if (hits.getHits().length == 0) {
                    run = false;
                }
            }

例外 线程“主”中的异常ElasticsearchStatusException [Elasticsearch异常[type = exception,reason = ElasticsearchIllegalArgumentException [无法解码scrollId];嵌套:IOException [在数组位置0的错误Base64输入字符十进制123]; ]]     在org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177)     在org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:2050)     在org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:2026) :

第二种方法:

int offset = 0;
        boolean run = true;
        while (run) {
            SearchRequest searchRequest = new SearchRequest("indexname");
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            searchSourceBuilder.from(offset);
            searchSourceBuilder.size(500);
            searchRequest.source(searchSourceBuilder);
            long start = System.currentTimeMillis();
            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            long end = System.currentTimeMillis();

            SearchHits hits = searchResponse.getHits();
            System.out.println(" Total hits : " + hits.totalHits + " time : " + (end - start));
            offset += 500;
            if(hits.getHits().length == 0) {
                run = false;
            }
        }

任何其他读取数据的方法。

1 个答案:

答案 0 :(得分:0)

通常最好的解决方案是远程重新索引:https://www.elastic.co/guide/en/elasticsearch/reference/6.7/docs-reindex.html#reindex-from-remote

我不确定REST客户端是否仍与1.x兼容,而远程重新索引是否应该支持它。

深分页非常昂贵,这就是为什么应该避免它的原因-您在示例中看到了原因。