使用Java RestHighLevelClient通过BulkRequest API为ElasticSearch索引100K文档

时间:2018-08-16 01:47:27

标签: java elasticsearch elastic-stack

使用滚动API从索引100k读取documents_qa到文件路径。实际文件将在我的本地d:\drive中可用。通过使用文件路径,我将读取实际文件并转换为base64,并使用另一个索引document_attachment_qa中的(文件的)base64内容重新索引。

我当前的实现方式是,正在读取filePath,将文件转换为base64并与fileContent一起索引文档。因此,它花费了更多时间,例如:-索引4000文档花费了6个小时以上,并且由于IO exception,连接也终止了。

所以现在我想使用BulkRequest API为文档建立索引,但是我正在使用RestHighLevelClient,并且不确定如何与BulkRequest一起使用RestHighLevelClient API。

请找到我当前的实现,该实现将一个文档一个索引。

jsonMap = new HashMap<String, Object>();
            jsonMap.put("id", doc.getId());
            jsonMap.put("app_language", doc.getApp_language());
            jsonMap.put("fileContent", result);

            String id=Long.toString(doc.getId());

IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id ) // ATTACHMENT is the index name
                    .source(jsonMap) // Its my single document.
                    .setPipeline(ATTACHMENT);

IndexResponse response = SearchEngineClient.getInstance3().index(request); // increased timeout 

我找到了以下BulkRequest的文档。

https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-docs-bulk.html

但是不确定何时和使用BulkRequestBuilder bulkRequest = client.prepareBulk();时如何实现RestHighLevelClient client.prepareBulk()方法。

更新1

正在尝试一次索引所有100K文档。所以我创建了一个JSONArray并将我所有的JSONObject一一放入数组中。最终,我尝试构建BulkRequest并将我所有的文档(JSONArray)作为源添加到BulkRequest并尝试为其编制索引。

不确定如何将JSONArray转换为字符串列表。

private final static String ATTACHMENT = "document_attachment_qa";
private final static String TYPE = "doc";
JSONArray reqJSONArray=new JSONArray();

while (searchHits != null && searchHits.length > 0) { 
...
...
    jsonMap = new HashMap<String, Object>();
    jsonMap.put("id", doc.getId());
    jsonMap.put("app_language", doc.getApp_language());
    jsonMap.put("fileContent", result);

    reqJSONArray.put(jsonMap)
}

String actionMetaData = String.format("{ \"index\" : { \"_index\" : \"%s\", \"_type\" : \"%s\" } }%n", ATTACHMENT, TYPE);
List<String> bulkData =   // not sure how to convert a list of my documents in JSON strings    
StringBuilder bulkRequestBody = new StringBuilder();
for (String bulkItem : bulkData) {
    bulkRequestBody.append(actionMetaData);
    bulkRequestBody.append(bulkItem);
    bulkRequestBody.append("\n");
}

HttpEntity entity = new NStringEntity(bulkRequestBody.toString(), ContentType.APPLICATION_JSON);
try {
    Response response = SearchEngineClient.getRestClientInstance().performRequest("POST", "/ATTACHMENT/TYPE/_bulk", Collections.emptyMap(), entity);
    return response.getStatusLine().getStatusCode() == HttpStatus.SC_OK;
} catch (Exception e) {
    // do something
}

2 个答案:

答案 0 :(得分:1)

您可以new BulkRequest()并添加请求,而无需使用BulkRequestBuilder,例如:

BulkRequest request = new BulkRequest();
request.add(new IndexRequest("foo", "bar", "1")
        .source(XContentType.JSON,"field", "foobar"));
request.add(new IndexRequest("foo", "bar", "2")
        .source(XContentType.JSON,"field", "foobar"));
...
BulkResponse bulkResponse = myHighLevelClient.bulk(request, RequestOptions.DEFAULT);

答案 1 :(得分:1)

除了@chengpohi答案。我想补充以下几点:

BulkRequest可用于通过单个请求执行多个索引,更新和/或删除操作。

它要求至少向批量请求中添加一项操作:

BulkRequest request = new BulkRequest(); 
request.add(new IndexRequest("posts", "doc", "1")  
        .source(XContentType.JSON,"field", "foo"));
request.add(new IndexRequest("posts", "doc", "2")  
        .source(XContentType.JSON,"field", "bar"));
request.add(new IndexRequest("posts", "doc", "3")  
        .source(XContentType.JSON,"field", "baz"));
  

注意:Bulk API仅支持以JSON或SMILE编码的文档。   提供任何其他格式的文档将导致错误。

同步操作:

BulkResponse bulkResponse = client.bulk(request, RequestOptions.DEFAULT);

客户将是高级休息客户,执行将是同步的。

异步操作(推荐方法):

client.bulkAsync(request, RequestOptions.DEFAULT, listener);

批量请求的异步执行需要将BulkRequest实例和ActionListener实例都传递给异步方法。

Listener Example:

ActionListener<BulkResponse> listener = new ActionListener<BulkResponse>() {
    @Override
    public void onResponse(BulkResponse bulkResponse) {

    }

    @Override
    public void onFailure(Exception e) {

    }
};

返回的BulkResponse包含有关已执行操作的信息,并允许如下迭代每个结果:

for (BulkItemResponse bulkItemResponse : bulkResponse) { 
    DocWriteResponse itemResponse = bulkItemResponse.getResponse(); 

    if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.INDEX
            || bulkItemResponse.getOpType() == DocWriteRequest.OpType.CREATE) { 
        IndexResponse indexResponse = (IndexResponse) itemResponse;

    } else if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.UPDATE) { 
        UpdateResponse updateResponse = (UpdateResponse) itemResponse;

    } else if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.DELETE) { 
        DeleteResponse deleteResponse = (DeleteResponse) itemResponse;
    }
}

可以选择提供以下参数:

request.timeout(TimeValue.timeValueMinutes(2)); 
request.timeout("2m");

我希望这会有所帮助。