Elastic Search从另一个索引创建索引时丢失了一些文档

时间:2018-06-28 19:59:45

标签: java elasticsearch elastic-stack

我在弹性搜索名称中创建了一个索引为 <text x="50%" y="50%" className='middleText' textAnchor="middle" alignmentBaseline="middle" >{style.displayVal.toFixed(precision)}</text> ,并通过logstash从Oracle数据库加载数据。该索引包含以下值。

documents_local

然后,我也想对那些文件内容建立索引,因此我在弹性搜索中创建了一个名为"FilePath" : "Path of the file", "FileName" : "filename.pdf", "Language" : "Language_name" 的索引,并使用以下Java代码获取了document_attachment和{{1} }在FilePath的帮助下从索引FileName}}我检索到的文件将在我的本地驱动器中可用,并且我已使用摄取附件插件处理器为这些文件内容建立了索引。

请在下面找到我的Java代码,这些文件正在为文件建立索引。

documents_local

请找到我对filepath的映射详细信息(我先进行了映射,然后执行了此Java代码)。

private final static String INDEX = "documents_local";  //Documents Table with file Path - Source Index
private final static String ATTACHMENT = "document_attachment"; // Documents with Attachment...  -- Destination Index
private final static String TYPE = "doc";


public static void main(String args[]) throws IOException {


    RestHighLevelClient restHighLevelClient = null;
    Document doc=new Document();

    try {
        restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
                new HttpHost("localhost", 9201, "http")));
    } catch (Exception e) {
        System.out.println(e.getMessage());
    }


    //Fetching Id, FilePath & FileName from Document Index. 
    SearchRequest searchRequest = new SearchRequest(INDEX); 
    searchRequest.types(TYPE);
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    QueryBuilder qb = QueryBuilders.matchAllQuery();
    searchSourceBuilder.query(qb);
    searchSourceBuilder.size(3000);
    searchRequest.source(searchSourceBuilder);
    SearchResponse searchResponse = null;
    try {
         searchResponse = restHighLevelClient.search(searchRequest);
    } catch (IOException e) {
        e.getLocalizedMessage();
    }

    SearchHit[] searchHits = searchResponse.getHits().getHits();
    long totalHits=searchResponse.getHits().totalHits;

    int line=1;
    String docpath = null;

    Map<String, Object> jsonMap ;
    for (SearchHit hit : searchHits) {

        String encodedfile = null;
        File file=null;

        Map<String, Object> sourceAsMap = hit.getSourceAsMap();
        doc.setId((int) sourceAsMap.get("id"));
        doc.setLanguage(sourceAsMap.get("language"));
        doc.setFilename(sourceAsMap.get("filename").toString());
        doc.setPath(sourceAsMap.get("path").toString());

        String filepath=doc.getPath().concat(doc.getFilename());

        System.out.println("Line Number--> "+line+++"ID---> "+doc.getId()+"File Path --->"+filepath);

        file = new File(filepath);
        if(file.exists() && !file.isDirectory()) {
            try {
                FileInputStream fileInputStreamReader = new FileInputStream(file);
                byte[] bytes = new byte[(int) file.length()];
                fileInputStreamReader.read(bytes);
                encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
        }

        jsonMap = new HashMap<>();
        jsonMap.put("id", doc.getId());
        jsonMap.put("language", doc.getLanguage());
        jsonMap.put("filename", doc.getFilename());
        jsonMap.put("path", doc.getPath());
        jsonMap.put("fileContent", encodedfile);

        String id=Long.toString(doc.getId());

        IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
                .source(jsonMap)
                .setPipeline(ATTACHMENT);


        try {
            IndexResponse response = restHighLevelClient.index(request);
        } catch(ElasticsearchException e) {
            if (e.status() == RestStatus.CONFLICT) {
            }
            e.printStackTrace();
        }

    }

    System.out.println("Indexing done...");
}

但是在执行此过程时,我遗漏了一些文档,

我的源索引是ingest attachment plugin,其中包含 PUT _ingest/pipeline/document_attachment { "description" : "Extract attachment information", "processors" : [ { "attachment" : { "field" : "fileContent" } } ] } 个文档。 正在获取所有documents_local文档并附加我的PDF(转换为base64后)并写入另一个索引2910

但是只有118的{​​{1}}索引应该是document_attachment。一些文件丢失了。另外,由于索引需要很长时间。

不确定,第二索引(document_attachment)中的文档是如何丢失的,还有没有其他方法可以用来简化此过程?

我们可以在这里包括线程机制吗?因为,我以后必须以同样的方式索引超过10万个pdf。

0 个答案:

没有答案