我在弹性搜索名称中创建了一个索引为 <text x="50%" y="50%"
className='middleText'
textAnchor="middle"
alignmentBaseline="middle"
>{style.displayVal.toFixed(precision)}</text>
,并通过logstash从Oracle数据库加载数据。该索引包含以下值。
documents_local
然后,我也想对那些文件内容建立索引,因此我在弹性搜索中创建了一个名为"FilePath" : "Path of the file",
"FileName" : "filename.pdf",
"Language" : "Language_name"
的索引,并使用以下Java代码获取了document_attachment
和{{1} }在FilePath
的帮助下从索引FileName
}}我检索到的文件将在我的本地驱动器中可用,并且我已使用摄取附件插件处理器为这些文件内容建立了索引。
请在下面找到我的Java代码,这些文件正在为文件建立索引。
documents_local
请找到我对filepath
的映射详细信息(我先进行了映射,然后执行了此Java代码)。
private final static String INDEX = "documents_local"; //Documents Table with file Path - Source Index
private final static String ATTACHMENT = "document_attachment"; // Documents with Attachment... -- Destination Index
private final static String TYPE = "doc";
public static void main(String args[]) throws IOException {
RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();
try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}
//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchSourceBuilder.size(3000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = restHighLevelClient.search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}
SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
int line=1;
String docpath = null;
Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {
String encodedfile = null;
File file=null;
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
doc.setId((int) sourceAsMap.get("id"));
doc.setLanguage(sourceAsMap.get("language"));
doc.setFilename(sourceAsMap.get("filename").toString());
doc.setPath(sourceAsMap.get("path").toString());
String filepath=doc.getPath().concat(doc.getFilename());
System.out.println("Line Number--> "+line+++"ID---> "+doc.getId()+"File Path --->"+filepath);
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("language", doc.getLanguage());
jsonMap.put("filename", doc.getFilename());
jsonMap.put("path", doc.getPath());
jsonMap.put("fileContent", encodedfile);
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);
try {
IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace();
}
}
System.out.println("Indexing done...");
}
但是在执行此过程时,我遗漏了一些文档,
我的源索引是ingest attachment plugin
,其中包含 PUT _ingest/pipeline/document_attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "fileContent"
}
}
]
}
个文档。
正在获取所有documents_local
文档并附加我的PDF(转换为base64后)并写入另一个索引2910
但是只有118
的{{1}}索引应该是document_attachment
。一些文件丢失了。另外,由于索引需要很长时间。
不确定,第二索引(document_attachment
)中的文档是如何丢失的,还有没有其他方法可以用来简化此过程?
我们可以在这里包括线程机制吗?因为,我以后必须以同样的方式索引超过10万个pdf。