我必须使用Elasticsearch
摄取插件在pdf文档中实现基于全文的搜索。当我试图在pdf文档中搜索单词someword
时,我得到一个空的命中数组。
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
答案 0 :(得分:2)
通过传递Base64编码内容使用第二个命令索引文档时,文档如下所示:
{
"filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
因此,您的查询需要查看attachment.content
字段,而不是data
字段(仅用于在编制索引期间发送原始内容的目的)
将您的查询修改为此,它将起作用:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
PS:发送有效载荷时使用POST
代替GET