如何使用Elasticsearch ingest-attachment插件索引pdf文件?

时间:2017-02-08 10:01:17

标签: elasticsearch full-text-search elasticsearch-plugin

我必须使用Elasticsearch摄取插件在pdf文档中实现基于全文的搜索。当我试图在pdf文档中搜索单词someword时,我得到一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

1 个答案:

答案 0 :(得分:2)

通过传递Base64编码内容使用第二个命令索引文档时,文档如下所示:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

因此,您的查询需要查看attachment.content字段,而不是data字段(仅用于在编制索引期间发送原始内容的目的)

将您的查询修改为此,它将起作用:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS:发送有效载荷时使用POST代替GET