Question

我是ElasticSearch的新手。我已经完成了关于创建索引的非常基础的教程。我确实理解索引的概念。我希望ElasticSearch在.PDF文件中搜索。基于我对创建索引的理解，似乎我需要读取.PDF文件并提取所有关键字以进行索引。但是，我不明白我需要遵循哪些步骤。如何阅读.PFD文件以提取关键字。

Answer 1

似乎elasticsearch-mapper-attachment插件已在5.0.0（2016年10月26日发布）中弃用。 documentation建议使用Ingest Attachment Processor Plugin作为替代。

安装：

sudo bin/elasticsearch-plugin install ingest-attachment

有关如何使用“摄取附件”插件的信息，请参阅How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?。

Answer 2

您需要查看elasticsearch-mapper-attachments plugin，因为它很可能帮助您实现所需。

Answer 3

安装Elasticsearch mapper-attachment插件并使用类似于：

的代码

public String indexDocument(String filePath, DataDTO dto) {
        IndexResponse response = null;
        try {
            response = this.prepareIndexRequest("collectionName").setId(dto.getId())
                    .setSource(jsonBuilder().startObject()
                    .field("file", Base64.encodeFromFile(filePath))
                    .endObject()).setRefresh(true).execute().actionGet();
        } catch (ElasticsearchException e) {
            //
        } catch (IOException e) {
            //
        }
    return response.getId();
}

Answer 4

如前所述，不推荐使用elasticsearch-mapper-attachment插件，而是可以使用Ingest Attachment插件

https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

Answer 5

对于我的项目，我还必须使我的本地.PDF文件可供搜索。我通过以下方式实现了这一目标：

使用Apache Tika从.PDF文件中提取数据，我使用了Apache Tika因为它让我可以自由地从不同的数据中提取数据具有相同管道的扩展。
使用Apache Tika的输出进行索引。

通常我的索引看起来像：

{ filename：＆＃34; FILENAME＆＃34;， filebody：＆＃34;从Apache Tika中提取的数据＆＃34; }

有很多不同的解决方案，如此处所提到的，使用Elasticsearch mapper-attachment plugin也是一个很好的解决方案。我选择了这种方法，因为我想使用大文件和不同的扩展名。

Answer 6

我在Pdf to elastic search的下面找到了以下代码，代码提取pdf并进行弹性搜索

import PyPDF2
import re
import requests
import json
import os
from datetime import date

class ElasticModel:

    name = ""
    msg = ""

    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__, 
            sort_keys=True, indent=4)

def __readPDF__(path):
    # pdf file object
    # you can find find the pdf file with complete code in below
    pdfFileObj = open(path, 'rb')
    # pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # number of pages in pdf
    print(pdfReader.numPages)
    # a page object
    pageObj = pdfReader.getPage(0)
    # extracting text from page.
    # this will print the text you can also save that into String
    line = pageObj.extractText() 
    line = line.replace("\n","")
    print(line)
    return line


#line = pageObj.extractText()

def __prepareElasticModel__(line, name):
    eModel = ElasticModel();

    eModel.name = name
    eModel.msg = line
    return eModel


def __sendToElasticSearch__(elasticModel):
    print("Name : " + str(eModel))

############################################
####  #CHANGE INDEX NAME IF NEEDED
#############################################
    index = "samplepdf"

    url = "http://localhost:9200/" + index +"/_doc?pretty"
    data = elasticModel.toJSON()
    #data = serialize(eModel)
    response = requests.post(url, data=data,headers={
                    'Content-Type':'application/json',
                    'Accept-Language':'en'

                })
    print("Url : " + url)
    print("Data : " + str(data))

    print("Request : " + str(requests))
    print("Response : " + str(response))


#################################
#Change pdf dir path
###################################
pdfdir = "C:/Users/abhis/Desktop/TemplatesPDF/SamplePdf"

listFiles = os.listdir(pdfdir)
for file in listFiles :
    path = pdfdir + "/" + file
    print(path)

    line = __readPDF__(path)
    eModel = __prepareElasticModel__(line, file)
    __sendToElasticSearch__(eModel)

上面的代码正在提取样本pdf

从上面的pdf样本中，使用正则表达式提取了很少的字段（例如Name和Msg）并将其插入到弹性搜索中，希望这会有所帮助

如何在ElasticSearch中索引.PDF文件

6 个答案: