hadoop - 从存储在HDFS中的文档中提取数据到Elasticsearch中的索引 - Thinbug

从存储在HDFS中的文档中提取数据到Elasticsearch中的索引

时间：2016-04-05 07:18:32

标签： hadoop elasticsearch full-text-search elasticsearch-hadoop

我有一个HDFS存档来存储各种文档，如 pdf，ms word文件，ppt，csv 等。我想使用elasticsearch构建一个平台来搜索文件或文本内容。我知道我可以使用 es-hadoop 插件将数据从HDFS索引到ES。我想知道我可以从存储在HDFS中的文档中提取文本数据并将其编入索引的最佳方法。

任何帮助都将不胜感激。

2 个答案:

答案 0 :(得分：2)

您可以使用Elasticsearch mapper attachments plugin。此插件使用Apache Tika来摄取几乎所有已知类型的文档，并使其可由Elasticsearch进行搜索。希望有所帮助。

答案 1 :(得分：1)

我做了很多搜索，这是我到目前为止找到的方法列表。

这是整个集成/插件页面： https://www.elastic.co/guide/en/elasticsearch/plugins/master/integrations.html

这里是映射器附件的新替代品，Injest插件： https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html 关于如何使用它的帖子：https://qbox.io/blog/index-attachments-files-elasticsearch-mapper 这里讨论了使用Injest vs fs-crawler的优缺点（dadoonet是一个弹性开发人员）： https://discuss.elastic.co/t/mapper-attachment-plugin-vs-pre-parsing-and-extracting-content-from-binary-files/73764/10

这是文件系统爬虫（FS crawler）插件： https://github.com/dadoonet/fscrawler

这是Ambar文档搜索系统 - 他们有一个带有开源代码的社区github： https://ambar.cloud/ https://github.com/RD17/ambar https://blog.ambar.cloud/ingesting-documents-pdf-word-txt-etc-into-elasticsearch/ 他们似乎使用两种数据库服务器类型（MongoDB和Redis），不知道为什么。

这是Apache Tika，Injest和Ambar都使用（并且通过使用Tesseract提供OCR，我听说过Injest不支持）： http://tika.apache.org/1.16/

此外，在Injest使用Tika时，只支持一部分文件类型： https://discuss.elastic.co/t/full-list-of-supported-document-formats-by-es/81149

我希望以上内容可以节省其他开发人员的时间，如果人们发现更多，他们会在下面发表评论。

谢谢！