ElasticSearch 5.5.0:查找相关文档

时间:2017-08-16 08:52:59

标签: elasticsearch

在ElasticSearch 5.5.0中,我通过了“more_like_this”条款,但无法找到相关文档。我在ElasticSearch中有以下数据,“description”字段有大量> 100万字节的巨大非索引数据。像下面我有一万份文件。如何找出一组彼此匹配至少80%的文档:

{
    "_index": "school",
    "_type": "book",
    "_id": "1",
    "_source": {
      "title": "How to drive safely",
      "description": "LOTS OF WORDS...The book is written to help readers about giving driving safety guidelines. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. LONG...."
    }
}

最后,我正在寻找具有至少80%匹配内容的文档ID列表。可能的预期结果包含匹配的文档ID(任何格式都可以):

[ [1,30, 500, 8000], [2, 40, 199], .... ]

我是否需要编写批处理并将每个文档与所有其他文档进行比较并构建输出集?

请帮忙。

1 个答案:

答案 0 :(得分:2)

more like this query有一个名为minimum_should_match的参数,可以设置为80%。但是,此处还需要考虑max_query_terms参数。

最重要的是,当你索引这些字段的内容时,这个onls会起作用。

此外,在查询时执行此操作听起来像一个非常慢的操作。您可能希望在此处重新考虑您的策略,并在索引时间上集中/分组文档(您需要自己做的事情,因为这是一个非常自定义的事情),因此搜索变得很快。