弹性搜索子字符串匹配不变的位置

时间:2015-02-04 15:43:18

标签: python elasticsearch

我和ES一起工作了一个多月。我正在寻找与位置保持子串匹配相关的知识。

假设我已将文档编入索引以进行弹性搜索。带有" doc_field"的2个文档:带有id1和id2的文档。

id1: " Once when a big Lion was asleep, a little Mouse began running up and down upon him. "
id2: " The mouse is very little"

我不知道我是否应该保留索引" not_analyzed"或"分析"。

我很好奇的是,如果我执行以下一组查询,它将能够给我正确的匹配。

query = { "query":
           "match":{"document":"little mouse","operator": and }}

我希望它只返回那些有"小老鼠"的文件。它不应该返回在其他部分有很少或鼠标的文档。简单地说,应该保留查询中单词的排列。帮助

1 个答案:

答案 0 :(得分:0)

查看shingles TokenFilter(documentation)。它与ngram非常相似,但使用标记而不是字符。

使用默认设置,它会生成两个字的长令牌。您可以使用_analyze API检查其行为:

POST _analyze?tokenizer=whitespace&filters=shingle&text=The mouse is very little

将输出:

{
   "tokens": [
      {
         "token": "The",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      },
      {
         "token": "The mouse",
         "start_offset": 0,
         "end_offset": 9,
         "type": "shingle",
         "position": 1
      },
      {
         "token": "mouse",
         "start_offset": 4,
         "end_offset": 9,
         "type": "word",
         "position": 2
      },
      {
         "token": "mouse is",
         "start_offset": 4,
         "end_offset": 12,
         "type": "shingle",
         "position": 2
      },
      {
         "token": "is",
         "start_offset": 10,
         "end_offset": 12,
         "type": "word",
         "position": 3
      },
      {
         "token": "is very",
         "start_offset": 10,
         "end_offset": 17,
         "type": "shingle",
         "position": 3
      },
      {
         "token": "very",
         "start_offset": 13,
         "end_offset": 17,
         "type": "word",
         "position": 4
      },
      {
         "token": "very little",
         "start_offset": 13,
         "end_offset": 24,
         "type": "shingle",
         "position": 4
      },
      {
         "token": "little",
         "start_offset": 18,
         "end_offset": 24,
         "type": "word",
         "position": 5
      }
   ]
}

然后,通过查询此字段,您将看到两个示例文档之间的差异。

您可以在权威指南的this section中找到有关邻近搜索的详细说明。