我和ES一起工作了一个多月。我正在寻找与位置保持子串匹配相关的知识。
假设我已将文档编入索引以进行弹性搜索。带有" doc_field"的2个文档:带有id1和id2的文档。
id1: " Once when a big Lion was asleep, a little Mouse began running up and down upon him. "
id2: " The mouse is very little"
我不知道我是否应该保留索引" not_analyzed"或"分析"。
我很好奇的是,如果我执行以下一组查询,它将能够给我正确的匹配。
query = { "query":
"match":{"document":"little mouse","operator": and }}
我希望它只返回那些有"小老鼠"的文件。它不应该返回在其他部分有很少或鼠标的文档。简单地说,应该保留查询中单词的排列。帮助
答案 0 :(得分:0)
查看shingles
TokenFilter(documentation)。它与ngram
非常相似,但使用标记而不是字符。
使用默认设置,它会生成两个字的长令牌。您可以使用_analyze API检查其行为:
POST _analyze?tokenizer=whitespace&filters=shingle&text=The mouse is very little
将输出:
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "The mouse",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "mouse",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "mouse is",
"start_offset": 4,
"end_offset": 12,
"type": "shingle",
"position": 2
},
{
"token": "is",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 3
},
{
"token": "is very",
"start_offset": 10,
"end_offset": 17,
"type": "shingle",
"position": 3
},
{
"token": "very",
"start_offset": 13,
"end_offset": 17,
"type": "word",
"position": 4
},
{
"token": "very little",
"start_offset": 13,
"end_offset": 24,
"type": "shingle",
"position": 4
},
{
"token": "little",
"start_offset": 18,
"end_offset": 24,
"type": "word",
"position": 5
}
]
}
然后,通过查询此字段,您将看到两个示例文档之间的差异。
您可以在权威指南的this section中找到有关邻近搜索的详细说明。