Question

ElasticSearch是否可以形成一个能够保留术语排序的查询？

一个简单的例子是使用标准分析器索引这些文档：

您知道搜索
你知道搜索
知道搜索你

我可以查询+you +search，这会返回所有文件，包括第三个文件。

如果我只想检索具有此特定顺序条款的文档，该怎么办？我可以形成一个可以帮我的查询吗？

考虑到短语可以通过简单引用文字："you know"（检索第一和第二个文档），我觉得应该有一种方法可以保留多个术语的顺序，而不是相邻的。

在上面的简单示例中，我可以使用邻近搜索，但这并不包含更复杂的案例。

Answer 1

您可以使用span_near查询，它有一个in_order参数。

{
    "query": {
        "span_near": {
            "clauses": [
                {
                    "span_term": {
                        "field": "you"
                    }
                },
                {
                    "span_term": {
                        "field": "search"
                    }
                }
            ],
            "slop": 2,
            "in_order": true
        }
    }
}

Answer 2

短语匹配不确保顺序;-)。如果你指定了足够的斜率 - 例如2 - ＆＃34; hello world＆＃34;将匹配＆＃34;世界你好＆＃34;。但这并不一定是件坏事，因为如果两个词是＆＃34; close＆＃34;相互之间并不重要。而且我不认为这个特征的作者会想到匹配相差1000个单词的单词。

有一个解决方案，我可以找到保持顺序，但不简单：使用脚本。这是一个例子：

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "hello world" }
{ "index": { "_id": 2 }}
{ "title": "world hello" }
{ "index": { "_id": 3 }}
{ "title": "hello term1 term2 term3 term4 world" }

POST my_index/_search
{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "title": {
            "query": "hello world",
            "slop": 5,
            "type": "phrase"
          }
        }
      },
      "filter": {
        "script": {
          "script": "term1Pos=0;term2Pos=0;term1Info = _index['title'].get('hello',_POSITIONS);term2Info = _index['title'].get('world',_POSITIONS); for(pos in term1Info){term1Pos=pos.position;}; for(pos in term2Info){term2Pos=pos.position;}; return term1Pos<term2Pos;",
          "params": {}
        }
      }
    }
  }
}

为了使脚本本身更具可读性，我在这里用缩进重写：

term1Pos = 0;
term2Pos = 0;
term1Info = _index['title'].get('hello',_POSITIONS);
term2Info = _index['title'].get('world',_POSITIONS);
for(pos in term1Info) {
  term1Pos = pos.position;
}; 
for(pos in term2Info) {
  term2Pos = pos.position;
}; 
return term1Pos < term2Pos;

以上是一个搜索＆＃34; hello world＆＃34;在上面的文档中有一个5的斜率将匹配所有这些。但脚本过滤器将确保文字中的位置＆＃34;你好＆＃34;低于单词＆＃34; world＆＃34;的文档中的位置。通过这种方式，无论我们在查询中设置了多少个丢弃，这些位置是一个接一个的事实确保了订单。

这是section in the documentation，它阐明了上述脚本中使用的内容。

Answer 3

这正是match_phrase查询（请参阅here）的作用。

它会根据他们的存在来检查条款的位置。

例如，这些文件：

POST test/values
{
  "test": "Hello World"
}

POST test/values
{
  "test": "Hello nice World"
}

POST test/values
{
  "test": "World, I don't say hello"
}

将使用基本的match查询找到所有内容：

POST test/_search
{
  "query": {
    "match": {
      "test": "Hello World"
    }
  }
}

但是使用match_phrase，只会返回第一个文档：

POST test/_search
{
  "query": {
    "match_phrase": {
      "test": "Hello World"
    }
  }
}

{
   ...
   "hits": {
      "total": 1,
      "max_score": 2.3953633,
      "hits": [
         {
            "_index": "test",
            "_type": "values",
            "_id": "qFZAKYOTQh2AuqplLQdHcA",
            "_score": 2.3953633,
            "_source": {
               "test": "Hello World"
            }
         }
      ]
   }
}

在您的情况下，您希望接受您的条款之间的距离。这可以通过slop参数来实现，该参数表示您允许您的条款相互之间的距离：

POST test/_search
{
  "query": {
    "match": {
      "test": {
        "query": "Hello world",
        "slop":1,
        "type": "phrase"
      }
    }
  }
}

使用上一个请求，您也可以找到第二个文档：

{
   ...
   "hits": {
      "total": 2,
      "max_score": 0.38356602,
      "hits": [
         {
            "_index": "test",
            "_type": "values",
            "_id": "7mhBJgm5QaO2_aXOrTB_BA",
            "_score": 0.38356602,
            "_source": {
               "test": "Hello World"
            }
         },
         {
            "_index": "test",
            "_type": "values",
            "_id": "VKdUJSZFQNCFrxKk_hWz4A",
            "_score": 0.2169777,
            "_source": {
               "test": "Hello nice World"
            }
         }
      ]
   }
}

您可以在definitive guide中找到关于此用例的整章。

保留ElasticSearch查询中的术语顺序

3 个答案: