Question

这是我的索引

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "filter": {
                "my_ascii_folding": {
                    "type" : "asciifolding",
                    "preserve_original": "true"
                }
            },
            "analyzer": {
                "include_special_character": {
                    "type":      "custom",
                    "filter": [
                        "lowercase",
                        "my_ascii_folding"
                    ],
                    "tokenizer": "whitespace"
                }
            }
        }
    }
}

这是我的映射：

PUT /my_index/_mapping/formulas
{
   "properties": {
      "content": {
         "type": "text",
         "analyzer": "include_special_character"
      }
   }
}

我的示例数据：

POST /_bulk
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"formula =IF(SUM(3;4;5))"}
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"some if words: dif difuse"}

在这个查询中，我想仅返回带有公式的记录（＆＃34;公式= IF（SUM（3; 4; 5））＆＃34;）但它返回两者。

GET /my_index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "if(",
        "analyzer": "include_special_character",
        "fields": ["_all"]
    }
  }
}

此查询不会返回带公式的记录。

GET /my_index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "=if(",
        "analyzer": "include_special_character",
        "fields": ["_all"]
    }
  }
}

如何修复这两个查询以返回我的期望？

由于

Answer 1

首先，我想说谢谢你获取所有必要的请求来获取你在本地工作的数据集。使更多更容易查看问题的答案。

这里发生了一些相当有趣的事情。我要指出的第一件事是，当您使用_all字段时，您的查询实际上发生了什么，因为有一些微妙的行为很容易造成混淆。

我将依赖_analyze端点来尝试帮助指出这里发生了什么。

首先，这里是一个查询，用于分析如何根据＆＃34;内容＆＃34;字段：

GET my_index/_analyze
{
  "analyzer": "include_special_character",
  "text": [
    "formula =IF(SUM(3;4;5))"
  ],
  "field": "content"
}

结果：

{
  "tokens": [
    {
      "token": "formula",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "=if(sum(3;4;5))",
      "start_offset": 8,
      "end_offset": 23,
      "type": "word",
      "position": 1
    }
  ]
}

到目前为止，这么好。这可能是您期望看到的。如果您想真正深入了解发生的事情的详细输出，请在分析查询中使用以下内容：

explain: true

现在，如果您删除＆＃34;分析仪＆＃34;来自该分析器查询的值，文本输出将保持不变。这是因为我们只是用已经设置的分析仪覆盖它所选择的分析仪。我们正在回击我们正在查询的领域及其指定的分析器。

为了证明这一点，我将查询您提供的索引上没有映射的字段，在一个请求中指定分析器，在另一个请求中指定分析器。

在：

GET my_index/_analyze { "analyzer": "include_special_character", "text": [ "formula =IF(SUM(3;4;5))" ], "field": "test" }

输出：

{ "tokens": [ { "token": "formula", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "=if(sum(3;4;5))", "start_offset": 8, "end_offset": 23, "type": "word", "position": 1 } ] }

现在没有指定分析仪。在：

GET my_index/_analyze { "text": [ "formula =IF(SUM(3;4;5))" ], "field": "test" }

输出：

{ "tokens": [ { "token": "formula", "start_offset": 0, "end_offset": 7, "type": "<ALPHANUM>", "position": 0 }, { "token": "if", "start_offset": 9, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "sum", "start_offset": 12, "end_offset": 15, "type": "<ALPHANUM>", "position": 2 }, { "token": "3;4;5", "start_offset": 16, "end_offset": 21, "type": "<NUM>", "position": 3 } ] }

在第二个例子中，它回退到默认分析器并以这种方式解释输入，因为没有这样的字段用于＆＃34; test＆＃34;任何映射。

现在获取关于＆＃34; _all＆＃34;的一些信息。领域以及为什么你会得到意想不到的结果。根据文档，您应该将"_all" field视为一个特殊字段，除非明确禁用，否则始终将其视为"text" field。

_all字段只是一个文本字段，并接受相同的参数其他字符串字段接受，包括分析器，term_vectors， index_options和store。

为了完整性，以下是索引时分析其他文档的方式。

在：

GET my_index/_analyze { "analyzer": "include_special_character", "text": [ "some if words: dif difuse" ], "field": "content" }

输出：

{ "tokens": [ { "token": "some", "start_offset": 0, "end_offset": 4, "type": "word", "position": 0 }, { "token": "if", "start_offset": 5, "end_offset": 7, "type": "word", "position": 1 }, { "token": "words:", "start_offset": 8, "end_offset": 14, "type": "word", "position": 2 }, { "token": "dif", "start_offset": 15, "end_offset": 18, "type": "word", "position": 3 }, { "token": "difuse", "start_offset": 19, "end_offset": 25, "type": "word", "position": 4 } ] }

现在，了解分析器为什么以某种方式对现有字段表现，以及处理＆＃34; _all＆＃34;逻辑上将字段作为已映射为文本的字段。在查询＆＃34; _all＆＃34;时，似乎忽略了指定的分析器，不允许上面的覆盖。现在希望以下结果不那么令人惊讶。

在：

GET my_index/_analyze { "analyzer": "include_special_character", "text": [ "=if(" ], "field": "_all" }

输出：

{ "tokens": [ { "token": "if", "start_offset": 1, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 } ] }

在上面的例子中，无论我指定什么分析器，因为＆＃34; _all＆＃34;字段被视为映射文本字段，它将使用与之关联的分析器。

现在，当你搜索＆＃34; _all＆＃34;字段，您应该注意到您正在获得点击，因为索引和搜索分析器都有＆＃34;如果＆＃34;，这会导致命中。当您使用_all字段时，您的索引术语和查询术语都将通过默认分析器，而不是您指定的分析器，制作令牌＆＃34;如果＆＃34;出现在你的文件＆＃34; _all＆＃34;字段和您的查询文本。

对我来说最有趣的部分是＆＃34; = if（＆＃34;没有返回任何命中。我通常认为这将完全等同于＆＃34;如果＆＃34;或＆＃34; if（＆＃34;在这种情况下，因为除了＆＃34之外的所有内容;如果＆＃34;部分由于默认分析器而被抛出。在你没有得到命中的情况下你会期望，我相信这与查询字符串的解析方式有关，因为＆＃34; =＆＃34;字符。我试着对这个平等字符的确切做法进行一些研究，但我没有做过看到任何好的文档，除了它是Lucene语法的一部分。我不认为知道用相同的符号发生什么对你的问题很重要，但它绝对是我很好奇的事情如果有人在这里可以阐明它。

通过退出＆＃34; simple_query_string＆＃34;来尝试查询时，我确实设法在以下任一查询中看到了两个结果......

等于：

GET /my_index/_search { "query": { "match": { "_all": "=if(" } } }

不等于：

GET /my_index/_search { "query": { "match": { "_all": "if(" } } }

所以现在，通过以上所有的探索，这里有一些关于你的问题的潜在方法的想法。

以下是我们想要返回点击的文档的标记...

在：

GET my_index/formulas/AV9GIDTggkgblFY6zpKT/_termvectors?fields=content

输出：

{ "_index": "my_index", "_type": "formulas", "_id": "AV9GIDTggkgblFY6zpKT", "_version": 1, "found": true, "took": 0, "term_vectors": { "content": { "field_statistics": { "sum_doc_freq": 7, "doc_count": 2, "sum_ttf": 7 }, "terms": { "=if(sum(3;4;5))": { "term_freq": 1, "tokens": [ { "position": 1, "start_offset": 8, "end_offset": 23 } ] }, "formula": { "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 7 } ] } } } } }

由于上述原因，如果我们从＆＃34; _all＆＃34;更改您的查询对于＆＃34;内容＆＃34;，您只能通过上述响应中的两个令牌之一获得我们感兴趣的文档。如果你搜索＆＃34; = if（sum（3; 4; 5））＆＃34;你将获得点击率。或＆＃34;公式＆＃34;。虽然这变得越来越准确，但我认为它没有完成你的目标。

我可能根据要求考虑的另一种方法是使用keyword映射。然而，这将比例子更具限制性，因为每个＆＃34;内容＆＃34;字段只有一个标记，即它的全部值。我认为最适合您的问题是要求我们在您的地图中添加n-gram tokenizer。

以下是我将用于解决此问题的一系列查询。

索引设置：

PUT /my_index2 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "filter": { "my_ascii_folding": { "type": "asciifolding", "preserve_original": "true" } }, "analyzer": { "include_special_character_gram": { "type": "custom", "filter": [ "lowercase", "my_ascii_folding" ], "tokenizer": "ngram_tokenizer" } }, "tokenizer": { "ngram_tokenizer": { "type": "ngram", "min_gram": 2, "max_gram": 5, "token_chars": [ "letter", "digit", "punctuation", "symbol" ] } } } } }

地图：

PUT /my_index2/_mapping/formulas { "properties": { "content": { "type": "text", "analyzer": "include_special_character_gram" } } }

添加文档：

POST /_bulk {"index":{"_index":"my_index2","_type":"formulas"}} {"content":"formula =IF(SUM(3;4;5))"} {"index":{"_index":"my_index2","_type":"formulas"}} {"content":"some if words: dif difuse"}

第一个doc的术语向量：

GET my_index2/formulas/AV9GZ3sSgkgblFY6zpK2/_termvectors?fields=content

输出：

{ "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_version": 1, "found": true, "took": 0, "term_vectors": { "content": { "field_statistics": { "sum_doc_freq": 102, "doc_count": 2, "sum_ttf": 106 }, "terms": { "(3": { "term_freq": 1, "tokens": [ { "position": 46, "start_offset": 15, "end_offset": 17 } ] }, "(3;": { "term_freq": 1, "tokens": [ { "position": 47, "start_offset": 15, "end_offset": 18 } ] }, ... Omitting the rest because of max response lengths. } } }

现在让我们把这个例子包起来......这是我以前使用的返回两个条目的查询，并继续在这里做同样的事情。

在：

GET /my_index2/_search { "query": { "match": { "content": { "analyzer": "keyword", "query": "=if(" } } } }

输出：

{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 2.9511943, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 2.9511943, "_source": { "content": "formula =IF(SUM(3;4;5))" } }, { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK3", "_score": 0.30116585, "_source": { "content": "some if words: dif difuse" } } ] } }

所以我们看到相同的结果，但为什么会发生这种情况呢？在上面的查询中，我们现在将相同的n-gram分析器应用于输入文本，这意味着两个文档仍然具有匹配的标记！

在：

GET my_index2/_analyze { "analyzer": "include_special_character_gram", "text": [ "=if(" ], "field": "t" }

输出：

{ "tokens": [ { "token": "=i", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "=if", "start_offset": 0, "end_offset": 3, "type": "word", "position": 1 }, { "token": "=if(", "start_offset": 0, "end_offset": 4, "type": "word", "position": 2 }, { "token": "if", "start_offset": 1, "end_offset": 3, "type": "word", "position": 3 }, { "token": "if(", "start_offset": 1, "end_offset": 4, "type": "word", "position": 4 }, { "token": "f(", "start_offset": 2, "end_offset": 4, "type": "word", "position": 5 } ] }

如果运行上述查询，您将看到查询生成的令牌。这里的关键因素是现在将您的查询分析器指定为＆＃34;关键字＆＃34;所以你的一个索引术语向量将匹配整个查询值，使用不同的分析器查询，而不是我们对该字段。

在：

GET my_index2/_analyze { "analyzer": "keyword", "text": [ "=if(" ] }

输出：

{ "tokens": [ { "token": "=if(", "start_offset": 0, "end_offset": 4, "type": "word", "position": 0 } ] }

让我们看看它是否有效......

在：

GET /my_index2/_search { "query": { "match": { "content": { "query": "=if(", "analyzer": "keyword" } } } }

输出：

{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.56074005, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 0.56074005, "_source": { "content": "formula =IF(SUM(3;4;5))" } } ] } }

基于以上所述，当我们为搜索分析器明确指定关键字分析器时，您可以看到它对我们存储的n-gram分析字段的工作原理。这是我们可以应用于映射的更新，这将简化我们的请求...（注意，您将要破坏现有索引或

PUT /my_index2/_mapping/formulas { "properties": { "content": { "type": "text", "analyzer": "include_special_character_gram", "search_analyzer": "keyword" } } }

现在让我们回到我最初用来显示两个文档都返回的匹配查询。

在：

GET /my_index2/_search { "query": { "match": { "content": "=if(" } } }

输出：

{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.56074005, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 0.56074005, "_source": { "content": "formula =IF(SUM(3;4;5))" } } ] } }

编辑 - 在simple_query_string中查询

在：

GET /my_index2/_search { "query": { "simple_query_string": { "query": "=if\\(", "fields": ["content"] } } }

输出：

{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.56074005, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 0.56074005, "_source": { "content": "formula =IF(SUM(3;4;5))" } } ] } }

你有它。如果你选择走这条路，你可以明显地摆弄n-gram尺寸。这个答案已经足够冗长，所以我不打算尝试提供你可以采取的其他方法，但我认为一个解决方案会有所帮助。我认为重要的是通过_all字段和查询字符串的解释来了解幕后的情况。

希望这有助于并感谢有趣的问题。

带有特殊字符的简单查询字符串，例如（和=

1 个答案: