Question

我对弹性进行了以下查询：

"query": {
    "filtered": {
        "filter": {
            "and": {
                "filters": [
                    {
                        "term": {
                            "entities.hashtags": "gf"
                        }
                    }
                ]
            }
        },
        "query": {
            "match_phrase": {
                "body": "anime"
            }
        }
    }
},

entities.hashtags是数组，因此我收到带有主题标签gf_anime，gf_whatever，gf_foobar等的条目。但我需要的是收到准确的＆＃34; gf＆＃34;标签存在。我已经在SO上查看了其他问题，看到这种情况下的解决方案是改变entity.hashtags的分析，所以它只匹配精确的值（我很新有弹性因此可能会误解这里的术语）。

我的问题是，是否可以在“查询”中定义完全匹配搜索？ Id est没有改变弹性指数的字段吗？

Answer 1

你确定你需要做什么吗？根据你的例子，你没有，你可能不想做not_analyzed：

curl -XPUT localhost:9200/test -d '{
  "mappings": {
    "test" : {
      "properties": {
        "body" : { "type" : "string" },
        "entities" : {
          "type" : "object",
          "properties": {
            "hashtags" : {
              "type" : "string"
            }
          }
        }
      }
    }
  }
}'

curl -XPUT localhost:9200/test/test/1 -d '{
  "body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'

curl -XPUT localhost:9200/test/test/2 -d '{
  "body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'

curl -XPUT localhost:9200/test/test/3 -d '{
  "body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'

上述数据已编入索引后，您的查询只返回文档2（注意：这是您的查询的简化版本，没有不必要的/不受欢迎的and过滤器;至少目前，您应该总是使用bool过滤器而不是and / or，因为它了解如何使用过滤器缓存）：

curl -XGET localhost:9200/test/_search
{
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "entities.hashtags": "gf"
        }
      },
      "query": {
        "match_phrase": {
          "body": "anime"
        }
      }
    }
  }
}

如果这种情况发生故障，那么当您开始输入将被拆分为多个标记的标签值时，会触发使用term过滤器的错误点击。您可以通过将字段传递到_analyze端点并告诉其使用分析器的字段来确定字段分析器如何处理任何值：

curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
  "tokens" : [ {
    "token" : "gf_anime",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
  "tokens" : [ {
    "token" : "gf",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "anime",
    "start_offset" : 3,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

如果您要使用＆＃34; gf动漫添加第四个文档＆＃34;变种，然后你会得到一个错误的打击。

curl -XPUT localhost:9200/test/test/4 -d '{
  "body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'

这实际上不是索引问题，而是一个糟糕的数据问题。

通过所有解释，您可以效率低下通过使用始终遵循term过滤器的脚本来解决此问题（以有效排除更常见的情况＃39;击中它）：

curl -XGET localhost:9200/test/_search
{
  "query": {
    "filtered": {
      "filter": {
        "bool" : {
          "must" : [{
            "term" : {
              "entities.hashtags" : "gf"
            }
          },
          {
            "script" : {
              "script" :
                "_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
              "params" : {
                "tag" : "gf"
              }
            }
          }]
        }
      },
      "query": {
        "match_phrase": {
          "body": "anime"
        }
      }
    }
  }
}

这可以通过解析原始_source（和不使用索引doc值）来实现。这就是为什么这不会非常有效，但它会一直有效，直到你重新索引。只有当hashtags 总是一个数组时才需要_source.entities.hashtags == tag部分（在我的例子中，文档1 不是一个数组）。如果总是一个数组，那么您可以使用_source.entities.hashtags.contains(tag)代替_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null。

注意：脚本语言是Groovy，这是从1.4.0开始的默认语言。不是早期版本中的默认值，它必须是explicitly enabled using script.default_lang : groovy。

弹性精确匹配，无需更改索引

1 个答案: