用于字段数组的elasticsearch匹配查询返回错误结果

时间:2016-08-06 11:41:41

标签: elasticsearch

我使用此搜索查询:

GET videosearch/_search
{
  "query": {
    "match": {
      "tags": "logs"
    }
  }
}

以便在标签字段中返回包含“logs”的所有文档。

标签字段具有以下映射:

    "tags": {
      "type": "string",
      "analyzer": "english",
      "fields": {
        "verbatim": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }

查询返回的结果如下:

{
    "_index": "videosearch",
    "_type": "videos",
    "_id": "10",
    "_score": 0.792282,
    "_source": {
      "id": "10",
      "url": "https://www.youtube.com/watch?v=yDLtyLi6Ny8",
      "title": "#bbuzz: Radu Gheorghe JSON Logging with Elasticsearch",
      "uploaded_by": "newthinking communications",
      "upload_date": "2013-06-19",
      "views": 370,
      "likes": 0,
      "tags": [
        "elasticsearch",
        "logs",
        "logstash",
        "rsyslog",
        "json"
      ]
    }
  }

但也会返回不好的结果:

{
    "_index": "videosearch",
    "_type": "videos",
    "_id": "15",
    "_score": 0.9054651,
    "_source": {
      "id": "15",
      "url": "https://www.youtube.com/watch?v=4L1DjY90Whk",
      "title": "Tuning Solr for Logs, by Radu Gheorghe",
      "uploaded_by": "Lucidworks",
      "upload_date": "2015-01-07",
      "views": 280,
      "likes": 2,
      "tags": [
        "logging",
        "solr",
        "tuning",
        "performance"
      ]
    }
  }

我认为最后一个是“坏”结果,因为它不包含tags字段中的“logs”字符串。另外我可以注意到,即使它是一个“坏”结果,它的得分也高于“好”结果:0.9054651 vs 0.792282。

发生了什么事,我错过了什么?

1 个答案:

答案 0 :(得分:0)

经过更多的研究,我读到了有关分析器的问题,弹性搜索使用这些分析器将单词分解为标记。

英语分析器正在使用词干来构造令牌。 在下面的示例中,我将使用英语分析器将一些单词分解为搜索标记:

GET _analyze?pretty
{
  "analyzer": "english",
  "text": ["hair dryer", "introduction", "stars", "Introspective", "fishing", "logging"]
}

这导致以下令牌:

{
  "tokens": [
    {
      "token": "hair",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "dryer",
      "start_offset": 5,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "introduct",
      "start_offset": 11,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "star",
      "start_offset": 24,
      "end_offset": 29,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "introspect",
      "start_offset": 30,
      "end_offset": 43,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "fish",
      "start_offset": 44,
      "end_offset": 51,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "log",
      "start_offset": 52,
      "end_offset": 59,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

您可以注意到,令牌实际上是每个要分析的单词的对应词。

总之, log logs logging 这两个词具有相同的词干 log ,所以所有这三个都是搜索结果候选人。