Question

我有一些代码可以查询字段message中的特定字符串，如下所示：

"message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"

这是我的代码：

from elasticsearch import Elasticsearch
import json

client = Elasticsearch(['http://192.168.1.114:9200'])

response = client.search(
  index="squidlog-2017.10.29",
  body={
      "query": {
          "match": {
            "message": 'GET'
          }
      }
  }
)

for hit in response['hits']['hits']:
    print json.dumps(hit['_source'], indent=4, sort_keys=True)

当我使用特定字符串查询时：使用上面的模板获取，一切正常。但是当我想在消息中查询关于url的内容时，我没有收到任何内容，例如以下查询：

body={
      "query": {
          "match": {
            "message": 'pravda'
          }
      }
  }

查询时，我的邮件中的斜杠有问题吗？有人请给我一个建议。感谢。

Answer 1

您可以考虑使用其他tokenizer，这样可以进行所需的搜索。但是，让我解释为什么你的查询不会在第二种情况下返回结果。

`standard`分析器和标记器

默认情况下，standard analyzer由standard tokenizer组成，这显然会使域名不会被点分割。您可以尝试使用_analyze端点的不同分析器和标记器，如下所示：

GET _analyze
{
    "text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}

响应是ElasticSearch在搜索时将用于表示此字符串的标记列表。这是：

{
   "tokens": [
      {
         "token": "oct",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 0
      }, ...
      {
         "token": "http",
         "start_offset": 59,
         "end_offset": 63,
         "type": "<ALPHANUM>",
         "position": 11
      },
      {
         "token": "www.pravda.ru",
         "start_offset": 66,
         "end_offset": 79,
         "type": "<ALPHANUM>",
         "position": 12
      },
      {
         "token": "science",
         "start_offset": 80,
         "end_offset": 87,
         "type": "<ALPHANUM>",
         "position": 13
      }, ...
   ]
}

如您所见，"pravda"不在令牌列表中，因此您无法搜索它。您只能搜索分析仪发出的令牌。

请注意，"pravda"是域名的一部分，该域名作为单独的令牌进行分析："www.pravda.ru"。

`lowercase` tokenizer

如果你使用不同的标记器，例如lowercase标记器，它会发出pravda作为标记，并且可以搜索它：

GET _analyze
{
    "tokenizer" : "lowercase",
    "text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}

令牌列表：

{
   "tokens": [
      {
         "token": "oct",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 0
      }, ...
      {
         "token": "http",
         "start_offset": 59,
         "end_offset": 63,
         "type": "word",
         "position": 4
      },
      {
         "token": "www",
         "start_offset": 66,
         "end_offset": 69,
         "type": "word",
         "position": 5
      },
      {
         "token": "pravda",
         "start_offset": 70,
         "end_offset": 76,
         "type": "word",
         "position": 6
      },
      {
         "token": "ru",
         "start_offset": 77,
         "end_offset": 79,
         "type": "word",
         "position": 7
      },
      {
         "token": "science",
         "start_offset": 80,
         "end_offset": 87,
         "type": "word",
         "position": 8
      }, ...
   ]
}

如何在编制索引之前定义分析器？

为了能够搜索此类令牌，您必须在索引阶段以不同方式分析它们。这意味着使用不同的分析器定义不同的映射。就像在这个例子中一样：

PUT yet_another_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_custom_analyzer": {
               "type": "custom",
               "tokenizer": "lowercase"
            }
         }
      }
   },
   "mappings": {
      "my_type": {
         "properties": {
            "message": {
               "type": "text",
               "fields": {
                  "lowercased": {
                     "type": "text",
                     "analyzer": "my_custom_analyzer"
                  }
               }
            }
         }
      }
   }
}

在这里，我们首先使用所需的标记生成器定义custom analyzer，然后告诉ElasticSearch通过fields功能对message字段进行两次索引：隐式使用默认分析器，并使用{{ 1}}。

现在我们可以查询所需的令牌。对原始字段的请求将不会给出响应：

my_custom_analyzer

但是POST yet_another_index/my_type/_search { "query": { "match": { "message": "pravda" } } } "hits": { "total": 0, "max_score": null, "hits": [] }的查询会成功：

message.lowercased

有很多选项，这个解决方案可以回答您提供的示例。查看不同的分析器和标记器，找出更适合您的分析器和标记器。

希望有所帮助！

使用python查询elasticsearch时没有响应

1 个答案:

`standard`分析器和标记器

`lowercase` tokenizer

如何在编制索引之前定义分析器？

使用python查询elasticsearch时没有响应

1 个答案:

standard分析器和标记器

lowercase tokenizer

如何在编制索引之前定义分析器？

`standard`分析器和标记器

`lowercase` tokenizer