Question

我在弹性搜索中有一些包含各种格式网址（http://www，www。）的文字。我想要做的是搜索包含例如google.com的所有文字。

对于当前搜索，我使用类似此查询的内容：

query = { "query": {
                "bool": {
                     "must": [{
                            "range": {
                            "cdate": {
                                "gt": dfrom,
                                "lte": dto }
                            }
                        },
             { "query_string":{
                "default_operator": "AND",
                "default_field": "text",
                "analyze_wildcard":"true",
                "query": searchString } }
            ]
        }
        }}

但是看起来像 google.com 的查询永远不会返回任何结果，搜索例如，“test”这个词工作正常（没有“）。我确实想使用query_string因为我喜欢使用布尔运算符，但我真的需要能够搜索整个单词的子串。

谢谢！

Answer 1

确实，http://www.google.com会被标准分析器标记为http和www.google.com，因此google.com将无法找到。

因此标准分析器本身无济于事，我们需要一个能够正确转换URL令牌的令牌过滤器。如果您的text字段仅包含网址，则另一种方法是使用UAX Email URL tokenizer，但由于该字段可以包含任何其他文本（即用户评论），因此无法使用。

幸运的是，有一个名为analysis-url的新插件，它提供了一个URL令牌过滤器，这正是我们所需要的（在small modification我请求之后，感谢@jlinn; - ））

首先，您需要安装插件：

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip

然后，我们可以开始玩了。我们需要为您的text字段创建合适的分析器：

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "analysis": {
      "filter": {
        "url_host": {
          "type": "url",
          "part": "host",
          "url_decode": true,
          "passthrough": true
        }
      },
      "analyzer": {
        "url_host": {
          "filter": [
            "url_host"
          ],
          "tokenizer": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "url_host"
        }
      }
    }
  }
}'

使用此分析器和映射，我们可以正确索引您希望能够搜索的主机。例如，让我们使用我们的新分析器分析字符串blabla bla http://www.google.com blabla。

curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'

我们将获得以下令牌：

{
  "tokens" : [ {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "www.google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 5
  } ]
}

如您所见，http://www.google.com部分将被标记为：

www.google.com
google.com即您的期望
com

现在，如果您的searchString为google.com，您将能够找到包含text（或google.com www.google.com字段的所有文档）。

Answer 2

全文搜索总是与倒排索引中的完全匹配，除非您执行外卡搜索，强制遍历反向索引。在queryString的开头使用通配符将导致索引的完全遍历，不建议使用。

不仅考虑索引URL，还考虑应用Keyword Tokenizer的域（通过剥离协议，子域和域后的任何信息）。然后，您可以针对此字段搜索域。

Elasticsearch - 带通配符

2 个答案: