Question

我正在尝试重建我的弹性搜索查询，因为我发现我没有收到我要查找的所有文件。

所以，我们假设我有这样的文件：

{
  "id": 1234,
  "mail_id": 5,
  "sender": "john smith",
  "email": "johnsmith@gmail.com",
  "subject": "somesubject",
  "txt": "abcdefgh\r\n",
  "html": "<div dir=\"ltr\">abcdefgh</div>\r\n",
  "date": "2017-07-020 10:00:00"
}

我有几百万这样的文件，现在我试图通过这样的查询搜索一些：

{
  "sort": [
    {
      "date": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "bool": {
      "minimum_should_match": "100%",
      "should": [
        {
          "multi_match": {
            "type": "cross_fields",
            "query": "abcdefgh johnsmith john smith",
            "operator": "and",
            "fields": [
              "email.full",
              "sender",
              "subject",
              "txt",
              "html"
            ]
          }
        }
      ],
      "must": [
        {
          "ids": {
            "values": [
              "1234"
            ]
          }
        },
        {
          "term": {
            "mail_id": 5
          }
        }
      ]
    }
  }
}

对于这样的查询一切都很好，但是当我想通过查询找到文档时，gmail＆＃39;或者＆＃39; com＆＃39;它不起作用。

"query": "abcdefgh johnsmith john smith gmail"
"query": "abcdefgh johnsmith john smith com"

只有在我搜索＆＃39; gmail.com＆＃39; ＆＃34;查询＆＃34;：＆＃34; abcdefgh johnsmith john smith gmail.com＆＃34;

所以...我试图附加分析器

...
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"analyzer": "simple",
...

根本没有帮助。我能够找到这个文档的唯一方法是定义正则表达式，例如：

"minimum_should_match": 1,
"should": [
  {
    "multi_match": {
      "type": "cross_fields",
      "query": "fdsfs wukamil kam wuj gmail.com",
      "operator": "and",
      "fields": [
        "email.full",
        "sender",
        "subject",
        "txt",
        "html"
      ]
    }
  },
  {
    "regexp": {
      "email.full": ".*gmail.*"
    }
  }
],

但是在这种方法中，我必须将（查询*字段）regexp对象添加到我的json中，所以我不认为这将是最好的解决方案。我也知道通配符，但它会像regexp一样混乱。

如果有人有这样的问题并且知道解决方案，我将感谢您的帮助：）

Answer 1

如果您通过标准分析器运行搜索字词，则可以看到标记https://<your_site>:<es_port>/_analyze/?analyzer=standard&text=johnsmith@gmail.com被分解为什么。您可以使用以下网址直接在浏览器中执行此操作：

{

    "tokens": [
        {
            "token": "johnsmith",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "gmail.com",
            "start_offset": 10,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]

}

这将显示电子邮件被分解为以下令牌：

gmail

因此，这表明您无法仅使用gmail.com进行搜索，但可以使用https://<your_site>:<es_port>/_analyze/?analyzer=simple&text=johnsmith@gmail.com。要在点上拆分文字，您也可以更新地图以使用sort_index，其中包含：

简单的分析器会在遇到不是字母的字符时将文本分成多个术语。所有条款都较低。

我们可以通过更新我们之前的URL来使用简单的分析器，如下所示：

{

    "tokens": [
        {
            "token": "johnsmith",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 1
        },
        {
            "token": "gmail",
            "start_offset": 10,
            "end_offset": 15,
            "type": "word",
            "position": 2
        },
        {
            "token": "com",
            "start_offset": 16,
            "end_offset": 19,
            "type": "word",
            "position": 3
        }
    ]

}

返回：

AbstractFormField

这个分析器可能不是正确的工具，因为它忽略了任何非字母值，但你可以使用分析器和标记器，直到你得到你需要的为止。

elasticsearch multi_match with regexp

1 个答案: