Elasticsearch电子邮件的域聚合

时间:2016-08-01 06:47:17

标签: elasticsearch

我是Elasticsearch的新手,我正在尝试计算字段子字符串的不同出现次数。

我有电子邮件收件人作为邮件日志索引的一部分,我想计算索引中不同域的数量。

例如,如果我的索引中有3个邮件日志,它们来自以下地址:a@b.comc@b.comd@e.com;我想查看来自b.com域的 2封邮件和来自e.com的1封邮件。

1 个答案:

答案 0 :(得分:1)

您需要pattern_capture filter才能捕获@之后的内容。另外,为了不混淆文本的原始分析,我建议在原始email字段中添加一个子字段,并仅对此特定聚合使用该字段:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "email_domains": {
          "type": "pattern_capture",
          "preserve_original" : 0,
          "patterns": [
            "@(.+)"
          ]
        }
      },
      "analyzer": {
        "email": {
          "tokenizer": "uax_url_email",
          "filter": [
            "email_domains",
            "lowercase",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "emails": {
      "properties": {
        "email": {
          "type": "string",
          "fields": {
            "domain": {
              "type": "string",
              "analyzer": "email"
            }
          }
        }
      }
    }
  }
}

尝试一些测试数据:

POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "john.doe@gmail.com"}
{"index":{"_id":"2"}}
{"email": "john.doe@gmail.com, john.doe@outlook.com"}
{"index":{"_id":"3"}}
{"email": "hello-john.doe@outlook.com"}
{"index":{"_id":"4"}}
{"email": "john.doe@outlook.com"}
{"index":{"_id":"5"}}
{"email": "john@yahoo.com"}

对于您的特定用例,如下所示的简单聚合应该这样做:

GET /test/emails/_search
{
  "size": 0,
  "aggs": {
    "by_domain": {
      "terms": {
        "field": "email.domain",
        "size": 10
      }
    }
  }
}

结果是这样的:

   "aggregations": {
      "by_domain": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "outlook.com",
               "doc_count": 3
            },
            {
               "key": "gmail.com",
               "doc_count": 2
            },
            {
               "key": "yahoo.com",
               "doc_count": 1
            }
         ]
      }
   }