Question

ElasticSearch使我们能够在任何给定字段上通过正则表达式过滤一组文档，并且还可以通过给定（相同或不同字段中的术语，使用“存储桶聚合”）对结果文档进行分组。例如，在索引上它包含“Url”字段和“UserAgent”字段（某种Web服务器日志），以下内容将返回UserAgent字段中找到的术语的顶级文档计数。

{
    query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
    size: 0,                            
    aggs: { myaggregation: { terms: { field: "UserAgent" } } }                          
}

我想要做的是使用regexp过滤器（在整个字段上操作，而不仅仅是字段中的术语）的功能来手动定义我的聚合桶，这样我可以相对可靠地分割我的文档/按“用户代理类型”数据计算/点击，而不是在字段中通过弹性搜索解析的任意项。

基本上，我在SQL术语中寻找GROUP BY中CASE语句的等价物。表达我意图的SQL查询将类似于：

SELECT Bucket, Count(*)
FROM (
    SELECT CASE 
        WHEN UserAgent LIKE '%android%' OR UserAgent LIKE '%ipad%' OR UserAgent LIKE '%iphone%' OR UserAgent LIKE '%mobile%' THEN 'Mobile'
        WHEN UserAgent LIKE '%msie 7.0%' then 'IE7'
        WHEN UserAgent LIKE '%msie 8.0%' then 'IE8'
        WHEN UserAgent LIKE '%firefox%' then 'FireFox'
        ELSE 'OTHER'
        END Bucket
    FROM pagedata
    WHERE Url LIKE '%interestingpage%'
) Buckets
GROUP BY Bucket

可以在ElasticSearch查询中完成吗？

Answer 1

您可以将术语聚合与脚本字段一起使用：

{
  query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
  size: 0,
  aggs: {
    myaggregation: {
      terms: {
        script: "doc['UserAgent'] =~ /.*android.*/ || doc['UserAgent'] =~ /.*ipad.*/ || doc['UserAgent'] =~ /.*iphone.*/ || doc['UserAgent'] =~ /.*mobile.*/ ? 'Mobile' : doc['UserAgent'] =~ /.*msie 7.0.*/ ? 'IE7' : '...you got the idea by now...'"
      }
    }
  }
}

但要注意性能的提升！

Answer 2

这是一个有趣的用例。

这是一个更具弹性搜寻方式的解决方案。我们的想法是在索引编制时进行所有这种正则表达式匹配，并且搜索时间要快（搜索时间内的脚本，如果有很多文档，表现不佳并且需要时间）。让我解释一下：

为您的主要字段定义子字段，其中术语的操作是自定义的
将执行此操作，以便保留在索引中的唯一术语将是您定义的术语：FireFox，IE8，IE7，{{ 1}}。每个文档可以包含多个这些字段。意味着像Mobile这样的文字只会生成两个词：msie 7.0 sucks and ipad rules和IE7。

所有这些都可以通过Mobile token filter来实现。

应该有另一个令牌过滤器列表，它们将实际执行替换。使用keep token filter。
因为您有两个应该替换的单词（例如pattern_replace），您需要一种方法来捕获这两个单词（msie 7.0和msie）其他。这可以使用7.0 token filter。

让我把所有这些放在一起，提供完整的解决方案：

shingle

测试数据：

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_replace_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "filter_shingle",
            "my_pattern_replace1",
            "my_pattern_replace2",
            "my_pattern_replace3",
            "my_pattern_replace4",
            "words_to_be_kept"
          ]
        }
      },
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 10,
          "min_shingle_size": 2,
          "output_unigrams": true
        },
        "my_pattern_replace1": {
          "type": "pattern_replace",
          "pattern": "android|ipad|iphone|mobile",
          "replacement": "Mobile"
        },
        "my_pattern_replace2": {
          "type": "pattern_replace",
          "pattern": "msie 7.0",
          "replacement": "IE7"
        },
        "my_pattern_replace3": {
          "type": "pattern_replace",
          "pattern": "msie 8.0",
          "replacement": "IE8"
        },
        "my_pattern_replace4": {
          "type": "pattern_replace",
          "pattern": "firefox",
          "replacement": "FireFox"
        },
        "words_to_be_kept": {
          "type": "keep",
          "keep_words": [
            "FireFox", "IE8", "IE7", "Mobile"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "UserAgent": {
          "type": "string",
          "fields": {
            "custom": {
              "analyzer": "my_pattern_replace_analyzer",
              "type": "string"
            }
          }
        }
      }
    }
  }
}

查询：

POST /test/test/_bulk
{"index":{"_id":1}}
{"UserAgent": "android OS is the best firefox"}
{"index":{"_id":2}}
{"UserAgent": "firefox is my favourite browser"}
{"index":{"_id":3}}
{"UserAgent": "msie 7.0 sucks and ipad rules"}

结果：

GET /test/test/_search?search_type=count
{
  "aggs": {
    "myaggregation": {
      "terms": {
        "field": "UserAgent.custom",
        "size": 10
      }
    }
  }
}

如何定义存储区聚合，其中存储区由字段上的任意过滤器定义（GROUP BY CASE等效项）

2 个答案: