Hive - 相关值的总和

时间:2017-03-26 16:24:19

标签: sql amazon-web-services count presto amazon-athena

我正在使用AWS Athena来过滤负载均衡器日志。我创建了下表并将日志导入表中。

CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs  (
  request_timestamp string,   
  elb_response_code string,    
  url string, 
   ) 

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
         'serialization.format' = '1','input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://athena-examples/elb/raw/';

现在我希望计数为200 OK,400和500个响应计数。所以我执行了以下查询。

SELECT distinct(elb_response_code),
         count(url) AS count
FROM elb_logs
GROUP BY  elb_response_code

虽然有效,但它会返回所有回复,如下所示。

**response  count**
401   1270
201   1369
422   342
200   3568727
400   1221
404   444
304   10435
413   3
206   30
500   1542

我想将所有400,401,404,413,422和2xx,3xx和5xx的相同内容相加所以结果应该是4xx总和(400,401,404,413,422)

**response  count**
4xx           52145  
2xx           1363224
5xx           532

1 个答案:

答案 0 :(得分:0)

假设所有代码都是3个字符

select      substr (elb_response_code,1,1) || 'xx' as elb_response_code_prefix
           ,count(*)                               as cnt

from        elb_logs

group by    1

这是更通用的解决方案

select      rpad (substr (elb_response_code,1,1),length(elb_response_code),'x') 
                      as elb_response_code_prefix
           ,count(*)  as cnt

from        elb_logs

group by    1