雅典娜经典负载平衡器日志

时间:2019-11-18 10:25:03

标签: logging amazon-athena

我使用此link成功地查询了传统的负载平衡日志,但是随着存储桶的大小增加,我想对表进行分区。不幸的是,我无法使它正常工作,希望能得到一些建议。

以下是创建分区表的尝试,但已创建但返回了0行:

CREATE EXTERNAL TABLE `elb_logs_part`( 
`timestamp` string COMMENT '', 
  `elb_name` string COMMENT '', 
  `request_ip` string COMMENT '', 
  `request_port` int COMMENT '', 
  `backend_ip` string COMMENT '', 
  `backend_port` int COMMENT '', 
  `request_processing_time` double COMMENT '', 
  `backend_processing_time` double COMMENT '', 
  `client_response_time` double COMMENT '', 
  `elb_response_code` string COMMENT '', 
  `backend_response_code` string COMMENT '', 
  `received_bytes` bigint COMMENT '', 
  `sent_bytes` bigint COMMENT '', 
  `request_verb` string COMMENT '', 
  `url` string COMMENT '', 
  `protocol` string COMMENT '', 
  `user_agent` string COMMENT '', 
  `ssl_cipher` string COMMENT '', 
  `ssl_protocol` string COMMENT '')
PARTITIONED BY(year string, month string, day string)

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 

WITH SERDEPROPERTIES ( 
  'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucketname/AWSLogs/XXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
  'transient_lastDdlTime'='1555268331')

我尝试编辑以将区域添加到位置的末尾(这看起来很合逻辑,但是上面的内容是根据雅典娜创建的create查询创建的,所以我同意了)。

我也尝试使用以下方法来更改现有表:

ALTER TABLE elb_logs.elb_logs ADD 
   PARTITION(year='2019' month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019' month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019' month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'

不幸的是,这会产生错误:

  

第2行:4:在“分区”中缺少“列””

我不理解,因为上面的内容直接来自文档。我想这是由于未定义分区或其他原因造成的!!?

对不起我的所有新手,谁能帮助我使用Athena对存储在s3中的经典负载均衡器日志进行分区?

真的需要发现不断抓取我的网站的虫子,而我认为无意中使我们离线!

2 个答案:

答案 0 :(得分:1)

指定,后,您缺少逗号(year)。以下语句导致查询成功

ALTER TABLE elb_logs.elb_logs ADD 
   PARTITION(year='2019', month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019', month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019', month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'

答案 1 :(得分:0)

要对数据进行分区,我必须执行两个步骤,首先创建一个新的分区表,但指向错误的位置,然后添加正确的分区。

第1步:创建分区表(请注意,该位置实际上不理想,不包括该区域)

CREATE EXTERNAL TABLE `elb_logs_part`(
  `timestamp` string COMMENT '', 
  `elb_name` string COMMENT '', 
  `request_ip` string COMMENT '', 
  `request_port` int COMMENT '', 
  `backend_ip` string COMMENT '', 
  `backend_port` int COMMENT '', 
  `request_processing_time` double COMMENT '', 
  `backend_processing_time` double COMMENT '', 
  `client_response_time` double COMMENT '', 
  `elb_response_code` string COMMENT '', 
  `backend_response_code` string COMMENT '', 
  `received_bytes` bigint COMMENT '', 
  `sent_bytes` bigint COMMENT '', 
  `request_verb` string COMMENT '', 
  `url` string COMMENT '', 
  `protocol` string COMMENT '', 
  `user_agent` string COMMENT '', 
  `ssl_cipher` string COMMENT '', 
  `ssl_protocol` string COMMENT '')
  PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
  'transient_lastDdlTime'='1555268331')

第2步:按以下方式对数据进行分区(感谢Ilya找出错字)

ALTER TABLE elb_logs.elb_logs_part ADD 
   PARTITION(year='2019', month = '11', day = '18') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/18'
   PARTITION(year='2019', month = '11', day = '17') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/17'
   PARTITION(year='2019', month = '11', day = '16') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/16'

这对我有用。