我使用此link成功地查询了传统的负载平衡日志,但是随着存储桶的大小增加,我想对表进行分区。不幸的是,我无法使它正常工作,希望能得到一些建议。
以下是创建分区表的尝试,但已创建但返回了0行:
CREATE EXTERNAL TABLE `elb_logs_part`(
`timestamp` string COMMENT '',
`elb_name` string COMMENT '',
`request_ip` string COMMENT '',
`request_port` int COMMENT '',
`backend_ip` string COMMENT '',
`backend_port` int COMMENT '',
`request_processing_time` double COMMENT '',
`backend_processing_time` double COMMENT '',
`client_response_time` double COMMENT '',
`elb_response_code` string COMMENT '',
`backend_response_code` string COMMENT '',
`received_bytes` bigint COMMENT '',
`sent_bytes` bigint COMMENT '',
`request_verb` string COMMENT '',
`url` string COMMENT '',
`protocol` string COMMENT '',
`user_agent` string COMMENT '',
`ssl_cipher` string COMMENT '',
`ssl_protocol` string COMMENT '')
PARTITIONED BY(year string, month string, day string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/AWSLogs/XXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
'transient_lastDdlTime'='1555268331')
我尝试编辑以将区域添加到位置的末尾(这看起来很合逻辑,但是上面的内容是根据雅典娜创建的create查询创建的,所以我同意了)。
我也尝试使用以下方法来更改现有表:
ALTER TABLE elb_logs.elb_logs ADD
PARTITION(year='2019' month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019' month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019' month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
不幸的是,这会产生错误:
第2行:4:在“分区”中缺少“列””
我不理解,因为上面的内容直接来自文档。我想这是由于未定义分区或其他原因造成的!!?
对不起我的所有新手,谁能帮助我使用Athena对存储在s3中的经典负载均衡器日志进行分区?
真的需要发现不断抓取我的网站的虫子,而我认为无意中使我们离线!
答案 0 :(得分:1)
指定,
后,您缺少逗号(year
)。以下语句导致查询成功。
ALTER TABLE elb_logs.elb_logs ADD
PARTITION(year='2019', month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019', month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019', month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
答案 1 :(得分:0)
要对数据进行分区,我必须执行两个步骤,首先创建一个新的分区表,但指向错误的位置,然后添加正确的分区。
第1步:创建分区表(请注意,该位置实际上不理想,不包括该区域)
CREATE EXTERNAL TABLE `elb_logs_part`(
`timestamp` string COMMENT '',
`elb_name` string COMMENT '',
`request_ip` string COMMENT '',
`request_port` int COMMENT '',
`backend_ip` string COMMENT '',
`backend_port` int COMMENT '',
`request_processing_time` double COMMENT '',
`backend_processing_time` double COMMENT '',
`client_response_time` double COMMENT '',
`elb_response_code` string COMMENT '',
`backend_response_code` string COMMENT '',
`received_bytes` bigint COMMENT '',
`sent_bytes` bigint COMMENT '',
`request_verb` string COMMENT '',
`url` string COMMENT '',
`protocol` string COMMENT '',
`user_agent` string COMMENT '',
`ssl_cipher` string COMMENT '',
`ssl_protocol` string COMMENT '')
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
'transient_lastDdlTime'='1555268331')
第2步:按以下方式对数据进行分区(感谢Ilya找出错字)
ALTER TABLE elb_logs.elb_logs_part ADD
PARTITION(year='2019', month = '11', day = '18') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/18'
PARTITION(year='2019', month = '11', day = '17') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/17'
PARTITION(year='2019', month = '11', day = '16') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/16'
这对我有用。