我正在尝试使用分区投影在 Athena 中设置一个表。 我的日志格式为 s3://bucket/folder/year/month/day/hour 和然后里面有一个 json 文件。
我尝试使用分区投影创建表,如下所示:
CREATE EXTERNAL TABLE `waf_logs_webacl1`(
`timestamp` bigint,
`formatversion` int,
`webaclid` string,
`terminatingruleid` string,
`terminatingruletype` string,
`action` string,
`terminatingrulematchdetails` array<
struct<
conditiontype:string,
location:string,
matcheddata:array<string>
>
>,
`httpsourcename` string,
`httpsourceid` string,
`rulegrouplist` array<
struct<
rulegroupid:string,
terminatingrule:struct<
ruleid:string,
action:string,
rulematchdetails:string
>,
nonterminatingmatchingrules:array<
struct<
ruleid:string,
action:string,
rulematchdetails:array<
struct<
conditiontype:string,
location:string,
matcheddata:array<string>
>
>
>
>,
excludedrules:array<
struct<
ruleid:string,
exclusiontype:string
>
>
>
>,
`ratebasedrulelist` array<
struct<
ratebasedruleid:string,
limitkey:string,
maxrateallowed:int
>
>,
`nonterminatingmatchingrules` array<
struct<
ruleid:string,
action:string
>
>,
`requestheadersinserted` string,
`responsecodesent` string,
`httprequest` struct<
clientip:string,
country:string,
headers:array<
struct<
name:string,
value:string
>
>,
uri:string,
args:string,
httpversion:string,
httpmethod:string,
requestid:string
>,
`labels` array<
struct<
name:string
>
>
)
PARTITIONED BY
(
day STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket/folder/'
TBLPROPERTIES
(
"projection.enabled" = "true",
"projection.day.type" = "date",
"projection.day.range" = "2021/01/01,NOW",
"projection.day.format" = "yyyy/MM/dd/HH",
"projection.day.interval" = "1",
"projection.day.interval.unit" = "YEARS",
"storage.location.template" = "s3://bucket/folder/${year}/${month}/${day}/${hour}/"
)
它已成功创建,但是当我加载其中的所有分区时,出现错误
Partitions not in metastore: waf_logs_webacl1:2021/05/16/23 waf_logs_webacl1:2021/05/17/00 waf_logs_webacl1:2021/05/17/01 waf_logs_webacl1:2021/05/17/02 waf_logs_webacl1:2021/05/17/03 etc
我也试过将 storage.location.template 设为 s3://bucket/folder/
和 s3://bucket/folder/${year}/
并在加载分区时遇到相同的错误。请帮忙谢谢。
答案 0 :(得分:0)
当你使用分区投影时,你不需要加载分区,分区会在查询执行时找到。
您的表的问题在于您有一个分区键 day
,但您对 Athena 说数据存储在包含 /${year}/${month}/${day}/${hour}/
的目录结构中,即四个分区键。
要么您需要使用所有四个分区键创建表并为其配置分区投影(例如 projection.year.type
等),要么您需要从存储位置模板中删除未定义的键。
我认为正确的做法是前者,因为这就是数据的组织方式。 Athena 文档中有一个示例,您应该可以在此处用作起点:https://docs.aws.amazon.com/athena/latest/ug/partition-projection-kinesis-firehose-example.html