Question

我有一些JSON文件，我正在尝试使用AWS Athena / Glue进行查询。每个文件一个记录。首先，我让Glue搜寻器查看这些文件。胶水自动方案与我的方案相同，除了存在timeline而不是我目前拥有的map<string, string>的结构。

timeline数据对我来说并不是真正有用的，只需要不出错即可查询。如果可能的话，我想避免编写ETL作业（如本question/answer中所述）以剥离/展平/更改timeline数据，但是如果必须的话，我必须这样做。

JSON文件中的数据：

{
  "id": "0093f8ee-406d-49a6-96c0-0ae43eb6a94e",
  "handlerId": "323d11be7e5f720224b9935a6476ebfd",
  "handlerUrn": "urn::::stack:AWS:EC2:Patching",
  "contextUrn": "urn::test:aab:aws:533:us-east-1:ec2:instance/i-07",
  "urn": "urn::test:aab:process:0093f8ee-406d-49a6-96c0-0ae43eb6a94e",
  "timeline": {
    "2019-05-17T16:55:06.715Z": "NEW",
    "2019-05-17T16:55:06.862Z": "READY",
    "2019-05-17T16:55:07.186Z": "WAITING",
    "2019-05-17T16:55:07.895Z": "RUNNING",
    "2019-05-17T17:03:09.775Z": "TERMINATED"
  },
  "state": "TERMINATED",
  "timestamp": "2019-05-17T17:03:09.775Z"
}

这是我现在拥有的架构。

CREATE EXTERNAL TABLE IF NOT EXISTS processes00.processesmap
 (id string,
  handlerId string,
  handlerUrn string,
  contextUrn string,
  urn string,
  timeline map<string, string>,
  state string,
  `timestamp` timestamp
  )
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://Processes/0/0/';

我试图将丑陋的时间戳记密钥名包装为maps<string, string>中的字符串，但是我不知道这样做的效果如何。

简单的答案是“不要使用时间戳作为键名”，我希望我可以更改它。

什么是处理JSON键名中时间戳的正确Glue模式？

0 个答案: