我有一个名为uf
的列,其中包含JSON对象数组。这是一个样机:
[
{"type": "browserId", "name": "", "value": "unknown"},
{"type": "campaign", "name": "", "value": "om_227dec0082a5"},
{"type": "custom", "name": "2351350529", "value": "10148"},
{"type": "custom", "name": "9501713387", "value": "true"},
{"type": "custom", "name": "9517735577", "value": "true"},
{"type": "custom", "name": "9507402548", "value": "true"},
{"type": "custom", "name": "9733902068", "value": "true"}
]
我正试图将其作为子记录,但是由于某些原因,我找不到首先取消嵌套的正确方法。然后我注意到我的整个数组都包裹在另一个JSON对象中。
我在这里:
我尝试了简单选择,并注意到结果是:
[{type=[{"type": "browserId", "name": "", "value": "ff"}, name=null, value=null}]
此列的定义如下:
{
"Name": "uf",
"Type": "array<struct<type:string,name:string,value:string>>"
}
定义不正确吗,这就是为什么我将整个数组包装在另一个json数组中?
-编辑
这是我的csv文件(制表符分隔)的示例。最近两天花了很多时间来查看是否与JSON有关,这使Glue无法将列识别为JSON数组,但是我创建了一个带有简单JSON数组的新列,该列已正确分配为array<struct
,但在查询后我得到的正是与上述相同的问题
timestamp project_id campaign_id experiment_id variation_id layer_holdback audience_names end_user_id uuid session_id snippet_revision user_ip user_agent user_engine user_engine_version referer global_holdback event_type event_name uf active_views event_features event_metrics event_uuid
1570326511 74971132 11089500404 11097730080 11078120202 false [] oeu1535997971348r0.4399811351004357 AUTO 6540 5.91.170.0 Mozilla/5.0 (Linux; Android 7.0; SAMSUNG SM-G925F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36 js 0.128.0 https://www.zavamed.com/uk/account/ false view_activated 10832783364 [{"type": "browserId", "name": "", "value": "unknown"}, {"type": "device", "name": "", "value": "mobile"}, {"type": "device_type", "name": "", "value": "phone"}, {"type": "referrer", "name": "", "value": "https:\/\/www.google.co.uk\/"}, {"type": "source_type", "name": "", "value": "campaign"}, {"type": "currentTimestamp", "name": "", "value": "-1631518596"}, {"type": "offset", "name": "", "value": "-60"}] [] [] [] 4926a5f1-bbb5-4553-9d0b-b26f773fa0f4
答案 0 :(得分:1)
我将带有您提供的内容的示例csv文件上传到S3。然后我在上面跑了一个胶履带。这是我最终得到的表模式:
CREATE EXTERNAL TABLE `question_58765672`(
`timestamp` bigint,
`project_id` bigint,
`campaign_id` bigint,
`experiment_id` bigint,
`variation_id` bigint,
`layer_holdback` boolean,
`audience_names` array<string>,
`end_user_id` string,
`uuid` string,
`session_id` string,
`snippet_revision` bigint,
`user_ip` string,
`user_agent` string,
`user_engine` string,
`user_engine_version` string,
`referer` string,
`global_holdback` boolean,
`event_type` string,
`event_name` bigint,
`uf` string,
`active_views` array<string>,
`event_features` array<string>,
`event_metrics` array<string>,
`event_uuid` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://__S3_PATH_IN_MY_BUCKET__/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='some-crawler',
'areColumnsQuoted'='false',
'averageRecordSize'='553',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'='\t',
'objectCount'='1',
'recordCount'='2',
'sizeKey'='1109',
'skip.header.line.count'='1',
'typeOfData'='file')
您可以看到它将列uf
标识为字符串,对此我并不感到意外。为了取消嵌套此列,我必须手动将其强制转换为正确的类型ARRAY(JSON)
:
SELECT
"timestamp",
_unnested_column
FROM
"stackoverflow"."question_58765672",
UNNEST( CAST(json_parse(uf) AS ARRAY(JSON)) ) AS t(_unnested_column)
结果:
timestamp _unnested_column
1 1570326511 {"name":"","type":"browserId","value":"unknown"}
2 1570326511 {"name":"","type":"device","value":"mobile"}
3 1570326511 {"name":"","type":"device_type","value":"phone"}
4 1570326511 {"name":"","type":"referrer","value":"https://www.google.co.uk/"}
5 1570326511 {"name":"","type":"source_type","value":"campaign"}
6 1570326511 {"name":"","type":"currentTimestamp","value":"-1631518596"}
7 1570326511 {"name":"","type":"offset","value":"-60"}
然后,我想到了创建一个athena views的地方,其中列uf
会被正确地投射:
CREATE OR REPLACE VIEW question_58765672_v1_json AS
SELECT
CAST(json_parse(uf) AS ARRAY(JSON)) as uf
-- ALL other columns from your table
FROM
"stackoverflow"."question_58765672"
但是,出现以下错误:
列uf的列类型无效:不受支持的配置单元类型:json
我的猜测是,列uf
的模式对于胶粘剂抓取器来说太过复杂以至于无法正确识别它,或者根本不被所使用的Serde支持,即'org.apache.hadoop.mapred.TextInputFormat'
或{{1 }}。