我的数据流是:json -> parquet -> Athena
,但是嵌套键值字段tags
遇到了问题。
JSON文件为
[{"myid":1,"name":"foo","tags":{"tag1":"a","tag2":"b"}},
{"myid":2,"name":"bar","tags":{"tag1":"c","tag2":"d"}}
]
雅典娜表是
CREATE EXTERNAL TABLE IF NOT EXISTS dbname.tablename (
`myid` int,
`name` string,
`tags` STRUCT < tag1 : string, tag2 : string >
)
STORED AS PARQUET
LOCATION 's3://path/to/folder'
TBLPROPERTIES (
"parquet.compress"="SNAPPY"
);
通过select * from dbname.tablename
进行测试,一切都很好。
但是,如果我将STRUCT
替换为tags MAP < string, string >
,则select查询将引发异常
HIVE_CANNOT_OPEN_SPLIT:
Error opening Hive split s3://path/file.snappy.parquet (offset=0, length=992):
Expected MAP column 'tags.entry' entry to have two fields, but has 1 fields
我的最终目标是导入json,而无需在STRUCT
的{{1}}中显式写出密钥。有指针吗?
更新:在Spark服务器上(通过databricks.com),从json转换为实木复合地板的步骤如下
create table
答案 0 :(得分:1)
Map和struct在json中看起来相同,但是正如注释中所述,map和struct存储在镶木地板中并不相同。在Athena中,您无法执行基础数据的隐式转换,因此有两个选择:在Spark转换期间显式转换数据,或使用CTAS在Athena中转换数据。
默认情况下,Spark将json映射转换为struct:
val jsonStr = """
[{"myid":1,"name":"foo","tags":{"tag1":"a","tag2":"b"}},
{"myid":2,"name":"bar","tags":{"tag1":"c","tag2":"d"}}
]
"""
val df_json = spark.read.json(Seq(jsonStr).toDS.rdd)
df_json.write.parquet(path)
val df_parquet = spark.read.parquet(path)
df_parquet.printSchema
df_parquet.show
root
|-- myid: long (nullable = true)
|-- name: string (nullable = true)
|-- tags: struct (nullable = true)
| |-- tag1: string (nullable = true)
| |-- tag2: string (nullable = true)
+----+----+-----+
|myid|name| tags|
+----+----+-----+
| 1| foo|[a,b]|
| 2| bar|[c,d]|
+----+----+-----+
需要将tags
结构明确转换为映射:
df_json
.select($"myid", $"name", map(
lit("tag1"), $"tags.tag1",
lit("tag2"), $"tags.tag2"
) as "tags")
.write.parquet(path)
val df_parquet = spark.read.parquet(path)
df_parquet.printSchema
df_parquet.show
root
|-- myid: long (nullable = true)
|-- name: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+----+----+-------------------------+
|myid|name|tags |
+----+----+-------------------------+
|1 |foo |Map(tag1 -> a, tag2 -> b)|
|2 |bar |Map(tag1 -> c, tag2 -> d)|
+----+----+-------------------------+
在Athena中,您无法修改现有表,但可以使用CTAS表达式创建一个新表(将表创建为Select):
CREATE EXTERNAL TABLE IF NOT EXISTS temptable (
`myid` int,
`name` string,
`tags` STRUCT < tag1 : string,
tag2 : string >
) STORED AS PARQUET LOCATION 's3://xxx' TBLPROPERTIES ( "parquet.compress"="SNAPPY" );
CREATE table temptable2 AS
SELECT myid,
name,
MAP(ARRAY['tag1', 'tag2'],ARRAY[tags.tag1, tags.tag2]) AS tags
FROM temptable;
temptable2
现在看起来像:
CREATE EXTERNAL TABLE `temptable2`(
`myid` int COMMENT '',
`name` string COMMENT '',
`tags` map<varchar(4),string> COMMENT '')