雅典娜(Athena):数据类型struct <tag1:string,tag2:string>是否等效于map <string,string>?

时间:2019-11-23 12:00:37

标签: amazon-athena

我的数据流是:json -> parquet -> Athena,但是嵌套键值字段tags遇到了问题。

JSON文件为

[{"myid":1,"name":"foo","tags":{"tag1":"a","tag2":"b"}},
 {"myid":2,"name":"bar","tags":{"tag1":"c","tag2":"d"}}
]

雅典娜表是

CREATE EXTERNAL TABLE IF NOT EXISTS dbname.tablename (
  `myid` int,
  `name` string,
  `tags` STRUCT < tag1 : string, tag2 : string >
)
STORED AS PARQUET
LOCATION 's3://path/to/folder'
TBLPROPERTIES (
  "parquet.compress"="SNAPPY"
);

通过select * from dbname.tablename进行测试,一切都很好。

但是,如果我将STRUCT替换为tags MAP < string, string >,则select查询将引发异常

HIVE_CANNOT_OPEN_SPLIT:
Error opening Hive split s3://path/file.snappy.parquet (offset=0, length=992):
Expected MAP column 'tags.entry' entry to have two fields, but has 1 fields

我的最终目标是导入json,而无需在STRUCT的{​​{1}}中显式写出密钥。有指针吗?

更新:在Spark服务器上(通过databricks.com),从json转换为实木复合地板的步骤如下

create table

1 个答案:

答案 0 :(得分:1)

Map和struct在json中看起来相同,但是正如注释中所述,map和struct存储在镶木地板中并不相同。在Athena中,您无法执行基础数据的隐式转换,因此有两个选择:在Spark转换期间显式转换数据,或使用CTAS在Athena中转换数据。

将结构转换为Spark中的地图

默认情况下,Spark将json映射转换为struct:

val jsonStr = """
[{"myid":1,"name":"foo","tags":{"tag1":"a","tag2":"b"}},
 {"myid":2,"name":"bar","tags":{"tag1":"c","tag2":"d"}}
]
"""

val df_json = spark.read.json(Seq(jsonStr).toDS.rdd)
df_json.write.parquet(path)
val df_parquet = spark.read.parquet(path)
df_parquet.printSchema
df_parquet.show

root
 |-- myid: long (nullable = true)
 |-- name: string (nullable = true)
 |-- tags: struct (nullable = true)
 |    |-- tag1: string (nullable = true)
 |    |-- tag2: string (nullable = true)
+----+----+-----+
|myid|name| tags|
+----+----+-----+
|   1| foo|[a,b]|
|   2| bar|[c,d]|
+----+----+-----+

需要将tags结构明确转换为映射:

df_json
    .select($"myid", $"name", map(
        lit("tag1"), $"tags.tag1",
        lit("tag2"), $"tags.tag2"
    ) as "tags")
    .write.parquet(path)
val df_parquet = spark.read.parquet(path)
df_parquet.printSchema
df_parquet.show

root
 |-- myid: long (nullable = true)
 |-- name: string (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
+----+----+-------------------------+
|myid|name|tags                     |
+----+----+-------------------------+
|1   |foo |Map(tag1 -> a, tag2 -> b)|
|2   |bar |Map(tag1 -> c, tag2 -> d)|
+----+----+-------------------------+

使用CTAS将结构转换为地图

在Athena中,您无法修改现有表,但可以使用CTAS表达式创建一个新表(将表创建为Select):

CREATE EXTERNAL TABLE IF NOT EXISTS temptable (
         `myid` int,
         `name` string,
         `tags` STRUCT < tag1 : string,
         tag2 : string > 
) STORED AS PARQUET LOCATION 's3://xxx' TBLPROPERTIES ( "parquet.compress"="SNAPPY" );

CREATE table temptable2 AS
SELECT myid,
         name,
         MAP(ARRAY['tag1', 'tag2'],ARRAY[tags.tag1, tags.tag2]) AS tags
FROM temptable;

temptable2现在看起来像:

CREATE EXTERNAL TABLE `temptable2`(
  `myid` int COMMENT '', 
  `name` string COMMENT '', 
  `tags` map<varchar(4),string> COMMENT '')