Question

我有一个.txt文件，其中的列包含用逗号分隔的字符串数组，并用方括号括起来，我想在AWS Athena / QS中进行一些分析。原始数据如下：

col_id    col2
1         ["string1", "string2", "string3", "string4"] 
2         ["string1", "string2"]
3         ["string1", "string2", "string3"]
...

我在雅典娜创建了一个带有以下内容的表格：

create external table db.xx (
    col1 string,
    col2 array<string>

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '  ',
  'field.delim' = ' ',
  'collection.delim' = ','
) LOCATION 's3://xxx'
TBLPROPERTIES ("skip.header.line.count"="1");

表已成功创建，并且该列被识别为数组数据类型。

但是我无法访问数组中的元素。

从表中选择element_at（col2,1）返回：

string1, string2, string3, string4
string1, string2
string1, string2, string3

我也尝试从原始数据中删除[]和“”，但仍然得到相同的结果。

Answer 1

CSV没有数组类型，并且有许多方法可以编码数组。不幸的是，即使您说一列的类型为array<string>，Athena也不会自动找出数据的处理方式。

但是，有一种解决方法：使用string作为列类型，然后在查询时将值转换为JSON（因为看起来数组是用JSON编码字符串数组的方式编码的），或者使用the many JSON functions之一从数组中提取值：

像这样创建表：

create external table db.xx (
    col1 string,
    col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '  ',
  'field.delim' = ' ',
  'collection.delim' = ','
) LOCATION 's3://xxx'
TBLPROPERTIES ("skip.header.line.count"="1");

然后像这样查询它：

SELECT
  col1,
  json_array_get(col2, 0)
FROM db.xx

无法访问AWS Athena中的数组元素

1 个答案: