Question

我有一个Athena表，其中某些字段具有相当复杂的嵌套格式。 S3中的后备记录是JSON。沿着这些方向（但我们有几个级别的嵌套）：

CREATE EXTERNAL TABLE IF NOT EXISTS test (
  timestamp double,
  stats array<struct<time:double, mean:double, var:double>>,
  dets array<struct<coords: array<double>, header:struct<frame:int, 
    seq:int, name:string>>>,
  pos struct<x:double, y:double, theta:double>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
LOCATION 's3://test-bucket/test-folder/'

现在我们需要能够查询数据并将结果导入Python进行分析。由于安全限制，我无法直接连接到雅典娜;我需要能够给某人查询，然后他们会给我CSV结果。

如果我们只是直接选择*，我们以不完全JSON的格式返回struct / array列。这是一个示例输入文件条目：

{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}

示例输出：

select * from test

"timestamp","stats","dets","pos"
"1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 1.7, 0.3], header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"

我希望以更方便的格式导出那些嵌套字段 - 用JSON获取它们会很棒。

不幸的是，似乎对JSON的强制转换只适用于地图而不是结构，因为它只是将所有内容展平为数组：

SELECT timestamp, cast(stats as JSON) as stats, cast(dets as JSON) as dets, cast(pos as JSON) as pos FROM "sampledb"."test"

"timestamp","stats","dets","pos"
"1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[2.4,1.7,0.3],[1,1,""hello""]]]","[5.0,1.4,0.04]"

是否有一种转换为JSON（或其他易于导入的格式）的好方法，还是应该继续执行自定义解析功能？

Answer 1

我浏览了所有文档，不幸的是现在似乎没有办法做到这一点。唯一可行的解决方法是

converting a struct to a json when querying athena

lo

或者我会使用后期处理将数据转换为json。下面的脚本显示了如何

SELECT
  my_field,
  my_field.a,
  my_field.b,
  my_field.c.d,
  my_field.c.e
FROM 
  my_table

输入数据的输出为

#!/usr/bin/env python
import io
import re

pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)

with io.open("test.csv") as f:
    headers = list(map(lambda f: f.strip(), f.readline().split(",")))
    for line in f.readlines():
        orig_line = line
        data = []
        for i, l in enumerate(line.split('","')):
            data.append(headers[i] + ":" + re.sub('^"|"$', "", l))

        line = "{" + ','.join(data) + "}"
        line = pattern1.sub(r'"\1":', line)
        line = pattern2.sub(r':"\1"', line)
        print(line)

哪个是有效的JSON

Answer 2

@tarun的python代码几乎让我到了那里，但是由于我的数据，我不得不以几种方式对其进行修改。特别是，我有：

在雅典娜中保存为字符串的json结构
包含多个单词的字符串，因此必须在双引号之间。其中一些包含“ []”和“ {}”符号。

这是对我有用的代码，希望对其他人有用：

#!/usr/bin/env python
import io
import re, sys

pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)

with io.open(sys.argv[1]) as f:
    headers = list(map(lambda f: f.strip(), f.readline().split(",")))
    print(headers)
    for line in f.readlines():

        orig_line = line
        #save the double quote cases, which mean there is a string with quotes inside
        line = re.sub('""', "#", orig_line)
        data = []
        for i, l in enumerate(line.split('","')):
            item = re.sub('^"|"$', "", l.rstrip())
            if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):
                data.append(headers[i] + ":" + item)
            else: #we have a string
                data.append(headers[i] + ": \"" + item + "\"")

        line = "{" + ','.join(data) + "}"
        line = pattern1.sub(r'"\1":', line)
        line = pattern2.sub(r':"\1"', line)

        #restate the double quotes to single ones, once inside the json
        line = re.sub("#", '"', line)

        print(line)

Answer 3

这个方法不是通过修改Query。

通过后处理对于 Javascript/Nodejs，我们可以使用 npm 包 athena-struct-parser。

详细回答示例

https://stackoverflow.com/a/67899845/6662952

参考 - https://www.npmjs.com/package/athena-struct-parser

AWS Athena将结构数组导出为JSON

3 个答案: