我具有采用以下方案的JSON格式的历史数据。
npm run setup
CREATE EXTERNAL TABLE `raw_table`(
`time_stamp` timestamp,
`event_1` array<struct<name:string, value_1:double, value_2:double>>,
`event_2` array<struct<name:string, value_1:double, value_2:double>>)
...
出现在name
和event_1
event_2
出现多次,但name
和value_1
却不同我想用CTAS query将其转换为Parquet,同时我还想提取value_2
的值并将其用作新的分区键,以使新表的结构成为>
name
我的一般想法是先嵌套数组,然后按CREATE EXTERNAL TABLE `processed_table`(
`time_stamp` timestamp,
`event_1` array<struct<value_1:double, value_2:double>>,
`event_2` array<struct<value_1:double, value_2:double>>)
PARTITIONED BY (
`name` string)
...
和time_stamp
分组,同时通过汇总name
和value_1
创建新数组。
value_2
此查询可以很好地处理单个事件,并且不会花费太长时间。理想情况下,我希望并排保留这些事件,以简化以后的查询。
接下来,尝试将此方法扩展到两个事件,查询变得非常复杂:
SELECT
time_stamp,
unnested_event_1.name as "name",
array_agg(
MAP(
ARRAY['value_1', 'value_2'],
ARRAY[unnested_event_1.value_1, unnested_event_1.value_2]
)
) as "event_1"
FROM
some_db.raw_data,
UNNEST(event_1) as t(unnested_event_1)
GROUP BY 1, 2
尽管它提供了正确的结果,但是即使从源中读取数据也要花费一些时间,我认为这是由于取消嵌套和将其与两个子查询分组所致。
所以我尝试用SELECT
time_stamp,
event_1,
event_2,
COALESCE (event_1_name, event_2_name) as "name"
FROM (
SELECT
time_stamp,
event_1_name,
event_1,
unnested_event_2.name as "event_2_name",
array_agg(
MAP(
ARRAY['value_1', 'value_2'],
ARRAY[unnested_event_2.value_1, unnested_event_2.value_2]
)
) as "event_2"
FROM (
SELECT
time_stamp,
unnested_event_1.name as event_1_name,
array_agg(
MAP(
ARRAY['value_1', 'value_2'],
ARRAY[unnested_event_1.value_1, unnested_event_1.value_2]
)
) as "event_1",
event_2
FROM
some_db.raw_data,
UNNEST(event_1) as t(unnested_event_1)
GROUP BY 1, 2, 4
),
UNNEST(event_2) as t(unnested_event_2)
GROUP BY 1, 2, 3, 4
)
WHERE(
event_2_name = event_1_name
)
一次性完成:
GROUP BY GROUPING SETS (set_1, set_2)
性能方面的方法看起来更好,即查看读取基础数据所需的时间。但是,我不太确定如何合并此查询的结果,如下所示:
SELECT
time_stamp,
unnested_event_1.name as event_1_name,
unnested_event_2.name as event_2_name,
array_agg(
MAP(
ARRAY['value_1', 'value_2'],
ARRAY[unnested_event_1.value_1, unnested_event_1.value_2]
)
) as "event_1",
array_agg(
MAP(
ARRAY['value_1', 'value_2'],
ARRAY[unnested_event_2.value_1, unnested_event_2.value_2]
)
) as "event_2"
FROM
some_db.raw_data,
UNNEST(event_1, event_2) as t(unnested_event_1, unnested_event_2)
GROUP BY GROUPING SETS (
(time_stamp, event_1_name),
(time_stamp, event_2_name)
)
进入
| | time_stamp | event_1_name | event_2_name | event_1 | event_2 |
|---|------------|--------------|--------------|-------------------|-------------------|
| 1 | 12345 | foo | | foo_event_1 | false foo_event_2 |
| 2 | 12345 | bar | | bar_event_1 | false bar_event_2 |
| 3 | 12345 | | foo | false foo_event_1 | foo_event_2 |
| 4 | 12345 | | bar | false bar_event_1 | bar_event_2 |