bigquery挑战:
我有一张桌子,上面有进入工厂的批次产品,并且沿途有多个传感器来测量单个产品不同部分的不同缺陷。我们正在以扁平结构从设备读取数据。 数据将写入到传入表中。
Batch_id|Sensor_id|Product_part_id|defect_id|Count_defects|Event_Date
1.......|.1.......|1..............|2........|.5...........|.2018-7-1
1.......|.2.......|1..............|2........|.6...........|.2018-7-1
1.......|.2.......|2..............|3........|.7...........|.2018-7-1
1.......|.3.......|2..............|3........|.8...........|.2018-7-1
1.......|.3.......|2..............|4........|.9...........|.2018-7-1
1.......|.3.......|3..............|5........|.10...........|.2018-7-1
我们可以在这些表上进行重复数据删除,因为基于最后一个[updated_time]
现在,我正在尝试将原始输入物化为由Event_Date分区的事实表,但是为了获得最佳性能和最便宜的存储,我想实现以下结构:
Batch_id|Sensor_id|Product_part_id|defect_id|Count_defects|Event_Date
1.......|.1.......|1..............|2........|.5...........|.2018-7-1
........|.2.......|1..............|2........|.6...........|.2018-7-1
........|.........|2..............|3........|.7...........|.2018-7-1
........|.3.......|2..............|3........|.8...........|.2018-7-1
........|.........|...............|4........|.9...........|.2018-7-1
........|.........|3..............|5........|.10..........|.2018-7-1
我不能进行多个嵌套的ARRAY()调用,这是不允许的,并且执行效果也很差,因为它将多次使用与输入相同的基表。
正在寻找解决方法的建议。
谢谢!
答案 0 :(得分:3)
我正在使用array_agg()
+ GROUP BY
的顺序应用程序,从最里面的数组开始。第一次迭代后,我将查询放入WITH
中,并再次使用array_agg()
+ GROUP BY
创建下一个数组。
在性能方面,此方法具有所有GROUP BY
查询都相同的约束-如果可能,您应避免组大小偏斜-否则将花费更长的时间,因为BigQuery不得不在后台重新计划资源实现一个组占用大量内存。但是您可以使用query execution plan进行优化。
对于您的示例表,我的结果查询如下:
WITH t AS (
SELECT 1 as batch_id, 1 as sensor_id, 1 as product_part_id, 2 as defect_id, 5 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 2 as sensor_id, 1 as product_part_id, 2 as defect_id, 6 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 2 as sensor_id, 2 as product_part_id, 3 as defect_id, 7 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 2 as product_part_id, 3 as defect_id, 8 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 2 as product_part_id, 4 as defect_id, 9 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 3 as product_part_id, 5 as defect_id, 10 as count_defects, '2018-7-1' as event_date
),
defect_nesting as (
SELECT
batch_id,
sensor_id,
product_part_id,
array_agg(STRUCT(defect_id, count_defects, event_date) ORDER BY defect_id) defectInfo
FROM t
GROUP BY 1, 2, 3
),
product_nesting as (
SELECT
batch_id,
sensor_id,
array_agg(STRUCT(product_part_id, defectInfo) ORDER BY product_part_id) productInfo
FROM defect_nesting
GROUP BY 1,2
)
SELECT
batch_id,
array_agg(STRUCT(sensor_id, productInfo) ORDER BY sensor_id) sensorInfo
FROM product_nesting
GROUP BY 1
生成的JSON:
[
{
"batch_id": "1",
"sensorInfo": [
{
"sensor_id": "1",
"productInfo": [
{
"product_part_id": "1",
"defectInfo": [
{
"defect_id": "2",
"count_defects": "5",
"event_date": "2018-7-1"
}
]
}
]
},
{
"sensor_id": "2",
"productInfo": [
{
"product_part_id": "1",
"defectInfo": [
{
"defect_id": "2",
"count_defects": "6",
"event_date": "2018-7-1"
}
]
},
{
"product_part_id": "2",
"defectInfo": [
{
"defect_id": "3",
"count_defects": "7",
"event_date": "2018-7-1"
}
]
}
]
},
{
"sensor_id": "3",
"productInfo": [
{
"product_part_id": "2",
"defectInfo": [
{
"defect_id": "3",
"count_defects": "8",
"event_date": "2018-7-1"
},
{
"defect_id": "4",
"count_defects": "9",
"event_date": "2018-7-1"
}
]
},
{
"product_part_id": "3",
"defectInfo": [
{
"defect_id": "5",
"count_defects": "10",
"event_date": "2018-7-1"
}
]
}
]
}
]
}
]
希望有帮助!