通过扁平化结构创建具有多个嵌套级别的表

时间:2018-07-21 10:51:23

标签: google-bigquery

bigquery挑战:

输入

我有一张桌子,上面有进入工厂的批次产品,并且沿途有多个传感器来测量单个产品不同部分的不同缺陷。我们正在以扁平结构从设备读取数据。 数据将写入到传入表中。

Batch_id|Sensor_id|Product_part_id|defect_id|Count_defects|Event_Date
1.......|.1.......|1..............|2........|.5...........|.2018-7-1
1.......|.2.......|1..............|2........|.6...........|.2018-7-1
1.......|.2.......|2..............|3........|.7...........|.2018-7-1
1.......|.3.......|2..............|3........|.8...........|.2018-7-1
1.......|.3.......|2..............|4........|.9...........|.2018-7-1
1.......|.3.......|3..............|5........|.10...........|.2018-7-1

我们可以在这些表上进行重复数据删除,因为基于最后一个[updated_time]

问题:转换为多个嵌套的重复结构

现在,我正在尝试将原始输入物化为由Event_Date分区的事实表,但是为了获得最佳性能和最便宜的存储,我想实现以下结构:

Batch_id|Sensor_id|Product_part_id|defect_id|Count_defects|Event_Date
1.......|.1.......|1..............|2........|.5...........|.2018-7-1
........|.2.......|1..............|2........|.6...........|.2018-7-1
........|.........|2..............|3........|.7...........|.2018-7-1
........|.3.......|2..............|3........|.8...........|.2018-7-1
........|.........|...............|4........|.9...........|.2018-7-1
........|.........|3..............|5........|.10..........|.2018-7-1

我不能进行多个嵌套的ARRAY()调用,这是不允许的,并且执行效果也很差,因为它将多次使用与输入相同的基表。

正在寻找解决方法的建议。

谢谢!

1 个答案:

答案 0 :(得分:3)

我正在使用array_agg() + GROUP BY的顺序应用程序,从最里面的数组开始。第一次迭代后,我将查询放入WITH中,并再次使用array_agg() + GROUP BY创建下一个数组。

在性能方面,此方法具有所有GROUP BY查询都相同的约束-如果可能,您应避免组大小偏斜-否则将花费更长的时间,因为BigQuery不得不在后台重新计划资源实现一个组占用大量内存。但是您可以使用query execution plan进行优化。

对于您的示例表,我的结果查询如下:

WITH t AS (
            SELECT 1 as batch_id, 1 as sensor_id, 1 as product_part_id, 2 as defect_id,  5 as count_defects, '2018-7-1' as event_date
  UNION ALL SELECT 1 as batch_id, 2 as sensor_id, 1 as product_part_id, 2 as defect_id,  6 as count_defects, '2018-7-1' as event_date
  UNION ALL SELECT 1 as batch_id, 2 as sensor_id, 2 as product_part_id, 3 as defect_id,  7 as count_defects, '2018-7-1' as event_date
  UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 2 as product_part_id, 3 as defect_id,  8 as count_defects, '2018-7-1' as event_date
  UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 2 as product_part_id, 4 as defect_id,  9 as count_defects, '2018-7-1' as event_date
  UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 3 as product_part_id, 5 as defect_id, 10 as count_defects, '2018-7-1' as event_date
),
defect_nesting as (
  SELECT 
    batch_id, 
    sensor_id, 
    product_part_id, 
    array_agg(STRUCT(defect_id, count_defects, event_date) ORDER BY defect_id) defectInfo
  FROM t 
  GROUP BY 1, 2, 3
  ),

product_nesting as (  
  SELECT 
    batch_id,
    sensor_id,
    array_agg(STRUCT(product_part_id, defectInfo) ORDER BY product_part_id) productInfo
  FROM defect_nesting
  GROUP BY 1,2
)

SELECT 
  batch_id,
  array_agg(STRUCT(sensor_id, productInfo) ORDER BY sensor_id) sensorInfo
FROM product_nesting
GROUP BY 1

生成的JSON:

[
  {
    "batch_id": "1",
    "sensorInfo": [
      {
        "sensor_id": "1",
        "productInfo": [
          {
            "product_part_id": "1",
            "defectInfo": [
              {
                "defect_id": "2",
                "count_defects": "5",
                "event_date": "2018-7-1"
              }
            ]
          }
        ]
      },
      {
        "sensor_id": "2",
        "productInfo": [
          {
            "product_part_id": "1",
            "defectInfo": [
              {
                "defect_id": "2",
                "count_defects": "6",
                "event_date": "2018-7-1"
              }
            ]
          },
          {
            "product_part_id": "2",
            "defectInfo": [
              {
                "defect_id": "3",
                "count_defects": "7",
                "event_date": "2018-7-1"
              }
            ]
          }
        ]
      },
      {
        "sensor_id": "3",
        "productInfo": [
          {
            "product_part_id": "2",
            "defectInfo": [
              {
                "defect_id": "3",
                "count_defects": "8",
                "event_date": "2018-7-1"
              },
              {
                "defect_id": "4",
                "count_defects": "9",
                "event_date": "2018-7-1"
              }
            ]
          },
          {
            "product_part_id": "3",
            "defectInfo": [
              {
                "defect_id": "5",
                "count_defects": "10",
                "event_date": "2018-7-1"
              }
            ]
          }
        ]
      }
    ]
  }
]

希望有帮助!