更新两次嵌套重复记录

时间:2019-07-17 15:55:39

标签: google-bigquery

我正在努力处理此查询(虚拟版本,其中有更多字段):

UPDATE 
  table1 as base
SET 
  lines = 
    ARRAY(
          SELECT AS STRUCT 
            b.line_id,
            s.purch_id,
            ARRAY(
                  SELECT AS STRUCT
                    wh.warehouse_id,
                    s.is_proposed,
                  FROM table1 as t, UNNEST(lines) as lb, UNNEST(lb.warehouses) as wh
                  INNER JOIN 
                    (SELECT 
                      l.line_id,
                      wh.is_proposed
                     FROM table2, UNNEST(lines) as l, UNNEST(l.warehouses) as wh) as s
                  ON lb.line_id = s.line_id AND wh.warehouse_id = s.warehouse_id)
          FROM table1, UNNEST(lines) as b
          INNER JOIN UNNEST(supply.lines) as s
          ON b.line_id = s.line_id)
FROM 
  table2 as supply
WHERE 
  base.date = supply.date
  AND
  base.sales_id = supply.sales_id

table1和table2具有相同的嵌套:

  • lines:重复记录
  • lines.warehouses:在行中重复记录

(所以{...,生产线[{...仓库[]

plus table1是table2的子集,具有它的字段的子集,table1从开始就具有NULL(由于信息异步,因此在数据可用时我会刷新信息)。

我首先尝试了此步骤(成功):

UPDATE 
  table1 as base
SET 
  lines = 
    ARRAY(
          SELECT AS STRUCT 
            b.line_id,
            s.purch_id,
            b.warehouses
          FROM table1, UNNEST(lines) as b
          INNER JOIN UNNEST(supply.lines) as s
          ON b.line_id = s.line_id)
FROM 
  table2 as supply
WHERE 
  base.date = supply.date
  AND
  base.sales_id = supply.sales_id

但是事实上我实际上也需要更新lines.warehouses,所以我很高兴它可以工作,但还不够。

完整查询有效,当我在终端中尝试最后一个ARRAY部分时,查询速度很快且输出没有重复。 完整的UPDATE仍然没有结束(20分钟后,我杀死了它)。

桌子不是那么大,两边都是20k(完全压平了220k)。

那么我做错什么了吗? 有更好的方法吗?

谢谢

1 个答案:

答案 0 :(得分:1)

我终于解决了这个问题,它比我想象的要简单得多。 我想我误解了整个查询嵌套的工作原理。

所以我只链接了从匹配的第一行到最后一个数组的所有可用数据,因为顶层的过滤数据会传播到底层。

UPDATE 
  table1 as base
SET 
  lines = 
    ARRAY(
          SELECT AS STRUCT 
            b.line_id,
            s.purch_id,
            ARRAY(
                  SELECT AS STRUCT
                    wh.warehouse_id,
                    sh.is_proposed,
                  FROM UNNEST(b.warehouses) as wh -- take only upper level data
                  INNER JOIN UNNEST(s.warehouses) as sh -- idem
                  ON wh.warehouse_id = sh.warehouse_id) -- no need to 'redo' the joining on already filtered ones
          FROM UNNEST(base.lines) as b
          INNER JOIN UNNEST(supply.lines) as s
          ON b.line_id = s.line_id)
FROM 
  table2 as supply
WHERE 
  base.date = supply.date
  AND
  base.sales_id = supply.sales_id

查询在不到1分钟的时间内成功