BigQuery-在where和in子句中删除两列

时间:2019-09-03 09:33:33

标签: sql google-cloud-platform google-bigquery

我有一个数据从OLTP数据库流向BQ。因此,如果有任何行被更新,BQ将在2个不同的行中同时包含新旧记录。因此,我有一个查询来删除数据,但我想每天至少一次从BQ中删除旧值。

这是我的选择查询,它将为我提供最新记录。

SELECT DISTINCT o._sdc_sequence
      FROM `my-production.bqtest.mytbl` o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM `my-production.bqtest.mytbl`
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch
  • id-整数(主键)
  • sdc_sequence-Unix时间戳记(当数据插入BQ时)
  • _sdc_batched_at时间戳-(正在批量处理,因此是批处理开始时间的时间戳)

以上各列的样本数据:

select id, 
_sdc_sequence, 
_sdc_batched_at 
FROM `my-production.bqtest.mytbl`

ID: 2741332
_sdc_sequence: 1565726907840002084
_sdc_batched_at: 2019-08-13 21:01:07.687 UTC

我想删除旧记录,我可以使用最新行进行每日表轮换,但是如果我更改表结构上的某些内容,Im使用的ETL工具将无法使用。

我在下面的查询中尝试了此方法,但是它也删除了一些有效的行。

delete from  `my-production.bqtest.mytbl` where _sdc_sequence not in( 
SELECT DISTINCT o._sdc_sequence
      FROM `my-production.bqtest.mytbl` o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM `my-production.bqtest.mytbl` 
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

因为我有2个具有相同序列ID的行,所以我需要用where _sdc_sequence not in + max(_sdc_batched_at)过滤掉它

或任何其他更好的查询来执行此操作。

3 个答案:

答案 0 :(得分:0)

如果每个id仅保留一行,则元组语法可能最简单。为了保留记录:

select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
          (select t2.id, MAX(t2._sdc_sequence)
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

根据您的描述,我不确定该批次与该问题有什么关系,所以我将其省略。

您可以使用delete或类似的逻辑将其转换为not in

delete from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) not in
          (select t2.id, MAX(t2._sdc_sequence)
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

您也可以这样表达:

delete from `my-production.bqtest.mytbl` t
where _sdc_sequence <
          (selectd max(t2._sdc_sequence)
           from `my-production.bqtest.mytbl` t2
           where t2.id = t.id
          );

答案 1 :(得分:0)

select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
          (select (t2.id, MAX(t2._sdc_sequence))
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

在选择查询的列列表中使用()来避免Subquery of type IN must have only one output column错误

答案 2 :(得分:0)

只需在where检查和子查询中同时合并两个字段,即可将它们转换为具有唯一值的单个字段(在运行删除操作之前确保它们是唯一的):< / p>

select t.*
from `my-production.bqtest.mytbl` t
where concat(id, _sdc_sequence) in
          (select concat(t2.id, MAX(t2._sdc_sequence))
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          )