我有一个数据从OLTP数据库流向BQ。因此,如果有任何行被更新,BQ将在2个不同的行中同时包含新旧记录。因此,我有一个查询来删除数据,但我想每天至少一次从BQ中删除旧值。
这是我的选择查询,它将为我提供最新记录。
SELECT DISTINCT o._sdc_sequence
FROM `my-production.bqtest.mytbl` o
INNER JOIN (
SELECT id,
MAX(_sdc_sequence) AS seq,
MAX(_sdc_batched_at) AS batch
FROM `my-production.bqtest.mytbl`
GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch
id
-整数(主键)sdc_sequence
-Unix时间戳记(当数据插入BQ时)_sdc_batched_at
时间戳-(正在批量处理,因此是批处理开始时间的时间戳)以上各列的样本数据:
select id,
_sdc_sequence,
_sdc_batched_at
FROM `my-production.bqtest.mytbl`
ID: 2741332
_sdc_sequence: 1565726907840002084
_sdc_batched_at: 2019-08-13 21:01:07.687 UTC
我想删除旧记录,我可以使用最新行进行每日表轮换,但是如果我更改表结构上的某些内容,Im使用的ETL工具将无法使用。
我在下面的查询中尝试了此方法,但是它也删除了一些有效的行。
delete from `my-production.bqtest.mytbl` where _sdc_sequence not in(
SELECT DISTINCT o._sdc_sequence
FROM `my-production.bqtest.mytbl` o
INNER JOIN (
SELECT id,
MAX(_sdc_sequence) AS seq,
MAX(_sdc_batched_at) AS batch
FROM `my-production.bqtest.mytbl`
GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch
因为我有2个具有相同序列ID的行,所以我需要用where _sdc_sequence not in + max(_sdc_batched_at)
过滤掉它
或任何其他更好的查询来执行此操作。
答案 0 :(得分:0)
如果每个id
仅保留一行,则元组语法可能最简单。为了保留记录:
select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
(select t2.id, MAX(t2._sdc_sequence)
from `my-production.bqtest.mytbl` t2
group by t2.id
);
根据您的描述,我不确定该批次与该问题有什么关系,所以我将其省略。
您可以使用delete
或类似的逻辑将其转换为not in
:
delete from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) not in
(select t2.id, MAX(t2._sdc_sequence)
from `my-production.bqtest.mytbl` t2
group by t2.id
);
您也可以这样表达:
delete from `my-production.bqtest.mytbl` t
where _sdc_sequence <
(selectd max(t2._sdc_sequence)
from `my-production.bqtest.mytbl` t2
where t2.id = t.id
);
答案 1 :(得分:0)
select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
(select (t2.id, MAX(t2._sdc_sequence))
from `my-production.bqtest.mytbl` t2
group by t2.id
);
在选择查询的列列表中使用()
来避免Subquery of type IN must have only one output column
错误
答案 2 :(得分:0)
只需在where检查和子查询中同时合并两个字段,即可将它们转换为具有唯一值的单个字段(在运行删除操作之前确保它们是唯一的):< / p>
select t.*
from `my-production.bqtest.mytbl` t
where concat(id, _sdc_sequence) in
(select concat(t2.id, MAX(t2._sdc_sequence))
from `my-production.bqtest.mytbl` t2
group by t2.id
)