我有一个有2m +行的巨大桌子。 结构是这样的:
ThingName (STRING),
Date (DATE),
Value (INT64)
有时Value
是null
,我需要通过将其设置为最接近Value
行的 NOT NULL Date
来对其进行修复对应于ThingName
...
我完全不是SQL专家。
我试图用此查询描述我的任务(并仅使用以前的日期进行了很多简化(但实际上我也需要检查将来的日期)):
update my_tbl as SDP
set SDP.Value = (select SDPI.Value
from my_tbl as SDPI
where SDPI.Date < SDP.Date
and SDP.ThingName = SDPI.ThingName
and SDPI.Value is not null
order by SDPI.Date desc limit 1)
where SDP.Value is null;
我尝试设置更新行Value
,用我从同一表中为同一ThingName
选择的行,而用limit 1
只留下单个结果。
但是查询编辑器告诉我:
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
实际上,我完全不确定仅通过查询即可解决我的任务。
那么,有人可以帮助我吗?如果这不可能,请告诉我,如果可能,请告诉我什么SQL构造可以帮助我。
答案 0 :(得分:3)
在BigQuery中,update
很少见。您似乎想要的逻辑是:
select t.*,
coalesce(value,
lag(value ignore nulls) over (partition by thingname order by date)
) as value
from my_tbl;
我真的看不出将其保存回表中的原因。
答案 1 :(得分:3)
以下是用于BigQuery标准SQL
在许多(如果不是大多数情况下)情况下,您不想更新表(因为它会产生与cost相关的额外limitations和DML statements),而是可以调整'missing '查询中的值-如下例所示:
#standardSQL
SELECT
ThingName,
date,
IFNULL(value,
LAST_VALUE(value IGNORE NULLS)
OVER(PARTITION BY thingname ORDER BY date)
) AS value
FROM `project.dataset.my_tbl`
如果由于某种原因您实际上需要更新表-上面的语句将无济于事,因为DML的UPDATE不允许使用解析函数,因此您需要使用另一种方法。例如下面的一个
#standardSQL
SELECT
t1.ThingName, t1.date,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] AS value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY t1.ThingName, t1.date, t1.value
现在您可以使用它来更新表格,如下例所示
#standardSQL
UPDATE `project.dataset.my_tbl` t
SET value = new_value
FROM (
SELECT TO_JSON_STRING(t1) AS id,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] new_value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY id
)
WHERE TO_JSON_STRING(t) = id