我有两个表,都带有时间戳和更多数据:
| name | timestamp | a_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:10:00 | a |
| 2 | 2018-01-01 12:20:00 | b |
| 3 | 2018-01-01 13:30:00 | c |
| name | timestamp | b_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:00:00 | w |
| 2 | 2018-01-01 12:00:00 | x |
| 3 | 2018-01-01 13:00:00 | y |
| 3 | 2018-01-01 13:10:00 | y |
| 3 | 2018-01-01 13:10:00 | z |
我想做的是
LEFT JOIN
中的每一行,表B 中的最新记录早于该记录。 | name | timestamp | a_data | b_data |
| ---- | ------------------- | ------ | ------ |
| 1 | 2018-01-01 11:10:00 | a | w |
| 2 | 2018-01-01 12:20:00 | b | x |
| 3 | 2018-01-01 13:30:00 | c | z | <-- note z, not y
我认为这涉及一个子查询,但是我无法在Big Query中使用它。到目前为止,我所拥有的:
SELECT a.a_data, b.b_data
FROM `table_a` AS a
LEFT JOIN `table_b` AS b
ON a.name = b.name
WHERE a.timestamp = (
SELECT max(timestamp) from `table_b` as sub
WHERE sub.name = b.name
AND sub.timestamp < a.timestamp
)
在我的实际数据集上(这是一个很小的测试集(在2Mb以下)),查询运行但从未完成。任何赞赏的指针
答案 0 :(得分:2)
您可以尝试使用选择子查询。
SELECT a.*,(
SELECT MAX(b.b_data)
FROM `table_b` AS b
WHERE
a.name = b.name
and
b.timestamp < a.timestamp
) b_data
FROM `table_a` AS a
编辑
或者您可以尝试在子查询中使用ROW_NUMBER
窗口函数。
SELECT name,timestamp,a_data , b_data
FROM (
SELECT a.*,b.b_data,ROW_NUMBER() OVER(PARTITION BY a.name ORDER BY b.timestamp desc,b.name desc) rn
FROM `table_a` AS a
LEFT JOIN `table_b` AS b ON a.name = b.name AND b.timestamp < a.timestamp
) t1
WHERE rn = 1
答案 1 :(得分:1)
在BigQuery中,数组通常是解决此类问题的有效方法:
SELECT a.a_data, b.b_data
FROM `table_a` a LEFT JOIN
(SELECT b.name,
ARRAY_AGG(b.b_data ORDER BY b.timestamp DESC LIMIT 1)[OFFSET(1)] as b_data
FROM `table_b` b
GROUP BY b.name
) b
ON a.name = b.name;
答案 2 :(得分:1)
以下内容适用于BigQuery Standard SQL,不需要在两侧都指定所有列-仅name
和timestamp
。因此,它将适用于两个表中的任意数量的列(假设名称中没有歧义,而不是上面提到的两列)
#standardSQL
SELECT a.*, b.* EXCEPT (name, timestamp)
FROM (
SELECT
ANY_VALUE(a) a,
ARRAY_AGG(b ORDER BY b.timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] b
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
USING (name)
WHERE a.timestamp > b.timestamp
GROUP BY TO_JSON_STRING(a)
)
答案 3 :(得分:1)
这是常见的情况,您不能仅Group by
并获得最低要求。我建议以下内容:
SELECT *
FROM table_a as a inner join (SELECT name, min(timestamp) as timestamp
FROM table_b group by 1) as b
on (a.timestamp = b.timestamp and a.name = b.name)
通过这种方式,您可以将其限制为表b中指定的最小值。
您还可以使用WITH
语句以更具可读性的方式实现这一目标:
WITH min_b as (
SELECT name,
min(timestamp) as timestamp
FROM table_b group by 1
)
SELECT *
FROM table_a as a inner join min_b
on (a.timestamp = min_b.timestamp and a.name = min_b.name)
让我知道它是否有效!