我有两个表需要将它们与最近的时间戳联接在一起,但是我找不到简单地在SQL中实现此目的的方法。
示例数据:
table_1
+---------------------+------+
| timestamp | name |
+---------------------+------+
| 2020-02-11 14:50:00 | xxx |
| 2020-02-11 14:51:00 | yyy |
| 2020-02-11 14:52:00 | zzz |
+---------------------+------+
table_2
+---------------------+-------+
| timestamp | value |
+---------------------+-------+
| 2020-02-11 14:49:50 | 1 |
| 2020-02-11 14:49:58 | 2 |
| 2020-02-11 14:49:59 | 3 |
| 2020-02-11 14:50:50 | 11 |
| 2020-02-11 14:50:58 | 12 |
| 2020-02-11 14:50:59 | 13 |
| 2020-02-11 14:51:50 | 21 |
| 2020-02-11 14:51:58 | 22 |
| 2020-02-11 14:51:59 | 23 |
+---------------------+-------+
我需要让table_1离开连接table_2的时间最近的时间戳,条件是table_2中的时间戳总是比table_1中的时间戳小一点。按照这种逻辑,我期望得到这个结果表。
expected result
+---------------------+------+-------+
| timestamp | name | value |
+---------------------+------+-------+
| 2020-02-11 14:50:00 | xxx | 3 |
| 2020-02-11 14:51:00 | yyy | 13 |
| 2020-02-11 14:52:00 | zzz | 23 |
+---------------------+------+-------+
即使SQL查询转换效率不高,我也可以使用SQL查询吗?否则,我正在考虑将数据加载到spark数据框。我们是否在spark中实现了这种算法?
谢谢
答案 0 :(得分:-1)
您可以使用相关子查询:
select t1.*,
(select t2.value
from table_2 t2
where t2.timestamp <= t1.timestamp
order by t2.timestamp desc
limit 1
) as t2_value
from table_1 t1;
答案 1 :(得分:-1)
如果您只需要value
中的table_2
,我将使用Gordon的答案。但是,如果您需要选择更多列,则可以在LEFT JOINs ON子句中使用相关子查询:
select t1.timestamp, t1.name, t2.value
from table_1 t1
left join table_2 t2 on t2.timestamp = (
select max(t2i.timestamp)
from table_2 t2i
where t2i.timestamp <= t1.timestamp
)
结果:
| timestamp | name | value |
| ------------------- | ---- | ----- |
| 2020-02-11 14:50:00 | xxx | 3 |
| 2020-02-11 14:51:00 | yyy | 13 |
| 2020-02-11 14:52:00 | zzz | 23 |