连接具有最近时间戳记的表

时间:2020-02-11 14:00:06

标签: mysql sql dataframe apache-spark dataset

我有两个表需要将它们与最近的时间戳联接在一起,但是我找不到简单地在SQL中实现此目的的方法。

示例数据:

table_1
+---------------------+------+
|      timestamp      | name |
+---------------------+------+
| 2020-02-11 14:50:00 | xxx  |
| 2020-02-11 14:51:00 | yyy  |
| 2020-02-11 14:52:00 | zzz  |
+---------------------+------+
table_2
+---------------------+-------+
|      timestamp      | value |
+---------------------+-------+
| 2020-02-11 14:49:50 |     1 |
| 2020-02-11 14:49:58 |     2 |
| 2020-02-11 14:49:59 |     3 |
| 2020-02-11 14:50:50 |    11 |
| 2020-02-11 14:50:58 |    12 |
| 2020-02-11 14:50:59 |    13 |
| 2020-02-11 14:51:50 |    21 |
| 2020-02-11 14:51:58 |    22 |
| 2020-02-11 14:51:59 |    23 |
+---------------------+-------+

我需要让table_1离开连接table_2的时间最近的时间戳,条件是table_2中的时间戳总是比table_1中的时间戳小一点。按照这种逻辑,我期望得到这个结果表。

expected result
+---------------------+------+-------+
|      timestamp      | name | value |
+---------------------+------+-------+
| 2020-02-11 14:50:00 | xxx  |     3 |
| 2020-02-11 14:51:00 | yyy  |    13 |
| 2020-02-11 14:52:00 | zzz  |    23 |
+---------------------+------+-------+

即使SQL查询转换效率不高,我也可以使用SQL查询吗?否则,我正在考虑将数据加载到spark数据框。我们是否在spark中实现了这种算法?

谢谢

2 个答案:

答案 0 :(得分:-1)

您可以使用相关子查询:

select t1.*,
       (select t2.value
        from table_2 t2
        where t2.timestamp <= t1.timestamp
        order by t2.timestamp desc
        limit 1
       ) as t2_value
from table_1 t1;

答案 1 :(得分:-1)

如果您只需要value中的table_2,我将使用Gordon的答案。但是,如果您需要选择更多列,则可以在LEFT JOINs ON子句中使用相关子查询:

select t1.timestamp, t1.name, t2.value
from table_1 t1
left join table_2 t2 on t2.timestamp = (
  select max(t2i.timestamp)
  from table_2 t2i
  where t2i.timestamp <= t1.timestamp
)

结果:

| timestamp           | name | value |
| ------------------- | ---- | ----- |
| 2020-02-11 14:50:00 | xxx  | 3     |
| 2020-02-11 14:51:00 | yyy  | 13    |
| 2020-02-11 14:52:00 | zzz  | 23    |

View on DB Fiddle