我们说我有2个表,并且它们都有一个包含timestamp
的列用于各种事件。两个表中的时间戳值不同,因为它们适用于不同的事件。
我想加入这两个表,以便table1中的每个记录与table2上的第一个较低时间戳连接。
For e.g.
Table1 Table2
142.13 141.16
157.34 145.45
168.45 155.85
170.23 166.76
168.44
Joined Table should be:
142.13,141.16
157.34,155.85
168.45,166.76
170.23,168.44
我正在使用Apache Spark SQL。
我是SQL中的菜鸟,这看起来不像是菜鸟的工作:)。感谢。
答案 0 :(得分:3)
试试这个:
with t1 as (
select 142.13 v from dual union all
select 157.34 v from dual union all
select 168.45 v from dual union all
select 170.23 v from dual
),
t2 as (
select 141.16 v from dual union all
select 145.45 v from dual union all
select 155.85 v from dual union all
select 166.76 v from dual union all
select 168.44 v from dual
)
select v, ( select max(v) from t2 where t2.v <= t1.v )
from t1;
V (SELECTMAX(V)FROMT2WHERET2.V<=T1.V)
---------- -----------------------------------
142.13 141.16
157.34 155.85
168.45 168.44
170.23 168.44
4 rows selected.
WITH子句只是我伪造数据...... 简化的查询只是:
select t1.v, ( select max(t2.v) from table2 t2 where t2.v <= t1.v ) from table1 t1
[编辑] 不可否认,我对Spark并不熟悉......但这很简单SQL ..我假设它有效:) [/编辑]
答案 1 :(得分:1)
Ditto已经展示了解决这个问题的直接方法。如果Apache Spark确实遇到了这个非常基本的查询问题,那么首先加入(这可能导致一个很大的中间结果)然后聚合:
select t1.v, max(t2.v)
from table1 t1
join table2 t2 on t2.v <= t1.v
group by t1.v
order by t1.v;
答案 2 :(得分:0)
如果您使用的是apache spark sql,那么您可以将这两个表作为数据框加入,并使用monotonically_increasing_id()
添加一列
val t1 = spark.sparkContext.parallelize(Seq(142.13, 157.34, 168.45, 170.23)).toDF("c1")
val t2 = spark.sparkContext.parallelize(Seq(141.16,145.45,155.85,166.76,168.44)).toDF("c2")
val t11 = t1.withColumn("id", monotonically_increasing_id())
val t22 = t2.withColumn("id", monotonically_increasing_id())
val res = t11.join(t22, t11("id") + 1 === t22("id") ).drop("id")
输出:
+------+------+
| c1| c2|
+------+------+
|142.13|145.45|
|168.45|166.76|
|157.34|155.85|
|170.23|168.44|
+------+------+
希望这有帮助