我想用Spark SQL 2.0执行以下查询
SELECT
a.id as id,
(SELECT SUM(b.points)
FROM tableB b
WHERE b.id = a.id AND b.date <= a.date) AS points
FROM tableA a
但是我收到以下错误
The correlated scalar subquery can only contain equality predicates
。
任何想法如何重写查询或使用两个数据框tableA和tableB之间的操作来使其工作?
答案 0 :(得分:1)
select a.id as id,
sum(b.points) as points
from a, b
where a.id = b.id
and b.date <= a.date
group by a.id
;
跳过子选择和group by
ID,以确保ID与b点之和之间的一对一关系。
这里是一个“肮脏而又肮脏的”。我用过的例子:
select * from a ;
id|date
1|2017-01-22 17:59:49
2|2017-01-22 18:00:00
3|2017-01-22 18:00:05
4|2017-01-22 18:00:11
5|2017-01-22 18:00:15
select * from b ;
id|points|date
1|12|2017-01-21 18:03:20
3|25|2017-01-21 18:03:37
5|17|2017-01-21 18:03:55
2|-1|2017-01-22 18:04:27
4|-4|2017-01-22 18:04:35
5|400|2017-01-20 18:17:31
5|-1000|2017-01-23 18:18:36
请注意,b有三个id = 5的条目,两个在a.date之前,一个在after之后。
select a.id, sum(b.points) as points from a, b where a.id = b.id and b.date <= a.date group by a.id ;
1|12
3|25
5|417
我也确认&#34; group by&#34;支持:http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features