我注意到有关此的一些帖子,但它们很可能是错误。在一些讨论中,您可以注意到,大数据的结果并不是100%准确,但却足够接近。布鲁姆过滤器响了吗?
我注意到aDF没有mapPartitions的可能性,并假设由于除了ROW的序数之外没有其他真正的索引,所以分区是关键,而对于SPARK SQL来说,像这样的语句是按比例的:
with X as (select device, time_asc, trip_id from trips where trip_id is not null)
select Y.TRIP_ID, Y.DEVICE, Y.TIME_ASC FROM (
select T1.TIME_ASC, T1.DEVICE, X.TRIP_ID, X.TIME_ASC AS TIME_ASC_COMPARE
,RANK() OVER (PARTITION BY T1.TIME_ASC, T1.DEVICE ORDER BY X.TIME_ASC) AS RANK_VAL from trips T1, X
where T1.DEVICE = X.DEVICE
and T1.TIME_ASC <= X.TIME_ASC) Y
where RANK_VAL = 1
order by TRIP_ID, TIME_ASC
永远都是正确的,但是如果应用了不太理想的分区/存储分区,则可能会遭受损失。
答案 0 :(得分:0)
我在将各种分区方案应用于DF / DS的小型计算机上检查了此断言,发现使用更多分区时结果正确且速度更快。也就是说,下面的语句可以锤打机器,但结果没有损失,但是实际上它起泡非常快-SPARK SQL的分区和列内部:
val res = spark.sql("""with X as (select device, time_asc, trip_id from trips where trip_id <> 0)
select Y.TRIP_ID, Y.DEVICE, Y.TIME_ASC FROM (
select T1.TIME_ASC, T1.DEVICE, X.TRIP_ID, X.TIME_ASC AS TIME_ASC_COMPARE
,RANK() OVER (PARTITION BY T1.TIME_ASC, T1.DEVICE ORDER BY X.TIME_ASC) AS RANK_VAL
from trips T1, X
where T1.DEVICE = X.DEVICE
and T1.TIME_ASC <= X.TIME_ASC) Y
where RANK_VAL = 1
order by TRIP_ID, TIME_ASC""").cache
获得不同结果的帖子可能是由于未缓存...?无论如何,WITH子句在2.3.1。上也能很好地工作,尽管它是微不足道的。