我运行此Spark SQL查询
select
c.*,
count(*)
OVER (PARTITION BY brand, manufacturer, country
ORDER BY (eventTime div (1000*3600))
RANGE BETWEEN 24 PRECEDING and 1 PRECEDING) as countBrandManuCountry24H,
coalesce(
sum(isMinor)
OVER(
PARTITION BY brand, manufacturer, country
ORDER BY (eventTime div (1000*3600))
RANGE BETWEEN 24 PRECEDING and 1 PRECEDING), 0L) as minorCountBrandManuCountry24H,
count(*)
OVER (PARTITION BY brand, manufacturer, country
ORDER BY (eventTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING
) as countProductCampaignCountry336H,
coalesce(
sum(isMinor)
OVER (PARTITION BY brand, manufacturer, country
ORDER BY (eventTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as minorCountBrandManuCountry336H
from records c
执行计划主要是线性DAG,并不是完全并行运行,但是它在同一个分区上执行。
无论如何要提高这个性能吗?
也许:
在没有窗口的情况下写入DataFrame
订购并缓存并查询
在分离的DataFrame中运行并加入