Question

我运行此Spark SQL查询

   select
    c.*,
    count(*)
      OVER (PARTITION BY brand, manufacturer, country
      ORDER BY (eventTime div (1000*3600))
      RANGE BETWEEN 24 PRECEDING and 1 PRECEDING) as countBrandManuCountry24H,
    coalesce(
      sum(isMinor)
      OVER(
        PARTITION BY brand, manufacturer, country
        ORDER BY (eventTime div (1000*3600))
        RANGE BETWEEN 24 PRECEDING and 1 PRECEDING), 0L) as minorCountBrandManuCountry24H,
    count(*)
      OVER (PARTITION BY brand, manufacturer, country
        ORDER BY (eventTime div (1000*3600))
        RANGE BETWEEN 336 PRECEDING and 1 PRECEDING
      ) as countProductCampaignCountry336H,
    coalesce(
      sum(isMinor)
        OVER (PARTITION BY brand, manufacturer, country
        ORDER BY (eventTime div (1000*3600))
        RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as minorCountBrandManuCountry336H
  from records c

执行计划主要是线性DAG，并不是完全并行运行，但是它在同一个分区上执行。

无论如何要提高这个性能吗？

也许：

在没有窗口的情况下写入DataFrame
订购并缓存并查询
在分离的DataFrame中运行并加入

在同一个查询中Spark很多窗口函数

0 个答案: