在同一个查询中Spark很多窗口函数

时间:2017-10-25 13:27:18

标签: apache-spark apache-spark-sql

我运行此Spark SQL查询

   select
    c.*,
    count(*)
      OVER (PARTITION BY brand, manufacturer, country
      ORDER BY (eventTime div (1000*3600))
      RANGE BETWEEN 24 PRECEDING and 1 PRECEDING) as countBrandManuCountry24H,
    coalesce(
      sum(isMinor)
      OVER(
        PARTITION BY brand, manufacturer, country
        ORDER BY (eventTime div (1000*3600))
        RANGE BETWEEN 24 PRECEDING and 1 PRECEDING), 0L) as minorCountBrandManuCountry24H,
    count(*)
      OVER (PARTITION BY brand, manufacturer, country
        ORDER BY (eventTime div (1000*3600))
        RANGE BETWEEN 336 PRECEDING and 1 PRECEDING
      ) as countProductCampaignCountry336H,
    coalesce(
      sum(isMinor)
        OVER (PARTITION BY brand, manufacturer, country
        ORDER BY (eventTime div (1000*3600))
        RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as minorCountBrandManuCountry336H
  from records c

执行计划主要是线性DAG,并不是完全并行运行,但是它在同一个分区上执行。

Stage

无论如何要提高这个性能吗?

也许:

  1. 在没有窗口的情况下写入DataFrame

  2. 订购并缓存并查询

  3. 在分离的DataFrame中运行并加入

0 个答案:

没有答案