Spark中允许多个命名窗口?

时间:2018-03-05 00:23:11

标签: apache-spark pyspark apache-spark-sql spark-dataframe

PySpark是否支持同一查询中的多个命名窗口?我想在同一个查询中计算各种大小的移动平均值。

seconds_per_day = 86400
seconds_per_minute = 60
sql('''
    SELECT datetime,
           symbol,
           price,
           AVG (price) OVER past_7_days AS price_7_day_avg,
           AVG (price) OVER past_1_hour AS price_1_hour_avg
      FROM data_formatted
    WINDOW past_7_days AS (PARTITION BY symbol 
           ORDER BY CAST(datetime AS long)
           RANGE BETWEEN 7 * {days} PRECEDING AND 1 * {minutes} PRECEDING)
    WINDOW past_1_hour AS (PARTITION BY symbol 
           ORDER BY CAST(datetime AS long)
           RANGE BETWEEN 60 * {minutes} PRECEDING AND 1 * {minutes} PRECEDING)
     ORDER BY symbol ASC, datetime DESC
      '''.format(
        days=seconds_per_day,
        minutes=seconds_per_minute)).show(1)

但是我的代码产生以下错误:

: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'ORDER' expecting {<EOF>, ',', 'LIMIT'}(line 14, pos 5)

== SQL ==

    SELECT datetime,
           symbol,
           price,
           AVG (price) OVER past_7_days AS price_7_day_avg,
           AVG (price) OVER past_1_hour AS price_1_hour_avg
      FROM data_formatted
    WINDOW past_7_days AS (PARTITION BY symbol 
           ORDER BY CAST(datetime AS long)
           RANGE BETWEEN 7 * 86400 PRECEDING AND 1 * 60 PRECEDING)
    WINDOW past_1_hour AS (PARTITION BY symbol 
           ORDER BY CAST(datetime AS long)
           RANGE BETWEEN 60 * 60 PRECEDING AND 1 * 60 PRECEDING)
     ORDER BY symbol ASC, datetime DESC
-----^^^

取出第二个命名窗口(以及使用它的列)会导致代码无误地运行,但我必须计算大量移动平均值,并且我不想为每列创建单独的表。

2 个答案:

答案 0 :(得分:0)

我在同一个查询中使用了几个窗口,但我使用它们sql API,它可以工作。

答案 1 :(得分:0)

只用一个WINDOW声明和用逗号(,)分隔的多个命名窗口重写你的SQL语句,如下所示,它应该可以工作。

SELECT datetime,
       symbol,
       price,
       AVG (price) OVER past_7_days AS price_7_day_avg,
       AVG (price) OVER past_1_hour AS price_1_hour_avg
  FROM data_formatted
WINDOW past_7_days AS (PARTITION BY symbol 
       ORDER BY CAST(datetime AS long)
       RANGE BETWEEN 7 * {days} PRECEDING AND 1 * {minutes} PRECEDING),
       past_1_hour AS (PARTITION BY symbol 
       ORDER BY CAST(datetime AS long)
       RANGE BETWEEN 60 * {minutes} PRECEDING AND 1 * {minutes} PRECEDING)
 ORDER BY symbol ASC, datetime DESC