PySpark是否支持同一查询中的多个命名窗口?我想在同一个查询中计算各种大小的移动平均值。
seconds_per_day = 86400
seconds_per_minute = 60
sql('''
SELECT datetime,
symbol,
price,
AVG (price) OVER past_7_days AS price_7_day_avg,
AVG (price) OVER past_1_hour AS price_1_hour_avg
FROM data_formatted
WINDOW past_7_days AS (PARTITION BY symbol
ORDER BY CAST(datetime AS long)
RANGE BETWEEN 7 * {days} PRECEDING AND 1 * {minutes} PRECEDING)
WINDOW past_1_hour AS (PARTITION BY symbol
ORDER BY CAST(datetime AS long)
RANGE BETWEEN 60 * {minutes} PRECEDING AND 1 * {minutes} PRECEDING)
ORDER BY symbol ASC, datetime DESC
'''.format(
days=seconds_per_day,
minutes=seconds_per_minute)).show(1)
但是我的代码产生以下错误:
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'ORDER' expecting {<EOF>, ',', 'LIMIT'}(line 14, pos 5)
== SQL ==
SELECT datetime,
symbol,
price,
AVG (price) OVER past_7_days AS price_7_day_avg,
AVG (price) OVER past_1_hour AS price_1_hour_avg
FROM data_formatted
WINDOW past_7_days AS (PARTITION BY symbol
ORDER BY CAST(datetime AS long)
RANGE BETWEEN 7 * 86400 PRECEDING AND 1 * 60 PRECEDING)
WINDOW past_1_hour AS (PARTITION BY symbol
ORDER BY CAST(datetime AS long)
RANGE BETWEEN 60 * 60 PRECEDING AND 1 * 60 PRECEDING)
ORDER BY symbol ASC, datetime DESC
-----^^^
取出第二个命名窗口(以及使用它的列)会导致代码无误地运行,但我必须计算大量移动平均值,并且我不想为每列创建单独的表。
答案 0 :(得分:0)
我在同一个查询中使用了几个窗口,但我使用它们sql API,它可以工作。
答案 1 :(得分:0)
只用一个WINDOW声明和用逗号(,)分隔的多个命名窗口重写你的SQL语句,如下所示,它应该可以工作。
SELECT datetime,
symbol,
price,
AVG (price) OVER past_7_days AS price_7_day_avg,
AVG (price) OVER past_1_hour AS price_1_hour_avg
FROM data_formatted
WINDOW past_7_days AS (PARTITION BY symbol
ORDER BY CAST(datetime AS long)
RANGE BETWEEN 7 * {days} PRECEDING AND 1 * {minutes} PRECEDING),
past_1_hour AS (PARTITION BY symbol
ORDER BY CAST(datetime AS long)
RANGE BETWEEN 60 * {minutes} PRECEDING AND 1 * {minutes} PRECEDING)
ORDER BY symbol ASC, datetime DESC