大约六个月前(2017年9月),我发布了一个问题,询问SparkSQL
是否允许重复使用GROUP BY和ORDER BY中的选择表达式的结果。答案是否定的。人们强迫使用不同的解决方法,例如编写子查询或在GROUP BY中重复表达式(请参阅帖子here)
但我只是再次尝试了查询。也许我们已升级为使用较新版本的SparkSQL。我注意到SparkSQL
现在实际上允许重用。
更新了新的查询示例和新的观察
这是一个例子。现在允许查询(新方法):
select
count(*) cnt,
date(from_unixtime(t.starttime/1000)) as timeDate
from someTable as t
group by
timeDate
order by
timeDate desc
/* Okay */
我们还能用旧方法吗?根据我的实验,答案是 是的,不。我不知道是否有错误或意图。
允许此查询:
select
count(*) cnt,
date(from_unixtime(t.starttime/1000)) as timeDate
from someTable as t
group by
date(from_unixtime(t.starttime/1000))
order by
date(from_unixtime(t.starttime/1000)) desc
/* Okay */
但是以下查询会触发异常:
select
count(*) cnt,
cast(date(from_unixtime(t.starttime/1000)) as date) as timeDate
from someTable as t
group by
date(from_unixtime(t.starttime/1000))
order by
starttime desc
/* Error: AnalysisException */
错误:
Error: org.apache.spark.sql.AnalysisException: \
cannot resolve '`starttime`' given input columns: \
[cnt, timeDate]; line 7 pos 9;
查看下面的更多失败/成功案例
我理解在上面的示例中,cast(... as date)
是多余的,但在这里我只是想探索语法允许的内容。如果我们使用新方式(重用select表达式的结果),则没有问题。
select
count(*) cnt,
cast(date(from_unixtime(t.starttime/1000)) as date) as timeDate
from someTable as t
group by
timeDate
order by
timeDate desc
/* Okay */
我喜欢这种新方式,但我想确保我已经正确观察过。有人可以证实吗?
我忘记了六个月前的版本,但这是当前的软件版本:
我们的软件版本:
附录:更多失败/成功案例
似乎ORDER BY必须以某种方式匹配GROUP BY,或者 将抛出异常。
select
count(*) cnt,
date(from_unixtime(t.starttime/1000)) as timeDate
from someTable as t
group by
date(from_unixtime(t.starttime/1000))
order by
t.starttime desc
/* Error: AnalysisException */
select
count(*) cnt,
cast(date(from_unixtime(t.starttime/1000)) as date) as timeDate
from someTable as t
group by
cast(date(from_unixtime(t.starttime/1000)) as date)
order by
cast(date(from_unixtime(t.starttime/1000)) as date) desc
/* Okay */